Machine Learning for Fake News Detection
Machine Learning for Fake News Detection
on
Fake News Detection Using Machine Learning
Submitted to Guru Gobind Singh Indraprastha University, Delhi (India)
In partial fulfilment of the requirement for the award of the degree
of
Bachelor of Technology
in
Information Technology
Submitted By:
I
CANDIDATE’S DECLARATION
It is hereby certified that the work which is being presented in the [Link]. Minor
Project report entitled MULTIPLE DISEASES PREDICTION in partial fulfilment of
the requirements for the award of the degree of Bachelor of Technology and submitted
in the Department of Information Technology, New Delhi (Affiliated to Guru
Gobind Singh Indraprastha University, New Delhi) is an authentic record of our
work carried out during a period from February 2024 to May 2024 under the guidance
of Prof. Tripti Sharma, Professor (IT).
The matter presented in the [Link]. Minor Project Report has not been submitted by us
for the award of any other degree of this or any other institute.
This is to certify that the above statement made by the candidates are correct to the best
of my knowledge. They are permitted to appear in the External Minor Project
Examination.
II
ACKNOWLEDGEMENT
We would like to extend our sincere thanks to the Head of Department, Prof.
Prabhjot Kaur for her time-to-time suggestions to complete our project work. We are
also thankful to Prof. Archana Balyan, Director (O) of MSIT for providing us with
the facilities to carry out our project work.
III
ABSTRACT
Fake news has become a pervasive issue in today's information landscape, spreading
misinformation and eroding trust in media sources. In response, machine learning techniques
have emerged as powerful tools for automatically detecting and combating fake news. This
research paper delves into the application of machine learning algorithms, including Support
Vector Machines (SVM), Artificial Neural Networks (ANN), and Random Trees (RT), for the
accurate identification of fake news articles.
The paper discusses the benefits and challenges of using machine learning in fake news
detection, highlighting the importance of feature selection, model evaluation, and the
integration of diverse data sources such as textual content, social media metadata, and user
engagement patterns. Additionally, it explores the role of natural language processing
techniques in extracting meaningful features from news articles to improve classification
accuracy.
The research findings underscore the potential of machine learning in mitigating the spread of
fake news, contributing to a more informed and trustworthy media environment. Keywords:
Fake News Detection, Machine Learning, SVM, ANN, RT, Natural Language Processing.
IV
CONTENT
ABSTRACT iv
LIST OF ABBREVIATIONS ix
CHAPTER 1 INTRODUCTION 01
1.1 Introduction 01
1.2 Motivation of the work 02
1.3 Problem Statement 02
CHAPTER 2 LITERATURE SURVEY 03
CHAPTER 3 METHODOLOGY 06
3.1Existing System 06
3.2Proposed System 06
3.2.1 Collection of dataset 07
3.2.2 Selection of attributes 07
3.2.3 Pre-processing of Data 09
3.2.4 Balancing of Data 09
3.2.5 Prediction of Disease 10
V
CHAPTER 5 EXPERIMENTAL ANALYSIS 23
APPENDIX 32
VI
LIST OF FIGURES
VII
LIST OF TABLES
2 Accuracy Table 23
VIII
LIST OF ABBREVIATIONS
AI Artificial Intelligence
IX
CHAPTER-1
INTRODUCTION
Fake news can originate from various sources, ranging from unintentional
misinformation to deliberate attempts at misleading the public for various
agendas. This complexity underscores the critical need for robust mechanisms
to identify and address fake news promptly and effectively. News
organizations, in particular, face the daunting task of ensuring the accuracy
and credibility of their reporting amidst the rampant spread of
misinformation.
Here is where machine learning emerges as a potent ally in the fight against
fake news. By harnessing the power of machine learning techniques,
organizations can leverage sophisticated algorithms to analyze vast datasets,
identify patterns, and derive insights that facilitate the detection of false
information. Machine learning excels in learning from historical data,
recognizing subtle patterns, and adapting its detection capabilities to evolving
tactics used by purveyors of fake news.
10
processing (NLP) techniques enable the analysis of textual content to uncover
linguistic cues indicative of misinformation. Sentiment analysis can gauge the
tone and context of news articles, helping to distinguish between objective
reporting and biased or sensationalized content. Network analysis can identify
the spread of fake news across social media platforms, tracing its origins and
propagation pathways.
The promise of machine learning in fake news detection lies in its ability to
automate and scale the process of identifying and mitigating the impact of
false information. By continuously refining algorithms, incorporating feedback
loops, and leveraging diverse data sources, machine learning empowers
organizations to stay vigilant against the threats posed by fake news.
The motivation behind conducting research on fake news detection stems from
the pressing need to combat the detrimental effects of misinformation in
today's digital era. Fake news has emerged as a pervasive and disruptive force,
capable of influencing public opinion, shaping political discourse, and eroding
trust in media sources. The consequences of fake news extend beyond
individual misinformation; they can sow discord, fuel social divisions, and
undermine democratic processes.
The problem statement revolves around the need for robust mechanisms to
detect and combat fake news effectively. Current methods for identifying fake
news rely heavily on manual fact-checking processes, which are time-
consuming, resource-intensive, and often unable to keep pace with the rapid
dissemination of false information.
Moreover, the sophistication of fake news tactics, including deepfakes,
misleading headlines, and manipulated images, further complicates the task of
11
distinguishing between genuine and fabricated content. As a result, there is a
pressing need for automated solutions that can analyze vast amounts of data,
detect subtle patterns indicative of fake news, and provide timely and accurate
assessments of information credibility.
CHAPTER-2
LITERATURE SURVEY
[1] Ahmed, S., Hinkelmann, K., Corradini, F.: Combining machine learning
with knowledge engineering to detect fake news in social networks - a survey.
In: Proceedings of the AAAI 2019 Spring Symposium, vol. 12 (2019).
Ahmed et al. (2019): This study explores the integration of machine learning
and knowledge engineering to detect fake news in social networks. It presents
a survey of techniques and methodologies used in this context, highlighting the
importance of combining domain knowledge with machine learning
algorithms for effective fake news detection.
[3] Atodiresei C-S, Tănăselea A, Iftene A. Identifying fake news and fake users
on twitter. Procedia Comput. Sci. 2018;126:451–461. doi:
10.1016/[Link].2018.07.279. [CrossRef] [Google Scholar].
Atodiresei et al. (2018): This research focuses on identifying fake news and
fake users on Twitter, examining techniques and strategies for detecting and
mitigating misinformation on social media platforms.
12
[4] Burkhardt JM. History of fake news. Libr. Technol. Rep. 2017;53(8):37.
[Google Scholar].
Burkhardt (2017): Burkhardt delves into the history of fake news, providing
a historical perspective on the evolution of misinformation and its prevalence
in various forms of media.
[5] Castelo, S., Almeida, T., Elghafari, A., Santos, A., Pham, K., Nakamura, E.,
Freire, J.: A topic-agnostic approach for identifying fake news pages. In:
Companion Proceedings of the 2019 World Wide Web Conference on - WWW
2019, pp. 975–980 (2019). 10.1145/3308560.3316739.
Castelo et al. (2019): The study presents a topic-agnostic approach for
identifying fake news pages, showcasing a method that is not dependent on
specific topics but rather focuses on broader patterns and characteristics of
fake news content.
[6] Chen, Y., Conroy, N.J., Rubin, V.L.: Misleading online content:
recognizing clickbait as false news? In: Proceedings of the 2015 ACM on
Workshop on Multimodal Deception Detection - WMDD 2015, Seattle,
Washington, USA, pp. 15–19. ACM Press (2015a). 10.1145/2823465.2823467.
Chen et al. (2015a): Chen and colleagues investigate misleading online
content, particularly clickbait, and its association with false news. The paper
explores the nuances of deceptive content and its potential impact on
information credibility.
[7] Chen Yimin, Conroy Nadia K., Rubin Victoria L. News in an online world:
The need for an “automatic crap detector” Proceedings of the Association for
Information Science and Technology. 2015;52(1):1–4. [Google Scholar].
Chen et al. (2015b): In another study, Chen et al. emphasize the need for an
"automatic crap detector" in the online news ecosystem, highlighting the
challenges of discerning reliable information from misleading or false content.
[8] Hassan, N., Arslan, F., Li, C., Tremayne, M.: Toward automated fact-
checking: detecting check-worthy factual claims by claimbuster. In:
Proceedings of the 23rd ACM SIGKDD International Conference on
13
Knowledge Discovery and Data Mining - KDD 2017, Halifax, NS, Canada,
pp. 1803–1812. ACM Press (2017). 10.1145/3097983.3098131.
Hassan et al. (2017): The research discusses automated fact-checking and the
development of tools such as ClaimBuster for detecting check-worthy factual
claims. The paper focuses on improving the accuracy and efficiency of fact-
checking processes in combating fake news.
CHAPTER 3
METHODOLOGY
Datasets:
To find patterns in Fake News, first news needs to be collected and labeled.
Both Fake News and legitimate news needs to be represented in roughly equal
amounts. This is to avoid the frequency of Fake News in the dataset being
used as a determining factor in classifying. Having good data is essential
producing valid results. Good data in this context is data that is representative
of the real world and is generalizable.
The dataset used to train the classifiers is the ISOT Fake News Dataset, the
largest available dataset of full length Fake News articles [8], [9]. The ISOT
dataset contains 21,417 articles labeled Real and 23,481 that are labeled Fake,
totaling 44,898. FakeNewsNet is another data set containing full length
articles, however there are only 422 labeled articles in it [6]. And lastly there
is a set of 180 articles, 90 Fake and 90 Real, collected by the author which,
will be referred to as the Original Data. These two additional datasets will be
used to test the accuracy of the trained classifiers.
Each model will initially be trained with 80% of the ISOT data. The
remaining 20% of the ISOT data will be used to test the accuracy of the
trained classifiers. As mentioned, FakeNewsNet and the Original Data will be
used for testing as well. The reasoning behind using these additional tests is to
14
make sure we are detecting Fake News and not some other pattern of the
ISOT dataset, such as a style of a particular news organization.
Each article labeled as Real in the ISOT dataset was collected from Reuters;
all articles their started with the word “Reuters”. This pattern could easily be
picked up by humans and machines alike. To avoid this issue the beginning
“Reuters” phrase was removed from each article.
Features To find patterns, several different features should be tested. Features are
numeric values that describe the text. Examples of these numeric values are word
count or the number of times a particular punctuation mark is used. Some features
will be more helpful than others, for instance the number of verbs is more likely to
be useful compared to the number of times a particular word is used, such as ‘kitten’.
The goal is to find the features that are most helpful in detection of Fake News. Next,
each extracted feature will be discussed in detail. Word counts are among the most
easily obtained features that can be extracted from raw text. It is simply a count of all
the terms in a body of text. Word counts are also called a ‘bag of words’, however, to
keep names descriptive, we shall call this type of feature a count. To get the word
count in texts, scikit-learn’s CountVectorizer is used; the CountVectorizer tokenizes
the data and then counts each term [14]. The data can be tokenized by word or by n-
gram. N-grams are series of n items, such as words or characters. In this thesis n-
grams refers to groupings of two and three characters. For instance, the n-grams of
the word ‘feature’ would be as follows: ‘fe’, ‘ea’, ‘at’, ‘tu’, ‘ur’, ‘re’, ‘fea’, ‘eat’, ‘atu’,
‘tur’, and ‘ure’. These features will be referred to as countword and count-ngram
respectively. Term frequency-inverse document frequency, or TF-IDF, is calculated as
follows: term frequency times the inverse document frequency.
Where term frequency is the number of times a term is in a document divided by the
number of terms in a document. The inverse document frequency is the logarithm of
the number of text (or articles) in the collection divided by the number of texts or
articles where the term appears. Below is the equation for TF-IDF: TF-IDF = number
of term occurrences terms in text × log number of texts in collection number of texts
where term occurs 12 TF-IDF is a way to rank the importance of a term within a text
with respect to all the texts in the collection. It ranks common words as less
important (smaller numeric value) and less used words as more important (large
15
numeric values). The implementation used in the software produced in conjunction
with this thesis is part of sklearn which is included in the scikit-learn extraction
module [14].
The terms can either be on a word or n-gram level; these features will be referred to
as TFIDF-word and TFIDF-ngram respectively. Fake News often uses people’s
emotions and preconceptions to manipulate the readers [15]. Although sentiment
analysis is considered to be separate from Fake News detection, sentiment analysis
could improve Fake News detection. To explore this using data science, the sentiment
of an article needs to be articulated. To achieve this a sentiment analyzer is required
and several are available. VADER (Valence Aware Dictionary and sEntiment Reasoner)
is one of those tools [16]. VADER is publicly available and performs better than other
benchmark sentiment analysis tools such as LIWC, GI, WordNet, and SentiWordNet.
This feature will be referred to as VADER. VADER gives four numbers: a score of how
negative the tone of a piece of text is, how positive the text’s tone is, how neutral it
is, and how ‘compound’ it is or how mixed it is between the other values. The values
it gives range from -1 to 1. Due to the fact that some classifiers are not able to use
negative numbers, each VADER score will have 1 added to it, making it range vary
from 0 to 2. Shifting VADER score’s range in this way does not affect the meaning of
the score. Stop words are common words that are taken out of a text to improve
accuracy in some data science applications. By removing stop words from a body of
text we can focus better into words which make the text distinct. There are a number
of ready-made lists of stop words, however, not all lists are good for all applications
[17].
For instance, a word that is common and useless in one 13 context could be
important in another. Two English ready-made lists are NLTK’s stop word list and
spaCy’s stop word list. These will be referred to as NLTKStop and spaCyStop
respectively. Part of speech tagging, PoS, tagging is the process of labeling what part
of speech a word is, based on the word and the surrounding words. Sentences are
formed by using different PoS, sentences can be analyzed by looking at the patterns
formed by combining the PoS. Exploring these patterns, where they occur, could
provide valuable insight. NLTK provides PoS tagging capabilities [18].
16
The NLTK tagging includes different tags for different tenses. For instance, a past
tense verb is not the same as a verb in the present tense. This feature will help the
machine learning algorithms to take into account if an author is writing in present,
future, or any other tense. This feature will be referred to as PoS. Lemmatization is
the process of getting the root from a word. For example, cats would be cat and feet
would be foot. Computers do not understand that feet and foot are closely related
and therefore cannot take such things into account. However, by lemmatizing the
text we can turn all the forms of a word into the root word, allowing the classifying
algorithms to focus on the root words. NLTK provides lemmatization. A wrapper
function, written by Ken Tsuji, was used in the software produced in conjunction with
this thesis[19].
Although by lemmatizing a word the tense of the word is lost, this should not be a
problem because the close relationship between different tenses of a word is being
revealed. This feature will be referred to as lemma. Named entity recognition is the
process of identifying persons, organizations, and other named entities. This is
important for algorithms as they do not process the meaning of words. By labeling
words as ‘person’ or ‘organizations’ algorithms can pick on patterns involving these
entities that would otherwise be obstructed. For this thesis, spaCy’s named entity
recognition was used. This feature will be referred to as ER.
Classifiers:
As previously mentioned, the extracted features were used to train classifiers. The
classifiers used are now discussed. Naïve Bayes, NB, is a type of classifier that takes
each feature and treats it as unrelated to any other feature. It then calculates the
probability that the particular feature belongs to a classification. It does that for each
feature and then aggerates each individual probability to calculate the final
classification. For example, with a count-word it would calculate the probability that
the count of the first word would belong to Fake News as opposed to not. This
process will continue for every word and these probabilities a final decision would be
made. Before describing the next classifier, we will consider decision trees. A decision
tree classifier takes the values of the features and splits them into two groups such
that each group is as close as possible to only having a single classification. This is
repeated until each group consists of a single classification. See Figure 1, for a visual
example of a decision tree. The main issue with decision trees is that they do not
17
generalize very well. They tend to fit the training data so well that the general
patterns in the data are over looked. This is where the next classifier comes in.
Random Forests are a type of classifier built out of a collection of decision trees. But,
instead of each decision tree training on all of the data, each decision tree gets a
random subset of the data to train on. Making each decision tree in the forest
unique. When classifying, each decision tree in the forest gives its own classification,
then whichever classification gets the most votes of the decision trees wins. Figure 1:
Decision Tree 15 All of the code written for this thesis is provided in Appendix B.
[Link] contains function for reading datasets from files, splitting training
and testing data, training classifiers, testing classifiers, and printing results.
[Link] contains functions for feature extraction. [Link] uses
the function from [Link] and [Link] to test the different
features. [Link] simply contains the code used to remove the Reuters
headers from the news articles.
18
CHAPTER 4
RESULT
Using two different models, each extracted feature was tested. The models used were
Random Forest (RF) and Naïve Bayes (NB). There is some difference between the two
classifiers. There is a much larger difference between datasets. The following is a detailed
discussion of each set of features. We will compare features and classifiers by their
accuracy, which is the percentage of correct classification made by the classifier
Count-word and Count-ngram: First, most notable the ISOT testing data is getting
way higher accuracy results than either the Original dataset or the FakeNewsNet
dataset. After the ISOT, the Original dataset is getting the next highest accuracy rates.
This suggests that the Original dataset is closer in makeup to the ISOT dataset than
the FakeNewsNet is. Next the data shows that the NB classifier generalizes better
than the RF classifier. The NB classifier gets better accuracy rates with count-ngram.
The RF has no clear winner between count-word and count-ngram.
19
TFIDF-word and TFIDF-ngram: As seen in Figure 3, the ISOT testing data has the
highest accuracies again. The random forest classifiers get better results with the ISOT
dataset than the Naive Bayes. However, the NB does generalize better to the Original
dataset and the FakeNewsNet dataset. TFIDF-word is getting better accuracy rates
over TFIDF-ngram. In the case of the RF’s classification of the Original dataset, the
TFIDF-word is getting 6.47% more accuracy. Again, the Original dataset is being
classified better than the FakeNewsNet dataset. Between TFIDF and Count, the
Count-ngram is getting the best accuracy results.
20
ER: Still, ISOT is doing best and NB generalizes better. Compared to the previous
features, ER is not as good of a feature by itself. However, it cannot be concluded that
ER is not a good feature. More testing with ER combined with other features should
be done before disregarding ER as a feature for Fake News detection.
PoS: Here we see for the first time that NB is not generalizing better than the RF. Also,
there is an accuracy below 50%, which shows that by using this feature to classify is
no better than a RF NB ISOT News 85.71% 74.93% FakeNewsNet 54.50% 56.40%
Original Data 60.59% 65.29% 40% 50% 60% 70% 80% 90% 100% Accuracy ER RF NB
ISOT News 93.68% 82.17% FakeNewsNet 50.23% 44.79% Original Data 67.06%
60.00% 40% 50% 60% 70% 80% 90% 100% Accuracy PoS 19 random guess. With an
accuracy as low as 44.70%, it can be concluded that PoS by itself is definitely not a
good feature for Fake News classification. However, there is a chance that when
combined with another feature, PoS might be a good feature.
21
VADER: As Figure 6 shows, the VADER feature is very detrimental to the accuracy
rates. While this is not enough to conclude that VADER will not be helpful when
combined with other features, it does suggest that VADER alone is not very helpful for
classifying Fake News. Although, PoS has an instance of lower accuracy, VADER is
lower overall and therefore is a worse feature.
Stop Word: Once more, ISOT is dominating the accuracy rates and the Original
dataset is in second place. Figure 7 shows that NB generalizes much better than the
RF classifier. Although close, the NLTK list of stop words is superior to the spaCy list
22
for Fake New detection. From the results, we can see that Original dataset benefits
greatly from NLTKStop and spaCyStop compared to Count-word. Additionally,
FakeNewsNet also benefits from NLTKStop and spaCyStop, just not as much.
Lemma: Once again, ISOT accuracies are the highest, with Original coming in second.
The NB classifier is still generalizing better than the RF classifier. The results from
lemma are better than some of the other features. However, lemma with a Count-
word is not as accurate as a Countword. Suggesting that the different forms of a word
are helpful to the classifier. From the results a few more general conclusions can be
made. The most notable is that the accuracy on the ISOT test data is much higher
than the accuracies of the other datasets. From this, it can be concluded that there is
a pattern in the ISOT dataset that is being picked up by the two classifiers. However, it
appears that these patterns do not generalize well to the other available datasets.
The patterns that the classifiers are picking up on could be a pattern found in Reuters
articles, or could be another pattern that exists mainly in the ISOT dataset Such as
article topic, or political leaning. All of this suggests that ISOT is not a good dataset to
train with. Next, it can be seen that the Original dataset is classifying with better
accuracy than the FakeNewsNet dataset. The Original dataset does not contain
articles from Reuters hence, this does not explain the jump in accuracy. Therefore, it
is possible that the Fake News within the ISOT and Original dataset are closer in
underlaying structure. For accuracy rates, it can be concluded that Counts and TFIDF
generalize better than ER, PoS, and VADER. More tests should be done with ER, PoS,
23
and VADER features before any of them are discarded for providing lower accuracy
rates. They still may be a benefit to accuracy rates when combined with other
features, despite not doing well by themselves. Complete tables of accuracy rates for
both classifiers can be found in appendix A.
Conclusions:
With the nature of the Internet as it is, Fake News is easily created and distributed.
Fact checking is tedious and time consuming, so automating Fake News detection is
critical. Thus Fake News classifiers should be created. However, a classifier does not
come out of thin air, it must be trained on already existing data. The quality and
quantity of the data is important. Three datasets were used for the research in this
thesis. ISOT, a huge dataset of over 40,000 articles. FakeNewsNet is another, much
smaller dataset containing 422 articles. Lastly, the Original dataset, containing 180
articles, that was gathered specifically for this research. However, a classifier cannot
read, so it must have features extracted for the articles. A feature is a numeric value
extracted from the article. Such as a word count, or a count of parts of speech, or
more complicated features. Such as a count of the named entities, like businesses or
organizations. However, which features work best? Two different classifiers, Random
Forests and Naive Bayes, were trained on the 80% of the ISOT dataset reserved for
testing using each of the ten different features: Count-word, Countngram, TFIDF-
word, TFIDF-ngram, PoS, ER, Lemma, VADER, NLTKStop, and spaCyStop. Then each
classifier was tested on the remaining 20% from ISOT, all of FakeNewsNet, and all of
the Original dataset. The accuracy results where then examined and conclusions were
drawn. The ISOT dataset did not generalize well to the other two datasets used for
testing. Making the test results for the 20% testing portion of ISOT get way higher
results than the other two datasets. This could be found the fact that ISOT got all of
its real news from Reuters and the classifiers ended up being a Reuters vs not-Reuters
classifier. Next it was discovered that Count/TFIDF are better standalone features than
PoS, ER, and VADER. However, these features still have potential to be used in
conjunction with other 23 features. Although Lemma was one of the better features,
it was outperformed by Count-word, suggesting that some of the removed data was
improving the classification.
24