Fake News Detection Using Machine Learning
Ashutosh Kumar1, Aditi Rath2, Ashutosh Sarangi3, Indrajit Das4, Prativa Das5
Department of Computer Science and Engineering, Siksha ‘O’ Anusandhan (Deemed to be)
University, Bhubaneswar, Odisha, India
[Link]@[Link]
[Link]@[Link]
[Link]@[Link]
2041004164.indrajitdas1@[Link]
Abstract. The massive rise in data sharing throughout the globe has prompted
concerns about the reliability of sources. Users of online social networks are
susceptible to being duped by false information and misleading language, which
impacts offline society. For internet information to remain reliable, false infor-
mation needs to be exposed as soon as possible. This study examines the issues
caused by false information and the intricate relationships between news
sources, subjects, and producers. The objective is to identify false news writers,
sources, and themes on social media platforms; the efficacy of this search is
being assessed. Following its instructions and utilizing various approaches,
machine learning algorithms are evaluated according to these characteristics.
Based on training data and techniques, machine learning algorithms can differ-
entiate between news that is authentic consistently and that is fake. It is possible
to create models, especially for different sectors. The suggested model is simu-
lated using VS Code, a Python machine-learning program. Continuous research
is necessary to adjust detection algorithms to evolving patterns and challenges.
Fact-checking and expert analysis should be utilized with machine learning-
based detection due to its limitations, which include biased training data and
adversarial attacks.
Keywords: Feature Extraction, Fake News, Machine Learning, Text Classifica-
tion, Information Exchange.
2
1 Introduction
Nowadays, the average person can't tell the difference between fake and real news
due to the increased density of global information sharing. This misleading language
of online fake news spreads swiftly among users of social media sites, and it is
already greatly affecting society offline [1]. Identifying bogus news. Nonetheless,
most research concentrates on certain datasets or domains, with policy areas being
crucial [3]. Consequently, when exposed to articles from multiple domains, the
algorithm does not produce optimal results because it was trained on a certain article
type domain [4]. It is challenging to create a universal algorithm that works well in
every particular news area because every article has a different text structure
depending on the news area. In this study, we address the false news detection
problem by proposing an ensemble machine learning technique. These methods
enable the effective and efficient training of different machine learning algorithms.
We also conducted thorough testing using four publicly available real-world datasets.
Ongoing research and improvement are required because of issues like the availability
of high-quality labelled datasets, the dynamic nature of fake news, and the possibility
of adversarial attacks. However, advancements in this subject could lead to the
development of reliable systems that can prevent the spread of false information and
maintain information integrity in the digital age.
1.1 Motivation(s)
This study tackles the issues raised by the variety of relationships between
news items, producers, subjects, and the unpredictable nature of fake news.
Because machine learning-based techniques offer a more precise and effective
means of confirming the credibility of news sources, they can mitigate the
negative effects of erroneous information [2]. Leveraging machine learning
techniques is ethically responsible for mitigating harm caused by inaccurate
information, and it offers a powerful tool to battle fake news.
1.2 Objectives(s)
The objective is to locate news stories or other resources that present inaccur-
ate or misleading information. The purpose of these technologies is to enable
humans to safely read news and information. Determine the main obstacles
and constraints that machine learning-based false news identification faces,
such as problems with data quality, model resilience, and the dynamic nature
of fake news.
3
1.3 Original Contributions
Our research looks into various textual traits that may be utilized to differentiate
between real and fake content. We use a range of ensemble techniques that are
not extensively discussed in the current literature to train many machine-learn-
ing algorithms [5]. These techniques enable the effective and efficient training
of various machine learning algorithms. We have extensively tested four pub-
licly available real-world datasets [6].
1.4 Paper Layout
This paper is excellent because it is broken into smaller portions. Specific in-
formation is easily obtainable. Section 1 presents the paper's introduction; Sec-
tion 2 delves into the literature review; and Section 3 presents our suggested
model. Section 4 displays our findings along with those of every other statistical
major. In section 5, we have brought our paper to a close and discussed potential
future developments.
2 Literature Survey
One significant advancement has been proposed by Kumar et al. [5], who advise fu-
ture research projects to use Twitter and Facebook datasets. In 2020, Thakur et al. [4]
examined and resolved a few issues with automated news identification to ascertain
the veracity of the news. Using CNN and deep learning principles, Khan et al. [3] and
Wang et al. [1] obtained great accuracy (around 93.5%) in the detection of fake news
in 2018 and 2019. Notwithstanding these endeavors, numerous obstacles remain to be
tackled. The absence of a standardized dataset is one of the main obstacles, which
makes it challenging to compare the performance of several systems. The accuracy of
false news detection algorithms can also be impacted by the diverse datasets that re -
searchers employ. The amount and speed at which information is created and ex-
changed online presents another difficulty. This may entail employing cunning strate-
gies to spread misleading information, such as manipulating visual media or setting
up phony social media profiles. Nevertheless, there are already tools in place to assist
in thwarting false information. Examples of helpful tools include fact-checking web-
sites and browser plugins that highlight and indicate the reliability of news sources.
Furthermore, media literacy can combat false news, and education can support educa-
tion. Research works like Shu et al. (2017), and Ruchansky et al. (2017) demonstrate
how well these deep learning models work to increase detection accuracy. Further-
more, there has been an increasing focus on utilizing user engagement metrics and
social context, as evidenced by Vosoughi et al. (2018)'s research that combines tex-
tual analysis with network-based features. Even with these developments, there are
still many important obstacles to overcome, such as the requirement for sizable, ex-
4
cellent labeled datasets and the ability of models to adjust to changing fake news
strategies.
3 Proposed Model
We are expanding on the current writing in our proposed system to categorize news
pieces from various spaces as authentic or fraudulent. The system is built to be flex -
ible, using online learning techniques to regularly add new data to the model and keep
it up to date against the ever-changing strategies of fake news.
3.1 Methodologies Used
We selected two datasets for our tests that include a combination of real and
false articles. The datasets can be accessed through the World Wide Web and
can be downloaded. You can get both datasets for free on Kaggle. The corpus
taken from the Internet is preprocessed before being used as an input to train
the models. Filtered out are variables not wanted in the article, such as the
authors, posting date, URL, and category. After the pertinent characteristics
have been selected after the data cleaning and exploration stage, the next step
is extracting the linguistic features. Some textual characteristics needed to be
converted into a numerical form before being used as input for the training
models.
3.2 Data-set Description
The datasets used in this study are open-source and freely accessible online.
Real and fake news articles from various domains are included in the data.
Many articles' comments regarding politics can be manually confirmed using
fact-checking websites like [Link] and [Link]. The news article-
based data sets were acquired via Kaggle [6]. Each article has a label that says
"true" or "fake." These consist of the title, content, subject, and date. The fake
article's form (23481, 4) denotes that it has 4 columns and 23481 rows. The
item's form is (21417,4), meaning there are 21417 rows and 4 columns. Nu-
merous strategies are employed to obtain the version by habituation success-
fully. To ensure that this declaration is insufficient for training machine learn-
ing models, it involves transforming raw data into a comprehensible format.
This change requires the application of a few processes and techniques.
Among the techniques are function extraction, data disjunction, propensity
scaling, handling outliers, unrestricted proclamation, and managing inattentive
statistics to accumulate errors and inconsistencies by using ML models, which
improve the situation and make it easier to determine that the proclamation is
excellent for analysis and to initiate qualified and factual results pretreatment.
First, we shall concatenate our data. Next, we will obtain the useless columns
5
to create our data leaner. This procedure is often used to make the text data
more consistent and manageable.
Fig.1. Frequent words in fake news
Fig.2. Frequent words in true news
6
Figure 1 and Figure 2 show that the big words are then tokenized into smaller
ones. Next, the frequency of each term in the dataset is counted, and the fre-
quencies are split according to the news story's label—that is, genuine or fraudu-
lent.
3.3 Schematic Layout
Fig.3. Model Diagram
Fig. 3 demonstrates how to create a system for identifying fake news. It involves
preparing the data, separating it, using a decision tree classifier, extracting fea-
tures using TF-IDF, and assessing performance with metrics and a confusion
matrix. The visual representation highlights the crucial elements involved in
every step of the process and shows how these phases flow.
3.4 System Requirements
The device requirements include a laptop running a similar operating system,
such as Windows, a minimum of 8GB of RAM (ideally 16GB or more), and a
modern GPU with CUDA capability, preferably from NVIDIA 1650 GTX. Ver-
sion 3.7 or higher of the Python programming language is required, as with all
necessary dependencies and libraries, including Pandas, SK-Learn, Matplotlib,
7
NumPy, and Word cloud. Furthermore, sufficient storage space is needed to
hold the runtime files, model, and dataset.
3.5 Proposed Algorithm
In addition to our suggested methodology, we employed the following learning
algorithms to assess how well false news detection classifiers performed. A DT
classifier is used by an ML algorithm called a DT algorithm to generate predic -
tions. It then shows a tree-shaped example of decisions and their legal implica-
tions. Recursively dividing the announcement into subsets according to the
roughly crucial attribute at each knob of the tree is how the program operates.
The main idea underlying DT is that it learns several decision rules from the
complete dataset to build a model that forecasts the value of a dependent com -
ponent. The most crucial step in the DT learning process is selecting the appro-
priate attribute. Various trees use measures to address this issue, such as the gain
ratio in the C4.5 algorithm and information gain in the ID3 algorithm. The final
stage is called prevision. To prove that the Decision Tree Classifier best serves
our purpose, we have compared its accuracy to four other traditional Machine
Learning Models: Naïve Bayes, Support Vector Machine, Random Forest, and
Logistic Regression models.
4 Experimentation and Model Evaluation
Evaluating the accuracy and overall performance of a forex attention model is part of
the model comparison process for forex awareness in desktop learning. In addition to
techniques like cross-validation and confusion matrices, this evaluation typically in-
cludes metrics like precision, recall, and F1-score. The evaluation results explain how
well it can identify and categorize different currencies.
8
4.1 Depiction Results
Fig.4. Article per subject
This figure shows the frequency of words like Government-News, Middle-east,
News, US_News, left-news, politics, politics-News and world News in our
dataset.
Fig.5. News Percentage
9
This figure shows the percentage of True news articles and fake news articles in
our dataset.
Table 1. Performance Metrics of the Classifier
Fig. 6. Confusion matrix
4.2 Validation
Performance Metrics: We used a variety of criteria to combine the ap-
proaches' abilities. The bulk of them are based on the cm (confusion matrix), a
table representing the execution of various models on the test set. Accuracy
10
demonstrates the preparation of precisely estimated observations that were
either accurate or fake. The following equation can be used to model perform-
ance and determine a system's correctness.
Accuracy: It is the most widely utilized way of ML model validation
Accuracy= (𝑇P + TN) / 𝑇P + 𝐹P + 𝑇N + FN
for evaluating issues connected to categorization.
(1)
Recall: The true-positive rate is another name for the proportion of
data samples that a machine-learning technique correctly identifies as
belonging to a unit of interest—the "certain-class"—out of all the
samples in front of that relationship.
Recall= TP / (TP + FN) (2)
F1-Score: The F1 score can be used as the goal function to discover
the most optimal combination of precision and recall when optimizing
a model.
F1-Score= TP / {TP + (1/2) FP +FN} (3)
Precision: It evaluates the capacity of a model to properly notice
positive samples from all expressed positive samples. Precision can
be especially profitable when the cost of false positives is vital.
Precision= TP/ (TP + FP) (4)
Where TP: True Positive, TN: True Negative, FP: False Positive,
FN: False Negative
4.3 Discussions on Contributions
The study aimed to create a machine-learning model that could reliably identify
the authentic news stories and those that were not. We used the Kaggle database
repository, which offered datasets with real and fake articles for our project.
48% of the samples were from the real dataset, and the remaining 52% were
made up samples. Although we trained five models, out of them all, we could
produce outstanding results using a decision tree classifier after extensive train-
ing, attaining an accuracy of 99.62%, precision of 99.7%, recall of 99.52%, and
an F1-score of 99.61%.
Data Collection and Preprocessing: We began the investigation by examin-
ing the datasets available for false news recognition on the Kaggle platform.
After giving it some thought, we selected a trustworthy site that provided a
large selection of authentic and fraudulent news articles. The datasets were
cleaned up to ensure homogeneity and remove any biases that would jeop-
ardize the model's efficacy.
11
Feature Extraction and Selection: A range of feature extraction methods
were employed to extract significant information from the unprocessed tex-
tual input. The text was transformed into numerical representations as bag-
of-words or TF-IDF (Term Frequency-Inverse Document Frequency) vec-
tors. We also extracted supplementary context and emotional cues from the
articles using methods like sentiment analysis and n-grams. By focusing on
the most important features, We reduced computational complexity and im-
proved the model's functionality.
Model Training and Evaluation: Our principal objective was to build a de-
cision tree classifier using the pre-processed dataset. We divided the data
into two sets: training and testing, ensuring that each set contained a realistic
distribution of authentic and fake articles. Using the decision tree method, I
modified the hyperparameters to maximize performance metrics such as F1-
score, accuracy, precision, and recall.
The results showed an astounding accuracy percentage of 99.62%. This metric
demonstrates the model's ability to determine whether an item is authentic.
Moreover, the 99.7% accuracy rate demonstrates a low rate of false positives,
meaning that the algorithm hardly ever labels real things as fraudulent.
5 Conclusion and Future Scope
Identifying fake news is an important and difficult undertaking in the current digital
world. As is well known, the spread of misinformation is negatively correlated with
the quick expansion of social media and online platforms. This significantly influ-
ences our social media digital marketing efforts and numerous anti-social behaviors.
There is no one solution that can stop fake news from spreading; detecting fake news
is a continuous and difficult problem. The methods and techniques for identifying and
combating it are still evolving. Detecting fake news requires a multimodal strategy
incorporating technical skills, media literacy, and critical thinking ability. Many meth-
ods, techniques, and algorithms, including machine learning (ML) and natural lan-
guage processing (NLP), have been developed to address this problem. Technology
and education are also important in reducing the spread of false information. Because
misinformation is always changing, identifying fake news is still difficult. New tech-
nology, company collaboration, and ongoing research can mitigate these problems
and protect information integrity in the digital age. The decision-tree algorithm's po-
tential for identifying false news is bright. It is a well-liked machine learning model
that is simple to use and comprehend.
References
1. Albahr, Abdulaziz, and Marwan Albahar. "An empirical comparison of fake news detec-
tion using different machine learning algorithms." International Journal of Advanced Com-
puter Science and Applications 11.9 (2020).
12
2. Khan, A. I., Shahzad, F., & Ali, S. “Fake news detection: a deep learning approach using
CNN”. IEEE Access, doi: 10.1109/ACCESS.2019.2901590 (2019).
3. Thakur, P., Shah, R. R., & Rana, N. P. “A survey on automated fake news detection:
Trends and challenges.” Information Processing & Management, 57(2), 102026. doi:
10.1016/[Link].2019.102026 (2021).
4. Kumar, R., Singh, R. K., & Roy, P. P. “Fake news detection on social media: A review.
Artificial Intelligence Review”, 54(4), 2997-3030. doi: 10.1007/s10462-020-09981-4
(2021).
5. Allcott, Hunt, and Matthew Gentzkow. "Social media and fake news in the 2016 election."
Journal of economic perspectives 31.2 (2017).
6. Conroy, Nadia K., Victoria L. Rubin, and Yimin Chen. "Automatic deception detection:
Methods for finding fake news." Proceedings of the association for information science
and technology 52.1 (2015)
7. Shu, Kai, et al. "Fake news detection on social media: A data mining perspective." ACM
SIGKDD explorations newsletter 19.1 (2017)
8. Rashkin, Hannah, et al. "Truth of varying shades: Analyzing language in fake news and
political fact-checking." Proceedings of the 2017 conference on empirical methods in
natural language processing. 2017.
9. Jatakia, Er Tapan, et al. "Home Automation Control System." International Research Jour-
nal of Engineering and Technology, vol. 9, no. 10, 2022.
10. Al-Gburi, Mohamed Khudhair, and Laith Ali Abdul-Rahaim. "Secure Smart Home Auto-
mation and Monitoring System Using Internet of Things." Indonesian Journal of Electrical
Engineering and Computer Science, vol. 28, no. 1, 2022.
11. Abbas, Muhammad, et al. "Smart Android Based Home Automation System Using Inter-
net of Things (IoT)." Sustainability, vol. 14, no. 17, 2022, p. 10717.
12. Rehman, Sadiq Ur, et al. "Low-Cost Smart Home Automation System with Advanced
Features." Quaid-E-Awam University Research Journal of Engineering Science and Tech-
nology Nawabshah, vol. 20, no. 1, 2022, pp. 74-82.