NLP-Project

NLP Project (COMP 586 Spring 2025) - Time-based Fact-checking and Fake News Detection

ToDo

Classical Retriever: Sagar [DONE]
Time-based Re-ranker: Sagar [DONE]
LLM-based classification: Sagar
BERT-based classification: Param [DONE]

Steps to use retrieval + reranking

Install the packages in requirements.txt. You may need C++14 to install chromadb.
Run ingest.py. You need to do this ONLY once - this will create the chromadb vector store.
Refer to full_flow.py to understand the usage of the retriever and reranker.

Steps to use BERT classification

Install additional dependencies with pip install -r requirements.txt
Train the BERT model by running python train_bert.py
- You can adjust training parameters: python train_bert.py --epochs 4 --batch_size 32
- Training outputs (model, plots, reports) are saved to the outputs/ directory
To use the full pipeline (retrieval → reranking → classification), run python full_flow.py

Model Architecture

The project includes three main components:

Retriever: Uses ChromaDB to find semantically relevant documents with a timestamp filter
Re-ranker: Re-ranks documents based on semantic relevance and temporal proximity
BERT Classifier: Fine-tuned BERT model for three-way classification of statements (true, false, unknown)
LLM Classifier:

Performance

The BERT classifier is trained on the labeled dataset and achieves:

Three-way classification of statements as true, false, or unknown
Visualization of training progress and confusion matrices
Integration with the retrieval pipeline for evidence-based fact checking

Three-way Classification

The BERT model classifies statements into three categories:

True: Statements that are verified as factually accurate
False: Statements that are verified as factually inaccurate
Unknown: Statements where the factual accuracy cannot be determined with confidence

This comprehensive classification system provides a more nuanced approach to fact-checking, acknowledging that not all statements can be definitively categorized as true or false.

Fact-Checking Dataset EDA Suite

This repository contains a suite of tools for Exploratory Data Analysis (EDA) of fact-checking datasets. The scripts analyze datasets from the NLP Project focused on fake news detection.

Datasets

The analysis is performed on the following datasets:

/Users/param/Desktop/Original_NLP/NLP-Project/datasets/train_set.json
/Users/param/Desktop/Original_NLP/NLP-Project/datasets/validate_set.json
/Users/param/Desktop/Original_NLP/NLP-Project/datasets/test_set.json

Setup and Installation

Install the required packages:

pip install -r requirements.txt

Make sure all scripts have execution permissions:

chmod +x *.py

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
datasets		datasets
tools		tools
.gitignore		.gitignore
README.md		README.md
bert_classification.py		bert_classification.py
best_bert_outputs.txt		best_bert_outputs.txt
full_flow.py		full_flow.py
ingest.py		ingest.py
llm_classification.py		llm_classification.py
llm_classifier.py		llm_classifier.py
requirements.txt		requirements.txt
reranker.py		reranker.py
retriever.py		retriever.py
train_bert.py		train_bert.py
ui_app.py		ui_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-Project

ToDo

Steps to use retrieval + reranking

Steps to use BERT classification

Model Architecture

Performance

Three-way Classification

Fact-Checking Dataset EDA Suite

Datasets

Setup and Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP-Project

ToDo

Steps to use retrieval + reranking

Steps to use BERT classification

Model Architecture

Performance

Three-way Classification

Fact-Checking Dataset EDA Suite

Datasets

Setup and Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages