Project Report
on
RAG-Powered Local Document Chatbot
Submitted for Internship/ Industrial training requirements
of
[Link] and [Link] (Dual Degree)
in
Mathematics And Data Science
Submitted by
Nishant Bharati
Scholar No. 2340401103
Project work completed
as part of the
IBM SkillBuild AI Agent Architect Program
Department of Mathematics, Bioinformatics and Computer Applications
Maulana Azad National Institute of Technology Bhopal-462003 (India)
ABSTRACT
CONCEPT DEFINITIONS
TABLE OF CONTENTS ...
SECTION 3: IMPLEMENTATION
CONCEPT DEFINITIONS ... (Page 8)
(Glossary) ... (Page 5)
➢ 3.1. Project Structure
SECTION 1: INTRODUCTION ...
(Page 6) ➢ 3.2. Core Functionality 1:
Persistent Workspaces
➢ 1.1. The Problem: (SQLite)
Information Overload &
Privacy ➢ 3.3. Core Functionality 2: The
RAG Indexing Pipeline
➢ 1.2. The Solution: Local-First
RAG ➢ 3.4. Core Functionality 3: The
"Live" Persona Chat Logic
➢ 1.3. Project Objectives
SECTION 4: RESULTS &
SECTION 2: SYSTEM CONCLUSION ... (Page 9)
ARCHITECTURE &
TECHNOLOGY STACK ... (Page 7 ➢ 4.1. Results
➢ 2.1. System Architecture ➢ 4.2. Conclusion
➢ 2.2. Technology Stack ➢ 4.3. Future Work
ABSTRACT
The exponential growth of digital information, siloed in diverse formats
like PDFs, DOCX files, and web pages, presents a significant challenge
for efficient information retrieval. Traditional search methods are often
insufficient, and Large Language Models (LLMs) are constrained by
limited context windows.
This project implements a "Chat with your Documents" application to
solve this problem. It is a comprehensive, multimodal conversational AI
built on a Retrieval-Augmented Generation (RAG) architecture. The
system ingests user-provided documents, processes them into a
searchable vector index, and stores this index in a persistent, project-
based workspace managed by an SQLite database.
The application is built using Python, with Streamlit for the web interface
and LangChain for orchestrating the RAG pipeline. To ensure complete
privacy, cost-free operation, and offline capability, the entire stack runs
locally. Text embedding is handled by a Hugging Face (all-MiniLM-L6-
v2) model, and the core conversational reasoning is powered by a
locally-hosted LLM (e.g., Llama 3) served via Ollama.
Key features include a project-based system for saving and loading
knowledge bases, "live" persona selection for a personalized user
experience, multimodal input (text, images, and audio), and robust
export options (PDF, DOCX, PPTX). The result is a scalable,
responsive, and fully private tool that transforms static documents into
an interactive, conversational knowledge base.
CONCEPT DEFINITIONS (Glossary)
AI: Artificial Intelligence.
LLM: Large Language Model. The "brain" of the chatbot (e.g., Llama 3, Mistral).
Ollama: A tool for running, serving, and managing open-source LLMs (like Llama 3)
locally on a user's machine.
RAG: Retrieval-Augmented Generation. The process of retrieving relevant data
before asking an LLM to generate an answer.
Embeddings: Numerical representations (vectors) of text, images, or audio, allowing
them to be searched based on meaning.
Vector Store: A specialized database (e.g., FAISS) designed to store and search
embeddings efficiently.
LangChain: A Python framework used to "chain" together components (LLM, vector
store, prompts) to build a complete application.
FAISS: Facebook AI Similarity Search. A high-performance local vector database.
Hugging Face: A platform providing open-source ML models. We use it for the local
embedding model.
Streamlit: A Python library used to build and deploy the web-based user interface
(UI) for the app.
SQLite: A lightweight, serverless, disk-based database used to store project
metadata.
Persona: A "system instruction" that defines the AI's tone, style, and goal (e.g.,
"Tutor," "Exam Ready").
Multimodal: The ability to process and understand multiple types of information,
such as text, images, and audio.
SECTION 1: INTRODUCTION
1.1. The Problem: Information Overload & Privacy
In the modern digital landscape, individuals and organizations face a constant deluge of
information. This data is often unstructured and locked in various formats. Manually
extracting specific information is tedious and prone to error.
Furthermore, while cloud-based Large Language Models (LLMs) are powerful, they present
two critical issues:
● Context & Cost: They have limited "context windows" and cannot read thousands of
pages at once. Sending large documents via API is slow and can be expensive.
● Privacy & Security: Using a commercial API requires sending private, sensitive, or
proprietary data to a third-party server, which is not acceptable for many academic,
corporate, or personal use cases.
1.2. The Solution: Local-First RAG
This project solves these problems using a 100% local-first Retrieval-Augmented Generation
(RAG) architecture. The application runs entirely on the user's machine, ensuring that no
data ever leaves their computer.
Indexing: The application reads all documents, splits them into small chunks, and converts
each chunk into a numerical "embedding" using a local Hugging Face model. These
embeddings are stored in a local vector database (FAISS).
Retrieval: When the user asks a question, the app converts the question into an embedding
and uses it to search the local database, retrieving only the top-K most relevant chunks of
text.
Generation: The app feeds this small, relevant context to a local LLM (e.g., Llama 3) served
by Ollama.
This local-first method allows the app to "chat" with thousands of pages of documents for
free, with zero latency, and with perfect privacy, even while offline.
1.3. Project Objectives
The primary objective is to build a robust, private, and offline-capable conversational AI that
serves as a personal knowledge assistant.
The key goals are:
Implement a 100% local RAG pipeline, using Ollama for chat and Hugging Face for
embeddings, to ensure user privacy and eliminate all API costs and rate limits.
Develop a persistent, project-based system using SQLite, allowing users to create, save,
and reload their document collections.
Incorporate "live" persona switching to allow users to dynamically change the AI's output
style (e.g., from "Tutor" to "Exam Ready").
Support multimodal RAG by integrating local vision models (e.g., llava) via Ollama.
Provide robust export options (PDF, DOCX, PPTX) to integrate the AI's answers into a
user's workflow.
SECTION 2: SYSTEM ARCHITECTURE &
TECHNOLOGY STACK
System Architecture
Technology Stack
SECTION 3: IMPLEMENTATION
3.1. Project Structure
The application is a self-contained system. All data and models (except for the initial
Ollama setup) are stored within the project directory, ensuring portability and privacy.
/your-project-folder/
│
├── [Link] # The main Streamlit application
├── [Link] # All Python dependencies
├── [Link] # SQLite database for project management
│
└── projects/ # Directory for all saved project data
│
├── my_project_1/
│ ├── faiss_index/
│ │ ├── [Link] # The RAG vector index
│ │ └── [Link] # The FAISS index metadata
│ │
│ └── an_image.png # A copy of an uploaded image
│
└── my_project_2/
└── ...
3.2. Core Functionality
1: Persistent Workspaces (SQLite)
Persistence is achieved via a [Link] SQLite file. This file contains a single table,
projects, which acts as the master record.
projects table schema:
name (TEXT): The unique project name (e.g., "AI Course Notes").
index_path (TEXT): The file path to its faiss_index folder (e.g.,
projects/ai_course_notes/faiss_index).
image_paths (TEXT): A JSON string list of paths to any associated images.
persona (TEXT): The string name of the persona saved with the project (e.g.,
"Tutor").
The "Load Project" button queries this database, loads the FAISS index from the
index_path into memory (st.session_state.retriever), and sets the app's state.
3.3. Core Functionality
2: The RAG Indexing Pipeline
The build_vector_index function is the core of the RAG pipeline.
Text Extraction: It first loops through all uploaded files and the URL, using
pdfplumber, docx, and BeautifulSoup to extract raw text.
Splitting: It uses LangChain's RecursiveCharacterTextSplitter with chunk_size=1000
and chunk_overlap=200.
Embedding: It initializes the local HuggingFaceEmbeddings(model_name="all-
MiniLM-L6-v2") model. This runs on the user's CPU/GPU, ensuring data never
leaves the machine.
Indexing: It calls FAISS.from_documents(all_chunks, embeddings) to create the in-
memory vector store.
Saving: Finally, it saves the index to disk using vector_index.save_local(index_path).
3.4. Core Functionality
3: The "Live" Persona & Chat Logic
To solve the "stale persona" problem (where changing the persona in the dropdown
had no effect), the application does not store the full, pre-built LangChain RAG
chain. Instead, it stores the building blocks: st.session_state.llm and
st.session_state.retriever.
The RAG chain is rebuilt on-the-fly with every single query. This allows the persona
to be "live":
Python
# This logic runs on EVERY user question:
if prompt_parts:
...
# 1. Get the CURRENTLY selected persona from the sidebar
current_persona_prompt = personae[st.session_state.system_instruction]
# 2. Build a new prompt template with this "live" persona
prompt_template = ChatPromptTemplate.from_template(
f"{current_persona_prompt}\n\n"
"Context: {context}\nQuestion: {input}"
)
# 3. Build a new, fresh RAG chain
document_chain = create_stuff_documents_chain(st.session_state.llm,
prompt_template)
retrieval_chain = create_retrieval_chain(st.session_state.retriever,
document_chain)
# 4. Run the chain
response_stream = retrieval_chain.stream({"input": text_for_rag})
...
SECTION 4: RESULTS & CONCLUSION
4.1. Results
The project was evaluated on its qualitative performance, scalability, and privacy.
Privacy & Cost (Local-First): The primary goal was achieved. The app runs 100%
offline, with zero API calls for embeddings or chat. This makes it infinitely scalable
for a single user, with no API costs or rate limits, and guarantees perfect data
privacy.
Accuracy (RAG Tuning): Initial tests with large chunks (chunk_size=4000) yielded
short answers. By tuning the parameters to chunk_size=1000 and retrieving more
chunks (k=8), the "Detailed Explainer" persona successfully generated long,
comprehensive, and accurate answers, as intended.
Flexibility (Live Personas): The "on-the-fly" chain construction was a success. The
user can switch from "Exam Ready" (getting short, bulleted answers) to "Tutor"
(getting long, simple explanations) for the very next question, all while using the
same underlying data.
Usability (UI/UX): The WhatsApp-style dark mode, streaming responses, "Read
Aloud," and audio inputs provide a modern, professional user experience. The
persistent project system and export features successfully transition the app from a
simple demo to a genuine productivity tool.
4.2. Conclusion
This project successfully demonstrates the power and feasibility of a 100% local-first
RAG architecture. By combining the open-source power of Ollama, LangChain,
Hugging Face, and FAISS, we have created a scalable, persistent, and multimodal
conversational AI that runs entirely on a personal computer.
The application overcomes the core limitations of cloud-based LLMs—namely cost,
rate-limiting, and privacy. It provides a user-friendly interface for anyone to build and
chat with their own specialized, private knowledge base.
4.3. Future Work
While the project is a comprehensive success, the following steps could enhance it
further:
Advanced Multimodal RAG: Currently, images are stored and passed to the LLM.
A more advanced system would use a vision embedding model (like CLIP) to make
the images themselves searchable, allowing a user to ask, "Show me all the
diagrams in my documents."
RAG-on-Audio: The current audio input is transcribed. A future version could
"index" the audio itself, allowing a user to search through hours of lecture recordings.
Agentic Behavior: The system could be expanded using LangGraph to perform
multi-step tasks, such as reading a document, searching the web for new
information, and then combining both to write a summary.
REFERENCES
Master Class: link
Master Class: link
Master Class: link
Master Class: link
Master Class: link