0% found this document useful (0 votes)

7 views8 pages

Hybrid Rag Chatbot

The document outlines the development of a production-grade, offline Retrieval-Augmented Generation (RAG) chatbot designed for Finance and Hospitality sectors, focusing on cost-efficiency, privacy, and usability. Key features include local model operation, hybrid retrieval methods, and thread-based conversational memory, all aimed at providing accurate, context-grounded answers while maintaining low latency. The architecture supports various document formats and includes robust ingestion, re-ranking, and summarization capabilities to enhance user experience and operational efficiency.

Uploaded by

Suraj Patra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views8 pages

Hybrid Rag Chatbot

Uploaded by

Suraj Patra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Hybrid Agentic RAG — Finance & Hospitality

Project Documentation (production-ready, copy-paste for DOCX)

Executive Summary

This project implements a production-grade, offline Retrieval-Augmented

Generation (RAG) chatbot tailored to Finance and Hospitality documents
(policies, SOPs, pricing, audits, customer-service guidelines). It is optimized
for cost-efficiency, privacy, and real-world usability: the solution runs locally
using quantized small models, persistent vector indexes, hybrid
semantic+keyword retrieval, cross-encoder reranking, streaming generation,
and thread-based conversational memory with automatic summarization.
The design prioritizes accuracy, low latency, explainability, and easy
deployment on modest hardware.

Problem Statement

Enterprises in finance and hospitality maintain vast, heterogeneous

document sets that staff must search repeatedly for procedural, compliance,
and policy answers. Existing cloud LLM solutions are expensive, raise privacy
concerns, and require internet connectivity; many organizations cannot
adopt them. The objective here is to provide a private, low-cost, offline-
capable conversational assistant that returns accurate, grounded answers
and cites sources while running on CPU-friendly infrastructure.

Objectives & Evaluation Goals

The system is built to achieve the following: (1) accurate, context-grounded

answers with source citations; (2) low cost via local quantized models and
efficient indexing; (3) fast perceived response through streaming; (4) robust
ingestion of multiple file formats with fallback parsers; (5) conversational
continuity using thread-based memory and automatic summarization; and
(6) demonstrable engineering maturity for hackathon judges or enterprise
pilots. Performance targets include sub-second retrieval, reranking under
200ms, and overall response times that are perceived as fast via streaming.

Key Features (at-a-glance)

 Offline-first generation using a local quantized model (Phi-4 Mini via

Ollama).

 Robust ingestion pipeline: PDF/DOCX/TXT/CSV support with PyMuPDF

primary and Docling fallback; optional OCR for scanned documents.
 Smart chunking: section-aware + token-aware splitting with metadata
(document, page, chunk_id).

 Hybrid retrieval: BM25 keyword scoring combined with semantic

retrieval (embeddings → FAISS).

 Cross-encoder reranker (optional) to improve top-k precision.

 Controlled agentic behavior via a lightweight Query Analyzer (no tool

loops or external calls).

 Thread-based conversational memory stored in SQLite with

configurable memory injection limits.

 Automatic summarization of older exchanges after N turns to keep

context compact.

 Streaming Server-Sent Events (SSE) for token-by-token or chunked

output with final JSON containing sources and timings.

 Production-grade FastAPI backend and Tailwind-styled Jinja2 frontend

for polished UI.

High-level Architecture

User → Query Analyzer → Hybrid Retrieval (BM25 + FAISS) → (Optional)

Reranker → Context Assembler + Memory Injection → Local Model (Ollama /
phi4-mini) → Streaming Response → Session Storage (SQLite).

Each part is modular: ingestion produces chunk metadata and persisted

embeddings; retrieval returns candidate chunks; reranker orders them for
the model; the prompt builder enforces token budgets and strict grounding
instructions; the model runtime streams partial tokens and a final object that
includes citations and timing metrics.

Ingestion & Document Handling

The ingestion pipeline is designed for accuracy and robustness. We attempt

PyMuPDF parsing first (fast and reliable for well-formed PDFs); if it fails or
results in low-quality extraction, we use Docling as a fallback to better
capture layout and section headers. For scanned documents we offer an
optional Tesseract OCR step. Chunking uses a hybrid strategy: prefer section
boundaries (headings) to preserve semantic units, then apply token-based
splitting (configurable chunk size & overlap) to meet model and retrieval
efficiency requirements. Every chunk is annotated with document_name,
page_number, section_title and chunk_id for traceable citations.
Embeddings, Vector Store, and BM25

We compute vector embeddings with a CPU-friendly embedding model (bge-

small-en-v1.5 or similar), caching results on disk to avoid recomputation. The
vectors are stored in a FAISS index persisted to disk (HNSW or Flat depending
on corpus size). We also build a BM25 inverted index (rank_bm25) over the
chunk texts. At query time, the system performs both semantic search
(FAISS) and keyword retrieval (BM25), then combines normalized scores with
configurable weights to produce a robust candidate set that captures both
paraphrase and exact-phrase matches common in policy documents.

Re-ranking (Cross-Encoder)

To improve precision for the final context sent to the generator, the top-N
candidates from hybrid search are optionally re-ranked using a cross-encoder
model (e.g., ms-marco-MiniLM-L-6-v2). This cross-encoder evaluates query–
chunk pair relevance (slower but applied only to the small candidate set),
typically improving factual grounding and reducing hallucination. Reranking
is a configurable toggle to trade latency for precision.

Query Analyzer and Agent Design (Controlled Agentic Behavior)

Instead of a heavy agent framework, the project uses a lightweight Query

Analyzer that classifies the incoming query (definition, procedure,
numeric/compute, compliance) and selects the retrieval depth, reranking,
and prompt settings accordingly. This yields "agentic" decision-making
without multi-tool loops or external calls, preserving offline operation and low
latency while still enabling intelligent query-dependent behavior.

Prompt Builder & Context Assembly

A dedicated prompt builder assembles model input by merging: (1) a strict

system instruction that enforces grounding and the fallback phrase
(“Information not available in documents”), (2) session memory injection
(summary + last N exchanges), and (3) the top-k chunks (deduplicated and
token-limited) with metadata. The builder enforces an overall token budget
and will compress or truncate lower-ranked chunks to remain within
MAX_CONTEXT_TOKENS.

Model Serving & Streaming

The generation layer uses a local runtime manager (Ollama) to serve a

quantized Phi-4 Mini model. We rely on Ollama for local GGUF handling and
streaming endpoints. Streaming is implemented with SSE (Server-Sent
Events) from FastAPI: tokens or chunked strings are yielded as they arrive
from the runtime to deliver perceived low latency. The final SSE event
contains the full answer text, a sources array with chunk metadata, and
timing metrics (retrieval_ms, rerank_ms, generation_ms) for traceability.

Session Memory, Summarization & Token Budgeting

Session memory is thread-based (UUID per conversation) and stored in

SQLite. The memory injection strategy is conservative: only a stored
summary (if present) plus the last N exchanges are injected to avoid token
explosion. The system automatically generates a concise factual summary
after a configurable SUMMARY_AFTER_TURNS (default 3), using the same
local model via a summarization prompt that forbids invented facts. If
summarization fails, the system logs the error and retains raw messages—
never deletes data on failure. This approach ensures long-running
conversations remain coherent and cost-efficient.

Frontend & UX

A modern, minimal UI is implemented with server-rendered Jinja2 templates

and Tailwind CSS. The interface includes a sidebar for thread list
management, a document upload panel, and a chat window with streaming
message rendering. The streaming client uses a small SSE JavaScript snippet
that appends incoming token chunks and, upon the final event, renders the
sources and timings in an expandable panel. The UI deliberately balances
simplicity and professionalism for quick judge demos or enterprise
acceptance.

Backend API & Endpoints

Key endpoints:

 POST /upload — single or bulk file upload; triggers parsing, chunking,

embedding, and indexing.

 GET /documents — list indexed documents.

 POST /documents/delete — remove a document and its vectors.

 POST /chat/new — create a new thread_id.

 POST /chat/{thread_id}/ask — SSE streaming chat endpoint.

 GET /chat/{thread_id}/history — fetch stored session history and

summary.
 GET /health — returns model runtime connectivity and index readiness.

Each response includes clear status codes, consistent JSON shapes for
programmatic use, and robust error messages.

Configuration & Operational Checklist

All runtime toggles and paths are centralized in [Link]. Important toggles
include ENABLE_RERANKER, ENABLE_STREAMING, ENABLE_SUMMARIZATION,
SUMMARY_AFTER_TURNS, MEMORY_INJECTION_LIMIT, and hybrid weighting
parameters. The pre-demo checklist includes:

1. Install dependencies (python -m venv .venv && .venv/bin/pip install -r

[Link]).

2. Pull models locally: ollama pull phi4-mini and ollama pull bge-small-en-
v1.5 (or equivalent embeddings/reranker models).

3. Start Ollama runtime and verify with the health endpoint.

4. Pre-index sample documents.

5. Run backend and exercise SSE-based endpoints locally.

The README contains exact commands and troubleshooting tips for offline
operation.

Testing & Quality Assurance

Unit tests are provided for parser fallback logic, chunker splitting behavior,
hybrid scoring, session summarization triggers, and the model streaming
wrapper (mocked). Pytest and pytest-asyncio are used; model runtime calls
are mocked in tests to validate streaming assembly and final sources
inclusion. Integration tests simulate end-to-end flows with small sample
documents and assert the structure and content of final responses.

Performance & Optimization Techniques

 Quantization: model is run in 4-bit quantized format to reduce

memory and CPU load.

 Embedding caching: embeddings are cached on disk and reused to

avoid re-compute on restarts.

 FAISS persisted index: fast nearest neighbor search with HNSW for
low-latency retrieval.
 Hybrid search: combining BM25 with embedding similarity captures
both exact and semantic matches.

 Reranking only top-N: cross-encoder applied to a small candidate

set to balance latency and precision.

 Streaming responses: perceived latency dramatically reduced by

incremental updates.

 Summarization: bounds the token growth of active sessions to keep

generation costs low.

These techniques combine delivering accurate answers with acceptable

latency on CPU-only hardware.

Security, Compliance & Privacy

All processing happens on-premise or on the demonstration machine. No

external LLM APIs, no document data leaves the local environment, and no
keys/credentials are embedded in the repo. Access controls can be added
easily (reverse proxy, basic auth) for enterprise demos. SQLite session stores
are local and portable; for production deployments, swap to an enterprise DB
and encrypted storage as needed.

Limitations & Known Constraints

 Model quality and latency depend on local hardware; larger corpora

and heavier models require more RAM or GPU for sub-second
generation.

 Ollama and certain model binaries must be predownloaded prior to

offline demos.

 Cross-encoder reranking trades a moderate latency penalty for higher

precision; tuning may be required per dataset.

 Summarization uses the same model as generation and thus shares its
hallucination profile—careful prompt engineering and constrained
context are used to mitigate this.

How to Demonstrate (Demo Script)

1. Start the Ollama runtime and pull phi4-mini.

2. Start the FastAPI server and visit the UI.

3. Upload a small set of finance/hotel policy PDFs. Wait for indexing.

4. Create a new chat, ask a procedural question (e.g., “What is the loan
approval process?”). Observe streaming text and final sources.

5. Ask a second and third question in the same thread; after the third
exchange, demonstrate that a summary has been created and that
subsequent injections include the summary only plus the last two
exchanges.

6. Show /health output and the SQLite table to prove stored summary and
timings.

This flow highlights offline capability, streaming UX, memory compaction,

and explainability.

Future Work & Roadmap

 Add optional fine-tuning (LoRA) on domain-specific datasets to reduce

hallucination further.

 Replace SQLite with Redis for in-memory session speed and

persistence to an enterprise DB for scale.

 Add role-based access, audit trails, and encrypted storage for

compliance.

 Explore accelerator support (ONNX, OpenVINO) for faster inference on

CPU.

 Add user-level analytics and feedback loop to measure accuracy and

continuously improve retrieval weights.

Appendix — Important Snippets & API Summary

System prompt rule (enforced for all generation):

“Answer only from the provided context and memory; if the information is
not present, reply: ‘Information not available in documents’.”

Essential endpoints:
 POST /upload — upload files and trigger indexing.

 POST /chat/new — returns thread_id.

 POST /chat/{thread_id}/ask — SSE streaming; final event contains

sources and timings.

 GET /chat/{thread_id}/history — session history including summaries.

 GET /health — validates Ollama, index, embedding readiness.

Sample judge prompts (included in README and recommended for

benchmarking):

 “Explain the loan approval process in a bank in structured step-by-step

format. Limit to 150 words.”

 “Explain the hotel check-in procedure for a business traveler. Use

bullet points.”

 “If a hotel room costs $200 per night for 3 nights with a 10% discount
and 12% tax after discount, show calculation steps and final bill.”

 “What is the cancellation policy of Grand Azure Hotel in

Bhubaneswar?” (should respond with “Information not available in
documents” if not indexed).

Closing Statement

This project is engineered to provide the right combination of engineering

rigor, real-world usefulness, and hackathon-ready polish: robust parsing and
chunking, hybrid retrieval with reranking, memory management with
summarization, and offline quantized generation with streaming for great UX.
The system demonstrates enterprise-grade design trade-offs—privacy, cost-
efficiency, explainability, and deployability—while remaining extensible for
future production hardening.

Hybrid AI Research System Design
No ratings yet
Hybrid AI Research System Design
18 pages
RAG Chatbot with Vector Database Guide
No ratings yet
RAG Chatbot with Vector Database Guide
25 pages
Refugee RAG Full Report Detailed Ee
No ratings yet
Refugee RAG Full Report Detailed Ee
28 pages
Search Engines Rs
No ratings yet
Search Engines Rs
7 pages
NLP MPR PPT
No ratings yet
NLP MPR PPT
10 pages
Open Source Financial Document Chatbot
No ratings yet
Open Source Financial Document Chatbot
8 pages
Hybrid Search Implementation Using LangChain and ChromaDB
No ratings yet
Hybrid Search Implementation Using LangChain and ChromaDB
8 pages
AI Research Assistant Agent Guide
No ratings yet
AI Research Assistant Agent Guide
14 pages
Aether Analyst
No ratings yet
Aether Analyst
14 pages
AI Knowledge Platform for Higher Education
No ratings yet
AI Knowledge Platform for Higher Education
5 pages
Multi-Agent Architecture for QA Systems
No ratings yet
Multi-Agent Architecture for QA Systems
9 pages
RAG Agent Development with Llama 3.1
No ratings yet
RAG Agent Development with Llama 3.1
5 pages
Genai Q
No ratings yet
Genai Q
4 pages
AI-Driven Enterprise Document Search Solutions
No ratings yet
AI-Driven Enterprise Document Search Solutions
19 pages
Multi-Agent AI System Design Overview
No ratings yet
Multi-Agent AI System Design Overview
2 pages
DocuChat: AI-Powered PDF Interaction
No ratings yet
DocuChat: AI-Powered PDF Interaction
6 pages
A Ip Study Guide
No ratings yet
A Ip Study Guide
10 pages
NLP MPR Report
No ratings yet
NLP MPR Report
12 pages
RAG Project - 1 - Architecture
No ratings yet
RAG Project - 1 - Architecture
5 pages
Customer Support Automation with LLM
No ratings yet
Customer Support Automation with LLM
8 pages
Plagiarism Checker 2
No ratings yet
Plagiarism Checker 2
3 pages
Agentic RAG Tutorial and Implementation
No ratings yet
Agentic RAG Tutorial and Implementation
2 pages
IMaintain AI Engineer - Mohamed
No ratings yet
IMaintain AI Engineer - Mohamed
7 pages
RAG Document Chunking Strategies
No ratings yet
RAG Document Chunking Strategies
7 pages
Dynamic RAG System for LLMs Evaluation
No ratings yet
Dynamic RAG System for LLMs Evaluation
13 pages
Beyond RAG Advanced Architectures
No ratings yet
Beyond RAG Advanced Architectures
12 pages
AI Workflow Architect's Leverage Protocol
No ratings yet
AI Workflow Architect's Leverage Protocol
9 pages
AI-Powered Query-Retrieval System
No ratings yet
AI-Powered Query-Retrieval System
12 pages
Research Platform Thinking & Execution
No ratings yet
Research Platform Thinking & Execution
10 pages
AI-Powered Employee Query System
No ratings yet
AI-Powered Employee Query System
6 pages
Neuro-Symbolic Causal Rationale Extraction
No ratings yet
Neuro-Symbolic Causal Rationale Extraction
4 pages
Rag PDF Report
No ratings yet
Rag PDF Report
3 pages
Ravana AGI Core Implementation Plan
No ratings yet
Ravana AGI Core Implementation Plan
6 pages
Rag Pipeline
No ratings yet
Rag Pipeline
20 pages
Research Hub - Agentic Ai Powered Research Tool
No ratings yet
Research Hub - Agentic Ai Powered Research Tool
21 pages
RAG Implementation in IR-Anthology
No ratings yet
RAG Implementation in IR-Anthology
83 pages
Agentic RAG Setup Guide for Beginners
No ratings yet
Agentic RAG Setup Guide for Beginners
4 pages
Insurance AI Copilot
No ratings yet
Insurance AI Copilot
37 pages
RAG Methodologies: Performance Analysis
No ratings yet
RAG Methodologies: Performance Analysis
11 pages
Project Proposal
No ratings yet
Project Proposal
7 pages
Rag at Scale
No ratings yet
Rag at Scale
4 pages
Abstractive Text Summarization with Transformers
No ratings yet
Abstractive Text Summarization with Transformers
9 pages
Improvement Roadmap - MD
No ratings yet
Improvement Roadmap - MD
22 pages
AI Multi-Agent System for Cold Cases
No ratings yet
AI Multi-Agent System for Cold Cases
18 pages
AI Research Assistant Backend Plan
No ratings yet
AI Research Assistant Backend Plan
4 pages
RAG Techniques
No ratings yet
RAG Techniques
47 pages
Agentic Deep-Thinking RAG Pipeline
No ratings yet
Agentic Deep-Thinking RAG Pipeline
94 pages
Awesome
No ratings yet
Awesome
6 pages
GenAI System Design Optimization Guide
No ratings yet
GenAI System Design Optimization Guide
15 pages
Ultimate Genai System Design
No ratings yet
Ultimate Genai System Design
91 pages
2026 Data Science Career Roadmap Projects in Detail
No ratings yet
2026 Data Science Career Roadmap Projects in Detail
1 page
Hybrid RAG System with PydanticAI
No ratings yet
Hybrid RAG System with PydanticAI
37 pages
LLM Context Optimization System Design
No ratings yet
LLM Context Optimization System Design
23 pages
RAG Nomenclature Guide
No ratings yet
RAG Nomenclature Guide
6 pages
RAG Systems: Use Cases & Architectures
No ratings yet
RAG Systems: Use Cases & Architectures
6 pages
Symmetry Effects on Fourier Coefficients
No ratings yet
Symmetry Effects on Fourier Coefficients
10 pages
CSAT IAS Prelims 2011 Question Paper
No ratings yet
CSAT IAS Prelims 2011 Question Paper
59 pages
Graph Theory Basics and Traversal
No ratings yet
Graph Theory Basics and Traversal
25 pages
IPC TM-650: Tensile Testing Method
No ratings yet
IPC TM-650: Tensile Testing Method
3 pages
SAP Report ZNOTE_2173829 Overview
No ratings yet
SAP Report ZNOTE_2173829 Overview
215 pages
Recruitment and Selection Process Guide
No ratings yet
Recruitment and Selection Process Guide
8 pages
FEAP Finite Element Analysis Manual
100% (1)
FEAP Finite Element Analysis Manual
551 pages
Affidavit of Loss for Driver's License
50% (2)
Affidavit of Loss for Driver's License
2 pages
Petroleum Refining Processes Course Overview
No ratings yet
Petroleum Refining Processes Course Overview
136 pages
Pre-Columbian Literature Sources Explained
No ratings yet
Pre-Columbian Literature Sources Explained
3 pages
CAP Round-I Allotment List 2023-24
No ratings yet
CAP Round-I Allotment List 2023-24
14 pages
FOPDT Model Characterization Guide
No ratings yet
FOPDT Model Characterization Guide
6 pages
Open Source Cone-beam Reconstructor
No ratings yet
Open Source Cone-beam Reconstructor
25 pages
أنواع مخارج الكهرباء والقواطع الكهربائية
No ratings yet
أنواع مخارج الكهرباء والقواطع الكهربائية
15 pages
Advantages of Indian vs UAE Companies
No ratings yet
Advantages of Indian vs UAE Companies
2 pages
E-PICV Valve Technical Specifications
No ratings yet
E-PICV Valve Technical Specifications
1 page
Tirumala Tirupati Credit Society List
No ratings yet
Tirumala Tirupati Credit Society List
18 pages
Squash and Syrup Production Guide
No ratings yet
Squash and Syrup Production Guide
2 pages
Read Greek in 30 Days New Testament Old Testament Apocrypha Philo Church Fathers 2nd Edition W. Larry Richards No Waiting Time
100% (4)
Read Greek in 30 Days New Testament Old Testament Apocrypha Philo Church Fathers 2nd Edition W. Larry Richards No Waiting Time
83 pages
Propositions in Logical Reasoning
No ratings yet
Propositions in Logical Reasoning
29 pages
Pressure Vessel and Heat Exchanger Codes
No ratings yet
Pressure Vessel and Heat Exchanger Codes
23 pages
Environmental Conservation Strategies
No ratings yet
Environmental Conservation Strategies
7 pages
Education Policies in Developing Nations
No ratings yet
Education Policies in Developing Nations
48 pages
Microwave Wireless Charging for Phones
No ratings yet
Microwave Wireless Charging for Phones
16 pages
Management Functions at Robi Axiata
No ratings yet
Management Functions at Robi Axiata
22 pages
Creating More Effective Graphs
No ratings yet
Creating More Effective Graphs
8 pages
Essential Volleyball Skills Explained
No ratings yet
Essential Volleyball Skills Explained
20 pages
Future Tense: "Be Going To" Exercises
No ratings yet
Future Tense: "Be Going To" Exercises
1 page
Work Immersion and Ethics Guide
No ratings yet
Work Immersion and Ethics Guide
2 pages