0% found this document useful (0 votes)
7 views8 pages

Hybrid Rag Chatbot

The document outlines the development of a production-grade, offline Retrieval-Augmented Generation (RAG) chatbot designed for Finance and Hospitality sectors, focusing on cost-efficiency, privacy, and usability. Key features include local model operation, hybrid retrieval methods, and thread-based conversational memory, all aimed at providing accurate, context-grounded answers while maintaining low latency. The architecture supports various document formats and includes robust ingestion, re-ranking, and summarization capabilities to enhance user experience and operational efficiency.

Uploaded by

Suraj Patra
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

Hybrid Rag Chatbot

The document outlines the development of a production-grade, offline Retrieval-Augmented Generation (RAG) chatbot designed for Finance and Hospitality sectors, focusing on cost-efficiency, privacy, and usability. Key features include local model operation, hybrid retrieval methods, and thread-based conversational memory, all aimed at providing accurate, context-grounded answers while maintaining low latency. The architecture supports various document formats and includes robust ingestion, re-ranking, and summarization capabilities to enhance user experience and operational efficiency.

Uploaded by

Suraj Patra
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Hybrid Agentic RAG — Finance & Hospitality

Project Documentation (production-ready, copy-paste for DOCX)

Executive Summary

This project implements a production-grade, offline Retrieval-Augmented


Generation (RAG) chatbot tailored to Finance and Hospitality documents
(policies, SOPs, pricing, audits, customer-service guidelines). It is optimized
for cost-efficiency, privacy, and real-world usability: the solution runs locally
using quantized small models, persistent vector indexes, hybrid
semantic+keyword retrieval, cross-encoder reranking, streaming generation,
and thread-based conversational memory with automatic summarization.
The design prioritizes accuracy, low latency, explainability, and easy
deployment on modest hardware.

Problem Statement

Enterprises in finance and hospitality maintain vast, heterogeneous


document sets that staff must search repeatedly for procedural, compliance,
and policy answers. Existing cloud LLM solutions are expensive, raise privacy
concerns, and require internet connectivity; many organizations cannot
adopt them. The objective here is to provide a private, low-cost, offline-
capable conversational assistant that returns accurate, grounded answers
and cites sources while running on CPU-friendly infrastructure.

Objectives & Evaluation Goals

The system is built to achieve the following: (1) accurate, context-grounded


answers with source citations; (2) low cost via local quantized models and
efficient indexing; (3) fast perceived response through streaming; (4) robust
ingestion of multiple file formats with fallback parsers; (5) conversational
continuity using thread-based memory and automatic summarization; and
(6) demonstrable engineering maturity for hackathon judges or enterprise
pilots. Performance targets include sub-second retrieval, reranking under
200ms, and overall response times that are perceived as fast via streaming.

Key Features (at-a-glance)

 Offline-first generation using a local quantized model (Phi-4 Mini via


Ollama).

 Robust ingestion pipeline: PDF/DOCX/TXT/CSV support with PyMuPDF


primary and Docling fallback; optional OCR for scanned documents.
 Smart chunking: section-aware + token-aware splitting with metadata
(document, page, chunk_id).

 Hybrid retrieval: BM25 keyword scoring combined with semantic


retrieval (embeddings → FAISS).

 Cross-encoder reranker (optional) to improve top-k precision.

 Controlled agentic behavior via a lightweight Query Analyzer (no tool


loops or external calls).

 Thread-based conversational memory stored in SQLite with


configurable memory injection limits.

 Automatic summarization of older exchanges after N turns to keep


context compact.

 Streaming Server-Sent Events (SSE) for token-by-token or chunked


output with final JSON containing sources and timings.

 Production-grade FastAPI backend and Tailwind-styled Jinja2 frontend


for polished UI.

High-level Architecture

User → Query Analyzer → Hybrid Retrieval (BM25 + FAISS) → (Optional)


Reranker → Context Assembler + Memory Injection → Local Model (Ollama /
phi4-mini) → Streaming Response → Session Storage (SQLite).

Each part is modular: ingestion produces chunk metadata and persisted


embeddings; retrieval returns candidate chunks; reranker orders them for
the model; the prompt builder enforces token budgets and strict grounding
instructions; the model runtime streams partial tokens and a final object that
includes citations and timing metrics.

Ingestion & Document Handling

The ingestion pipeline is designed for accuracy and robustness. We attempt


PyMuPDF parsing first (fast and reliable for well-formed PDFs); if it fails or
results in low-quality extraction, we use Docling as a fallback to better
capture layout and section headers. For scanned documents we offer an
optional Tesseract OCR step. Chunking uses a hybrid strategy: prefer section
boundaries (headings) to preserve semantic units, then apply token-based
splitting (configurable chunk size & overlap) to meet model and retrieval
efficiency requirements. Every chunk is annotated with document_name,
page_number, section_title and chunk_id for traceable citations.
Embeddings, Vector Store, and BM25

We compute vector embeddings with a CPU-friendly embedding model (bge-


small-en-v1.5 or similar), caching results on disk to avoid recomputation. The
vectors are stored in a FAISS index persisted to disk (HNSW or Flat depending
on corpus size). We also build a BM25 inverted index (rank_bm25) over the
chunk texts. At query time, the system performs both semantic search
(FAISS) and keyword retrieval (BM25), then combines normalized scores with
configurable weights to produce a robust candidate set that captures both
paraphrase and exact-phrase matches common in policy documents.

Re-ranking (Cross-Encoder)

To improve precision for the final context sent to the generator, the top-N
candidates from hybrid search are optionally re-ranked using a cross-encoder
model (e.g., ms-marco-MiniLM-L-6-v2). This cross-encoder evaluates query–
chunk pair relevance (slower but applied only to the small candidate set),
typically improving factual grounding and reducing hallucination. Reranking
is a configurable toggle to trade latency for precision.

Query Analyzer and Agent Design (Controlled Agentic Behavior)

Instead of a heavy agent framework, the project uses a lightweight Query


Analyzer that classifies the incoming query (definition, procedure,
numeric/compute, compliance) and selects the retrieval depth, reranking,
and prompt settings accordingly. This yields "agentic" decision-making
without multi-tool loops or external calls, preserving offline operation and low
latency while still enabling intelligent query-dependent behavior.

Prompt Builder & Context Assembly

A dedicated prompt builder assembles model input by merging: (1) a strict


system instruction that enforces grounding and the fallback phrase
(“Information not available in documents”), (2) session memory injection
(summary + last N exchanges), and (3) the top-k chunks (deduplicated and
token-limited) with metadata. The builder enforces an overall token budget
and will compress or truncate lower-ranked chunks to remain within
MAX_CONTEXT_TOKENS.

Model Serving & Streaming

The generation layer uses a local runtime manager (Ollama) to serve a


quantized Phi-4 Mini model. We rely on Ollama for local GGUF handling and
streaming endpoints. Streaming is implemented with SSE (Server-Sent
Events) from FastAPI: tokens or chunked strings are yielded as they arrive
from the runtime to deliver perceived low latency. The final SSE event
contains the full answer text, a sources array with chunk metadata, and
timing metrics (retrieval_ms, rerank_ms, generation_ms) for traceability.

Session Memory, Summarization & Token Budgeting

Session memory is thread-based (UUID per conversation) and stored in


SQLite. The memory injection strategy is conservative: only a stored
summary (if present) plus the last N exchanges are injected to avoid token
explosion. The system automatically generates a concise factual summary
after a configurable SUMMARY_AFTER_TURNS (default 3), using the same
local model via a summarization prompt that forbids invented facts. If
summarization fails, the system logs the error and retains raw messages—
never deletes data on failure. This approach ensures long-running
conversations remain coherent and cost-efficient.

Frontend & UX

A modern, minimal UI is implemented with server-rendered Jinja2 templates


and Tailwind CSS. The interface includes a sidebar for thread list
management, a document upload panel, and a chat window with streaming
message rendering. The streaming client uses a small SSE JavaScript snippet
that appends incoming token chunks and, upon the final event, renders the
sources and timings in an expandable panel. The UI deliberately balances
simplicity and professionalism for quick judge demos or enterprise
acceptance.

Backend API & Endpoints

Key endpoints:

 POST /upload — single or bulk file upload; triggers parsing, chunking,


embedding, and indexing.

 GET /documents — list indexed documents.

 POST /documents/delete — remove a document and its vectors.

 POST /chat/new — create a new thread_id.

 POST /chat/{thread_id}/ask — SSE streaming chat endpoint.

 GET /chat/{thread_id}/history — fetch stored session history and


summary.
 GET /health — returns model runtime connectivity and index readiness.

Each response includes clear status codes, consistent JSON shapes for
programmatic use, and robust error messages.

Configuration & Operational Checklist

All runtime toggles and paths are centralized in [Link]. Important toggles
include ENABLE_RERANKER, ENABLE_STREAMING, ENABLE_SUMMARIZATION,
SUMMARY_AFTER_TURNS, MEMORY_INJECTION_LIMIT, and hybrid weighting
parameters. The pre-demo checklist includes:

1. Install dependencies (python -m venv .venv && .venv/bin/pip install -r


[Link]).

2. Pull models locally: ollama pull phi4-mini and ollama pull bge-small-en-
v1.5 (or equivalent embeddings/reranker models).

3. Start Ollama runtime and verify with the health endpoint.

4. Pre-index sample documents.

5. Run backend and exercise SSE-based endpoints locally.

The README contains exact commands and troubleshooting tips for offline
operation.

Testing & Quality Assurance

Unit tests are provided for parser fallback logic, chunker splitting behavior,
hybrid scoring, session summarization triggers, and the model streaming
wrapper (mocked). Pytest and pytest-asyncio are used; model runtime calls
are mocked in tests to validate streaming assembly and final sources
inclusion. Integration tests simulate end-to-end flows with small sample
documents and assert the structure and content of final responses.

Performance & Optimization Techniques

 Quantization: model is run in 4-bit quantized format to reduce


memory and CPU load.

 Embedding caching: embeddings are cached on disk and reused to


avoid re-compute on restarts.

 FAISS persisted index: fast nearest neighbor search with HNSW for
low-latency retrieval.
 Hybrid search: combining BM25 with embedding similarity captures
both exact and semantic matches.

 Reranking only top-N: cross-encoder applied to a small candidate


set to balance latency and precision.

 Streaming responses: perceived latency dramatically reduced by


incremental updates.

 Summarization: bounds the token growth of active sessions to keep


generation costs low.

These techniques combine delivering accurate answers with acceptable


latency on CPU-only hardware.

Security, Compliance & Privacy

All processing happens on-premise or on the demonstration machine. No


external LLM APIs, no document data leaves the local environment, and no
keys/credentials are embedded in the repo. Access controls can be added
easily (reverse proxy, basic auth) for enterprise demos. SQLite session stores
are local and portable; for production deployments, swap to an enterprise DB
and encrypted storage as needed.

Limitations & Known Constraints

 Model quality and latency depend on local hardware; larger corpora


and heavier models require more RAM or GPU for sub-second
generation.

 Ollama and certain model binaries must be predownloaded prior to


offline demos.

 Cross-encoder reranking trades a moderate latency penalty for higher


precision; tuning may be required per dataset.

 Summarization uses the same model as generation and thus shares its
hallucination profile—careful prompt engineering and constrained
context are used to mitigate this.

How to Demonstrate (Demo Script)


1. Start the Ollama runtime and pull phi4-mini.

2. Start the FastAPI server and visit the UI.

3. Upload a small set of finance/hotel policy PDFs. Wait for indexing.

4. Create a new chat, ask a procedural question (e.g., “What is the loan
approval process?”). Observe streaming text and final sources.

5. Ask a second and third question in the same thread; after the third
exchange, demonstrate that a summary has been created and that
subsequent injections include the summary only plus the last two
exchanges.

6. Show /health output and the SQLite table to prove stored summary and
timings.

This flow highlights offline capability, streaming UX, memory compaction,


and explainability.

Future Work & Roadmap

 Add optional fine-tuning (LoRA) on domain-specific datasets to reduce


hallucination further.

 Replace SQLite with Redis for in-memory session speed and


persistence to an enterprise DB for scale.

 Add role-based access, audit trails, and encrypted storage for


compliance.

 Explore accelerator support (ONNX, OpenVINO) for faster inference on


CPU.

 Add user-level analytics and feedback loop to measure accuracy and


continuously improve retrieval weights.

Appendix — Important Snippets & API Summary

System prompt rule (enforced for all generation):

“Answer only from the provided context and memory; if the information is
not present, reply: ‘Information not available in documents’.”

Essential endpoints:
 POST /upload — upload files and trigger indexing.

 POST /chat/new — returns thread_id.

 POST /chat/{thread_id}/ask — SSE streaming; final event contains


sources and timings.

 GET /chat/{thread_id}/history — session history including summaries.

 GET /health — validates Ollama, index, embedding readiness.

Sample judge prompts (included in README and recommended for


benchmarking):

 “Explain the loan approval process in a bank in structured step-by-step


format. Limit to 150 words.”

 “Explain the hotel check-in procedure for a business traveler. Use


bullet points.”

 “If a hotel room costs $200 per night for 3 nights with a 10% discount
and 12% tax after discount, show calculation steps and final bill.”

 “What is the cancellation policy of Grand Azure Hotel in


Bhubaneswar?” (should respond with “Information not available in
documents” if not indexed).

Closing Statement

This project is engineered to provide the right combination of engineering


rigor, real-world usefulness, and hackathon-ready polish: robust parsing and
chunking, hybrid retrieval with reranking, memory management with
summarization, and offline quantized generation with streaming for great UX.
The system demonstrates enterprise-grade design trade-offs—privacy, cost-
efficiency, explainability, and deployability—while remaining extensible for
future production hardening.

You might also like