Hybrid Agentic RAG — Finance & Hospitality
Project Documentation (production-ready, copy-paste for DOCX)
Executive Summary
This project implements a production-grade, offline Retrieval-Augmented
Generation (RAG) chatbot tailored to Finance and Hospitality documents
(policies, SOPs, pricing, audits, customer-service guidelines). It is optimized
for cost-efficiency, privacy, and real-world usability: the solution runs locally
using quantized small models, persistent vector indexes, hybrid
semantic+keyword retrieval, cross-encoder reranking, streaming generation,
and thread-based conversational memory with automatic summarization.
The design prioritizes accuracy, low latency, explainability, and easy
deployment on modest hardware.
Problem Statement
Enterprises in finance and hospitality maintain vast, heterogeneous
document sets that staff must search repeatedly for procedural, compliance,
and policy answers. Existing cloud LLM solutions are expensive, raise privacy
concerns, and require internet connectivity; many organizations cannot
adopt them. The objective here is to provide a private, low-cost, offline-
capable conversational assistant that returns accurate, grounded answers
and cites sources while running on CPU-friendly infrastructure.
Objectives & Evaluation Goals
The system is built to achieve the following: (1) accurate, context-grounded
answers with source citations; (2) low cost via local quantized models and
efficient indexing; (3) fast perceived response through streaming; (4) robust
ingestion of multiple file formats with fallback parsers; (5) conversational
continuity using thread-based memory and automatic summarization; and
(6) demonstrable engineering maturity for hackathon judges or enterprise
pilots. Performance targets include sub-second retrieval, reranking under
200ms, and overall response times that are perceived as fast via streaming.
Key Features (at-a-glance)
Offline-first generation using a local quantized model (Phi-4 Mini via
Ollama).
Robust ingestion pipeline: PDF/DOCX/TXT/CSV support with PyMuPDF
primary and Docling fallback; optional OCR for scanned documents.
Smart chunking: section-aware + token-aware splitting with metadata
(document, page, chunk_id).
Hybrid retrieval: BM25 keyword scoring combined with semantic
retrieval (embeddings → FAISS).
Cross-encoder reranker (optional) to improve top-k precision.
Controlled agentic behavior via a lightweight Query Analyzer (no tool
loops or external calls).
Thread-based conversational memory stored in SQLite with
configurable memory injection limits.
Automatic summarization of older exchanges after N turns to keep
context compact.
Streaming Server-Sent Events (SSE) for token-by-token or chunked
output with final JSON containing sources and timings.
Production-grade FastAPI backend and Tailwind-styled Jinja2 frontend
for polished UI.
High-level Architecture
User → Query Analyzer → Hybrid Retrieval (BM25 + FAISS) → (Optional)
Reranker → Context Assembler + Memory Injection → Local Model (Ollama /
phi4-mini) → Streaming Response → Session Storage (SQLite).
Each part is modular: ingestion produces chunk metadata and persisted
embeddings; retrieval returns candidate chunks; reranker orders them for
the model; the prompt builder enforces token budgets and strict grounding
instructions; the model runtime streams partial tokens and a final object that
includes citations and timing metrics.
Ingestion & Document Handling
The ingestion pipeline is designed for accuracy and robustness. We attempt
PyMuPDF parsing first (fast and reliable for well-formed PDFs); if it fails or
results in low-quality extraction, we use Docling as a fallback to better
capture layout and section headers. For scanned documents we offer an
optional Tesseract OCR step. Chunking uses a hybrid strategy: prefer section
boundaries (headings) to preserve semantic units, then apply token-based
splitting (configurable chunk size & overlap) to meet model and retrieval
efficiency requirements. Every chunk is annotated with document_name,
page_number, section_title and chunk_id for traceable citations.
Embeddings, Vector Store, and BM25
We compute vector embeddings with a CPU-friendly embedding model (bge-
small-en-v1.5 or similar), caching results on disk to avoid recomputation. The
vectors are stored in a FAISS index persisted to disk (HNSW or Flat depending
on corpus size). We also build a BM25 inverted index (rank_bm25) over the
chunk texts. At query time, the system performs both semantic search
(FAISS) and keyword retrieval (BM25), then combines normalized scores with
configurable weights to produce a robust candidate set that captures both
paraphrase and exact-phrase matches common in policy documents.
Re-ranking (Cross-Encoder)
To improve precision for the final context sent to the generator, the top-N
candidates from hybrid search are optionally re-ranked using a cross-encoder
model (e.g., ms-marco-MiniLM-L-6-v2). This cross-encoder evaluates query–
chunk pair relevance (slower but applied only to the small candidate set),
typically improving factual grounding and reducing hallucination. Reranking
is a configurable toggle to trade latency for precision.
Query Analyzer and Agent Design (Controlled Agentic Behavior)
Instead of a heavy agent framework, the project uses a lightweight Query
Analyzer that classifies the incoming query (definition, procedure,
numeric/compute, compliance) and selects the retrieval depth, reranking,
and prompt settings accordingly. This yields "agentic" decision-making
without multi-tool loops or external calls, preserving offline operation and low
latency while still enabling intelligent query-dependent behavior.
Prompt Builder & Context Assembly
A dedicated prompt builder assembles model input by merging: (1) a strict
system instruction that enforces grounding and the fallback phrase
(“Information not available in documents”), (2) session memory injection
(summary + last N exchanges), and (3) the top-k chunks (deduplicated and
token-limited) with metadata. The builder enforces an overall token budget
and will compress or truncate lower-ranked chunks to remain within
MAX_CONTEXT_TOKENS.
Model Serving & Streaming
The generation layer uses a local runtime manager (Ollama) to serve a
quantized Phi-4 Mini model. We rely on Ollama for local GGUF handling and
streaming endpoints. Streaming is implemented with SSE (Server-Sent
Events) from FastAPI: tokens or chunked strings are yielded as they arrive
from the runtime to deliver perceived low latency. The final SSE event
contains the full answer text, a sources array with chunk metadata, and
timing metrics (retrieval_ms, rerank_ms, generation_ms) for traceability.
Session Memory, Summarization & Token Budgeting
Session memory is thread-based (UUID per conversation) and stored in
SQLite. The memory injection strategy is conservative: only a stored
summary (if present) plus the last N exchanges are injected to avoid token
explosion. The system automatically generates a concise factual summary
after a configurable SUMMARY_AFTER_TURNS (default 3), using the same
local model via a summarization prompt that forbids invented facts. If
summarization fails, the system logs the error and retains raw messages—
never deletes data on failure. This approach ensures long-running
conversations remain coherent and cost-efficient.
Frontend & UX
A modern, minimal UI is implemented with server-rendered Jinja2 templates
and Tailwind CSS. The interface includes a sidebar for thread list
management, a document upload panel, and a chat window with streaming
message rendering. The streaming client uses a small SSE JavaScript snippet
that appends incoming token chunks and, upon the final event, renders the
sources and timings in an expandable panel. The UI deliberately balances
simplicity and professionalism for quick judge demos or enterprise
acceptance.
Backend API & Endpoints
Key endpoints:
POST /upload — single or bulk file upload; triggers parsing, chunking,
embedding, and indexing.
GET /documents — list indexed documents.
POST /documents/delete — remove a document and its vectors.
POST /chat/new — create a new thread_id.
POST /chat/{thread_id}/ask — SSE streaming chat endpoint.
GET /chat/{thread_id}/history — fetch stored session history and
summary.
GET /health — returns model runtime connectivity and index readiness.
Each response includes clear status codes, consistent JSON shapes for
programmatic use, and robust error messages.
Configuration & Operational Checklist
All runtime toggles and paths are centralized in [Link]. Important toggles
include ENABLE_RERANKER, ENABLE_STREAMING, ENABLE_SUMMARIZATION,
SUMMARY_AFTER_TURNS, MEMORY_INJECTION_LIMIT, and hybrid weighting
parameters. The pre-demo checklist includes:
1. Install dependencies (python -m venv .venv && .venv/bin/pip install -r
[Link]).
2. Pull models locally: ollama pull phi4-mini and ollama pull bge-small-en-
v1.5 (or equivalent embeddings/reranker models).
3. Start Ollama runtime and verify with the health endpoint.
4. Pre-index sample documents.
5. Run backend and exercise SSE-based endpoints locally.
The README contains exact commands and troubleshooting tips for offline
operation.
Testing & Quality Assurance
Unit tests are provided for parser fallback logic, chunker splitting behavior,
hybrid scoring, session summarization triggers, and the model streaming
wrapper (mocked). Pytest and pytest-asyncio are used; model runtime calls
are mocked in tests to validate streaming assembly and final sources
inclusion. Integration tests simulate end-to-end flows with small sample
documents and assert the structure and content of final responses.
Performance & Optimization Techniques
Quantization: model is run in 4-bit quantized format to reduce
memory and CPU load.
Embedding caching: embeddings are cached on disk and reused to
avoid re-compute on restarts.
FAISS persisted index: fast nearest neighbor search with HNSW for
low-latency retrieval.
Hybrid search: combining BM25 with embedding similarity captures
both exact and semantic matches.
Reranking only top-N: cross-encoder applied to a small candidate
set to balance latency and precision.
Streaming responses: perceived latency dramatically reduced by
incremental updates.
Summarization: bounds the token growth of active sessions to keep
generation costs low.
These techniques combine delivering accurate answers with acceptable
latency on CPU-only hardware.
Security, Compliance & Privacy
All processing happens on-premise or on the demonstration machine. No
external LLM APIs, no document data leaves the local environment, and no
keys/credentials are embedded in the repo. Access controls can be added
easily (reverse proxy, basic auth) for enterprise demos. SQLite session stores
are local and portable; for production deployments, swap to an enterprise DB
and encrypted storage as needed.
Limitations & Known Constraints
Model quality and latency depend on local hardware; larger corpora
and heavier models require more RAM or GPU for sub-second
generation.
Ollama and certain model binaries must be predownloaded prior to
offline demos.
Cross-encoder reranking trades a moderate latency penalty for higher
precision; tuning may be required per dataset.
Summarization uses the same model as generation and thus shares its
hallucination profile—careful prompt engineering and constrained
context are used to mitigate this.
How to Demonstrate (Demo Script)
1. Start the Ollama runtime and pull phi4-mini.
2. Start the FastAPI server and visit the UI.
3. Upload a small set of finance/hotel policy PDFs. Wait for indexing.
4. Create a new chat, ask a procedural question (e.g., “What is the loan
approval process?”). Observe streaming text and final sources.
5. Ask a second and third question in the same thread; after the third
exchange, demonstrate that a summary has been created and that
subsequent injections include the summary only plus the last two
exchanges.
6. Show /health output and the SQLite table to prove stored summary and
timings.
This flow highlights offline capability, streaming UX, memory compaction,
and explainability.
Future Work & Roadmap
Add optional fine-tuning (LoRA) on domain-specific datasets to reduce
hallucination further.
Replace SQLite with Redis for in-memory session speed and
persistence to an enterprise DB for scale.
Add role-based access, audit trails, and encrypted storage for
compliance.
Explore accelerator support (ONNX, OpenVINO) for faster inference on
CPU.
Add user-level analytics and feedback loop to measure accuracy and
continuously improve retrieval weights.
Appendix — Important Snippets & API Summary
System prompt rule (enforced for all generation):
“Answer only from the provided context and memory; if the information is
not present, reply: ‘Information not available in documents’.”
Essential endpoints:
POST /upload — upload files and trigger indexing.
POST /chat/new — returns thread_id.
POST /chat/{thread_id}/ask — SSE streaming; final event contains
sources and timings.
GET /chat/{thread_id}/history — session history including summaries.
GET /health — validates Ollama, index, embedding readiness.
Sample judge prompts (included in README and recommended for
benchmarking):
“Explain the loan approval process in a bank in structured step-by-step
format. Limit to 150 words.”
“Explain the hotel check-in procedure for a business traveler. Use
bullet points.”
“If a hotel room costs $200 per night for 3 nights with a 10% discount
and 12% tax after discount, show calculation steps and final bill.”
“What is the cancellation policy of Grand Azure Hotel in
Bhubaneswar?” (should respond with “Information not available in
documents” if not indexed).
Closing Statement
This project is engineered to provide the right combination of engineering
rigor, real-world usefulness, and hackathon-ready polish: robust parsing and
chunking, hybrid retrieval with reranking, memory management with
summarization, and offline quantized generation with streaming for great UX.
The system demonstrates enterprise-grade design trade-offs—privacy, cost-
efficiency, explainability, and deployability—while remaining extensible for
future production hardening.