Tutorial 5
Retrieval-Augmented Generation (RAG)
Retrieval-Augmented Generation (RAG) is a powerful architectural paradigm that enhances the capabilities of
Large Language Models (LLMs) by integrating them with external knowledge sources. While modern LLMs are
capable of generating fluent, coherent, and contextually appropriate text, they remain fundamentally limited by the
knowledge encoded within their parameters during training. These limitations manifest in the form of
hallucinations, outdated knowledge, limited domain specialization, and lack of verifiable grounding. RAG
addresses these challenges by combining parametric knowledge (stored in model weights) with non-parametric
knowledge (stored in external databases), thereby enabling language models to produce responses that are both
contextually rich and factually grounded.
Unlike traditional fine-tuning approaches, which modify the internal parameters of a model to encode new
knowledge, RAG systems retrieve relevant information dynamically at inference time. This makes them
particularly suitable for applications that require up-to-date information, domain-specific expertise, traceability, and
cost efficiency.
Background: Limitations of Large Language Models
Large Language Models, such as GPT, LLaMA, and T5, are typically built upon the Transformer architecture.
These models are trained on massive corpora of text to predict the next token in a sequence or reconstruct masked
tokens.
Despite their impressive generative capabilities, LLMs exhibit several inherent drawbacks:
• Hallucination: The model may generate factually incorrect but linguistically plausible information.
• Outdated knowledge: The model’s knowledge is frozen at the time of training.
• Low efficiency in parameterizing knowledge: Encoding vast amounts of factual information into fixed
parameters is inefficient.
• Weak specialization: LLMs may struggle with highly technical or proprietary domain knowledge.
• Limited explainability: Generated answers are not inherently traceable to specific sources.
Fine-tuning can partially address some of these issues, but it introduces new challenges such as catastrophic
forgetting, high computational cost, and lack of modularity. RAG emerges as an alternative solution that avoids
modifying the model’s internal weights while still enabling knowledge adaptation.
Practical Requirements for a RAG-Based Application:
• Domain-Specific Accuracy: RAG leverages external knowledge bases, so answers are grounded in
domain-relevant sources, reducing hallucinations.
• Regular Data Updates: Since RAG retrieves from up-to-date documents, keeping the knowledge base
current ensures the system reflects the latest information.
• Traceability and Explainability: RAG can cite the sources of retrieved documents, making it clear why a
certain answer was generated.
• Cost Control: Efficient retrieval reduces the need for large-scale LLM computation on every query,
helping manage operational costs.
• Data Privacy: The knowledge base and retrieval mechanisms must ensure sensitive information is
protected and access is controlled.
Core Concept of Retrieval-Augmented Generation
Retrieval-Augmented Generation enhances a generative model by providing it with relevant external documents at
inference time. Instead of relying solely on its internal memory, the model retrieves context from a knowledge base
and generates responses grounded in that retrieved information.
The key idea is simple but powerful:
Retrieve first, then generate.
The generative model no longer acts as a standalone knowledge repository but rather as a reasoning and synthesis
engine operating over retrieved evidence.
The RAG Workflow
The Retrieval-Augmented Generation (RAG) workflow typically involves three key stages: user query,
document retrieval, and answer generation. It integrates a retriever module with a generative model to produce
accurate and context-aware responses.
1)User Query Submission
The process begins when a user submits a question or prompt to the RAG system. This query serves as the starting
point for retrieving relevant information from an external knowledge base.
2) Document Retrieval
This stage involves finding the most relevant documents to provide context for the query.
1. Indexing (Offline Phase)
Before handling queries, the knowledge base — which may include articles, books, databases, or internal
company documentation — is processed and indexed. This involves:
• Document Chunking: Large documents are split into smaller, semantically coherent chunks to fit LLM
context windows and improve retrieval precision. Techniques such as sliding windows, hierarchical
segmentation, and metadata augmentation can be used.
• Embedding Generation: Each chunk is transformed into a dense vector representation (embedding) that
captures its semantic meaning. Models like BERT or Sentence-BERT are commonly used.
• Vector Database Storage: Embeddings are stored in vector databases such as FAISS, Qdrant, Pinecone, or
Weaviate, enabling fast similarity searches at scale.
2. Retrieval (Online Phase)
When a query arrives:
• The query is converted into an embedding.
• A similarity search is performed against the indexed embeddings to identify the top-k most relevant chunks.
• Efficient retrieval often relies on Approximate Nearest Neighbor (ANN) algorithms, such as Hierarchical
Navigable Small Worlds (HNSW), which allow fast, scalable searches.
3. Answer Generation
The retrieved documents are combined with the user’s original query and passed to a generative model (e.g.,
BART, T5) to produce the final answer.
Context Augmentation:
The prompt for the generator typically takes the form:
"Given the following documents: [Document 1], [Document 2], … [Document k], answer the following question:
[User Query]."
Generation:
The LLM synthesizes information from the retrieved documents and the query to generate a coherent and
informative response. In this setup, the model acts not as a memory store, but as a reasoning engine that
summarizes and interprets external knowledge.
Benefits of RAG
• Grounded Answers: Responses are based on verifiable external documents rather than relying solely on
the LLM’s internal knowledge, reducing the risk of inaccuracies.
• Reduced Hallucination: By providing direct access to source material, RAG steers the model toward
factual answers.
• Up-to-Date & Domain-Specific Q&A: Updating the knowledge base allows the system to provide the
latest information and handle specialized or proprietary domains without retraining the LLM.
• Improved Explainability: Because answers come from specific documents, sources can be traced,
increasing transparency and trustworthiness.
• Cost-Effective Customization: Instead of fine-tuning large LLMs for every new domain, updating the
retrieval index is computationally efficient and more practical
Knowledge Representation in LLMs:
LLMs store knowledge in two complementary ways:
• Parametric Knowledge: Encoded directly in model parameters during training. Useful for
general knowledge, but static and may become outdated.
• Symbolic (External) Knowledge: Stored in documents, databases, or knowledge graphs.
RAG retrieves this at runtime, enabling dynamic, up-to-date, and domain-specific responses.
Optimization Implication:
RAG allows the LLM to focus on reasoning while symbolic knowledge provides factual
grounding. This reduces computation, memory requirements, and the need for expensive full
model retraining.
Ways to Optimize LLMs with RAG
1. Prompt Engineering
o Carefully structured prompts guide the LLM to focus on relevant information and
produce high-quality answers without changing model weights.
2. Retrieval-Augmented Generation (RAG)
o Separates knowledge from reasoning: the model generates answers based on retrieved
documents, reducing hallucinations and improving factual accuracy.
3. Instruction-Tuning / In-Context Learning
o Models are trained to follow specific instructions or formats, enhancing task
performance without full fine-tuning.
4. Fine-Tuning
o Adjusting model parameters for domain-specific data improves performance for
narrow, stable tasks but is computationally expensive and less flexible than RAG .
RAG vs Fine-tuning:
Applications of RAG
RAG is particularly useful in scenarios requiring up-to-date knowledge, traceability, or
specialized domain expertise:
• Long-tail data distribution
• Frequent knowledge updates
• Verified and traceable answers
• Specialized domain knowledge
• Data privacy preservation
Evolution and Variants of RAG:
RAG systems have evolved to improve retrieval precision, reasoning depth, and integration
with LLMs.
1. Naive RAG (Basic RAG)
Workflow: Chunk documents → generate embeddings → top-k retrieval → concatenate context
→ single-pass generation.
Step 1: Indexing
1. Divide the document into smaller, even chunks, each representing a portion of the
original text.
2. Generate embeddings for each chunk using an encoding model.
3. Store the embeddings in a vector database for efficient retrieval.
Step 2: Retrieval
• When a query is submitted, retrieve the top-k most relevant chunks from the vector
database using vector similarity search.
Step 3: Generation
• Combine the original query with the retrieved documents and feed them into a large
language model (LLM).
• The LLM produces a single-pass, context-aware answer based on the provided
information.
Limitations:
• May include redundant or irrelevant context.
• Reasoning depth is limited since only a single pass over retrieved information is
performed.
2. Advanced RAG
Introduces optimizations at multiple stages to improve retrieval quality and answer generation:
• Index Optimization: Sliding window chunking, hierarchical document structures,
metadata enrichment, summary-based retrieval.
• Pre-Retrieval Processing: Query rewriting, clarification, routing to specialized
retrievers.
• Post-Retrieval Processing: Re-ranking using cross-encoders, filtering irrelevant
documents, confidence estimation.
These techniques reduce noise and improve the relevance and accuracy of generated answers.
3. Modular RAG
• Concept: Decomposes the pipeline into specialized components: Read → Retrieve →
Rewrite → Rerank → Reflect → Generate.
• Benefits: Supports iterative retrieval, multi-hop reasoning, and adaptive search strategies.
• Example Systems: FLARE, Self-RAG dynamically decide when and how to retrieve,
enabling deeper reasoning.
4. RAG Integration Variants
• RAG-Sequence: The generator produces a partial answer, then the retriever performs
another round of retrieval based on this partial answer. The final output is a combination
of all iterative steps, allowing dynamic, multi-step reasoning.
• RAG-Token: Retrieved documents influence the generation token by token, so each
predicted token is guided by the most relevant context. This allows tighter integration
between retrieval and generation.
• RAG with Re-Ranking: After initial retrieval, a reranker rescoring the top-k documents
ensures that the most pertinent context is used, improving answer quality.
• RAG-Fusion: Multiple queries are generated from the original input, documents are
retrieved for each, and results are combined and reranked before generation. This
produces a richer context and supports multi-hop reasoning.
Key Design Questions in RAG
When designing a RAG system, several critical questions must be addressed to optimize
performance, accuracy, and efficiency.
1. What to Retrieve?
The granularity of retrieval can vary depending on the task and knowledge source:
• Tokens – smallest unit, precise but may lack context.
• Phrases – captures small semantic units.
• Chunks – fixed-length segments of text.
• Paragraphs – preserves context and coherence.
• Entities – named concepts or objects.
• Knowledge Graphs – structured, relational knowledge.
2. When to Retrieve?
Retrieval strategies can differ based on the frequency and context needed:
• Once per query – simplest approach, efficient but may miss context changes during
generation.
• Every N generated tokens – ensures up-to-date context during long generations.
• Adaptive retrieval based on model uncertainty – triggers retrieval when the model
detects insufficient context.
3. How to Use Retrieved Content?
Retrieved information can be integrated at different stages of the LLM pipeline:
• Input Layer (Prompt Augmentation): Concatenate retrieved content with the query.
Simple and widely used.
• Intermediate Layer (Model Fusion): Integrate information inside the model
architecture, allowing deeper reasoning.
• Output Layer (Verification/Filtering): Use retrieved content to check or filter the
model’s generated answer.
Key Issues in RAG
Designing an effective RAG system requires careful consideration of what to retrieve, how to
integrate it, and when to retrieve it, as each choice impacts efficiency, accuracy, and
relevance.
1. What to Retrieve (Retrieval Granularity)
The granularity of retrieval determines the coverage, precision, and computational
requirements:
• Token-level (e.g., kNN-LMM, 2019):
o Broad search that recalls a large amount of information but with low accuracy.
o High coverage but includes redundant information.
o Excels in long-tail and cross-domain queries with high computational
efficiency.
o Requires significant storage.
• Phrase-level (e.g., NPM, 2023):
o Slightly higher semantic structure than tokens.
o Balances coverage and precision.
• Chunk-level (e.g., In-Context RAG, 2023):
o Preserves coherent context segments.
o Efficient for long documents and in-context reasoning.
• Paragraph-level / Entity-level (e.g., EasE, 2022):
o Provides richer semantic and structured information.
o Retrieval efficiency is lower and depends on the quality of the source.
• Knowledge Graphs (2023):
o Highly structured and relational knowledge.
o High semantic value but limited by the quality of the KG and slower retrieval.
2. How to Use Retrieved Content (Integration Position)
Retrieved content can be integrated at different stages of the LLM pipeline, with distinct trade-
offs:
• Input / Prompt Layer:
o Retrieved content is concatenated with the query.
o Pros: Simple and easy to implement.
o Cons: Cannot efficiently incorporate multiple knowledge blocks; limited
optimization space.
• Intermediate / Model Layer:
o Retrieved content is fused within the model architecture.
o Pros: Supports multi-block retrieval and deeper reasoning.
o Cons: Adds complexity and requires additional training.
• Output / Prediction Layer:
o Retrieved content is used to verify or filter the generated output.
o Pros: Ensures output is highly relevant to the retrieved information.
o Cons: Lower computational efficiency due to additional processing.
3. When to Retrieve (Retrieval Frequency)
The frequency of retrieval impacts both efficiency and relevance:
• Once per query (e.g., Replug, 2023):
o Single search during the reasoning process.
o Pros: High efficiency with minimal overhead.
o Cons: Retrieved documents may have low relevance for long or multi-step
reasoning.
• Every N generated tokens (e.g., ATLAS, 2023):
o Updates context periodically during generation.
o Pros: Balances efficiency and contextual relevance.
o Cons: May not always achieve the optimal trade-off between speed and
information coverage.
• Adaptive retrieval (e.g., FLARE, 2023):
o Retrieval is triggered based on model uncertainty or information need.
o Pros: Retrieves large amounts of relevant information, improving multi-step
reasoning.
o Cons: Lower efficiency and may include redundant information if not carefully
managed.
Key Technologies and Evaluation of RAG
RAG systems rely on several technologies and optimization techniques to improve retrieval
quality, reasoning depth, and answer relevance. Evaluating their effectiveness is equally
important to ensure accuracy, robustness, and trustworthiness.
Techniques for Better RAG
1. Data Indexing Optimization
Optimizing how documents are indexed improves retrieval efficiency and
precision:
• Small-to-Big Embeddings: Generate embeddings at the sentence level,
expanding the context window during generation.
• Chunk Optimization: Use sliding windows so each chunk covers the entire
text, avoiding semantic ambiguity.
• Summary-Based Retrieval: Retrieve documents first through summaries,
then fetch detailed text blocks.
• Metadata Enhancement:
o Include document metadata such as page, title, or pseudo-metadata.
o Generate hypothetical documents or questions to improve retrieval
relevance.
o Apply metadata filtering/enrichment by annotating documents and
using metadata as additional query filters.
2. Structured Corpus
• Summary-to-Document Retrieval: Retrieve summaries instead of entire
documents to identify relevant nodes and explore associated content.
• Hierarchical Corpus Organization: Documents may contain embedded
objects (tables, charts). First retrieve entity references, then query the
underlying objects (blocks, databases, sub-nodes).
3. Retrieval Source Optimization
• Supports multiple types of sources:
o Unstructured Data: Text documents, articles.
o Structured Data: Triples, subgraphs, tables.
o LLM Memory: Generated text or code.
o Cross-lingual Prompts and Phrases.
4. Knowledge Graph as a Retrieval Source
• GraphRAG: Extract entities from user queries, build subgraphs, and feed
them to the LLM for generation.
• Implementation Steps:
1. Extract key entities from the question using LLMs or other models.
2. Retrieve subgraphs up to a defined depth (e.g., 2 hops).
3. Use the subgraph context for answer generation.
5. Query Optimization
• Adjusting the query can improve retrieval relevance:
o Query Rewriting
o Rewrite-Retrieve-Read
o Query Clarification
6. Embedding Optimization
• Selecting a Suitable Embedding Provider to match domain requirements.
• Fine-tuning the Embedding Model for domain-specific semantic
representation.
7. Retrieval Process Optimization
• Iterative Retrieval: Retrieve information repeatedly to acquire deeper
knowledge.
• Adaptive Retrieval: LLM determines timing and scope dynamically.
8. Hybrid Approaches (RAG + Fine-Tuning)
• Retriever Fine-Tuning & Generator Fine-Tuning:
Techniques:
o R-FT: Minimize KL divergence between retriever distribution and
LLM preferences.
o LM-FT: Maximize likelihood of correct answers given retrieval-
augmented instructions.
o Collaborative fine-tuning enables general-purpose retrieval plugins
with structural information integration.
Evaluating the Effectiveness of RAG
RAG evaluation focuses on both retrieval quality and generation quality:
1. Retriever Evaluation
• Measures the quality of text blocks retrieved by the query.
• Metrics: MRP, Hit Rate, NDCG
2. Generator / Synthesis Evaluation
• Measures the quality of answers synthesized using retrieved documents.
Evaluation Strategies:
• Independent Evaluation: Assess retrieved context or answer relevance separately.
• End-to-End Evaluation: Evaluate the final output generated by the model.
Metrics with Labels:
• Exact Match (EM), Accuracy
Metrics without Labels:
• Fidelity, Relevance, Harmlessness
Evaluation Dimensions:
• Context Relevance: Are retrieved documents relevant to the query?
• Answer Fidelity: Is the answer based on retrieved context?
• Answer Relevance: Is the answer relevant to the user’s query?
Key Capabilities to Measure
• Noise Robustness: Can the model extract useful information from noisy documents?
• Negative Rejection: Does the model refuse to answer when knowledge is absent?
• Information Integration: Can the model combine information from multiple documents
for complex queries?
• Counterfactual Robustness: Can the model recognize potential factual errors in
retrieved content?
Evaluation Methods:
• Human evaluation
• Automatic evaluation using LLMs as judges
• Tools: TruLens, RAGAS, ARES
o Evaluate Answer Fidelity, Answer Relevance, Contextual Relevance
o Use synthetic datasets, fine-tuning, and confidence ranking for automated
assessment
Popular Frameworks
Several open-source and commercial frameworks simplify the implementation of RAG
systems:
• Facebook RAG (original implementation): The seminal paper from
Facebook AI introduced the RAG model that combined a pre-trained retriever
(DPR) and a pre-trained generator (BART).
• LangChain: A popular Python framework designed to build applications with
LLMs. It provides modular components and chains to easily integrate LLMs with
external data sources, including var-
ious retrievers and vector databases.
LlamaIndex: Another Python framework focused on helping developers build
LLM applications over their own data. It provides tools for data indexing, retrieval,
and integration with LLMs.
Haystack: An open-source NLP framework that helps developers build powerful
semantic search and conversational AI applications. It offers flexible components
for building end-to-end RAG pipelines.
RAG has become a standard and highly effective technique for building reliable, factual,
and up-to-date LLM-powered applications across a wide range of use cases.
Fine-Tuning vs. Prompt Engineering
Fine-tuning is a common approach to adapt a language model to specific tasks or domains.
However, it has both advantages and limitations, and prompt engineering can sometimes
provide a lighter alternative.
Cons of Fine-Tuning:
• High Cost: Fine-tuning large models requires significant computational resources.
• Model Capacity Limits: The model may not have enough parameters to fully capture all
the new knowledge. This limitation motivated models like LLaMA, which come in 7B,
13B, and 70B parameter variants.
• Non-Additive Learning: Fine-tuning can overwrite existing knowledge. For example, a
model trained on English that is heavily fine-tuned on Japanese may “forget” its English
capabilities.
Mitigation: Prompt engineering can partially address this by “teaching” the model new tasks
without modifying its parameters. Techniques such as few-shot prompting allow the model to
perform new tasks by providing examples in the input prompt.
Pros of Fine-Tuning:
• Higher Quality Outputs: Fine-tuned models generally provide more accurate and
reliable results compared to prompt-based approaches.
• Reduced Context Size: During inference, fine-tuning eliminates the need to include
large instructions or context in the input, enabling more efficient use of the model’s
context window.
Sentence Embeddings: Comparison
To determine whether two sentences have similar meaning, we typically compare their
embeddings using a similarity metric:
• Cosine Similarity: Measures the cosine of the angle between two vectors.
o If the vectors point in the same direction, the cosine similarity is high, indicating
that the sentences are semantically similar.
The Challenge
• Pretrained models like BERT were not explicitly trained to produce embeddings that are
directly comparable with cosine similarity.
• This means that even if two sentences have similar meaning, their embeddings might not
necessarily point in the same direction.
The Solution
• To make embeddings comparable, we need to fine-tune the model or use specialized
techniques to align the embeddings with our chosen similarity function.
• Approaches include:
o Contrastive Learning: Encourages embeddings of semantically similar sentences
to be close together while pushing dissimilar sentences apart.
o Sentence-BERT (SBERT): A modification of BERT specifically designed to
produce embeddings suitable for cosine similarity comparison.
Sentence-BERT (SBERT) Architecture
Sentence-BERT (SBERT) modifies BERT to produce meaningful sentence embeddings that
can be directly compared using cosine similarity.
1 Siamese Network Structure
SBERT uses a Siamese Network architecture, meaning:
• Sentence A → BERT → Pooling → Sentence Embedding A
• Sentence B → BERT → Pooling → Sentence Embedding B
Both sentences pass through:
• The same BERT architecture
• The same parameters
• The same weights
This shared structure ensures that embeddings are generated in the same vector space, making
them directly comparable.
2 Pooling Layer
Since BERT outputs token-level embeddings, SBERT applies a pooling operation to obtain a
fixed-size sentence embedding.
Common pooling strategies include:
• Mean Pooling – Average of all token embeddings (most commonly used).
• Max Pooling – Maximum value across tokens.
• [CLS] Token – Use the special classification token as the sentence representation.
Optionally, a linear layer can be added after pooling to reduce the embedding size (e.g., from
768 to 512 dimensions) for efficiency.
3 Similarity Computation
Once we obtain:
• Sentence Embedding A
• Sentence Embedding B
We compute:
• Cosine Similarity between the two embeddings.
4 Training Objective
During training:
• The model predicts a cosine similarity score.
• This predicted similarity is compared to a target similarity score.
• The loss function used is typically Mean Squared Error (MSE).
This training process encourages:
• Semantically similar sentences → embeddings close together.
• Dissimilar sentences → embeddings far apart.
Strategies to Teach New Concepts to LLMs
1) Fine-Tuning
Base LLM + Custom Training Data = Fine-Tuned LLM
• Updates model weights.
• Knowledge becomes part of the model.
• High-quality results but computationally expensive.
2) Retrieval-Augmented Generation (RAG)
Base LLM + Vector Database (Embeddings) = RAG System
• Keeps model weights unchanged.
• Retrieves relevant documents during inference.
• Easy to update knowledge externally.
3) Hybrid Approach (Fine-Tuning + RAG)
Fine-Tuned LLM + RAG = Enhanced System
• Combines domain adaptation with external retrieval.
• More accurate and scalable.
Vector Database:
A Vector Database stores fixed-dimension vectors called embeddings. These embeddings
represent text, images, audio, or other data in numerical form.
The main purpose of a vector database is to:
• Search for embeddings that are most similar to a given query vector.
• Use a distance metric, typically:
o Cosine Similarity
o Euclidean Distance
To perform similarity search efficiently, vector databases use algorithms such as:
• K-Nearest Neighbors (KNN)
• Approximate Nearest Neighbor (ANN) methods
How It Works in RAG
1. Documents or web pages are collected.
2. They are split into smaller chunks.
3. Each chunk is converted into an embedding vector.
4. The embeddings are stored in the vector database.
When a user asks a question:
• The query is converted into an embedding.
• The vector database searches for the Top-K most similar vectors.
• The corresponding text chunks are retrieved as context.
• The LLM uses this context to generate the final answer.
Top-K KNN: Naïve Approach
A simple similarity search method is:
1. Compare the query vector with all vectors in the database.
2. Compute the distance (e.g., cosine similarity).
3. Sort them by similarity.
4. Keep the Top-K closest vectors.
However, this brute-force approach is inefficient for large datasets, which is why optimized
ANN algorithms are commonly used in real-world systems.
Similarity Search: Trading Precision for Speed
The naïve KNN approach guarantees accurate results because it compares the query vector with
all vectors in the database. However, this becomes computationally expensive for large datasets.
Instead, we can reduce the number of comparisons while still achieving highly accurate results
with high probability.
In similarity search, the most important metric is usually recall — the ability to retrieve most of
the true nearest neighbors, even if the search is approximate.
To achieve this balance between speed and accuracy, modern systems use Approximate
Nearest Neighbor (ANN) algorithms, such as:
Hierarchical Navigable Small Worlds (HNSW)
HNSW is widely used in production systems.
For example, it powers Qdrant, the open-source vector database used by Twitter’s (X) Grok
LLM for real-time tweet retrieval.
HNSW: Core Idea #1 — Small World Networks
HNSW evolves from the Navigable Small Worlds (NSW) algorithm.
It is inspired by the concept of Six Degrees of Separation, introduced by Milgram’s experiment:
• Participants in Nebraska and Kansas had to deliver a letter to a person in Boston.
• They could only forward the letter to someone they personally knew.
• Most letters reached the target in about 5–6 steps.
This showed that social networks are highly connected with short paths between nodes.
Similarly, in vector search:
• Each vector is connected to a small number of “neighbor” vectors.
• The graph structure allows fast navigation toward relevant nodes.
• The total number of connections per node remains limited.
Navigable Small Worlds (NSW): How It Works
NSW builds a graph where:
• Each node represents a vector (e.g., a document chunk).
• Each node connects to a small number of nearby vectors.
• The graph enables efficient traversal toward similar vectors.
Searching for K-Nearest Neighbors
Given a query like:
“How many encoder layers are there in the Transformer model?”
The algorithm:
1. Starts from a random node.
2. Moves to the closest neighbor.
3. Continues moving toward better (closer) nodes.
4. Stops when no neighbor is closer than the current node (local optimum).
5. Repeats from different random starting points.
6. Keeps the Top-K best results among visited nodes.
This avoids scanning the entire database.
Inserting a New Vector in NSW
To insert a new vector:
1. Use the search procedure to find its Top-K nearest neighbors.
2. Create connections (edges) between the new vector and those neighbors.
3. The graph structure updates dynamically.
HNSW: Core Idea #2 — Hierarchical Structure
To improve NSW further, HNSW introduces a hierarchical structure, inspired by the Skip List
data structure.
A Skip List:
• Maintains sorted elements.
• Allows search and insertion in average O(log N) time.
• Uses multiple layers with different densities.
HNSW applies the same idea to vector search.
HNSW: Hierarchical Navigable Small Worlds
HNSW builds multiple graph layers:
• Upper layers → Sparse graph (few nodes, long-range connections).
• Lower layers → Dense graph (many nodes, short-range connections).
Search Process in HNSW
1. Start at the top (sparse) layer.
2. Quickly navigate toward the approximate region of the query.
3. Move down layer by layer.
4. Refine the search in the dense bottom layer.
5. Return the Top-K nearest neighbors
Why HNSW is Powerful
• Much faster than brute-force KNN.
• Maintains high recall.
• Scales well to millions or billions of vectors.
• Used in modern vector databases and RAG systems.