Chapter 8: Retrieval Augmented
Generation RAG
Unit: Advanced DL
1
What is it?
• LLMs don't have any up-to-date information past
their training cut-off, and they don’t know private
and proprietary information. Retrieval-augmented
generation (RAG) is the technique that helps
address these limitations.
• Retrieval-Augmented Generation (RAG) is the
process of optimizing the output of a large
language model (LLM), so it references an
authoritative knowledge base outside of its
training data sources before generating a response.
• RAG involves several steps: data collection. data
chunking, document embeddings, handling user
queries, and generating responses using an LLM.
RAG
• Retrieval Augmented Generation (RAG) is an
advanced artificial intelligence (AI) technique.
• It combines the capabilities of a pre-trained large
language model with an external data source.
• This approach merges the generative power of
LLMs with the precision of specialized data search
mechanisms.
• The result is a system that can provide more
nuanced and accurate responses.
Architecture & Key Components
• At a high level, there are two components: the retriever and the generator. As the name suggests, the
retriever is responsible for retrieving the information, and the generator is the LLM, used to generate the
text.
• Retrieval and pre-processing:
• RAGs leverage powerful search algorithms to query external data, such as web pages, knowledge bases, and databases.
• Once retrieved, the relevant information undergoes pre- processing, including fokenization, stemming, and removal of stop
words.
• Grounded generation:
• The pre-processed retrieved information is then searnlessly incorporated into the pre-trained LLM.
• This integration enhances the LLM's context, providing it with a more comprehensive understanding of the topic.
• This augmented context enables the LLM to generate more precise, informative, and engaging responses.
• Key Components:
• The knowledge base
• The retriever
• The integration layer
• The generator
• The ranker
• The output handler
Vector Database
• A vector database stores, manages and
indexes high-dimensional vector data.
• Data points are stored as arrays of numbers
called "vectors," which are clustered based on
similarity.
• The best way of chunking a long text will
depend on the types of texts and queries your
system anticipates:
• Each sentence is a chunk
• Each paragraph is a chunk
• Overlapping window of paragraphs
• Vector databases examples:
• ChromaDB, Pinecone, & Faiss
RAG Applications with Examples
Example Scenario RAG in Action
Advanced Question- Imagine a customer support chatbot for an The chatbot retrieves the store's return policy document
Answering System online store. A customer asks, "What is the from its knowledge base. RAG then uses this information
return policy for a damaged item?" to generate a clear and concise answer like, "If your
item is damaged upon arrival, you can return it free of
charge within 30 days of purchase. Please visit our
returns page for detailed instructions."
Content creation and You're building a travel website and Creation RAG can access and process vast amounts of
summarization and want to create a summary of the information about the Great Barrier Reef from various
Summarization Great Barrier Reef. sources. It can then provide a concise summary
highlighting key points like its location, size, biodiversity,
and conservation efforts.
Educational An online learning platform for science The platform uses RAG to access relevant information
Tools and courses. A student is studying about the about the heart's anatomy and function from the
Resources human body and has a question about the course materials. It then presents the student with an
function of the heart. explanation, diagrams, and perhaps even links to video
resources, all tailored to their specific learning needs.
How Does RAG Work? (1/5)
Step 1: Data collection
• You must first gather all the
data that is needed for your
application.
• Example: in the case of a
customer support chatbot
for an electronics company,
this can include user
manuals, a product
database, and a list of FAQs.
How Does RAG Work? (2/5)
• Step 2: Data chunking
• involves breaking down large
datasets into smaller, more
focused pieces.
• This makes it easier to retrieve
relevant information and
improves efficiency by avoiding
unnecessary processing.
• By organizing data into specific
topics, you can ensure that
search results are directly
applicable to the user's query.
How Does RAG Work? (3/5)
Step 3: Document embeddings
• Now that the source data has
been broken down into smaller
parts, it needs to be converted
into a vector representation.
• This involves transforming text
data into embeddings, which are
numeric representations that
capture the semantic meaning
behind text.
How Does RAG Work? (4/5)
Step 4: Handling user queries
• User queries are converted into
embedding or vector representations.
• The same model is used for both
document and query embeddings to
maintain consistency.
• The system compares the query
embedding with document embeddings.
• It retrieves data chunks whose
embeddings are most similar to the
query, using cosine similarity or
Euclidean distance.
• The retrieved chunks are considered the
most relevant to the user’s query.
How Does RAG Work? (5/5)
Step 5: Generating responses with
an LLM
• The retrieved text chunks, along
with the initial user query, are fed
into a language model.
• The algorithm will use this
information to generate a
coherent response to the user’s
questions through a chat
interface.
Indexing stage
To seamlessly accomplish the steps required to generate responses with LLMs, you can use a data
framework like:
• LlamaIndex: enables LLMs to access and interact with data from various sources by indexing it for
efficient querying, without retraining the model.
• LangChain: Designed to help developers create applications that combine LLMs with external
tools and data sources, like databases, APIs, and search systems.
• FAISS (Facebook AI Similarity Search):A library for efficient similarity search and clustering of
dense vectors, used for fast nearest neighbor searches in vector embeddings.
• Pinecone: A managed vector database that enables efficient retrieval of relevant data chunks for
RAG by storing and searching through large sets of embeddings.
• Elasticsearch: Though primarily a text search engine, it supports vector-based search and can be
integrated with LLMs for RAG, combining traditional keyword search with dense retrieval.
Benefits
• Cost-efficient Al implementation and Alscaling
• Access to current domain-specific data
• Lower risk of Al hallucinations
• Increased user trust
• Expanded use cases
• Enhanced developer control and model maintenance
• Greater data security
RAG: challenges (1/3)
• Integration complexity:
It can be difficult to integrate a retrieval system with an LLM. This
complexity increases when there are multiple sources of external data
in varying formats.
• To overcome this challenge: separate modules can be designed to
handle different data sources independently
RAG: challenges (2/3)
• Scalability
As the amount of data increases, it gets more challenging to maintain
the efficiency of the RAG system. Many complex operations need to be
performed. These tasks are computationally intensive and can slow
down the system as the size of the source data increases.
• To address this challenge, you can distribute computational load
across different servers and invest in robust hardware infrastructure.
RAG: challenges (3/3)
• Data quality
The effectiveness of an RAG system depends heavily on the quality of data being
fed into it. If the source content accessed by the application is poor, the responses
generated will be inaccurate.
• To address this challenge : Organizations must invest in a diligent content
curation and fine-tuning process. It is necessary to refine data sources to
enhance their quality.
For commercial applications, it can be beneficial to involve a subject matter expert
to review and fill in any information gaps before using the dataset in an RAG
system.