CSC 390: Generative AI
Lesson 4:
Retrieval Augmented
Generation
RAG
I N S T R U C TO R :
M U B A R A K M O H A MMA D
LLMs
LANGE LANGUAGE MODELS
GENERATIVE AI 2
RAG Retrieval Augmented
Generation
GENERATIVE AI 3
Large Language Models (LLMs)
▪Large language models (LLMs) are a category of deep
learning models trained on immense amounts of data,
making them capable of understanding and generating
natural language and other types of content to perform a
wide range of tasks.
▪LLMs are built on a type of neural network architecture
called a transformer which excels at handling sequences of
words and capturing patterns in text.
GENERATIVE AI 4
GENERATIVE AI 5
▪LLMs are easily accessible to the public through interfaces
like:
•Anthropic’s Claude,
•DeepSeek
•Open AI’s ChatGPT,
•Microsoft’s Copilot,
•Meta’s Llama models,
•Google’s Gemini assistant,
•And many open source:
◦ Mistral, Qwen, GLM
GENERATIVE AI 6
▪Building an LLM from scratch is a complex and resource-intensive
process.
▪The most popular LLMs are the result of immense amounts of data,
GPUs, energy and human expertise, which is why most are built and
maintained by large tech companies with expansive resources.
▪many of these models are accessible to all developers through APIs.
▪Developers can use pretrained models to build chatbots, knowledge
retrieval systems, automation tools and more.
▪For more control over data and customization, many open-source
models can be deployed locally or in the cloud.
▪Github, Hugging Face, Kaggle and other platforms make AI
development accessible to all
GENERATIVE AI 7
Open LLMs leaderboard
GENERATIVE AI 8
RAG
RETRIEVAL AUGMENTED GENERATION
GENERATIVE AI 9
The Knowledge Problem in AI
▪Imagine you're talking to a brilliant professor who
graduated in 2023. You ask them about a scientific
breakthrough that happened last week.
•Despite their expertise, they can't help you—their
knowledge has a cutoff date.
▪Now imagine that same professor, but with instant access
to a vast library where they can quickly find and read the
most relevant books before answering your question.
•This is the essence of Retrieval-Augmented Generation
(RAG) : it utilizes available up-to-date resources to
enhance the capabilities of LLMs
GENERATIVE AI 10
Domain Private Knowledge
▪Imagine asking ChatGPT about a student’s grade in a
specific course!
▪Imagine a doctor asking DeepSeek about a specific patient’s
medical history!
▪Imagine an accountant asking Gemini about an employee’s
number of sick leaves taken this year!
These are private information stored in private database and
not accessible to external General AI LLMs.
GENERATIVE AI 11
Fundamental Limitations of LLMs
▪Knowledge Cut-off: They only know information from their
training data, which has a cutoff date
▪Hallucination: They can confidently generate incorrect
information when uncertain
▪No Private Knowledge: They can't access your company's
documents, personal notes, or proprietary data
GENERATIVE AI 12
Example Hallucination by Claude
GENERATIVE AI 13
Example Cut-Off Limitation by DeepSeek
▪DeepSeek answer without using RAG
GENERATIVE AI 14
RAG Solution
▪RAG solves these problems by combining the reasoning
power of LLMs with the ability to retrieve relevant
information from external sources.
▪It's not just an incremental improvement—it's a paradigm
shift that makes AI practical for real-world applications.
GENERATIVE AI 15
Example: ChatGPT uses RAG
ChatGPT answer using RAG
GENERATIVE AI 16
How did ChatGPT answer the
question?
▪User: "Who won the Nobel Prize in Physics in 2025?"
▪RAG System:
•1. Searches knowledge base, web search, etc
•2. Finds the official Nobel Prize announcement
•3. Scrapes information from websites and news
•4. Returns: "According to the Nobel Prize website,
[actual winners] won for their work on..."
← Grounded in retrieved facts!
GENERATIVE AI 17
What is RAG?
▪Retrieval-Augmented Generation is a technique that
enhances LLM responses by:
•Retrieving relevant information from a knowledge base:
◦ Internal databases, documents, videos, audios, websites, etc
•Augmenting the prompt with this contextual information:
◦ Using relevant information from the retrieved knowledge and add
it as context to the Prompt
•Generating an answer grounded in retrieved facts:
◦ LLM generates a response (text, image, video, etc) that takes into
consideration the data being provided in the input prompt from
the RAG context information
GENERATIVE AI 18
Enabling Private & Specialized Knowledge
▪RAG transforms generic AI into specialized domain experts.
▪For example: A medical RAG system can access:
•Latest research papers
•Clinical guidelines
•Hospital protocols
•Patient histories (with proper authorization)
GENERATIVE AI 19
Real-World RAG Applications
▪Legal: Lawyers use RAG to search thousands of case files
▪Healthcare: Doctors get evidence-based treatment
recommendations
▪Finance: Analysts query years of market reports
▪Education: Students get personalized tutoring from course
materials
GENERATIVE AI 20
RAG System Architecture
GENERATIVE AI 21
RAG System Architecture
▪Data Ingestion (Offline): It involves the process of
collecting, processing, and storing data in a format that can
be efficiently retrieved and used by the RAG model.
GENERATIVE AI 22
Key Steps in Data Ingestion
1. Data Collection: Gathering information from various sources such as
databases, APIs, web scraping, or file systems.
2. Data Cleaning: Preprocessing the collected data to remove noise,
handle missing values, and standardize formats.
3. Document Splitting: Breaking down large documents into smaller,
manageable chunks for more effective retrieval.
4. Metadata Extraction: Identifying and extracting relevant metadata
from the documents to enhance retrieval capabilities.
5. Embedding Generation: Creating vector representations of the text
chunks to enable semantic search.
6. Indexing and Storage: Organizing and storing the processed data in a
format optimized for quick retrieval, often using vector databases or
search engines.
GENERATIVE AI 23
Augmented Generation
▪Once the information is indexed and stored in a vector
database, the next step is to use this knowledge whenever a
relevant query is issued by the user.
▪The process of finding relevant information to the query is
called Semantic Similarity search
GENERATIVE AI 24
Key Steps in Augmented Generation
▪Query
▪Embedding
▪Similarity search
GENERATIVE AI 25
GENERATIVE AI 26
Vector Databases
GENERATIVE AI 27
Vector Database
▪A vector database is a special kind of database that saves
information in the form of multi-dimensional vectors
representing certain characteristics or qualities.
▪The number of dimensions in each vector can vary widely,
from just a few to several thousand
▪The primary benefit of a vector database is its ability to
swiftly and precisely locate and retrieve data according to
their vector proximity or resemblance.
GENERATIVE AI 28
Traditional vs Vector
databases
▪Traditional databases store simple data like
words and numbers in a table format.
▪Vector databases, however, work with complex
data called vectors and use unique methods for
searching.
▪vector databases are designed to support the
association of that embedding with the object
metadata, which can include a variety of
information such as the structured definition
and object definition
▪While regular databases search for exact data
matches, vector databases look for the closest
match using specific measures of similarity.
GENERATIVE AI 29
GENERATIVE AI 30
GENERATIVE AI 31
GENERATIVE AI 32
Example Vector Databases
▪Chroma DB
▪FAISS (Facebook AI Similarity Search)
▪Redis
▪Milvus
▪MongoDB Atlas
▪Pgvector (postgresSQL)
▪Pinecone
▪Weaviate
▪Qdrant
GENERATIVE AI 33
Feature Chroma Pinecone Weaviate Faiss Qdrant Milvus PGVector
Open-source
Managed Scalable High-Speed Adding
Vector High-
Primary Use LLM Apps Vector Vector Similarity Vector Search
Similarity Performance
Case Development Database for Storage and Search and to
Search AI Search
ML Search Clustering PostgreSQL
OpenAPI v3,
OpenAI, Python/Num TensorFlow, Built into
LangChain, Various
Integration LangChain Cohere, Py, GPU PyTorch, PostgreSQL
LlamaIndex Language
HuggingFace Execution HuggingFace ecosystem
Clients
Scales from Seamless Capable of Cloud-native
Scales to Depends on
Python Highly scaling to handling sets with
Scalability billions of PostgreSQL
notebooks to scalable billions of larger than horizontal
vectors setup
clusters objects RAM scaling
Custom Approximate
Milliseconds Optimized for
Fast similarity Low-latency Fast, HNSW Nearest
Search Speed for millions of low-latency
searches search supports GPU algorithm for Neighbor
objects search
rapid search (ANN)
Supports Advanced
Fully Emphasizes Primarily for Secure multi- Inherits
multi-user filtering on
Data Privacy managed security and research and tenant PostgreSQL’s
with data vector
service replication development architecture security
isolation payloads
PostgreSQL
Programming Python, Python, Java, C++, Python,
Python C++, Python Rust extension
Language JavaScript Go, others Go
(SQL-based)
GENERATIVE AI 34
CRUD using Vector DB
GENERATIVE AI 35
CRUD using Vector Database
▪We will use Chrome DB as our vector database
▪We will apply CRUD operations to Chroma DB using Python
•We need to install Chroma DB using:
◦ pip install chromadb
GENERATIVE AI 36
1. Instantiate Client object
▪You can choose to instantiate a client object of vector
database either:
•in-memory or
•persisted / stored in desk
GENERATIVE AI 37
2. Create DB Collection
GENERATIVE AI 38
▪In real applications, we should first check existing db collections
and see if our collection already exists.
• If our collection already exists then we use the function
get_collection to connect to it and get a reference to access it
• If the collection does not exist, then we create it using the
create_collection function
GENERATIVE AI 39
3. Adding records / documents / data
▪To load records in the vector database, each record consists of 4 parts:
• Id: unique id for each document, normally text
• Original data or Text: the text chunk or part or the full text if small
• Metadata: information as dictionary object about the data
• Vector Embeddings: vector numbers representing the data
GENERATIVE AI 40
Ids metadata data Vector embeddings
doc_0001 {"source": "lecture", Chroma is an open- [-0.03503859,0.05091505,
"topic": "chroma"} source vector database -0.08437242, ……..
doc_0002 {"source": "book", Vector databases store 0.02831112, 0.08777504,
"topic": "vector-db"} embeddings used by… -0.01009738, …….
doc_0003 {"source": "book", SentenceTransformers 0.03914005, -0.03259664,
"topic": "embeddings"} provide local embeddi. -0.02199583, ...........
doc_0004
{"source": "video", You can combine 0.02959949, -0.1179638 ,
"topic": "RAG"} Chroma with retrieval.. 0.01046857, .......
…. …… …… ……
GENERATIVE AI 41
GENERATIVE AI 42
4. Fetch data from DB collection
▪You can get data from the collection using:
•data = my_db_collection.peek(limit=3)
◦ peek: shows a sample (default 10) of the stored documents, including: IDs,
Documents, Metadata, and Embeddings (if available)
•result = my_db_collection.get(ids=["doc2"],
include=["documents", "metadatas", "embeddings"])
◦ fetch specific doc by id
•results = my_db_collection.get(where={"source":"book"})
◦ Filter by metadata
GENERATIVE AI 43
GENERATIVE AI 44
5. Update record
▪Using update function, it is possible to update document,
embeddings, or data / text
◦ If an id is not found in the collection, an error will be logged and the
update will be ignored.
◦ If documents are supplied without corresponding embeddings, the
embeddings will be recomputed with the collection's embedding
function.
◦ If the supplied embeddings are not the same dimension as the
collection, an exception will be raised
GENERATIVE AI 45
6. Delete
▪It is possible to delete records by:
•Ids, or
•Where condition over metadata
▪It is possible to delete a complete collection
GENERATIVE AI 46
Embeddings
GENERATIVE AI 47
Embeddings:
Transform data into Vectors
▪Unstructured data, such as text, images, and audio, lacks a
predefined format, posing challenges for traditional
databases.
▪To leverage this data in artificial intelligence and machine
learning applications, it is transformed into numerical
representations using embeddings.
GENERATIVE AI 48
GENERATIVE AI 49
Embeddings:
Semantic Understanding
▪Embeddings convert text into high-dimensional vectors that
capture semantic meaning.
▪Similar texts have similar vectors.
GENERATIVE AI 50
Visual Representation
GENERATIVE AI 51
Embedding Projector
GENERATIVE AI 52
Many Available Embedding Models
GENERATIVE AI 53
Embedding Leaderboard
GENERATIVE AI 54
Sentence Transformers
▪Sentence Transformers is widely used for Embedding
▪Calculates a fixed-size vector representation (embedding)
given texts or images.
▪Embedding calculation is often efficient, embedding
similarity calculation is very fast.
▪Applicable for a wide range of tasks, such as semantic
textual similarity, semantic search, clustering, classification,
paraphrase mining, and more.
GENERATIVE AI 55
GENERATIVE AI 56
pip install sentence-transformers
GENERATIVE AI 57
RAG Embedding function
GENERATIVE AI 58
Semantic Search /
Similarity Measure
GENERATIVE AI 59
Types of Similarity Measures
▪A vector database stores data as high-dimensional vectors.
When you perform a similarity search, you are asking:
▪“Which stored vectors are most similar (or closest) to my
query vector?”
▪The answer depends on how we define “similarity” — i.e.,
the similarity measure.
▪The most common similarity measures used in machine
learning and vector databases are:
•Euclidean Distance
•Dot Product
•Cosine Similarity
GENERATIVE AI 60
Euclidean Distance
▪Measures the straight-line distance between two points
(vectors) in multidimensional space.
▪It focuses on how far apart two vectors are in magnitude
and direction.
GENERATIVE AI 61
Dot Product
▪Measures how much two vectors align in direction.
▪It’s a measure of projection — how much one vector goes
in the direction of another:
◦ Large positive value → same direction (similar meaning)
◦ Near zero → orthogonal (unrelated)
◦ Large negative value → opposite direction
GENERATIVE AI 62
Cosine Similarity
▪Measures the cosine of the angle between two vectors,
ignoring magnitude.
▪It focuses purely on directional similarity:
◦ +1 → same direction (high similarity)
◦ 0 → orthogonal (unrelated)
◦ -1 → opposite direction (completely dissimilar)
GENERATIVE AI 63
Vector Databases:
The Search Engine for Meaning
▪Traditional databases search by exact matches.
▪Vector databases search by similarity
GENERATIVE AI 64
How Vector Search Works
GENERATIVE AI 65
GENERATIVE AI 66
Semantic Search for RAG
▪Use the query function and pass to it vector embeddings of the
question
• Specify number of results interested in
• Specify what to include in results (original documents,
metadata, distances, etc
GENERATIVE AI 67
Document Loaders
READING FILES OF DIFFERENT TYPES
GENERATIVE AI 68
Document Loaders
▪Allows to load data / documents from multiple sources into
a format that represents documents and associated
metadata
▪Example file types to load for RAG:
•txt
•pdf
•csv
•video
•audio
•website
GENERATIVE AI 69
Example Document Loaders
▪Python libraries like:
•PyPDF2
•python-docx
•csv
•Json
▪Available advance loaders:
•LangChain
•LamaIndex
GENERATIVE AI 70
Loading using python libraries
GENERATIVE AI 71
Specify folder path, supported extension
files, and initialize empty array
GENERATIVE AI 72
For each type, open file and load
GENERATIVE AI 73
GENERATIVE AI 74
Add loaded file content to
documents array
GENERATIVE AI 75
Chunking
GENERATIVE AI 76
Document Chunking:
Breaking Knowledge into Pieces
▪LLMs have context limits (e.g., 4K, 8K, 128K tokens)
▪Smaller chunks = more precise retrieval
▪Better embedding quality for focused content
GENERATIVE AI 77
Fixed-Size Chunking (words or chars)
GENERATIVE AI 78
Semantic Chunking (topic-based)
GENERATIVE AI 79
GENERATIVE AI 80
LLM API Call
GOOGLE GEMINI & DEEPSEEK
GENERATIVE AI 81
LLM API Call
▪An LLM API call involves sending requests to a large
language model (LLM) hosted by a service provider (like
OpenAI, Google, Cohere) to leverage its capabilities, such as
generating text, answering questions, or performing
language-related tasks.
▪This interaction typically involves sending data (e.g., a
prompt) to the API endpoint and receiving a response (e.g.,
generated text).
▪To make an API call, you need two things:
•URL address of the end point of the hosted LLM
•Key
GENERATIVE AI 82
Why an API Key is Needed for LLM API Calls
▪Authentication and Authorization:
• The API key acts as a unique identifier for your application or user, authenticating
your requests to the LLM service. It also authorizes you to access specific API
endpoints and functionalities, ensuring that only authorized entities can interact
with the LLM.
▪Usage Tracking and Billing:
• API keys enable the service provider to track your usage of the LLM, including the
number of requests, the amount of data processed, and the specific models
utilized. This data is essential for billing purposes, allowing the provider to charge
you based on your consumption.
▪Rate Limiting and Resource Management:
• API keys facilitate the implementation of rate limits, which restrict the number of
requests an application can make within a specific timeframe. This helps manage
the load on the LLM infrastructure, prevents abuse, and ensures fair access for all
users.
▪Analytics and Monitoring:
• By associating requests with specific API keys, service providers can gather
valuable insights into API usage patterns, identify potential issues, and monitor the
performance of their LLM services.
GENERATIVE AI 83
URL & Key
▪Depending on the LLM provider, you can get the URL end
point address and a Key
▪Some providers give free of charge keys with a limited
restriction on number of requests
▪Some providers only give paid keys
▪You can try some unified service providers like OpenRouter
GENERATIVE AI 84
Google Gemini
GENERATIVE AI 85
GENERATIVE AI 86
GENERATIVE AI 87
GENERATIVE AI 88
GENERATIVE AI 89
GENERATIVE AI 90
GENERATIVE AI 91
GENERATIVE AI 92
GENERATIVE AI 93
.env file
▪Commonly used to store configuration settings, API keys,
and other sensitive information.
▪It is a plain text file with key-value pairs
▪python-dotenv library is often used to load these variables
into the environment:
•pip install python-dotenv
GENERATIVE AI 94
.env file
# importing os module for environment variables
import os
# importing necessary functions from dotenv library
from dotenv import load_dotenv
# loading variables from .env file
load_dotenv()
# accessing and printing value
print([Link]("MY_KEY"))
GENERATIVE AI 95
GENERATIVE AI 96
Google Gemini Example LLM API call
# please install Google GenAI SDK first: pip install -q -U google-genai
GENERATIVE AI 97
DeepSeek
GENERATIVE AI 98
GENERATIVE AI 99
DeepSeek Example LLM API call
GENERATIVE AI 100
OpenRouter
GENERATIVE AI 101
GENERATIVE AI 102
RAG in Action:
Step-by-Step Example
GENERATIVE AI 103
Step 1: Loading Documents
GENERATIVE AI 104
Step 2: Chunking
GENERATIVE AI 105
Step 3: Embedding & Storage
GENERATIVE AI 106
Step 4: User Query
GENERATIVE AI 107
Step 5: Prompt Construction
GENERATIVE AI 108
Step 6: LLM Generation
GENERATIVE AI 109
Step 7: Response to User
GENERATIVE AI 110