0% found this document useful (0 votes)
3 views105 pages

Vector+Databases Course Slides

This course, led by Paulo Dichone, focuses on the fundamentals of vector databases, targeting entrepreneurs, developers, project managers, and data scientists. Participants will learn how to build vector databases from scratch, understand their functionality, and explore various use cases, including image retrieval and recommendation systems. Prerequisites include programming knowledge, particularly in Python, and a willingness to learn, as the course combines theoretical concepts with hands-on development.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views105 pages

Vector+Databases Course Slides

This course, led by Paulo Dichone, focuses on the fundamentals of vector databases, targeting entrepreneurs, developers, project managers, and data scientists. Participants will learn how to build vector databases from scratch, understand their functionality, and explore various use cases, including image retrieval and recommendation systems. Prerequisites include programming knowledge, particularly in Python, and a willingness to learn, as the course combines theoretical concepts with hands-on development.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Who Am I?

Paulo Dichone

Software, Cloud, AI Engineer


and Instructor
What Is This Course About?

● Vector Databases - Fundamentals (Deep Dive)


Who Is This Course For

Open
minded
Entrepreneurs
learners :)

Project
Developers Managers

Data
Scientists
Course Prerequisites
1. Know Programming (highly preferred…)
a. There will be some Python code

2. This is not a programming course


3. Willingness to learn :)
What you’ll learn

What are
they?
How to build
one from Use cases
scratch and
hands-on
Vector
Databases
How they
work
Top 5 Vector
Database
solutions
Vectorization
Course Structure

Theory (Fundamental Concepts)

Hands-on
Development Environment setup
● Python
● VS Code (or any other code editor)
● OpenAI API Account and API Key
Set up OpenAI API Account
OpenAI API - Dev Environment Setup

Python (Win, Mac, Linux)

[Link]
Introduction ●

What is a vector database?
Why vector databases?

To Vector ● Limitation of traditional


databases
Databases
What is a Vector Database?
A vector database encodes information as vectors in a
multi-dimensional space to perform high-efficient queries based on
similarity.

Dimensionality Similarity
Vector
Search
What is a Vector Database?
A vector - what is it?
V
magnitude

Vector
Tail direction Head
Vectors

3.4 miles

Campsite 1 Campsite 2

Direction - toward camp 2


Magnitude - 3.4 miles
Why Vector Databases?

How data “shows up” in the world?

80% or more of data is unstructured

Vector databases are specifically excellent


for working with these types of data
(actually, the only type of databases that
can work with unstructured data)

Why?
Why Are Vectors Used in a Vector Database?
Because…

Vector databases turn….

Numbers (vectors)

-0.05, -0.0955,..., 0.0722


Into

-0.053, -0.885, 0.1622, …


Why Are Vectors Used in a Vector Database?
We are back to the first question… and the
campground
Vectors in Action
Campsite as vectors

-0.05, -0.0955,..., 0.0722

-0.053, -0.885, 0.1622, …

lake
trail fishing
Benefits - Quick Search for the Best Campsite

lake lake
water -0.05, -0.0955,..., 0.0722
firepit
entrance
water
europe -0.053, -0.885, 0.1622, …
lake
swimming

Rapid discoverability
Efficient organization
Why are Vectors Used in a Vector Database?

1. Efficient Representation of Complex Data


a. Dimensionality - representing data in
high-dimensional space
b. Uniformity - data can be converted into a lake
uniform format (numerical vectors)
-0.05, -0.0955,..., 0.0722
2. Enabling Similarity Search
3. Leveraging Machine Learning Models
water
4. Optimizing Performance and Scalability
-0.053, -0.885, 0.1622, …
5. Improving User Experience
a. Real-time interaction (recommendations,
search results or data analysis outputs)
Traditional Databases
In contrast to Vector Databases - RDBMS

● Structured data: predefined columns


and rows
Key ● Schema-based: database structure
characteristics must be defined before hand.
● Data manipulation and querying:
manipulation through SQL
● ACID Compliant: Atomic, Consistency,
Isolation, Durability
● Indexing: to speed up data retrieval
Traditional Databases
How data search works in traditional databases:

Query Database
Traditional Databases
Limitations

● Scalability: hard to deal with complex queries


across large tables
● Flexibility: changing DB’s schema can be disruptive
● Handling Unstructured Data: not well-suited for
handling unstructured data (images, text, audio,
video)
Transforming Unstructured Data into Vectors -
Deep dive

Vector (embeddings)
...
...
...
Unstructured
data
Splitting
Storage

Embedding Vector
DB

text… Embedding -0.00543, -0.0984055,..., 0.075622

Embedding vector keeps the content and the meaning of the text
Text with similar content and meaning will have similar vectors
Text with similar content and meaning will have similar vectors

cat Cat
kitty -0.05, -0.0955,..., 0.0722
flower
eat kitty
europe -0.053, -0.885, 0.1622, …
run
walk

Vector
DB
Vector database (vectorstore) - full overview

Embedding creation and storage

-0.05, -0.0955,..., 0.0722


-0.068, -0.799,..., 0.1276
Documents

-0.99, -0.198,..., 0.008

text splits
embed

Embedding Original text


vector splits
Vector Database - querying the vector
database
Index

-0.05, -0.0955,..., 0.0722


query/ -0.068, -0.799,..., 0.1276
question

embed
-0.99, -0.198,..., 0.008

Search and compare all Pick the most similar


entries in the vector DB entry
Vector Store - Processing with LLM

LLM result/answer

Take the most similar entry found and pass


it on to the LLM along with the
question/query…
Pick the most similar
entry
Embeddings vs vectors

They essentially refer to the same thing… BUT they have


distinct definitions and roles.

Mathematical representation of data in an n-dimensional


V space ( each dimension == a feature of the data)

Embeddings A specific type of vector used in Machine Learning and


Artificial Intelligence
Embeddings vs vectors
Key differences and relation

Vectors are generic and used for a wide array of


applications (handle numerical computations…)

Embeddings map raw data into a vector space (preserves


semantic relationships - meaning)

Bottomline: embeddings are vectors, but not all vectors are embeddings.
Embeddings are vectors that encodes semantic similarities between the
items they represent (great for AI applications)
Vector databases - how they work

Nearest neighbor
Vector
Representation Query

... ...
Transform
... into
Transform
into
... embeddings

embeddings
Results
Vector Databases
Advantages
● Data Representation as Vectors: vector is vectorized
which brings lots of benefits for searching
● Similarity Search: finding data points closest to a
given query vector.
● Efficiency in High-Dimensional Searches: use of
specialized indexing structures that are highly
optimized
● Handling Unstructured Data: vector database are
made to deal with unstructured data!
● Schema-less Design: don’t require schema - allowing
more flexibility in handling various data types and
structures
Vector Databases Use cases

Vector
Database Powerful for handling complex queries

Here a 5 use cases:

1. Image Retrieval & Similarity Search


2. Recommendation Systems
3. Natural Language Processing (NPL)
4. Fraud Detection
5. Bioinformatics
Image Retrieval & Similarity Search
For example: an e-commerce platform

Can use a vector database for visual search feature.

Nearest neighbor
E-commerce system

customer

Similar items
available for sale

Results
Recommendation Systems
For example: music streaming service

Uses a vector database to recommend songs

Nearest neighbor
Streaming service
Music playlist
System
User
-0.053, -0.885, 0.1622, …

-0.053, -0.885, 0.1622, …


“Here are other
songs you may
like”
Vector Recommendations
database
NLP - Natural Language Processing
For example: customer support AI

Uses a vector database to understand and respond to user


queries more efficiently
Nearest neighbor
Customer support AI

Customer Where’s my
package?
-0.053, -0.885, 0.1622, …

“On its way to


you… 2 days
behind schedule”
Vector Response
database
Fraud Detection & Bioinformatics
For example: detect fraudulent activities

Uses a vector database to quickly compare user behavior


patterns and flag anomalies…

For example: compare genetic sequences

Uses a vector database to compare gene expression profiles


from different patient samples.
LLM
Large Large Language Models - what
are they?
Language
Models
What Is a LLM?
Large Language Model
A type of AI algorithm…
Trained on a large amount of text data
● Gives responses in natural language (and other formats)

The
Extensively Natural
Language
Trained Language
Model
Data Collection and Training…

Data Train the


Collection model (Language
Model)

Collection of lots of data & train the model: learn patterns


in human language (articles, studies, news…)
How LLMs are trained

There are a few steps involved in training an LLM:


● Unsupervised learning - the model starts to derive relationships
between words and concepts, then fine tuned with supervised
learning
● Next - training data goes through a Transformer - enabling the
LLM to recognize relationships and connections using a
self-attention mechanism
The Transformer Architecture - Overview
The Transformer Architecture is a Neural Network best
suited for text and natural language processing.

GPT
I eat an apple. (Language
Model)

Transformer

Encoder Decoder

Eu como uma
maçã.
The Transformer Architecture - Overview
I eat an apple.

Transformer

Encoder-Decoder
Self-Attention layer - decodes and
Attention looks at other words in the input… so to
Encoder
Decoder help lead to a better encoding for the
Self-attention
Self-attention words

This helps de decoder focus on relevant


parts of the input sentence.

Eu como uma
maçã.
The Transformer Architecture - Overview

Get and processes Input Captures relationships between words


& understands the context
Transformer

Training Input & Output


Encoder Decoder

Context Meaning
Self-Attention
Mechanism
LLMs Available
To name a few…

LLaMA FLAN UL2

GPT-X

BLOOM
ChatGPT
LLMs have many use-cases
They can…

Organize
Translate Content
Languages
Converse
Generate naturally
Text with the
Do user
sentiment (chatbot)
Summarize Analysis
/rewrite (humor or
content tone)
The top 5 Vector Databases

weaviate
Pinecone

Key Features:
● Managed service - simple deployment
and maintenance
● Supports real-time vector indexing &
querying - scalability and performance

Unique Selling Points:


● Provides a simple API
● Strong focus on consistency
Milvus
Key Features:
● Open-source and designed for scalability and high-performance
● Supports both CPU and GPU

Unique Selling Points:


● Highly customizable
● Rich API and SDK (in multiple programming languages
Faiss (Facebook AI Similarity Search)
Key Features:
● Developed by Facebook - efficient similarity search
● Operates mainly in memory for fast data retrieval

Unique Selling Points:


● Provides highly optimized algorithms for similarity search
● Best suited for researchers and developers in AI
Weaviate
Key Features:
● Open-source vector engine - supports GraphQL, RESTful APIs…
● Has features like semantic search, automatic classification and object
recognition

Unique Selling Points:


● Modular infrastructure
● Supports semantic search with a built-in knowledge
graph
Annoy (Approximate Nearest Neighbors Oh Yeah)

Key Features:
● Lightweight, open-source library - fast
● High performance with memory-mapped files - good for large-scale
datasets

Unique Selling Points:


● Prioritizes speed and memory efficiency
● Provides an easy-to-use interface with minimal setup
Chroma - AI Native & Open-source

Key Features:
● Embedding storage and search- allows embeddings and their a
associated metadata
● Wide range support - various programming language integration
● Performance and scalability
● Open-source
Unique Selling Points:
● Simplicity and developer productivity
● High performance
● Customizable & extensible
● Cost-effective - free and open-source
Building Hands-on - Building vector
Vector databases from scratch

Databases
Development Environment Set up

● VS Code
● Python
● OpenAI account and API Key
Hands-on - Create a Vector DB using
Chroma

Vector database with Chroma


Chroma database workflow

Image source:[Link]
Vector Search - How Does It Work?

Nearest neighbor
Vector
Representation Query

... ...
Transform
... into
Transform
into
... embeddings

embeddings
Results
Chroma database - OpenAI Integration
Understanding Large Language Models (LLMs)

Data Collection and Training…

Data Train
(Language
Collection the
Model)
model

Collection of lots of data & train the model: learn


patterns in human language (articles, studies,
news…)
How Does it Work?
VectorStore holds embeddings - Vector representation of the text

But why embeddings?

Because we can easily do search where


VectorStore
[0.5..0.2..-0.] we look for for pieces of text that are
most similar in the vector space

● Capture semantic meaning


● Enabling similarity measures
● Handling high-dimension data
Image source: [Link]
Vector database - full overview

Embedding creation and storage

-0.05, -0.0955,..., 0.0722


-0.068, -0.799,..., 0.1276
Documents

-0.99, -0.198,..., 0.008

text splits
embed

Embedding Original text


vector splits
Vector Database - querying the vector
database
Index

-0.05, -0.0955,..., 0.0722


query/ -0.068, -0.799,..., 0.1276
question

embed
-0.99, -0.198,..., 0.008

Search and compare all Pick the most similar


entries in the vector DB entry
Vector Store - Processing with LLM

LLM result/answer

Take the most similar entry found and pass


it on to the LLM along with the
question/query…
Pick the most similar
entry
LangChain
A framework (open source) for building applications that leverage various LLMs
(Large Language Models).
Additionally - you can also use external sources of data combined with various
LLMs!

General
Knowledge

Company … PDF
Action 2
Y
LangChain Database
Framework … Action to take… email

Actio
Your n 4…
Document
LangChain - How Does it Work?
The Framework takes a document and transforms it into VectorStore (stores the
chunks of data)

Document Document VectorStore


Chunks [0.5..0.2..-0]
Integrating OpenAI Embeddings with
Chroma
Chroma Vector Database - Metrics and Data
Structures
Metrics for Evaluating Vector Databases:
● Latency: time it takes to complete a query
● Throughput: # of queries that can be processed per unit of
time.
● Precision and Recall: evaluation of accuracy of the search
result
● Memory Usage
● Scalability
Chroma Vector Database - Metrics and Data
Structures
Data Structures in Chroma
● Inverted Indexes
● K-d Trees
● Hierarchical Navigable Small World (HNSW) Graphs
● Locality-Sensitive Hashing (LSH)
● Priority Queues
Metrics (Measuring Precision)- Hands on

● Set up a vector database with Chroma


● Insert data
● Perform queries
● Monitor key performance metrics
Measuring Throughput

● How many queries your system can handle per unit of time
Measuring Precision and Recall
Searching - finding relevant results to the query
string..
Cat
Recommendations - items with related text
-0.05, -0.0955,..., 0.0722 strings are recommended…

kitty
-0.053, -0.885, 0.1622, … Classification - text strings are classified by most
relevant and similar labels…

Vector
Store
Measuring Scalability
Create Vector Database with Pinecone

● Create account
● Dashboard overview
● Create a sample index
● Test it out
Use cases..
● Data Analysis (applications that can analyze large amounts of data…)
● Flight booking
● Study helper (learn material faster)
● Money transfer
● Code analyzer (debugging code, learn a large codebase fast)
● Personal AI assistants
● Connect to a variety of APIs (LLMs working with APIs through langchain)
● …
Vector similarity - Deep dive

● A way to capture the


closeness or alignment
between two data
points…
● Similarity influenced by
○ Direction
○ Magnitude
○ and relative
position
Common Measures of Vector Similarity

● Cosine Similarity
● Euclidean Distance
● Dot Product
Common Measures of Vector Similarity
● Cosine Similarity
Common Measures of Vector Similarity
● Cosine Similarity
● Magnitude (distance)
doesn’t matter
● The angle (cosine) is
what matters
○ higher value
means similar
○ Values range
between -1 and 1
Common Measures of Vector Similarity
● Cosine Similarity

○ Dot prod For Arrow A and Arrow B, that’s (3 *


2) + (1 * 2) = 6 + 2 = 8.
○ Magnitude For Arrow A, For Arrow A, it’s the
square root of (3² + 1²) = √(9 + 1) = √10
○ Magnitude For Arrow B, it’s √(2² + 2²) = √(4 +
4) = √8.

Result: ~0.894
Common Measures of Vector Similarity
● Cosine Similarity

Best use:
● Information retrieval and
text mining
Common Measures of Vector Similarity
● Euclidean Distance / L2 Norm

Best use:
● When vector
magnitude need to be
considered
● Ideal for clustering
tasks (k-means
clustering)
Common Measures of Vector Similarity
● Dot Product

Best use:
● Image retrieval &
matching
● Music recommendation
● …
Common Measures of Vector Similarity - Summary

Best use:
● Topic modeling
● Document similarity
● Collaborative filtering

Best use:
● Clustering analysis
● Anomaly & fraud detection

Best use:
● Image retrieval & matching
● Neural networks & Deep Learning
● Music Recommendation
Hands on - Real world use case

Best use:
● Vectorization of data
○ Split large text/documents
○ Create embeddings
○ Save them to vector database

● Query (similarity search)


The top 5 Vector Databases

weaviate
Pinecone

Key Features:
● Managed service - simple deployment
and maintenance
● Supports real-time vector indexing &
querying - scalability and performance

Unique Selling Points:


● Provides a simple API
● Strong focus on consistency
Create Vector Database with Pinecone

● Create account
● Dashboard overview
● Create a sample index
● Test it out
The top 5 Vector Databases

Challenge:
● Explore the other
vector databases:
weaviate ○ Weviate
○ Faiss
○ Milvus
○ …
Pinecone Summary

● Introduction to Pinecone
● Pinecone basics and set up
● Used LangChain Framework
● Attached an LLM to complete the workflow
Comparison of Vector Databases Deployment options

Vector Database Local Deployment Cloud On-premises


Deployment Deployment

Pinecone ❌ ✅ (managed) ❌

Milvus ✅ ✅(self-hosted) ✅

Chroma ✅ ✅(self-hosted) ✅

Weaviate ✅ ✅(self-hosted) ✅

Faiss ✅ ❌ ✅
Comparison of Vector Databases - Integration and API

Vector Database Language SDK REST API GRPC API

Pinecone Python, [Link], Go, Rust ✅ ✅

Milvus Python, Java, Go, C++, [Link], ✅ ✅


RESTful

Chroma Python ✅ ❌

Weaviate Python, Java, JavaScript, .NET ✅ ✅

Faiss C++, Python ❌ ✅


Community and Ecosystem

Vector Database Open-source Community Integration with Frameworks

Pinecone ❌ ✅ ✅

Milvus ✅ ✅ ✅

Chroma ✅ ✅ ✅

Weaviate ✅ ✅ ✅

Faiss ✅ ✅ ✅
Pricing

Vector Database Free Tier Pay-as-you-go Enterprise Plans

Pinecone ✅ ✅ ✅

Milvus ✅ ❌ ❌

Chroma ✅ ❌ ❌

Weaviate ✅ ❌ ✅

Faiss ✅ ❌ ❌
Which Vector Database Should I Use?

It depends on… what you what to


accomplish. And other many
factors…
1. Project Requirements
a. Scalability
b. Performance needs
c. Data type
2. Ease of Use and Integration
a. Developer experience
b. Community support
3. Feature Set
a. Advanced features
b. Customization and flexibility
4. Cost and Infrastructure
a. Budget constraints
b. Infrastructure Needs
5. Security and Compliance
a. Data security
Recommended Approach
To determine the best fit:
● Evaluate a shortlist - based on discussed criteria,
shortlist a few databases
● Prototype
● Community feedback
Choosing the Right Vector Database
● Data Type and Volume
○ Text, Images or Audio: Weaviate or Chroma
○ Scalability: vertical and horizontal - Pinecone and Milvus
● Query Performance and Latency
○ Latency requirements - low latency applications - Pinecone (recommendation systems or live
content filtering)
● Accuracy and Precision
○ Metric support - supports similarity metrics (cosine, euclidean…)
○ Tuning capabilities
● Ease of Integration and Use
○ API and Client Libraries
○ Documentation and Community
● Cost considerations
○ Pricing Structure (pay-as-you-go…)
○ Total cost of ownership
● Security and Compliance
○ Data Security
○ Compliance - DDPR, HIPAA
● Vendor Stability and Support
○ Vendor Reputation
○ Support Services
You made it to the end!
Congratulations! ● Next steps…
Course Summary
● Foundations of Vector Databases
○ What are they?
○ What problem vector databases solve?
○ Top 5 vector database
○ Key Differences
○ Challenges and use cases
○ How to build vector databases from scratch
■ Metrics and data structure
■ Vectorization with abstraction frameworks
○ Hands-on use cases: full AI-based application workflow (with
LLMs)
○ Vector database comparisons
○ How to choose a vector database
Wrap up - Where to Go From Here?

● Keep learning
○ Get more ideas and build more applications and test different,
less known vector databases!

● Read documentation - That’s where the Gold Is!


● Challenge yourself to keep learning new skills!
Thank you!
Traditional vs Vector Databases - Summary

● Traditional vs Vector databases


○ Limitations and Contrasts

● Vector databases & Embeddings - Overview


● How vector database work & advantages
● Vector databases Use cases
Building vector database - Hands-on
● Set up development environment
○ VS Code
○ Python
○ OpenAI API
● Chroma database workflow overview
● Creating a chroma database
● Default embedding function
● Creating OpenAI embeddings
● Vector database metrics

You might also like