0% found this document useful (0 votes)

26 views11 pages

PEC Gen AI Overview and Techniques

Q: How does Python facilitate the development of machine learning applications?

Python facilitates the development of machine learning applications through its extensive libraries and frameworks, such as Pandas, NumPy, Scikit-learn, and TensorFlow. These libraries provide robust tools for data manipulation, statistical analysis, and model building. Python's simple syntax and readability make it accessible for developers, while the large community support helps with resources, tutorials, and debugging assistance .

Q: In what way does stemming differ from lemmatization, and why might lemmatization be preferred in language processing tasks?

Stemming and lemmatization both aim to reduce words to their base or root form. However, stemming uses algorithms to strip prefixes and suffixes without regard for the context, potentially resulting in words losing meaningful context. Lemmatization, on the other hand, reduces words to their dictionary base form while maintaining context, making it more sophisticated. Therefore, lemmatization is often preferred in language processing tasks as it yields better semantic understanding and contextual accuracy .

Q: Explain the role of tokenization in the functioning of language models such as GPT.

Tokenization in language models like GPT involves breaking down text inputs into smaller units, such as words or subwords, which the model can process and understand. This is an essential step as it allows the conversion of text into numerical vectors, making it easier for models to perform operations like embedding, which further supports the understanding of meanings and contexts within the input. Tokenization helps streamline the input data by removing redundant elements, fostering efficient and accurate processing .

Q: How does the retrieval-augmented generation (RAG) technique improve the accuracy of AI models' responses?

RAG improves the accuracy of AI models' responses by allowing these models to incorporate external information into response generation. This technique enables models to access real-time or recent data that exceeds their static knowledge, which is limited to what was available at the time of their training. By integrating information retrieval systems, RAG enhances both the precision and contextual relevance of answers the models generate .

Q: Why might RAG be considered a necessary strategy in the context of LLMs' limitations?

RAG is considered necessary because LLMs have inherent limitations, including static knowledge and the inability to handle date-sensitive information or topics lacking context. RAG mitigates these limitations by augmenting the model's capabilities with external, up-to-date information. This ensures more accurate and contextually relevant responses by bridging the gap between the model's training data and the required current knowledge .

Q: Discuss how set and dictionary data structures in Python differ in terms of their storage capabilities and use cases.

Sets in Python are collections of unordered, unique elements, which are useful for tasks requiring elimination of duplicate entries or membership testing. In contrast, dictionaries store data as key-value pairs, allowing for more complex data relationships and retrieval operations. While sets cannot store duplicate items and do not support indexing, dictionaries provide the capability to efficiently map each key to its corresponding value for fast look-up and retrieval tasks .

Q: What are the potential limitations of using APIs for interaction with large language models?

The potential limitations of using APIs for interaction with large language models include the requirement for internet connectivity, which may affect application speed. Additionally, the reliance on company servers can introduce latency. APIs also entail cost implications, as they often require payment for usage. Furthermore, any device using the company's API must maintain a stable and efficient connection to the internet to ensure optimal performance .

Q: What are the advantages of using Google Colab for machine learning experiments, particularly regarding computational resources?

Google Colab offers significant advantages for machine learning experiments, particularly in terms of computational resources. It provides free access to GPU and TPU computational power, which dramatically accelerates ML computations compared to using a CPU alone. It also offers a cloud-based platform with substantial disk space and high RAM availability, facilitating the handling of large datasets without the need for personal high-performance hardware .

Q: What are the key differences between supervised, unsupervised, and reinforced learning in machine learning?

Supervised learning in machine learning involves training models on labeled data to make predictions or classifications, such as predicting house prices. Unsupervised learning, however, focuses on identifying patterns in unlabeled data, like clustering customers based on purchasing behaviors. Reinforced learning is distinct in that it allows models to learn through trial and error, receiving feedback from actions taken, which is often applied in game-playing AI .

Q: How do large language models (LLMs) overcome the challenge of understanding complex human language structures?

Large Language Models (LLMs) overcome the challenge of understanding complex human language structures by using transformers as their core architecture. Transformers facilitate the processing and understanding of text via tokenization, embedding, and self-attention mechanisms. The models convert textual inputs into numerical vectors that represent semantic meanings, allowing them to generate contextually appropriate and grammatically correct responses .

The document provides an overview of AI, ML, and DL, explaining their definitions, types, and applications, including generative AI and large language models. It also covers Python programming fundamentals, including syntax, operators, loops, functions, and data structures like lists, tuples, and sets, emphasizing its real-world applications in various fields. Additionally, it introduces Google Colab as a platform for running Python code and debugging with ChatGPT.

Uploaded by

Muhammad Adeel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views11 pages

PEC Gen AI Overview and Techniques

Uploaded by

Muhammad Adeel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

PEC GEN AI NOTES

MODULE 1
 AI is a discipline and Machine Learning(ML) is a subfield of this discipline and deep
learning (DL) is a further subfield of ML
 DL is a type of ML that uses Artificial Neural Networks to learn complex patterns
from the data. It contains further two types of techniques:
 Descriptive Technique: Discriminative model (classifies/differentiate between
spam or not spam)
 Generative Technique: Generative AI is a subset of AI that aims to generate new
content form given instructions (prompt)
 AI-Machines doing tasks intelligently, ML-Machine that learn from patterns of data
to make predictions, Gen-AI-Machines creating new, original content (ChatGPT,
DALL-E)

 Large Language Model (LLM): A language model is a type of Gen-AI that generates
text e.g autocomplete

 Machine Learning Model Types:

1) Supervised Learning: Learns from labelled data e.g. predicting house prices.
2) Unsupervised Learning: Finds patterns in unlabeled data e.g. clustering
customers
3) Reinforced Learning: Learn by trial and error e.g. game playing AI.
4) Deep learning: Identifies patterns in data using neural networks.
 Large Language Models: Trained on vast amounts of text data to understand
context, grammar, and nuances of language.
 How to interact with LLM’s:
 Chat interfaces (ChatGPT, Gemini), LLM APIs (GPT 3.5,4.0 and Gemini), Specialized
models (YOLO, Whisper)

 Chat Interfaces: Simplest ways to interact with LLMS. Most LLMs contain these
interfaces (ChatGPT, Gemini), most of these services have a free plan that limits
(restriction in no. of prompts) usage or limits models available (chatGPT 3.5 vs 4.0)
 API: Application Programming Interface: (similar to waiters at restaurants). They
take prompts from you, deliver it to the server, the server not the API process
data, API brings data back to user. Acts as an agent b/w you and the server.
Another example is CR in uni who’s the only medium b/w teachers and students.

 Adv of API: Easy to setup and use, faster prototyping, don’t need expensive GPU
as all processes run on company’s servers.

 Disadv of API: Costs money to use, application can be slower as it has to go

company’s server and then come back. Any device using company LLM API’s needs
to be connected to the internet and its speed can be effected by internet’s speed.

 Pre-trained Specialized models: Smaller models that are trained and fine-tuned to
perform specific tasks e.g. YOLO: Object detection, whisper: speech to text. These
smaller models are more accurate at accomplishing specific tasks. These models
can be further fine-tuned using own data to increase accuracy.

 Where can specialized models be found? Most of them are open source meaning
they are freely available on the internet for anyone to download, use, and modify.
There are specific platforms where trained models are uploaded alongside
instructions on how to use them. Hugging face is the most popular source for pre-
trained models and has easy to use instructions for most models. These platforms
have communities and forums where questions can be asked as well.

 GPT: Generative Pre-trained transformer – An LLM trained to generate answers to

our queries utilizing transformer libraries. The underlying dataset/transformer
libraries are same, there are simply multiple models that are being trained from
the same dataset. ChatGPT is one of these models. These GPTs are already trained
on large datasets (e.g GPT3 was trained on all Wikipedia) but there’s still room for
additional training (fine-tuning) for specific tasks.

 Transformers are the core neural network architecture that every generative AI
model is built on. All data processing occurs in these transformers from the
decoding of your query, changing into computer language, processing, changing
back into human [Link].
 HOW CHATGPT WORKS:
Prompt-> pre-processing-> tokenization (breakdown of text into smaller units like
words)-> embedding (conversion of tokens into numerical representations i.e.
vectors-> self-attention (understanding the meaning of the prompt)-> picks
appropriate model (GPT, DALL-E)-> transformer than generates the next
information-> un-embedding and post processing-> response in word form.
Tokenizing: Breaking inputs into chunks (employed to characterize what’s the
meaning of the input prompt and removes unnecessary or redundant data).
Embedding; Converting tokenized chunks into vectors in an N-dimensional vector
space. Can do vector operations like add, subtract, dot and cross products.
Un-embedding: After many rounds of attention and MLP’s, a final vector is
produced which is the official next bit of info. Now we reverse embedding into
useful token that is the same as input prompt.

 Cleaning inputs: Human language has a lot of filler words like ‘a’,’the’,’an’ called
Stop words that don’t really add value to the text. Removing these words lowers
the amount of tokens used and helps get more accurate results. Utilities like: NLTK
is a python package used to remove stop words.
 STEMMING: A way to remove prefixes and suffixes from words to distill meaning
e.g ‘walks’ becomes ‘walk’ and ‘retrieval’ becomes ‘retrieve’. Common suffixes like
‘ed’, ‘ing’, plurals to singular, removing adjectives ‘happier’ to ‘happy’ are removed
using common algorithms like: PorterStemmer.
Stemming is useful in information retrieval, search, and data mining.
Be careful not to overstem and reduce words to a meaningless form e.g
‘university’ to ‘univers’ so stemming words can make them lose their contextual
meaning. Compound words like ‘whiteboard’ are usually not well handled.

 Lemmatization: More sophisticated than stemming (better contextual awareness).

It basically reduces a word to its dictionary basic form e.g ‘running’ and ‘ran’ will
both be reduced to ‘run’ and ‘university’ remains ‘university’.
This helps models group certain words and sentences together.
Helps reduce overhead and reduce dimensionality of vectors that words produce.
Helps in info retrieval e.g ‘Best coffee’ will also retrieve results for ‘good coffee’
Also done using NLTK.

RAG (Retrieval Augmented Generation)

A technique that allows AI models to access and incorporate external information
to generate more accurate and informative responses. Essentially, it's like giving an
AI model access to a vast library of knowledge, enabling it to provide more
comprehensive and reliable answers.
 Hybrid Approach: Combines LLMs with information retrieval systems to enhance
response accuracy and relevance.
 Why RAG? LLM’s face limitations in certain tasks:
Static Knowledge: LLM’s are limited to knowledge up to their last training and lack
real-time updates
Contextual Limits: Struggle with highly specific or less common topics without
sufficient context
Large scale data handling: Handling large amounts of information and ensuring
relevancy & accuracy of responses.
Hallucination: Incorrect or misleading generated by an AI model.
Data Staleness: Model’s inability to provide updated info because it was trained
on a fixed dataset that does not include updated data.
 How RAG works:
Retrieval Component: Fetches relevant documents or data from an external
database. It does 3 steps on that document:
1) Chunking: Divides large documents into smaller chunks. For example, if a doc is
of 100 words, then the entire text is split into chunks of 10-10 words-> then
tokenization of these chunks occurs-> then embedding occurs and this
embedding’s are stored in vector databases.
2) Semantic Understanding: The user then writes a prompt asking something
from the document which will also be embedded. Now the query embedding
will be matched with document embedding and the most relevant & semantic
relationships b/w the query and document are returned as response. How it
finds the most relevant relationships is through Distance based retrieval.
Through it, it identifies relevant data by calculating distance in b/w vectors and
filters out irrelevant data by setting a distance threshold. The smaller the
distance b/w the vectors the more relevant the data

Generation Component: LLM generates responses based on both retrieved data

and own capabilities.
It returns both the most relevant answer from the document provided and the
user’s query. This is called an augmented response. This augmented response then
goes to the LLM that generates a coherent response
Here's how RAG works:
1) Query: A user submits a query to the language model.
2) Retrieval: The model accesses a knowledge base and retrieves relevant
information related to the query.
3) Augmentation: The retrieved information is integrated into the model's
response generation process.
4) Response Generation: The model generates a response that incorporates both
its own knowledge and the information from the knowledge base
 Benefits of RAG:
Updated knowledge: Access to real-time or recent information
Enhanced Accuracy: More precise and contextually relevant answers
Scalability: Handles large data more effectively.

In conclusion RAG improves a model’s ability to generate accurate responses for

queries that require knowledge which isn’t present in a model’s pre-trained data.

* LOOK UP HOW IT USES WIGHTS TO PRIORITIZE STUFF

MODULE 2-PYHTON FOR BEGINNERS
 Python: One of the most popular programming languages, used in Web
development, data science, AI, scientific computing, automation, and more. Used
by Google, Netflix, Fb
 Real World Application of Python: 1) Web Development: Frameworks like Django
and Flask made building web applications easier. 2) Data Science and ML: Libraries
like Pandas, NumPy, Scikit-learn, and Tensor Flow, 3) Automation: Python scripts
automate repetitive tasks, 4) Game Development: Libraries like Pygame are used
to create games.
 Examples: Youtube: Uses Pytho, Instagram: Uses Django, a Python web
framework, Spotify: Uses Python for Data Analysis and Backend Services.
 Why Python? 1) Its syntax is easy to write and understand (similar to plain
English), 2) Readable Code: Python emphasizes readability, making I easier to
learn and debug, 3) Large community and resources: Vast no. of tutorials, forums,
and documentation available, 4) Extensive Libraries and Frameworks: Makes
development easier and faster.

 GOOGLE COLAB: (ONLY RUNS PYTHON CODE)- We’ll be using it to write code.
Connect it with browser to use first. This will connect to CPU giving you a disk
space of 107.7GB, and RAM of 12.7GB.
 We can also connect it to a GPU. Basically it makes ML (Machine Learning)
computations much faster due to faster processing speed.
 There are two blocks labelled Code and Text under the formatting bar in Google
colab. Code is used for writing code and text for writing comments/notes/
headings.
 The purpose of Google Colab is that you’ll be copying the code written by
ChatGPT and pasting it here to run and check for errors. If errors do arise you’ll
be debugging it by sending the code back to ChatGPT alongside the errors that
have occurred and ask it to debug and fix it.
 To write comment in Python use they symbol: # before writing the comment.
 Variable Names in Python: 1) Must start with letter or underscore _ , 2) there
must be no white space in a variable name, 3) Cannot start with a number, 4)
Variable name can’t have special characters, 5) Names are case sensitive i.e. Age is
not equal to age,6) Keywords cannot be used as variable names
 Arithmetic Operators in Python: +, -, *, /, % (modulus operator-gives remainder
e.g. 9%2=1), // (floor division- divides two numbers and returns the largest integer
less than or equal to the result of the division. It's like regular division, but it
rounds down to the nearest whole number e.g. 9/2=4), ** (power operator ^)
 Logical Operators in Python: AND, OR, NOT (DON’T USE SYMBOLS IN PYTHON)
 Comparison Operators: <, >, != (Not equal to), <=, >=, == (equal to)
 Conditional Statements: instead of elseif python uses elif.
 LOOPS IN PYTHON:
1) For Loop:
- For each loop (value based loop)
- Index based loop
2) While Loop
 For Loop: (indexed based loop)
For i in range(10): #index starts from 0 so count will be from 0 to 9
Print(i) #prints out a serial from 0 to 9

For i in range(3, 10): # it now starts from the and ends at 9

Print(i) # output 3,4,5,…..,9

For i in range(3, 10, 2): # the third entry indicates the jump from each integer
Print(i) #output 3,5,7,9

So for i in range(3, 9, 2): # (start, end(exclusive), jump)

 While Loop
i=0;
while i<10: (REMEMBER PYTHON USES COLON : INSTEAD OF SEMI COLON ;)
print(i)
i+=2 #output 0,2,4,6,8

 FUNCTIONS IN PYTHON: 1) Built in functions, 2) User Defined Functions (which we

will be generating with chatgpt)
 Built in functions: max(), min(), [Link]() [calculates square root of a given
function e.g. [Link](16)=4.0
Built in functions don’t need to be defined prior to use e.g.
Output= max (3,4)
Print(output) #gives 4

 User Defined Function Syntax:

def function_name (parameters): # here def is a keyword
# body of function
Return expression
e.g
def multiply(x,y):
output=x*y
return output
x=5
y=10
output = multiply(x,y)
print(output) #gives 50

 LISTS IN PYTHON: denoted by square brackets e.g clothes=[“shirts”, “tie”, “pants”]

Print(clothes[1]) #gives back tie as answer

#applying for loops on lists

1) Index based for loop
For i in range(len(clothes)): # Loop through the list: This line sets up a loop
that will run once for each item in the clothes list. len(clothes) gives the
number of items in the list (which is 3), and range(3) creates a sequence
of numbers from 0 to 2.
Print(clothes[i]) # Inside the loop, this line prints the item at the current
position i in the clothes list.
2) Value based for loop
For i in clothes: # This line sets up a loop that will go through each item in
the clothes list one by one.
Print(i) # Inside the loop, this line prints the current item i from the
clothes list.
**LISTS CAN BE EDITED AFTER THEY HAVE BEEN MADE**
Cloth=[“shirts”,”ties”,”Pants”]
Cloth[0]= “Jeans”
Print(Cloth) # [“Jeans”,”ties”.”Pants”]
 TUPLE IN PYTHON: Similar to List but it cannot be changed once it is created and
its uses parenthesis () instead of square brackets []
e.g cloth=(“Jeans”,”ties”,”Pants”)
print(type(cloth)) #tuple
 SETS IN PYTHON: A set is a collection of unique data, meaning elements within a
set can be duplicated,. ELEMENTS IN A SET ARE UNORDERED. SETS USE CURLY
BRACKETS {}
e.g numbers={1,2,3,4,5,”numbers”,2.45,3} # even if you add multiple 3’s the
output will only have only 3
print(numbers) # gives 1,2,3,4,5,2.45,numbers in response i.e. unordered
 WE CANNOT USE INDEXES TO FETCH ITEMS IN SETS
e.g. print(numbers[2]) # Type error: ‘set’ object is not subscriptable
 BUT WE CAN APPLY LOOPS
e.g. for element in numbers:
print(numbers) # 1,2,3,4,5,2.45 ,numbers

 DICTIONARY IN PYTHON: ALSO REPRESENTED BY CURLY BRACKETS {} JUST LIKE

SETS. The difference is that a dictionary is a key-value pair (consisting of a key and
avalue). KEYS are always unique but values may not.
e.g country_capitals={ ‘Germany’ : ‘Berlin’, ‘Canada’ : ‘Ottawa’, ‘England’ :
‘London’} Here both Germany and Berlin form 1 element
We can access values from dictionaries with this syntax:
Print(country_capital[‘Canada’]) #Ottowa
IN DICTIONARIES DATA CAN’T BE REPEATED BECAUSE WHEN IT IS PRINTED IT
IGNORES THE FIRST ENTRY AND PRINTS THE LATEST ONE. e.g. products
{‘cans’:1,’bins’:3, ‘cans’:5}
Print(products[‘cans’]) # 5
Applying loops to Dictionaries
For element in [Link](); #.items is new
Print(element) # bins:3 (next line) cans:5
To get each item separately
For product_type, rate in [Link]()
Print(product_type, “=” rate) # Bins=3 (next line) Cans=5

 HOW TO GET USER INPUT IN PYTHON:

User_input=input(‘enter a number: ‘)
Print(User_input)

LEET CODE WEBSITE

Common questions