0% found this document useful (0 votes)
16 views41 pages

Understanding AI and Machine Learning

Uploaded by

dskz.2005
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views41 pages

Understanding AI and Machine Learning

Uploaded by

dskz.2005
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

UNIT-1

Artificial intelligence:
Artificial intelligence (AI) refers to computer systems capable of performing
complex tasks that historically only a human could do. These tasks include reasoning,
making decisions, or solving problems. In essence, AI enables machines to simulate
human intelligence and problem-solving capabilities. It’s a fascinating field that has
evolved significantly over time.
Types of AI Technologies:
1. Machine Learning: Machine learning uses algorithms trained on data sets to
create models that allow computer systems to perform tasks like making song
recommendations, identifying optimal travel routes, or translating text between
languages.
2. Deep Learning: Deep learning is a subset of machine learning that involves
neural networks with multiple layers. It excels in tasks like image recognition
and natural language processing.
3. Natural Language Processing (NLP): NLP enables computers to understand
and generate human language. Chatbots and language translation services are
examples of NLP applications.
Debates on True AI: While the term “AI” is commonly used to describe various
technologies today, there’s ongoing debate about whether these technologies truly
constitute artificial intelligence. Some argue that much of what we see today is highly
advanced machine learning, which is a stepping stone toward “general artificial
intelligence” (GAI)—machines with true human-like intelligence. Nevertheless, when
most people mention AI today, they’re often referring to machine learning-powered
technologies like Chat GPT or computer vision.
Real-World AI Examples: Although humanoid robots like Data from Star Trek or the
Terminator’s T-800 don’t exist yet, you’ve likely interacted with machine learning-
powered services or devices. These technologies allow computers to perform tasks
previously limited to humans, such as:
 Recommending songs based on your preferences.
 Identifying the fastest travel route to a destination.
 Translating text across languages.

How does AI work?


AI and its components, such as machine learning, are often used interchangeably as
synonyms, although they are not. AI systems require special hardware and software
that write and train machine learning algorithms. All of them make AI, but none of them
is AI alone.
Generally speaking, AI systems work by collecting data, analyzing it, and looking for
correlations between them to know how to use them in the future. In this way, they
make patterns that predict future uses.
AI is based on three cognitive processes:
1. The learning process is based on collecting and converting data into valuable
and usable information (algorithms).
2. The reasoning process involves choosing a suitable algorithm to solve a
particular task.
3. The self-correction process enables the improvement of algorithms and the
creation of more adequate responses to achieve the best possible results.

What is Machine Learning?


Machine Learning (ML) is a subset of Artificial Intelligence (AI) that involves developing
algorithms to enhance performance through experience and data use. In simple terms,
ML allows computers to learn from data, making decisions or predictions without
explicit programming. It revolves around creating algorithms that improve performance
over time by processing more data. Unlike traditional programming, ML involves
providing examples and tasks, allowing the computer to determine how to accomplish
the task based on the given examples. For example, to recognize images of cats,
thousands of cat images are provided, and the ML algorithm learns common patterns.
This ability to learn from data makes ML powerful, driving technological advancements
like voice assistants, recommendation systems, self-driving cars, and predictive
analytics.
9 Best Python Libraries for Machine Learning
1. Scikit-learn: This is a comprehensive library for classical machine learning
algorithms. It includes tools for data preprocessing, feature selection, model
training, and evaluation.
2. TensorFlow: Developed by Google, TensorFlow is an open-source deep learning
library widely used for neural network-based applications. It provides a flexible
platform for building and deploying machine learning models.
3. PyTorch: Similar to TensorFlow, PyTorch is an open-source deep learning
library. It is known for its dynamic computation graph, making it more intuitive
for researchers and developers to work with.
4. Keras: Initially a separate high-level neural networks API, Keras has been
integrated into TensorFlow. It provides a user-friendly interface for building and
training deep learning models.
5. NumPy and Pandas: While not specifically designed for machine learning,
NumPy and Pandas are essential for data manipulation and preprocessing tasks.
They are often used in conjunction with machine learning libraries.
6. Matplotlib and Seaborn: These libraries are used for data visualization, helping
researchers and developers understand and communicate the patterns and
insights within the data.
7. NLTK (Natural Language Toolkit): Specifically designed for working with
human language data, NLTK is often used in tasks like text processing and
sentiment analysis.
8. OpenCV (Open Source Computer Vision): Widely used in computer vision
tasks, OpenCV provides a range of tools for image and video analysis.
9. XGBoost and LightGBM: These are popular libraries for gradient boosting, a
machine learning technique that performs well in various types of data.

Machine Learning Definitions


Algorithm: A Machine Learning algorithm is a set of rules and statistical techniques
used to learn patterns from data and draw significant information from it. It is the logic
behind a Machine Learning model. An example of a Machine Learning algorithm is the
Linear Regression algorithm.
Model: A model is the main component of Machine Learning. A model is trained by
using a Machine Learning Algorithm. An algorithm maps all the decisions that a model is
supposed to take based on the given input, in order to get the correct output.
Predictor Variable: It is a feature(s) of the data that can be used to predict the output.
Response Variable: It is the feature or the output variable that needs to be predicted
by using the predictor variable(s).
Training Data: The Machine Learning model is built using the training data. The
training data helps the model to identify key trends and patterns essential to predict the
output.
Testing Data: After the model is trained, it must be tested to evaluate how accurately it
can predict an outcome. This is done by the testing data set.

Machine Learning Process


The Machine Learning process involves building a Predictive model that can be used to
find a solution for a Problem Statement. To understand the Machine Learning process
let’s assume that you have been given a problem that needs to be solved by using
Machine Learning.

The below steps are followed in a Machine Learning process:


Step 1: Define the objective of the Problem Statement
At this step, we must understand what exactly needs to be predicted. In our case, the
objective is to predict the possibility of rain by studying weather conditions. At this
stage, it is also essential to take mental notes on what kind of data can be used to solve
this problem or the type of approach you must follow to get to the solution.
Step 2: Data Gathering
At this stage, you must be asking questions such as,
 What kind of data is needed to solve this problem?
 Is the data available?
 How can I get the data?
Once you know the types of data that is required, you must understand how you can
derive this data. Data collection can be done manually or by web scraping. However, if
you’re a beginner and you’re just looking to learn Machine Learning you don’t have to
worry about getting the data. There are 1000s of data resources on the web, you can
just download the data set and get going.
Coming back to the problem at hand, the data needed for weather forecasting includes
measures such as humidity level, temperature, pressure, locality, whether or not you
live in a hill station, etc. Such data must be collected and stored for analysis.
Step 3: Data Preparation
The data you collected is almost never in the right format. You will encounter a lot of
inconsistencies in the data set such as missing values, redundant variables, duplicate
values, etc. Removing such inconsistencies is very essential because they might lead to
wrongful computations and predictions. Therefore, at this stage, you scan the data set
for any inconsistencies and you fix them then and there.
Step 4: Exploratory Data Analysis
Grab your detective glasses because this stage is all about diving deep into data and
finding all the hidden data mysteries. EDA or Exploratory Data Analysis is the
brainstorming stage of Machine Learning. Data Exploration involves understanding the
patterns and trends in the data. At this stage, all the useful insights are drawn and
correlations between the variables are understood.
For example, in the case of predicting rainfall, we know that there is a strong possibility
of rain if the temperature has fallen low. Such correlations must be understood and
mapped at this stage.
Step 5: Building a Machine Learning Model
All the insights and patterns derived during Data Exploration are used to build the
Machine Learning Model. This stage always begins by splitting the data set into two
parts, training data, and testing data. The training data will be used to build and analyze
the model. The logic of the model is based on the Machine Learning Algorithm that is
being implemented.
In the case of predicting rainfall, since the output will be in the form of True (if it will
rain tomorrow) or False (no rain tomorrow), we can use a Classification Algorithm such
as Logistic Regression.
Step 6: Model Evaluation & Optimization
After building a model by using the training data set, it is finally time to put the model to
a test. The testing data set is used to check the efficiency of the model and how
accurately it can predict the outcome. Once the accuracy is calculated, any further
improvements in the model can be implemented at this stage. Methods like parameter
tuning and cross-validation can be used to improve the performance of the model.
Step 7: Predictions
Once the model is evaluated and improved, it is finally used to make predictions. The
final output can be a Categorical variable (eg. True or False) or it can be a Continuous
Quantity (eg. the predicted value of a stock).
In our case, for predicting the occurrence of rainfall, the output will be a categorical
variable.
Evolution of Machine Learning:
The origins of machine learning trace back to the mid-20th century and are deeply
rooted in statistics, probability theory, and artificial intelligence (AI). Rigorous
mathematical frameworks and algorithmic breakthroughs have driven machine
learning’s evolution from theoretical constructs to real-world applications. From
statistical learning theory to deep learning and reinforcement learning, each milestone
has contributed to making ML more effective and scalable Some key milestones in ML
theory include:
1. The Birth of Formal Learning Theory (Late 19th – Early 20th Century):
Before the digital era, mathematical principles laid the foundation for learning from
data.
Bayesian Inference (1763, Revived in the 20th Century)
Proposed by Thomas Bayes, Bayesian probability provided a framework for updating
beliefs based on new evidence. It became essential in ML for probabilistic modeling,
decision-making, and deep learning applications (e.g., Bayesian neural networks).
Markov Chains (1906, Andrey Markov)
Introduced the concept of stochastic processes where the next state depends only on
the current state (Markov Property). Used extensively in sequence modeling (e.g.,
speech recognition, reinforcement learning).
2. Turing’s Vision and Early AI Concepts (1940s – 1950s)
Turing’s "Learning Machine" (1950, Alan Turing)
In Computing Machinery and Intelligence, Turing suggested that machines could "learn"
by adjusting parameters based on experience. Proposed the Turing Test to measure
machine intelligence.
Hebbian Learning Rule (1949, Donald Hebb)
"Neurons that fire together, wire together"—the first formal description of how
synaptic connections strengthen in biological learning.
Influenced neural networks, particularly self-organizing maps, and unsupervised
learning.
Perceptron Model (1957, Frank Rosenblatt)
The Perceptron, the first artificial neural network, could classify simple patterns but
failed to solve non-linearly separable problems (e.g., XOR problem).
Although limited, it inspired later deep-learning research.
3. The Rise of Computational Learning Theory (1960s – 1970s)
This era saw the emergence of more rigorous theoretical frameworks for learning.
Nearest Neighbor Algorithm (1967, Thomas Cover & Peter Hart)
Introduced k-Nearest Neighbors (k-NN), one of the earliest non-parametric algorithms
for pattern recognition. Still widely used for classification and regression.
The Backpropagation Algorithm (1970s, Paul Werbos, Popularized in 1986 by Rumelhart,
Hinton & Williams)
A method to train multi-layer neural networks using gradient descent. Enabled deep
learning breakthroughs decades later. Probably Approximately Correct (PAC) Learning
(1984, Leslie Valiant). A formal model defining conditions under which a learning
algorithm can generalize well to unseen data. Laid the foundation for modern machine
learning generalization theories.
Decision Trees and ID3 Algorithm (1986, Ross Quinlan)
Introduced the ID3 (Iterative Dichotomiser 3) algorithm, which later led to C4.5 and
Random Forests. Paved the way for rule-based learning and ensemble methods.
4. Statistical Learning and Kernel Methods (1990s)
During this period, ML evolved from heuristic-driven AI into a mathematically rigorous
field.
Support Vector Machines (1995, Vapnik & Cortes)
SVMs introduced maximum-margin classification, significantly improving model
generalization.
Kernel tricks allowed SVMs to handle non-linearly separable problems.
Boosting Algorithms (1997, Freund & Schapire)
Introduced AdaBoost, a key ensemble method that combines weak learners to form a
strong learner.
Inspired later models like Gradient Boosting Machines (GBM) and XGBoost.
Expectation-Maximization Algorithm (1990s, Dempster, Laird, Rubin)
Used in Gaussian Mixture Models (GMMs) and Hidden Markov Models (HMMs).
Played a crucial role in speech recognition, bioinformatics, and clustering tasks.
5. The Deep Learning Revolution (2000s – Present)
Deep Belief Networks (2006, Hinton, Osindero, Teh)
Showed that unsupervised pre-training could improve deep network performance.
Sparked renewed interest in neural networks.
AlexNet and ImageNet Breakthrough (2012, Krizhevsky, Sutskever, Hinton)
AlexNet, a deep convolutional neural network (CNN), won the ImageNet competition,
surpassing traditional ML approaches. Marked the dominance of deep learning in
computer vision.
Transformers and Self-Attention (2017, Vaswani et al.)
Introduced in Attention is All You Need, transformers revolutionized natural
language processing (NLP).
Powered state-of-the-art models like BERT, GPT, Gemini, and Claude AI.
Reinforcement Learning and AlphaGo (2016, DeepMind)
AlphaGo defeated human champions in Go using deep reinforcement learning.
Popularized RL applications in robotics, gaming, and autonomous systems.
Paradigms for ML:
Machine learning is commonly separated into three main learning
paradigms: supervised learning, unsupervised learning, and reinforcement learning.
These paradigms differ in the tasks they can solve and in how the data is presented to
the computer. Usually, the task and the data directly determine which paradigm should
be used (and in most cases, it is supervised learning). In some cases though, there is a
choice to make. Often, these paradigms can be used together in order to obtain better
results. This chapter gives an overview of what these learning paradigms are and what
they can be used for.
Supervised Learning:
Supervised learning is the most common learning paradigm. In supervised learning, the
computer learns from a set of input-output pairs, which are called labeled examples:

The goal of supervised learning is usually to train a predictive


model from these pairs. A predictive model is a program that is able to
guess the output value (a.k.a. label) for a new unseen input. In a nutshell,
the computer learns to predict using examples of correct predictions. For
example, let’s consider a dataset of animal characteristics (note that
typical datasets are much larger):
Our goal is to predict the weight of an animal from its other characteristics, so we rewrite this
dataset as a set of input-output pairs:

The input variables (here, age and sex) are generally called features, and the set of
features representing an example is called a feature vector. From this dataset, we can
learn a predictor in a supervised way using the function Predict:

In[•]:=

Out[•]=

Now we can use this predictor to guess the weight of a new animal:

Out[•]=

This is an example of a regression task (see Chapter 4, Regression) because the output is
numeric. Here is another supervised learning example where the input is text and the
output is a categorical variable ("cat" or "dog"):
In[•]:=

c = Classify[{"This cat is grey." -> "cat", "My cat is fast!" -> "cat", "This dog is scary..." ->
"dog" , "Good dog." -> "dog"}]
Out[•]=

Again, we can use the resulting model to make a prediction:


In[•]:=

Out[•]=
Because the output is categorical, this is an example of a classification task . The image
identification example from the first chapter is another example of classification since
the data consists of labeled examples
As we can see, supervised learning is separated into two phases: a learning phase
during which a model is produced and a prediction phase during which the model is
used. The learning phase is called the training phase because the model is trained to
perform the task. The prediction phase is called the evaluation phase or inference
phase because the output is inferred (i.e. deduced) from the input.
Regression and classification are the main tasks of supervised learning, but this
paradigm goes beyond these tasks. For example, object detection is an application of
supervised learning for which the output consists of multiple classes and their
corresponding box positions:

Unsupervised Learning
Unsupervised learning is the second most used learning paradigm. It is not used as
much as supervised learning, but it unlocks different types of applications. In
unsupervised learning, there are neither inputs nor outputs, the data is just a set of
examples:

Unsupervised learning can be used for a diverse range of tasks. One of them is
called clustering (see Chapter 6, Clustering), and its goal is to separate data examples
into groups called clusters:

An application of clustering could be to automatically separate customers of a company


to create better marketing campaigns. Clustering is also simply used as an exploration
tool to obtain insights about the data and make informed decisions.
Another classic unsupervised task is called dimensionality reduction .The goal of
dimensionality reduction is to reduce the number of variables in a dataset while trying
to preserve some properties of the data, such as distances between examples. Here is an
example of a dataset of three variables reduced to two variables:

Dimensionality reduction can be used for a variety of tasks, such as compressing the
data, learning with missing labels, creating search engines, or even creating
recommendation systems. Dimensionality reduction can also be used as an exploration
tool to visualize an entire dataset in a reduced space

Anomaly detection is another task that can be tackled in an unsupervised way. Anomaly
detection concerns the identification of examples that are anomalous, a.k.a. outliers.
Here is an example of anomaly detection performed on a simple numeric dataset:

This task could be useful for detecting fraudulent credit card transactions, to clean a
dataset, or to detect when something is going wrong in a manufacturing process.
Unsupervised learning is a bit less used than supervised learning, mostly because the
tasks it solves are less common and are harder to implement than predictive tasks.
However, unsupervised learning can be applied to a more diverse set of tasks than
supervised learning. Nowadays, unsupervised learning is a key element of many
machine learning applications and is also used as a tool to explore data. Moreover, many
researchers believe that unsupervised learning is how humans learn most of their
knowledge and will, therefore, be the key to developing future artificially intelligent
systems.
Reinforcement Learning
The third most classic learning paradigm is called reinforcement learning, which is a
way for autonomous agents to learn. Reinforcement learning is fundamentally different
from supervised and unsupervised learning in the sense that the data is not provided as
a fixed set of examples. Rather, the data to learn from is obtained by interacting with an
external system called the environment. The name “reinforcement learning” originates
from behavioral psychology, but it could just as well be called “interactive learning.”
Reinforcement learning is often used to teach agents, such as robots, to learn a given
task. The agent learns by taking actions in the environment and
receiving observations from this environment:

Typically, the agent starts its learning process by acting randomly in the environment,
and then the agent gradually learns from its experience to perform the task better using
a sort of trial-and-error strategy. The learning is usually guided by a reward that is given
to the agent depending on its performance. More precisely, the agent learns a policy that
maximizes this reward. A policy is a model predicting which action to make given
previous actions and observations.
Reinforcement learning can, for example, be used by a robot to learn how to walk in a
simulated environment. Here is an snapshot from the classic Ant-v2 environment:
It is also possible for a real robot to learn without a simulated environment, but real
robots are slow compared to simulated ones and current algorithms have a hard time
learning fast enough. A mitigation strategy consists of learning to simulate the real
environment, a field known as model-based reinforcement learning, which is under
active research.
Reinforcement learning can also be used to teach computers to play games. Famous
examples include AlphaGo, which can beat any human player at the board game Go, or
AlphaStar, which can do the same for the video game StarCraft:
Reinforcement learning is probably the most exciting paradigm since the agent is
learning by interacting, like a living being. Active systems have the potential to learn
better than passive ones because they can decide by themselves what to explore in
order to improve. We can imagine all sorts of applications using this paradigm, from a
farmer robot that learns to improve crop production, to a program that learns to trade
stocks, to a chatbot that learns by having discussions with humans. Unfortunately,
current algorithms need a large amount of data to be effective, which is why most
reinforcement learning applications use virtual environments. Also, reinforcement
learning problems are generally more complicated to handle than supervised and
unsupervised ones. For these reasons, reinforcement learning is less used than other
paradigms in practical applications. As research is progressing, it is likely that
algorithms will need less data to operate and that simpler tools will be dev

Types of Data:
Data refers to the set of observations or measurements to train a machine learning
models. The performance of such models is heavily influenced by both the quality and
quantity of data available for training and testing. Machine
learning algorithms cannot be trained without data. Cutting-edge development in
Artificial Intelligence, automation, and data analysis is powered mostly by vast sets of
data.

Properties of Data
 Volume: The scale of data generated every millisecond.
 Variety: Different data types like healthcare, images, videos, and audio.
 Velocity: The speed of data generation and streaming.
 Value: The meaningful insights data provides.
 Veracity: The accuracy and reliability of data.
 Viability: Data's adaptability for integration into systems.
 Security: Preventing tampering and unwanted access.
 Accessibility: Simple access for decision-making.
 Integrity: Accuracy and consistency throughout its lifecycle.
 Usability: Simplicity and interpretability for end-users.

Types of Data in Machine Learning

Based on Structure
1. Structured Data: Tabular data, such as rows and columns, is used to organize and
store structured data. Spreadsheets and databases frequently contain this type of data.
 Examples: Sales records, customer details, financial transactions.
 Usage: Useful in supervised learning tasks like regression and classification.
2. Unstructured Data: Processing unstructured data is more challenging because it
lacks a preset structure.
 Examples: Text files, pictures, videos, and audio files are a few examples.
 Usage: Found in speech-to-text systems, image recognition, and natural language
processing (NLP) applications.
3. Semi-Structured Data: This type of data falls somewhere between unstructured and
structured data. It has organizational elements but does not fit nicely into a tabular
format.
 Examples: JSON files, XML files, and NoSQL databases.
 Usage: Often used in web scraping, API responses, and social media analysis
Based on Representation
 Numerical Data: Features measured in numbers (e.g., age, income).
 Categorical Data: Represents Categories or labels (e.g., gender, fruit type).
 Ordinal Data: Categorical data with an essential order (e.g., clothing sizes: Small,
Medium, Large).
Based on Labeling
 Labeled Data: Includes input variables and corresponding target outputs.
Example: Features like "age" and "income" with a label like "loan approval
status."
 Unlabeled Data: Contains only input variables without any target labels.
Example: Images without annotations.
From Data to Knowledge
 Data: Data is raw, unprocessed facts, values, text, sounds or images that have not
been interpreted or analyzed. Without data, training models and driving modern
research or automation would be impossible.
 Information: As data gets processed, interpreted and organized, it turn into
information. It gives users meaningful insights which can be understood easily
and utilized.
 Knowledge: Knowledge is the product of combining experience, learning,
Information and insights. It allows individuals or businesses to construct
awareness, create ideas and make well-informed decisions.

Real-World Examples of ML Data

Domain Data Example

Healthcare Patient records, lab results, imaging

Finance Transaction logs, credit history

E-commerce User reviews, purchase history

Transportation GPS data, traffic reports

Text, images, user engagement


Social Media metrics

[Link] Data
 Used to train the model.
 Model learns the patterns from this labeled data.
2. Validation Data
 Helps fine-tune the model by evaluating it during training.
 Useful for hyperparameter tuning and early stopping.
3. Testing Data
 Used after training is complete.
 Evaluates how well the model generalizes to unseen data
Matching:
Matching in machine learning involves identifying similar or related data points within
a dataset or across multiple datasets. This process leverages algorithms to find patterns
and connections, often used in tasks like duplicate detection, recommendation systems,
and entity resolution. Machine learning matching can be categorized into rule-based,
fuzzy matching, and machine learning-assisted approaches, with each offering different
strengths depending on the complexity and nature of the data
Key Concepts:
 Rule-based matching:
This method relies on predefined rules and thresholds to determine matches. For
example, a rule might specify that two customer records are considered a match if their
email addresses are identical.
 Fuzzy matching:
This technique addresses the limitations of rule-based matching by considering
variations and similarities between data points. It often employs algorithms like
Levenshtein distance or Jaro-Winkler to measure string similarity, allowing for matches
even with typos or slight differences in names or addresses.
 Machine learning-assisted matching:
This approach leverages machine learning algorithms trained on labeled data (examples
of matches and non-matches) to learn complex patterns and relationships. This allows
the model to identify matches that might be missed by rule-based or fuzzy matching
methods, especially in large and complex datasets.
Examples:
 Product Matching:
In e-commerce, machine learning can be used to match product listings from different
vendors that refer to the same item. This can involve fuzzy matching on product names,
descriptions, and other attributes to identify potential duplicates.
 Customer Matching:
Matching customer records across different systems (e.g., CRM, marketing automation)
can be crucial for creating a unified customer view. Machine learning can help identify
duplicate customer profiles even with slight variations in names, addresses, or contact
information.
 Recommendation Systems:
In recommendation systems, matching algorithms are used to identify users with
similar preferences or items that are frequently purchased together. This allows for
personalized recommendations based on collaborative filtering or other matching
techniques.
 Entity Resolution:
Matching data points across different sources to identify the same real-world entity is a
fundamental task in data management. Machine learning can be used to resolve entities,
such as linking customer records to social media profiles or identifying related business
entities.
 Template Matching:
In computer vision, template matching is used to identify objects by comparing image
sections to predefined templates. This is used in applications like robotics, vehicle
tracking, and medical imaging for object detection.
Benefits of Machine Learning in Matching:
 Improved Accuracy:
Machine learning models can learn complex patterns and relationships, leading to more
accurate matching results compared to rule-based methods.
 Handling Data Variations:
Machine learning-assisted matching can handle variations in data formats, typos, and
inconsistencies, making it more robust than rule-based approaches.
 Scalability:
Machine learning models can be trained on large datasets and efficiently applied to new
data, making them suitable for handling large-scale matching tasks.
Stages in Machine Learning:-
Machine learning has become an indispensable tool in today's data-driven world,
powering everything from recommendation systems to predictive analytics. However,
to truly harness the power of machine learning, it's crucial to understand the process
that transforms raw data into actionable insights. Whether you’re a seasoned data
scientist or just beginning your journey, this guide will walk you through the key steps
involved in a machine learning project.
1. Problem Definition
The first step in any machine learning project is to clearly define the problem you want
to solve. This involves understanding the business context and the specific outcomes
you’re aiming to achieve. Ask yourself:
 What is the goal of the project?
 What questions do we want the data to answer?
 How will the results be used?
For example, if you're working on a customer churn prediction model, the goal might be
to identify which customers are likely to leave so that targeted retention strategies can
be implemented.
2. Data Collection
Data is the foundation of any machine learning project. Once the problem is defined, the
next step is to gather the relevant data. This could involve collecting data from internal
databases, APIs, web scraping, or using publicly available datasets. It’s crucial to ensure
that the data collected is relevant, representative, and sufficient in quantity to support
the analysis.
3. Data Cleaning and Preprocessing
Raw data is often messy and incomplete. Before you can feed it into a machine learning
model, it needs to be cleaned and preprocessed. This step includes:
 Handling Missing Values: Filling in or removing missing data.
 Removing Outliers: Eliminating data points that don’t fit the general pattern.
 Data Normalization: Adjusting the data to a standard scale.
 Encoding Categorical Variables: Converting non-numeric data into a format that
the model can understand.
Data preprocessing is critical because the quality of your input data directly impacts the
performance of your machine learning models.
4. Exploratory Data Analysis (EDA)
Exploratory Data Analysis involves investigating the data to discover patterns, spot
anomalies, test hypotheses, and check assumptions. This is usually done through
visualization techniques like scatter plots, histograms, and correlation matrices. EDA
helps you understand the data's underlying structure and provides insights that guide
feature selection and engineering.
5. Feature Engineering and Selection
Features are the inputs that the model uses to make predictions. Feature engineering
involves creating new features from the existing data, which can improve the model's
performance. Feature selection, on the other hand, involves choosing the most relevant
features to reduce the complexity of the model and prevent overfitting. Techniques like
recursive feature elimination, principal component analysis (PCA), and correlation
analysis are commonly used

6. Model Selection
Choosing the right model is crucial for the success of your machine learning project.
This decision depends on the nature of your problem (e.g., classification, regression,
clustering), the size of your data, and the complexity of the relationships you’re trying to
capture. Common models include:
 Linear Regression: For predicting continuous values.
 Logistic Regression: For binary classification problems.
 Decision Trees and Random Forests: For both classification and regression
tasks.
 Neural Networks: For complex tasks like image and speech recognition.
7. Model Training
Once the model is selected, the next step is to train it using your data. This involves
feeding the cleaned and processed data into the model, allowing it to learn the
relationships between the input features and the target variable. The model’s
parameters are adjusted to minimize error using algorithms like gradient descent.
8. Model Evaluation
After training the model, it’s important to evaluate its performance using metrics
relevant to your problem. Common evaluation metrics include:
 Accuracy: The ratio of correctly predicted instances to the total instances.
 Precision and Recall: Measures for classification problems that provide insights
into the balance between false positives and false negatives.
 Mean Squared Error (MSE): A measure of the difference between actual and
predicted values in regression tasks.
Using techniques like cross-validation helps ensure that the model generalizes well to
unseen data.
9. Hyperparameter Tuning
Hyperparameters are the settings that control the learning process of the model (e.g.,
learning rate, number of trees in a random forest). Tuning these hyperparameters can
significantly improve the model’s performance. Techniques like Grid Search and
Random Search are used to find the optimal set of hyperparameters.
10. Model Deployment
Once the model is trained, evaluated, and tuned, the next step is deployment. This
involves integrating the model into a production environment where it can start
generating predictions on new data. Model deployment can be done using various tools
and platforms like Docker, AWS, or Azure.
11. Monitoring and Maintenance
The work doesn’t stop once the model is deployed. Continuous monitoring is necessary
to ensure that the model performs well over time, especially as new data becomes
available. Retraining the model with updated data, adjusting features, or even selecting
new models might be necessary to maintain its accuracy and relevance.
Data Acquisition:-
What is Data Acquisition?
The process of collecting and storing data for machine learning from a variety of
sources is known as data acquisition(DAQ).
The procedure entails gathering, examining, and using crucial data to guarantee precise
measurements, instantaneous observation, and knowledgeable decision-making.
Sensors, measuring devices, and a computer work together in DAQ systems to
transform physical parameters into electrical signals, condition and amplify those
signals, and then store them for analysis.
What is Data Acquisition in Machine Learning?
In machine learning, "data acquisition" refers to the procedure of obtaining and
compiling data from diverse sources in order to test and train machine learning
models. In order to enable computers and software to manipulate and modify signals
from real-world occurrences, this technique entails digitizing such signals. Data
Acquisition aims to get a complete and representative dataset that successfully captures
the patterns and changes in the data that are crucial for productive machine learning
results.
The process of acquiring data also include taking the variable into account that affect its
quality and utility, such as volume, velocity, and diversity.

Successful machine learning begins with data collecting, which supplies the raw
information required to train models and make defensible conclusions. The gathering of
high-quality data is essential for providing machine learning algorithms with the
necessary input to enable them to learn and perform better.

What Does a DAQ System Measure?


A Data Acquisition (DAQ) system is capable of measuring several physical parameters,
such as:
 Temperature: Temperature can be measured using RTDs, thermistors, or
thermocouples in DAQ systems.
 Pressure: In a variety of settings, including industrial operations and medical
equipment, pressure is measured using pressure sensors.
 Voltage: Power systems, electronics, and electrical engineering all depend on the
ability of DAQ devices to monitor the voltage levels in electrical circuits.
 Current: DAQ systems can measure current flow using current sensors or
shunts. Current measurement is essential in electrical systems.
 Strain and Pressure: Deformation and pressure in materials are measured
using strain gauges and pressure sensors, which is crucial for material science
and structural health monitoring.
 Shock and Vibration: In a variety of fields, including mechanical, aeronautical,
and civil engineering, accelerometers and vibration sensors are used to monitor
shock, vibration, and acceleration.
 RPM, Angle, and Discrete Events: DAQ systems are crucial for robotics,
automation, and mechanical systems because they can measure rotational speed,
angle, and discrete events.
 Distance and Displacement: Ultrasonic, laser, and encoder sensors are among
the sensors that DAQ systems can use to detect distance and displacement.
 Weight: Measuring weight is crucial for a number of applications, including
quality control, logistics, and industrial automation.
Components of Data Acquisition System
To understand how data is selected and processed, a data acquisition system consists of
below key basic components: sensors, measuring instruments, and a computer.
1. Sensors: Sensors are devices that quantify and translate physical parameters like
voltage, pressure, or temperature into electrical impulses. Later, these signals are sent
to the measuring devices for additional analysis.
[Link] Conditioner: Signal conditioning is the process of improving raw sensor
signals so they can be reliably understood. To make sure that the signals are
dependable, clear, and compatible with the rest of the system, signal conditioning
procedures include isolation, amplification, and filtering.
 Amplification: It helps in improving accuracy by maximizing the signal strength
 Filtering: Filters extra and unwanted noise from the signal
 Isolation: Helps in separating sensor from DAQ system.
3. Analog-to-digital Converter: After the signals are conditioned, they must be
translated into a digital format that computers can comprehend using an analog-to-
digital converter (ADC). The continuous analog signals are transformed into discrete
digital values so that the system can process and store them.
4. Data Logger: The data logger serves as the operation's central nervous system. A
device or software program known as a data logger is responsible for managing
incoming data, controlling the acquisition process, and storing it for subsequently
analysis.
5. Data Processing Unit: After receiving data from ADC, the system has dedicated card
to process the signals like sampling, buffering and Data Transfer.
6. Data Storage : Acquired data is stored in the computer’s memory for real-time
monitoring.
The physical parameters are measured using sensors, which convert the physical
signals into electrical signals. The signals are then conditioned, amplified, and
converted into digital data using analog-to-digital converters (ADCs). The digital
data is then processed, analyzed, and stored using computers and software.
What are the Major Purposes of Data Acquisition?
Although there are many different and important reasons, some of the most important
ones are as follows:
 Long-term analysis and trend detection: Long-term analysis are made
possible by data acquisition systems, which make it possible to log, capture, and
store measurement of data over an extended period of time.
 Measurement that is accurate and dependable: DAQ systems and equipment
provide measurement that is accurate and dependable, enabling uses like optical
analysis and light intensity monitoring.
 Industry Leading devices: DAQ systems and devices are widely used,
connecting to a variety of sensors and collaborating with contemporary
computers, which makes them an excellent option for scientists and researchers
looking for accurate data.
 Enhanced productivity and dependability of machines: Data capture gives an
organization more control over its operations and enables quicker reaction to
potential breakdowns, maximizing procedure optimization.
 Faster problem analysis and resolution: Real-time data acquisition systems
allow measurements to be produced and shown instantly, which allows
personnel to respond to issues more quickly and get the machine operating at
peak efficiency in less time.
 Reduction of data redundancy: DAQ systems let businesses operate without
interference from extraneous data by making it easier to analyze the information
they have collected.
What are the Different Data Acquisition Options?
Devices like sensors, transducers, and other devices can provide data, which data
acquisition (DAQ) systems are made to measure, record, and analyze. Selecting the right
DAQ system relies on the requirements and particular application. There are various
types of DAQ systems, each with advantages and disadvantages of their own. The
following are a few of the several options for acquiring data:
 Data loggers: These are compact, lightweight gadgets with extended data
recording capabilities. They are frequently employed in applications like
industrial automation and environmental monitoring where data collection in
the field is required.
 Data acquisition devices: These are plug-and-play items that can be linked via
USB or other interfaces to a computer. They are perfect for projects where
requirements don't alter because they offer set functionality.
 Data acquisition systems: These are modular systems that can be set up to
accommodate certain measurement requirements. They are perfect for complex
systems that need several channels and high-speed data gathering because of
their tremendous versatility.
 Computer-Connected DAQ Modules: These DAQ systems provide an affordable
way to get data by connecting to a computer. Comparing them to stand-alone
systems, they are frequently lighter and smaller.
 Stand-Alone or Portable DAQ Systems: These are DAQ systems that record
and analyze data without the need for extra hardware because they come with
an integrated computer. They are frequently employed in situations when using
a computer is either inconvenient or not possible.
 Modular DAQ Systems: These systems are composed of a chassis and several
modules that are movable and addable. They are very flexible and perfect for
applications that need to acquire data quickly over several channels.
 PXIe Modular DAQ Systems: These are high-performance DAQ systems that
link several modules together via the PXIe (PCI Express) interface. They are
perfect for applications that demand low latency and high channel counts
because they provide fast data capture.
Types of Data Acquisition Sources
 Sensors: Convert physical parameters to electrical signals.
 IoT devices: Collect data from remote sources using secure communication
channels and encryption.
 Network devices: Collect data from network devices using secure
communication channels and encryption.
 Manual data entry: Implement robust access control mechanisms,
authentication, and authorization processes to increase the security of manual
data entry.
 Experiments: Collect primary data through experiments, such as wet lab
experiments like gene sequencing.
 Observations: Collect primary data through observations, such as surveys,
sensors, or in situ collection.
 Simulations: Collect primary data through simulations, such as theoretical
models like climate models.
 Scraping or compiling: Collect primary data through webscraping, text mining,
or compiling data from various sources.
 Institutionalized data banks: Collect secondary data from institutionalized
data banks, such as census or gene sequences.
 Published datasets: Collect secondary data from published datasets, such as
those found on Kaggle, GitHub, or UCI Machine Learning Repository.
 APIs: Collect secondary data through application programming interfaces (APIs),
which allow clients to request data from a website's server.
 Surveys: Collect primary data through surveys, which can be online or offline.
Importance of Data Acquisition in Machine Learning
Data Acquisition (DAQ) is definitely the most fundamental task that precedes any
machine learning project and should not be overlooked. Here's why it holds such
importance:
 Fuel for Learning: In contrast to the biological organisms, which can sense the
objects, the machine learning models are basically recognition-of-patterns
technology. Information and intelligence of the model would not be valid if data
quality is not up to standard and this affects the model’s ability to learn as well as
make credible predictions. DAQ just thus guarantees that you have the right
"fuel" to be the engine of your model's learning.
 Quality In, Quality Out: The sentence "garbage in garbage out" illustrates this
best. Just as if your model inherits data issues such as inaccuracy,
incompleteness, or irrelevancy, it will transmit these flaws into your model
unfortunately. DAQ that is successful, supplies you with data whose quality is
great and leads to formation of powerful, and reliable machine learning models.
 Relevance is Key: DAQ is what makes you gather data of that problem your
model want to learn. The higher the relevance of the data, the greater your
model will perceive the dependency between the essence and, therefore, will
make precise conclusions.
 Shaping Model Performance: You end up with the amount of data you collect
for your model, which most often affects the model's performance. An important
case in machine learning is when the algorithms need massive data sets in order
to learn properly. Expert DAQ strategies allow to collect considerable amount of
data for you to train your model so that you can just correctly generalize and
answer the questions that it hasn’t seen.
The Measurement Process
The measurement process is determining how many units of a specific quantity or
quality needs to be measured object. It is an essential procedure in many disciplines,
such as science, engineering, building, and daily life. There are various steps to the
measurement process, which include:
 Define the quantity that has to be measured: The defining of the quantity to
be measured is the first step in the measurement process, which also always
includes a comparison with a known quantity of the same kind. Finding the
physical quantity or attribute that has to be measured is part of this process.
 Comparing the object or quantity: The object is compared to a known quantity
of the same kind.
 Transduction: The quantity or item to be measured is "transduced" into an
analogous measurement signal if it cannot be directly compared.
 Transmission and processing of the signal: To generate a measurement
reading, the physical signal is routed through the system and subjected to
processing.
 Calibration: The process of obtaining the reference signal from items with
known quantities is known as calibration.
 Quantization: The measurement is quantized by counting or splitting the signal
into equal and known-sized pieces, and the physical signal is compared with the
reference signal.
Data Acquisition Tools
Tools for gathering, analyzing, and recording data from a variety of sensors,
instruments, or devices are software and hardware systems known as data acquisition
tools. Data Acquisition Tools are useful in scientific research, industrial automation,
engineering, and other domains where data gathering and processing are critical. Few
Tools for Acquiring Data are:
 DriveSpy: A data collection tool for Windows operating systems created by
Digital Intelligence Forensic Solutions.
 DewesoftX: A software suite for acquiring and analyzing data that provides
strong tools for these tasks.
 LabVIEW: A popular software program used in many different industries that
offers tools for data collection, processing, and visualization.
 Catman: A data acquisition software package that offers tools for data
acquisition, analysis, and visualization, and is commonly used in industrial
automation and engineering.
 Matlab: A software package that provides tools for data acquisition, analysis, and
visualization, and is widely used in various industries.
 FlexPro: A data acquisition software package that offers tools for data
acquisition, analysis, and visualization, and is commonly used in industrial
automation and engineering.
Feature Engineering:-
Feature engineering is the process of turning raw data into useful features that help
improve the performance of machine learning models. It includes choosing, creating and
adjusting data attributes to make the model’s predictions more accurate. The goal is to
make the model better by providing relevant and easy-to-understand information.
A feature or attribute is a measurable property of data that is used as input for
machine learning algorithms. Features can be numerical, categorical or text-based
representing essential data aspects which are relevant to the problem. For example in
housing price prediction, features might include the number of bedrooms, location and
property age.
Importance of Feature Engineering
Feature engineering can significantly influence model performance. By refining
features, we can:
 Improve accuracy: Choosing the right features helps the model learn better,
leading to more accurate predictions.
 Reduce overfitting: Using fewer, more important features helps the model
avoid memorizing the data and perform better on new data.
 Boost interpretability: Well-chosen features make it easier to understand how
the model makes its predictions.
 Enhance efficiency: Focusing on key features speeds up the model’s training
and prediction process, saving time and resources.
Processes Involved in Feature Engineering
Lets see various features involved in feature engineering:
Feature Creation: Feature creation involves generating new features from domain
knowledge or by observing patterns in the data. It can be:
 Domain-specific: Created based on industry knowledge likr business rules.
 Data-driven: Derived by recognizing patterns in data.
 Synthetic: Formed by combining existing features.

2. Feature Transformation: Transformation adjusts features to improve model


learning:
Normalization & Scaling: Adjust the range of features for consistency.
Encoding: Converts categorical data to numerical form i.e one-hot
encoding.
Mathematical transformations: Like logarithmic transformations for
skewed data.
3. Feature Extraction: Extracting meaningful features can reduce dimensionality and
improve model accuracy:
Dimensionality reduction: Techniques like PCA reduce features while
preserving important information.
Aggregation & Combination: Summing or averaging features to simplify the
model.
4. Feature Selection: Feature selection involves choosing a subset of relevant features
to use:
Filter methods: Based on statistical measures like correlation.
Wrapper methods: Select based on model performance.
Embedded methods: Feature selection integrated within model training.
5. Feature Scaling: Scaling ensures that all features contribute equally to the model:
Min-Max scaling: Rescales values to a fixed range like 0 to 1.
Standard scaling: Normalizes to have a mean of 0 and variance of 1.
Steps in Feature Engineering
Feature engineering can vary depending on the specific problem but the general steps
are:
1. Data Cleansing: Identify and correct errors or inconsistencies in the dataset to
ensure data quality and reliability.
2. Data Transformation: Transform raw data into a format suitable for modeling
including scaling, normalization and encoding.
3. Feature Extraction: Create new features by combining or deriving information
from existing ones to provide more meaningful input to the model.
4. Feature Selection: Choose the most relevant features for the model using
techniques like correlation analysis, mutual information and stepwise
regression.
5. Feature Iteration: Continuously refine features based on model performance by
adding, removing or modifying features for improvement.

Common Techniques in Feature Engineering


1. One-Hot Encoding: One-Hot Encoding converts categorical variables into binary
indicators, allowing them to be used by machine learning models.
import pandas as pd

data = {'Color': ['Red', 'Blue', 'Green', 'Blue']}


df = pd. DataFrame(data)
df_encoded = pd.get_dummies(df, columns=['Color'], prefix='Color')
print(df_encoded)
Output
Color_Blue Color_Green Color_Red
0 False False True
1 True False False
2 False True False
3 True False False
2. Binning: Binning transforms continuous variables into discrete bins, making them
categorical for easier analysis.
import pandas as pd

data = {'Age': [23, 45, 18, 34, 67, 50, 21]}


df = [Link](data)
bins = [0, 20, 40, 60, 100]
labels = ['0-20', '21-40', '41-60', '61+']
df['Age_Group'] = [Link](df['Age'], bins=bins, labels=labels, right=False)
print(df)

Output
Age Age_Group
0 23 21-40
1 45 41-60
2 18 0-20
3 34 21-40
4 67 61+
5 50 41-60
6 21 21-40
3. Text Data Preprocessing:
Involves removing stop-words, stemming and vectorizing text data to prepare it
for machine learning models.
import nltk
from [Link] import stopwords
from [Link] import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
texts = ["This is a sample sentence.", "Text data preprocessing is important."]
stop_words = set([Link]('english'))
stemmer = PorterStemmer()
vectorizer = CountVectorizer()
def preprocess_text(text):
words = [Link]()
words = [[Link](word)
for word in words if [Link]() not in stop_words]
return " ".join(words)
cleaned_texts = [preprocess_text(text) for text in texts]
X = vectorizer.fit_transform(cleaned_texts)
print("Cleaned Texts:", cleaned_texts)
print("Vectorized Text:", [Link]())

4. Feature Splitting: Divides a single feature into multiple sub-features, uncovering


valuable insights and improving model performance.
import pandas as pd

data = {'Full_Address': [
'123 Elm St, Springfield, 12345', '456 Oak Rd, Shelbyville, 67890']}
df = [Link](data)
df[['Street', 'City', 'Zipcode']] = df['Full_Address'].[Link](
r'([0-9]+\s[\w\s]+),\s([\w\s]+),\s(\d+)')
print(df)
Output
Full_Address Street City Zipcode
0 123 Elm St, Springfield, 12345 123 Elm St Springfield 12345
1 456 Oak Rd, Shelbyville, 67890 456 Oak Rd Shelbyville 67890...
Tools for Feature Engineering
There are several tools available for feature engineering. Here are some popular ones:
 Featuretools: Automates feature engineering by extracting and transforming
features from structured data. It integrates well with libraries like pandas and
scikit-learn making it easy to create complex features without extensive coding.
 TPOT: Uses genetic algorithms to optimize machine learning pipelines,
automating feature selection and model optimization. It visualizes the entire
process, helping you identify the best combination of features and algorithms.
 DataRobot: Automates machine learning workflows including feature
engineering, model selection and optimization. It supports time-dependent and
text data and offers collaborative tools for teams to efficiently work on projects.
 Alteryx: Offers a visual interface for building data workflows, simplifying feature
extraction, transformation and cleaning. It integrates with popular data sources
and its drag-and-drop interface makes it accessible for non-programmers.
 [Link]: Provides both automated and manual feature engineering tools for a
variety of data types. It includes features for scaling, imputation and encoding
and offers interactive visualizations to better understand model results.
What are the different ways of Data Representation?
The process of collecting the data and analyzing that data in large quantity is known as
statistics. It is a branch of mathematics trading with the collection, analysis,
interpretation, and presentation of numeral facts and figures.
It is a numerical statement that helps us to collect and analyze the data in large quantity
the statistics are based on two of its concepts:
 Statistical Data
 Statistical Science
Statistics must be expressed numerically and should be collected systematically.

Data Representation
The word data refers to constituting people, things, events, ideas. It can be a title, an
integer, or anycast. After collecting data the investigator has to condense them in
tabular form to study their salient features. Such an arrangement is known as the
presentation of data.
It refers to the process of condensing the collected data in a tabular form or graphically.
This arrangement of data is known as Data Representation.
The row can be placed in different orders like it can be presented in ascending orders,
descending order, or can be presented in alphabetical order.
Example: Let the marks obtained by 10 students of class V in a class test, out of 50
according to their roll numbers, be:
39, 44, 49, 40, 22, 10, 45, 38, 15, 50
The data in the given form is known as raw data. The above given data can be placed in
the serial order as shown below:

Roll No. Marks

1 39

2 44

3 49

4 40

5 22

6 10
Roll No. Marks

7 45

8 38

9 14

10 50

Now, if you want to analyse the standard of achievement of the students. If you arrange
them in ascending or descending order, it will give you a better picture.
Ascending order:
10, 15, 22, 38, 39, 40, 44. 45, 49, 50
Descending order:
50, 49, 45, 44, 40, 39, 38, 22, 15, 10
When the row is placed in ascending or descending order is known as arrayed data.
Types of Graphical Data Representation
Bar Chart
Bar chart helps us to represent the collected data visually. The collected data can be
visualized horizontally or vertically in a bar chart like amounts and frequency. It can be
grouped or single. It helps us in comparing different items. By looking at all the bars, it
is easy to say which types in a group of data influence the other.
Now let us understand bar chart by taking this example
Let the marks obtained by 5 students of class V in a class test, out of 10 according to
their names, be:
7,8,4,9,6
The data in the given form is known as raw data. The above given data can be placed in
the bar chart as shown below:

Name Marks

Akshay 7

Maya 8

Dhanvi 4
Name Marks

Jaslen 9

Muskan 6

Histogram
A histogram is the graphical representation of data. It is similar to the appearance of a
bar graph but there is a lot of difference between histogram and bar graph because a
bar graph helps to measure the frequency of categorical data. A categorical data means
it is based on two or more categories like gender, months, etc. Whereas histogram is
used for quantitative data.
For example:

Line Graph
The graph which uses lines and points to present the change in time is known as a line
graph. Line graphs can be based on the number of animals left on earth, the increasing
population of the world day by day, or the increasing or decreasing the number of
bitcoins day by day, etc. The line graphs tell us about the changes occurring across the
world over time. In a line graph, we can tell about two or more types of changes
occurring around the world.
For Example:
Pie Chart
Pie chart is a type of graph that involves a structural graphic representation of
numerical proportion. It can be replaced in most cases by other plots like a bar chart,
box plot, dot plot, etc. As per the research, it is shown that it is difficult to compare the
different sections of a given pie chart, or if it is to compare data across different pie
charts.
For example:

Frequency Distribution Table


A frequency distribution table is a chart that helps us to summarise the value and the
frequency of the chart. This frequency distribution table has two columns, The first
column consist of the list of the various outcome in the data, While the second column
list the frequency of each outcome of the data. By putting this kind of data into a table it
helps us to make it easier to understand and analyze the data.
Machine Learning Model Evaluation:
Model evaluation is a process that uses some metrics which help us to analyze the
performance of the model. Think of training a model like teaching a student. Model
evaluation is like giving them a test to see if they truly learned the subject—or just
memorized answers. It helps us answer:
Model development is a multi-step process and we need to keep a check on how well
the model do future predictions and analyze a models weaknesses. There are many
metrics for that. Cross Validation is one technique that is followed during the training
phase and it is a model evaluation technique.
Cross-Validation: The Ultimate Practice Test
Cross Validation is a method in which we do not use the whole dataset for training. In
this technique some part of the dataset is reserved for testing the model. There are
many types of Cross-Validation out of which K Fold Cross Validation is mostly used. In K
Fold Cross Validation the original dataset is divided into k subsets. The subsets
are known as folds. This is repeated k times where 1 fold is used for testing purposes,
rest k-1 folds are used for training the model. It is seen that this technique generalizes
the model well and reduces the error rate.
Holdout is the simplest approach. It is used in neural networks as well as in many
classifiers. In this technique the dataset is divided into train and test datasets. The
dataset is usually divided into ratios like 70:30 or 80:20. Normally a large percentage of
data is used for training the model and a small portion of dataset is used for testing the
model.
Evaluation Metrics for Classification Task
Classification is used to categorize data into predefined labels or classes. To evaluate the
performance of a classification model we commonly use metrics such as accuracy,
precision, recall, F1 score and confusion matrix. These metrics are useful in assessing
how well model distinguishes between classes especially in cases of imbalanced
datasets. By understanding the strengths and weaknesses of each metric, we can select
the most appropriate one for a given classification problem.
In this Python code, we have imported the iris dataset which has features like the length
and width of sepals and petals. The target values are Iris setosa, Iris virginica, and Iris
versicolor. After importing the dataset we divided the dataset into train and test
datasets in the ratio 80:20. Then we called Decision Trees and trained our model. After
that, we performed the prediction and calculated the accuracy score, precision, recall,
and f1 score. We also plotted the confusion matrix.
Importing Libraries and Dataset
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn import datasets
from [Link] import load_iris
from [Link] import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
import [Link] as plt
from [Link] import precision_score,\
recall_score, f1_score, accuracy_score
Now let's load the toy dataset iris flowers from the [Link] library and then
split it into training and testing parts (for model evaluation) in the 80:20 ratio.
iris = load_iris()
X = [Link]
y = [Link]
# Holdout [Link] the data into train and test
X_train, X_test,y_train, y_test = train_test_split(X, y,random_state=20, test_size=0.20)
Now, let's train a Decision Tree Classifier model on the training data, and then we will
move on to the evaluation part of the model using different metrics.
tree = DecisionTreeClassifier()
[Link](X_train, y_train)
y_pred = [Link](X_test)
1. Accuracy
Accuracy is defined as the ratio of number of correct predictions to the total number of
predictions. This is the most fundamental metric used to evaluate the model. The
formula is given by:

However Accuracy has a drawback. It cannot perform well on an imbalanced dataset.


Suppose a model classifies that the majority of the data belongs to the major class label.
It gives higher accuracy, but in general model cannot classify on minor class labels and
has poor performance.
print("Accuracy:", accuracy_score(y_test,y_pred))
Output:
Accuracy: 0.9333333333333333
2. Precision and Recall
Precision is the ratio of true positives to the summation of true positives and false
positives. It basically analyses the positive predictions.

The drawback of Precision is that it does not consider the True Negatives and False
Negatives.
Recall is the ratio of true positives to the summation of true positives and false
negatives. It basically analyses the number of correct positive samples.

The drawback of Recall is that often it leads to a higher false positive rate.
print("Precision:", precision_score(y_test,y_pred, average="weighted"))
print('Recall:', recall_score(y_test, y_pred,average="weighted"))
Output:
Precision: 0.9435897435897436
Recall: 0.9333333333333333
3. F1 score
F1-score is the harmonic mean of precision and recall. It is seen that during the
precision-recall trade-off if we increase the precision, recall decreases and vice versa.
The goal of the F1 score is to combine precision and recall.

# calculating f1 score
print('F1 score:', f1_score(y_test, y_pred, average="weighted"))
Output:
F1 score: 0.9327777777777778
4. Confusion Matrix
Confusion matrix is a N x N matrix where N is the number of target classes. It represents
number of actual outputs and predicted outputs. Some terminologies in the matrix are
as follows:
 True Positives: It is also known as TP. It is the output in which the actual and
the predicted values are YES.
 True Negatives: It is also known as TN. It is the output in which the actual and
the predicted values are NO.
 False Positives: It is also known as FP. It is the output in which the actual value
is NO but the predicted value is YES.
 False Negatives: It is also known as FN. It is the output in which the actual value
is YES but the predicted value is NO.
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
cm_display = [Link](
confusion_matrix=confusion_matrix,
display_labels=[0, 1, 2])
cm_display.plot()
[Link]()
Output:
Confusion matrix for the output of the model
In the output the accuracy of model is 93.33%. Precision is approximately 0.944 and
Recall is 0.933. F1 score is approximately 0.933. Finally the confusion matrix is plotted.
Here class labels denote the target classes:
0 = Setosa
1 = Versicolor
2 = Virginica
From the confusion matrix, we see that 8 setosa classes were correctly predicted. 11
Versicolor test cases were also correctly predicted by the model and 2 virginica test
cases were misclassified. In contrast, the rest 9 were correctly predicted.
5. AUC-ROC Curve
AUC (Area Under Curve) is an evaluation metric that is used to analyze the classification
model at different threshold values. The Receiver Operating Characteristic (ROC) curve
is a probabilistic curve used to highlight the model's performance. The curve has two
parameters:
 TPR: It stands for True positive rate. It basically follows the formula of Recall.
 FPR: It stands for False Positive rate. It is defined as the ratio of False positives to
the summation of false positives and True negatives.
This curve is useful as it helps us to determine the model's capacity to distinguish
between different classes. Let us illustrate this with the help of a simple Python example
import numpy as np
from sklearn .metrics import roc_auc_score
y_true = [1, 0, 0, 1]
y_pred = [1, 0, 0.9, 0.2]
auc = [Link](roc_auc_score(y_true, y_pred), 3)
print("Auc", (auc))
Output:
Auc 0.75
AUC score is a useful metric to evaluate the model. It highlights model's capacity to
separate the classes. In the above code 0.75 is a good AUC score. A model is considered
good if the AUC score is greater than 0.5 and approaches 1.
Evaluation Metrics for Regression Task
Regression is used to determine continuous values. It is mostly used to find a relation
between a dependent and independent variable. For classification we use a confusion
matrix, accuracy, f1 score, etc. But for regression analysis since we are predicting a
numerical value it may differ from the actual output. So we consider the error
calculation as it helps to summarize how close the prediction is to the actual value.
There are many metrics available for evaluating the regression model.
In this Python Code we have implemented a simple regression model using the Mumbai
weather CSV file. This file comprises Day, Hour, Temperature, Relative Humidity, Wind
Speed and Wind Direction. The link for the dataset is here.
We are interested in finding relationship between Temperature and Relative Humidity.
Here Relative Humidity is the dependent variable and Temperature is the independent
variable. We performed linear regression and use different metrics to evaluate the
performance of our model. To calculate the metrics we make extensive use of sklearn
library.
# importing the libraries
from sklearn.linear_model import LinearRegression
from [Link] import mean_absolute_error,
mean_squared_error, mean_absolute_percentage_error
Now let's load the data into the panda's data frame and then split it into training and
testing parts (for model evaluation) in the 80:20 ratio.
df = pd.read_csv('[Link]')
X = [Link][:, 2].values
Y = [Link][:, 3].values
X_train, X_test,Y_train, Y_test = train_test_split(X, Y,test_size=0.20, random_state=0)
Now, let's train a simple linear regression model. On the training data and we will move
to the evaluation part of the model using different metrics.
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
regression = LinearRegression()
[Link](X_train, Y_train)
Y_pred = [Link](X_test)
1. Mean Absolute Error (MAE)
This is the simplest metric used to analyze the loss over the whole dataset. As we know
that error is basically the difference between the predicted and actual values.
Therefore MAE is defined as the average of the errors calculated. Here we calculate the
modulus of the error, perform summation and then divide the result by the total
number of data points. It is a positive value. The formula of MAE is given by

mae = mean_absolute_error(y_true=Y_test, y_pred=Y_pred)


print("Mean Absolute Error", mae)
Output:
Mean Absolute Error 1.7236295632503873
2. Mean Squared Error(MSE)
The most commonly used metric is Mean Square error or MSE. It is a function used to
calculate the loss. We find the difference between the predicted values and actual
variable, square the result and then find the average by all datapoints present in
dataset. MSE is always positive as we square the values. Small the value of MSE better is
the performance of our model. The formula of MSE is given:

mse = mean_squared_error(y_true=Y_test, y_pred=Y_pred)


print("Mean Square Error", mse)
Output:
Mean Square Error 3.9808057060106954
3. Root Mean Squared Error(RMSE)
RMSE is a popular method and is the extended version of MSE. It indicates how much
the data points are spread around the best line. It is the standard deviation of the MSE. A
lower value means that the data point lies closer to the best fit line.

rmse = mean_squared_error(y_true=Y_test, y_pred=Y_pred,squared=False)


print("Root Mean Square Error", rmse)
Output:
Root Mean Square Error 1.9951956560725306
4. Mean Absolute Percentage Error (MAPE)
MAPE is used to express the error in terms of percentage. It is defined as the difference
between the actual and predicted value. The error is then divided by the actual value.
The results are then summed up and finally and we calculate the average. Smaller the
percentage better the performance of the model. The formula is given by

MAPE = (1 / n) * Σ(|At – Ft| / |At|) * 100

mape = mean_absolute_percentage_error(Y_test, Y_pred, sample_weight=None,


multioutput='uniform_average')
print("Mean Absolute Percentage Error", mape)
Output:
Mean Absolute Percentage Error 0.02334408993333347
Search and learning are fundamental and interconnected concepts within machine
learning and artificial intelligence. While distinct, they often work in conjunction to
achieve complex tasks.
Search in Machine Learning:
Search in machine learning refers to the process of exploring a defined space to find an
optimal solution, a specific pattern, or a desired outcome. This can manifest in several
ways:
 Algorithm Search:
Many machine learning algorithms, particularly in areas like reinforcement learning or
optimization, involve searching through a state space or a parameter space to find the
best policy or model parameters. Examples include A\* search for pathfinding or
genetic algorithms for optimizing complex functions.
 Data Search and Retrieval:
Machine learning models are frequently used to enhance search capabilities in
information retrieval systems. Techniques like vector search, neural search, and
semantic search leverage ML to understand the meaning and context of queries, leading
to more relevant results.
 Concept Learning as Search:
In concept learning, the process of identifying and defining concepts within data can be
viewed as a search through the space of possible concept definitions.
Learning in Machine Learning:
Learning in machine learning refers to the ability of a system to improve its
performance on a specific task over time through exposure to data, without being
explicitly programmed for every scenario. Key aspects of learning include:
 Pattern Recognition and Prediction:
Machine learning algorithms learn from data to identify patterns, build models, and
make predictions or classifications on new, unseen data.
 Adaptation and Improvement:
Learning enables systems to adapt to new information and improve their accuracy and
effectiveness as more data becomes available.
 Types of Learning:
This encompasses various paradigms like supervised learning (learning from labeled
data), unsupervised learning (finding patterns in unlabeled data), and reinforcement
learning (learning through trial and error with a reward system).
Interconnection:
Search and learning are often intertwined. Learning can guide and improve search
processes by providing better heuristics or models to navigate the search space
efficiently. Conversely, search can generate valuable data or experiences that are then
used by learning algorithms to refine their models or policies. This symbiotic
relationship is crucial for developing intelligent systems capable of solving intricate
problems.

You might also like