Disclaimer
Every effort has been made to make this book as accurate as
possible. However, there may be inadvertent typographical or
content errors. The author and publisher shall have no liability
or responsibility to any person or entity regarding any loss or
damage incurred, or alleged to have incurred, directly or
indirectly, by the information contained in this book. You
hereby agree to be bound by this disclaimer or you may
return this book within 30 days’ time period for a refund.
Any error, mistake or discrepancy may be brought to our
notice which shall be taken care of in the forthcoming edition.
The text of this Publication, or any part thereof, may not be
reproduced or transmitted in any form by any means,
electronic or mechanical, including photocopying, recording,
storage in an information retrieval, or otherwise, without the
prior written permission of the Author and Publisher.
Preface
This book covers theoretical as well as practical knowledge.
As an author we tried to cover all the concepts in mysql. Also
I would like to thank “”
(Masters of Science in
Information Technology – Data Science) with his
immense knowledge in the field of Data Science helped me in
compiling the entire textbook.
Machine learning
Dr. Abbas Lalani
(PHD, MCA, MBA, MCOM, [Link])
Machine learning
INTRODUCTION:-
Machine Learning, often abbreviated as ML, is
a subset of artificial intelligence (AI) that focuses on the
development of computer algorithms that improve
automatically through experience and by the use of
data.
In simpler terms, machine learning enables computers
to learn from data and make decisions or predictions
without being explicitly programmed to do so.
At its core, machine learning is all about creating and
implementing algorithms that facilitate these decisions
and predictions. These algorithms are designed to
improve their performance over time, becoming more
accurate and effective as they process more data.
In traditional programming, a computer follows a set of
predefined instructions to perform a task. However, in
machine learning, the computer is given a set of
examples (data) and a task to perform, but it's up to the
computer to figure out how to accomplish the task
based on the examples it's given.
For instance, if we want a computer to recognize
images of cats, we don't provide it with specific
instructions on what a cat looks like. Instead, we give it
thousands of images of cats and let the machine
learning algorithm figure out the common patterns and
features that define a cat.
Over time, as the algorithm processes more images, it
gets better at recognizing cats, even when presented
with images it has never seen before.
This ability to learn from data and improve over time
makes machine learning incredibly powerful and
versatile.
It's the driving force behind many of the technological
advancements we see today, from voice assistants and
recommendation systems to self-driving cars and
predictive analytics.
Machine learning vs AI vs deep learning:-
Machine learning is often confused with artificial
intelligence or deep learning. Let's take a look at how
these terms differ from one another. For a more
in-depth look, check out our comparison guides on AI
vs machine learning and machine learning vs deep
learning.
AI refers to the development of programs that behave
intelligently and mimic human intelligence through a set
of algorithms. The field focuses on three skills:
learning, reasoning, and self-correction to obtain
maximum efficiency. AI can refer to either machine
learning-based programs or even explicitly programmed
computer programs.
Machine learning is a subset of AI, which uses
algorithms that learn from data to make predictions.
These predictions can be generated through supervised
learning, where algorithms learn patterns from existing
data, or unsupervised learning, where they discover
general patterns in data.
ML models can predict numerical values based on
historical data, categorize events as true or false, and
cluster data points based on commonalities.
Deep learning, on the other hand, is a subfield of
machine learning dealing with algorithms based
essentially on multi-layered artificial neural networks
(ANN) that are inspired by the structure of the human
brain.
Unlike conventional machine learning algorithms, deep
learning algorithms are less linear, more complex, and
hierarchical, capable of learning from enormous
amounts of data, and able to produce highly accurate
results.
Language translation, image recognition, and
personalized medicines are some examples of deep
learning applications.
The Importance of Machine Learning:-
In the 21st century, data is the new oil, and machine
learning is the engine that powers this data-driven
world. It is a critical technology in today's digital age,
and its importance cannot be overstated. This is
reflected in the industry's projected growth, with the US
Bureau of Labor Statistics predicting a 26% growth in
jobs between 2023 and 2033.
Here are some reasons why it’s so essential
in the modern world:
Data processing. One of the primary reasons machine
learning is so important is its ability to handle and make
sense of large volumes of data. With the explosion of
digital data from social media, sensors, and other
sources, traditional data analysis methods have
become inadequate.
Machine learning algorithms can process these vast
amounts of data, uncover hidden patterns, and provide
valuable insights that can drive decision-making.
Driving innovation Machine learning is driving
innovation and efficiency across various sectors. Here
are a few examples:
● Healthcare. Algorithms are used to predict
disease outbreaks, personalize patient treatment
plans, and improve medical imaging accuracy.
● Finance. Machine learning is used for credit
scoring, algorithmic trading, and fraud detection.
● Retail. Recommendation systems, supply
chains, and customer service can all benefit
from machine learning.
● The techniques used also find applications in
sectors as diverse as agriculture, education, and
entertainment.
How Does Machine Learning Work:-
Understanding how machine learning works involves
delving into a step-by-step process that transforms raw
data into valuable insights. Let's break down this
process:
Step 1: Data collection
The first step in the machine learning process is data
collection. Data is the lifeblood of machine learning -
the quality and quantity of your data can directly impact
your model's performance. Data can be collected from
various sources such as databases, text files, images,
audio files, or even scraped from the web.
Step 2: Data preprocessing
Data preprocessing is a crucial step in the machine
learning process. It involves cleaning the data
(removing duplicates, correcting errors), handling
missing data (either by removing it or filling it in), and
normalizing the data (scaling the data to a standard
format).
Step 3: Choosing the right model
Once the data is prepared, the next step is to choose a
machine learning model. There are many types of
models to choose from, including linear regression,
decision trees, and neural networks. The choice of
model depends on the nature of your data and the
problem you're trying to solve.
Step 4: Training the model
After choosing a model, the next step is to train it using
the prepared data. Training involves feeding the data
into the model and allowing it to adjust its internal
parameters to better predict the output.
Step 5: Evaluating the model:-
Once a model is trained, evaluating its performance on
unseen data is essential before deployment. With
MLOps, monitoring doesn’t stop at this initial stage; it
involves ongoing evaluation to detect model drift (when
a model’s performance declines due to changes in data
patterns) and maintaining model quality over time.
Continuous monitoring and retraining workflows help
organizations ensure their models remain effective and
reliable in production environments.
Step 6: Hyperparameter tuning and optimization
Beyond tuning for accuracy, hyperparameter
optimization within an MLOps pipeline includes tools for
automated hyperparameter searches, ensuring
efficiency and reproducibility. Many teams employ
MLOps platforms that support hyperparameter tuning,
so experiments are repeatable and well-documented,
allowing for consistent optimization over time.
Step 6: Hyperparameter tuning and
optimization:-
Beyond tuning for accuracy, hyperparameter
optimization within an MLOps pipeline includes tools for
automated hyperparameter searches, ensuring
efficiency and reproducibility. Many teams employ
MLOps platforms that support hyperparameter tuning,
so experiments are repeatable and well-documented,
allowing for consistent optimization over time.
Types of Machine Learning:-
Machine learning can be broadly classified into three
types based on the nature of the learning system and
the data available: supervised learning, unsupervised
learning, and reinforcement learning. Let's delve into
each of these:
Supervised learning:-
Supervised learning is the most common type of
machine learning. In this approach, the model is trained
on a labeled dataset. In other words, the data is
accompanied by a label that the model is trying to
predict. This could be anything from a category label to
a real-valued number.
The model learns a mapping between the input
(features) and the output (label) during the training
process. Once trained, the model can predict the output
for new, unseen data.
Common examples of supervised learning algorithms
include linear regression for regression problems and
logistic regression, decision trees, and support vector
machines for classification problems. In practical terms,
this could look like an image recognition process,
wherein a dataset of images where each picture is
labeled as "cat," "dog," etc., a supervised model can
recognize and categorize new images accurately.
Unsupervised learning:-
on the other hand, involves training the model on an
unlabeled dataset. The model is left to find patterns and
relationships in the data on its own.
This type of learning is often used for clustering and
dimensionality reduction. Clustering involves grouping
similar data points together, while dimensionality
reduction involves reducing the number of random
variables under consideration by obtaining a set of
principal variables.
Common examples of unsupervised learning algorithms
include k-means for clustering problems and
Principal Component Analysis (PCA) for
dimensionality reduction problems. Again, in practical
terms, in the field of marketing, unsupervised learning is
often used to segment a company's customer base. By
examining purchasing patterns, demographic data, and
other information, the algorithm can group customers
into segments that exhibit similar behaviors without any
pre-existing labels.
Reinforcement learning:-
is a type of machine learning where an agent learns to
make decisions by interacting with its environment. The
agent is rewarded or penalized (with points) for the
actions it takes, and its goal is to maximize the total
reward.
Unlike supervised and unsupervised learning,
reinforcement learning is particularly suited to problems
where the data is sequential, and the decision made at
each step can affect future outcomes.
Common examples of reinforcement learning include
game playing, robotics, resource management, and
many more.
Machine Learning Tools:-
In the world of machine learning, having the right tools
is just as important as understanding the concepts.
These tools, which include programming languages and
libraries, provide the building blocks to implement and
deploy machine learning algorithms.
Let's explore some of the most popular tools in
machine learning:
Python for machine learning
Python is a popular language for machine learning due
to its simplicity and readability, making it a great choice
for beginners. It also has a strong ecosystem of
libraries that are tailored for machine learning.
Libraries such as NumPy and Pandas are used for data
manipulation and analysis, while Matplotlib is used for
data visualization. Scikit-learn provides a wide range of
machine learning algorithms, and TensorFlow and
PyTorch are used for building and training neural
networks. PyTorch is particularly popular among
researchers, and the new PyTorch 2.0 provides new
features for increased speed and ease of use
Python remains the dominant language in machine
learning, but it’s worth emphasizing its versatility across
fields with libraries like:
● Hugging Face Transformers for natural language
processing (NLP) and generative AI.
● LangChain for building language model-based
applications.
TensorFlow:-
TensorFlow is a powerful open-source library for
numerical computation, particularly well-suited for
large-scale machine learning. It was developed by the
Google Brain team and supports both CPUs and GPUs.
TensorFlow allows you to build and train complex
neural networks, making it a popular choice for deep
learning applications.
Scikit-learn:-
Scikit-learn is a Python library that provides a wide
range of machine learning algorithms for both
supervised and unsupervised learning. It's known for its
clear API and detailed documentation.
Scikit-learn is often used for data mining and data
analysis, and it integrates well with other Python
libraries like NumPy and Pandas.
Keras:-
Keras is a high-level neural networks API, written in
Python and capable of running on top of TensorFlow,
CNTK, or Theano. It was developed with a focus on
enabling fast experimentation.
Keras provides a user-friendly interface for building and
training neural networks, making it a great choice for
beginners in deep learning.
PyTorch:-
PyTorch is an open-source machine learning library
based on the Torch library. It's known for its flexibility
and efficiency, making it popular among researchers.
PyTorch supports a wide range of applications, from
computer vision to natural language processing. One of
its key features is the dynamic computational graph,
which allows for flexible and optimized computation.
The Top Machine Learning Careers:-
Machine learning has opened up a wide range of career
opportunities. From data science to AI engineering,
professionals with machine learning skills are in high
demand. Let's explore some of these career paths:
Data scientist:-
A data scientist uses scientific methods, processes,
algorithms, and systems to extract knowledge and
insights from structured and unstructured data. Machine
learning is a key tool in a data scientist's arsenal,
allowing them to make predictions and uncover patterns
in data.
Key skills:
● Statistical analysis
● Programming (Python)
● Machine learning
● Data visualization
● Problem-solving
Essential tools:
● Python
● SQL
● PowerBI
Machine learning engineer:-
A machine learning engineer designs and
implements machine learning systems. They run
machine learning experiments using programming
languages like Python and R, work with datasets,
and apply machine learning algorithms and
libraries.
Key skills:
● Programming (Python, Java, R)
● Machine learning algorithms
● Statistics
● System design
Research scientist:-
A research scientist in machine learning conducts
research to advance the field of machine learning. They
work in both academic and industry settings,
developing new algorithms and techniques.
Key skills:
● Deep understanding of machine learning
algorithms
● Programming (Python, R)
● Research methodology
● Strong mathematical skills
How to Get Started in Machine Learning:-
Starting a journey in machine learning can seem
daunting, but with the right approach and resources,
anyone can learn this exciting field. Here are some
steps to get you started:
Understand the basics:-
Before diving into machine learning, it's important to
have a strong foundation in mathematics (especially
statistics and linear algebra) and programming (Python
is a popular choice due to its simplicity and the
availability of machine learning libraries).
There are many resources available to learn these
basics. Online platforms like Khan Academy and
Coursera offer courses in mathematics and
programming. Books like "Think Stats" and "Python
Crash Course" are also good starting points.
Choose the right tools:-
Choosing the right tools is crucial in machine learning.
Python, along with libraries like NumPy, Pandas, and
Scikit-learn, is a popular choice due to its simplicity and
versatility.
To get started with these tools, you can follow online
tutorials or take courses on platforms like DataCamp.
Our Machine Learning Fundamentals skills track is the
ideal place to start.
Once you're comfortable with the basics, you can start
learning about machine learning algorithms. Start with
simple algorithms like linear regression and decision
trees before moving on to more complex ones like
neural networks.
Work on projects:-
Working on projects is a great way to gain practical
experience and reinforce what you've learned. Start
with simple projects like predicting house prices or
classifying iris species, and gradually take on more
complex projects. We have an article exploring 25
machine learning projects for all levels, which can help
you find something appropriate.
Stay up-to-date:-
Machine learning is a rapidly evolving field, so it's
important to stay up-to-date with the latest
developments. Following relevant blogs, attending
conferences, and participating in online communities
can help you stay informed. The DataFramed Podcast
and our webinars and live training are a great way to
keep up with trending topics in the industry.
Final Thoughts:-
From healthcare and finance to transportation and
entertainment, machine learning algorithms are driving
innovation and efficiency across various sectors. As
we've seen, getting started in machine learning requires
a strong foundation in mathematics and programming,
a good understanding of machine learning algorithms,
and practical experience working on projects.
The most common and popular machine learning
algorithms:-
Naïve Bayes Classifier Algorithm (Supervised
Learning - Classification)
The Naïve Bayes classifier is based on Bayes’ theorem
and classifies every value as independent of any other
value. It allows us to predict a class/category, based on
a given set of features, using probability.
Despite its simplicity, the classifier does surprisingly
well and is often used due to the fact it outperforms
more sophisticated classification methods.
K Means Clustering Algorithm (Unsupervised
Learning - Clustering):-
The K Means Clustering algorithm is a type of
unsupervised learning, which is used to categorise
unlabelled data, i.e. data without defined categories or
groups. The algorithm works by finding groups within
the data, with the number of groups represented by the
variable K. It then works iteratively to assign each data
point to one of K groups based on the features
provided.
Support Vector Machine Algorithm (Supervised
Learning - Classification)
Support Vector Machine algorithms are supervised
learning models that analyze data used for
classification and regression analysis. They essentially
filter data into categories, which is achieved by
providing a set of training examples, each set marked
as belonging to one or the other of the two categories.
The algorithm then works to build a model that assigns
new values to one category or the other.
Linear Regression (Supervised
Learning/Regression):-
Linear regression is the most basic type of regression.
Simple linear regression allows us to understand the
relationships between two continuous variables.
Logistic Regression (Supervised learning –
Classification):-
Logistic regression focuses on estimating the
probability of an event occurring based on the previous
data provided. It is used to cover a binary dependent
variable, that is where only two values, 0 and 1,
represent outcomes.
Artificial Neural Networks (Reinforcement
Learning):-
An artificial neural network (ANN) comprises ‘units’
arranged in a series of layers, each of which connects
to layers on either side. ANNs are inspired by biological
systems, such as the brain, and how they process
information. ANNs are essentially a large number of
interconnected processing elements, working in unison
to solve specific problems.
ANNs also learn by example and through experience,
and they are extremely useful for modeling non-linear
relationships in high-dimensional data or where the
relationship amongst the input variables is difficult to
understand.
Decision Trees (Supervised Learning –
Classification/Regression):-
A decision tree is a flow-chart-like tree structure that
uses a branching method to illustrate every possible
outcome of a decision. Each node within the tree
represents a test on a specific variable – and each
branch is the outcome of that test.
Random Forests (Supervised Learning –
Classification/Regression):-
Random forests or ‘random decision forests’ is an
ensemble learning method, combining multiple
algorithms to generate better results for classification,
regression and other tasks. Each individual classifier is
weak, but when combined with others, can produce
excellent results. The algorithm starts with a ‘decision
tree’ (a tree-like graph or model of decisions) and an
input is entered at the top. It then travels down the tree,
with data being segmented into smaller and smaller
sets, based on specific variables.
Nearest Neighbours (Supervised Learning):-
The K-Nearest-Neighbour algorithm estimates how
likely a data point is to be a member of one group or
another. It essentially looks at the data points around a
single data point to determine what group it is actually
in. For example, if one point is on a grid and the
algorithm is trying to determine what group that data
point is in (Group A or Group B, for example) it would
look at the data points near it to see what group the
majority of the points are in.
Practicals:-
NOTE:-Using GOOGLECOLAB, JUPYTER
NOTEBOOK for implementation.
Supervised Learning - Classification:-
*******Naïve Bayes Classifier Algorithm*******
(Supervised Learning - Classification):-The
Naïve Bayes Classifier is a probabilistic machine
learning model used for classification tasks. It is based
on Bayes' Theorem, which provides a way to calculate
the probability of a class (or label) given the input data
(features). The "naïve" part comes from the assumption
that the features are independent, which simplifies the
computation, even if this assumption is not strictly true.
Key Concepts of Naïve Bayes:-
[Link]' Theorem: Bayes' Theorem is the foundation of
the Naïve Bayes classifier. It describes the probability of
a class label CCC given a set of features
X=(x1,x2,...,xn)X = (x_1, x_2, ..., x_n)X=(x1,x2,...,xn):
P(C∣X)=P(X∣C)⋅P(C)P(X)P(C|X) = \frac{P(X|C) \cdot
P(C)}{P(X)}P(C∣X)=P(X)P(X∣C)⋅P(C)
P(C∣X)P(C|X)P(C∣X): Posterior probability of class
CCC given the features XXX.
P(X∣C)P(X|C)P(X∣C): Likelihood of features XXX given
class CCC.
P(C)P(C)P(C): Prior probability of class CCC.
P(X)P(X)P(X): Evidence or marginal likelihood of the
features.
Naïve Assumption: Naïve Bayes assumes that the
features are conditionally independent given the class
label. This simplifies the computation of
P(X∣C)P(X|C)P(X∣C) as:
P(X∣C)=P(x1∣C)⋅P(x2∣C)⋅⋯⋅P(xn∣C)P(X|C) = P(x_1|C)
\cdot P(x_2|C) \cdot \dots \cdot
P(x_n|C)P(X∣C)=P(x1∣C)⋅P(x2∣C)⋅⋯⋅P(xn∣C)
This assumption greatly simplifies the calculation, even
though in real-world data, the features might not be fully
independent.
Steps to Implement Naïve Bayes Classifier:
Prepare the Data:
Collect labeled data (features and corresponding class
labels).
Split the data into training and test sets.
Calculate the Prior Probability P(C)P(C)P(C):
The prior probability for each class is simply the
proportion of each class in the dataset:
P(C)=Number of instances of class CTotal number of
instancesP(C) = \frac{\text{Number of instances of
class } C}{\text{Total number of instances}}P(C)=Total
number of instancesNumber of instances of class C
Calculate the Likelihood
P(xi∣C)P(x_i|C)P(xi∣C):
For each feature xix_ixi, calculate the conditional
probability of observing the feature value given the
class label CCC.
This step depends on the type of features
(categorical, continuous).
For categorical features: Calculate the probability of
each category for a given class:
P(xi=value∣C)=count of value for class Ctotal count for
class CP(x_i = \text{value}|C) = \frac{\text{count of
value for class } C}{\text{total count for class }
C}P(xi=value∣C)=total count for class Ccount of value
for class C
For continuous features: Assume a Gaussian (normal)
distribution and compute the mean and variance for
each feature within each class. Use the probability
density function (PDF) of the Gaussian distribution:
P(xi∣C)=12πσ2exp(−(xi−μ)22σ2)P(x_i|C) =
\frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(x_i -
\mu)^2}{2\sigma^2}
\right)P(xi∣C)=2πσ21e
xp(−2σ2(xi−μ)2)
where μ\muμ is the mean and σ\sigmaσ is the standard
deviation of the feature for class CCC.
Apply Bayes' Theorem: Use Bayes' Theorem to
compute the posterior probability for each class
CCC given the input features XXX:
P(C∣X)=P(C)⋅P(x1∣C)⋅P(x2∣C)⋅⋯⋅P(xn∣C)P(C
|X) = P(C) \cdot P(x_1|C) \cdot P(x_2|C) \cdot
\dots \cdot
P(x_n|C)P(C∣X)=P(C)⋅P(x1∣C)⋅P(x2∣C)⋅⋯⋅P(x
n∣C)
Since P(X)P(X)P(X) is constant for all classes, you can
ignore it during classification and just focus on
maximizing P(C∣X)P(C|X)P(C∣X).
Class Prediction:
Calculate the posterior probability for each class.
Predict the class with the highest posterior probability.
Types of Naive Bayes Classifiers:
[Link] Naive Bayes:
Used when the features are continuous and assumed
to follow a normal (Gaussian) distribution.
[Link] Naive Bayes:
Suitable for discrete count data (e.g., word counts in
text classification). Assumes the features follow a
multinomial distribution.
Bernoulli Naive Bayes:
Used for binary/boolean features (e.g., presence or
absence of a word in text classification).
Advantages of Naive Bayes Classifier:
>>>Simple and fast: It is computationally efficient,
requiring less training time and memory.
>>>Works well with small datasets: Even with small
datasets, Naïve Bayes can give good performance,
especially when the independence assumption holds.
>>>Works well for text classification: Often used in
spam filtering and sentiment analysis due to its
effectiveness with high-dimensional sparse data (e.g.,
word frequencies).
Disadvantages of Naive Bayes Classifier:
>>>Naïve independence assumption: The
independence assumption rarely holds in real-world
data, which can reduce accuracy.
>>>Sensitive to irrelevant features: The model may
perform poorly if irrelevant features are present in the
data.
>>>Requires feature engineering: In some cases,
Naïve Bayes may require significant pre-processing of
the data, such as discretizing continuous variables.
Example:----->1
Predicting Pass/Fail based on Hours
Studied and Hours Slept.
To predict whether a student passes or fails based on
the number of hours studied and hours slept, you can
use machine learning models. Here's a general
approach to building this model:
1. Data Collection:
You need data that includes the following features:
● Hours Studied: Number of hours the student
studied.
● Hours Slept: Number of hours the student slept.
● Pass/Fail: The target variable, which indicates
whether the student passed (1) or failed (0).
Example data:
Hours Studied Hours Slept Pass/Fail
5 6 1
2 8 0
4 5 1
1 7 0
1
6 4
2. Preprocessing:
● Handling Missing Data: If there are any missing
values in the dataset, you might need to fill them
with the mean or median or remove the rows
with missing data.
● Normalization/Standardization: It's usually good
practice to scale the data to a similar range,
especially when dealing with features like hours
studied and sleep.
3. Model Selection:
You can use various machine learning algorithms to
predict the pass/fail status. Here are a few options:
● Logistic Regression: A binary classifier that
works well when the relationship between
features is linear.
● Decision Tree: Can capture non-linear
relationships and is easy to interpret.
● Random Forest: An ensemble of decision trees,
which can improve accuracy by reducing
overfitting.
● Support Vector Machine (SVM): Another
powerful classifier that works well in
high-dimensional spaces.
● K-Nearest Neighbors (KNN): A simple algorithm
that makes predictions based on the closest data
points.
4. Model Training:
● Split your data into training and testing sets (e.g.,
80% for training and 20% for testing).
● Train the selected model on the training set.
5. Model Evaluation:
After training the model, evaluate its performance on
the testing set using metrics like:
● Accuracy: The percentage of correct predictions.
● Precision, Recall, F1-Score: Especially if the
data is imbalanced (e.g., more students passing
than failing).
● Confusion Matrix: To visualize how many
pass/fail predictions were correct and how many
were incorrect.
6. Prediction:
Once the model is trained and evaluated, you can use it
to predict pass/fail based on new inputs (hours studied
and slept).
Example Code (Using Python and
Scikit-learn):
Here’s a simple example using logistic regression
in Python.
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from [Link] import StandardScaler
from sklearn.linear_model import
LogisticRegression
from [Link] import accuracy_score,
confusion_matrix
# Example data
data = {
'Hours Studied': [5, 2, 4, 1, 6],
'Hours Slept': [6, 8, 5, 7, 4],
'Pass/Fail': [1, 0, 1, 0, 1]
}
# Convert to DataFrame
df = [Link](data)
# Features and Target
X = df[['Hours Studied', 'Hours Slept']]
y = df['Pass/Fail']
# Train-Test Split (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
# Initialize and train Logistic Regression model
model = LogisticRegression()
[Link](X_train_scaled, y_train)
# Predictions
y_pred = [Link](X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
OUTPUT:-
Example:----->2
Building a Simple Neural Network with
Bias:-
use TensorFlow and Keras to demonstrate a simple
neural network with a bias term.
Here's a code implementation where we create a neural
network.
Implementing a Neural Network with Bias Term:-
Now let's write the code for a simple neural network
with a bias term:
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from sklearn.linear_model import
LogisticRegression
from [Link] import accuracy_score
from imblearn.over_sampling import SMOTE
from [Link] import make_classification
# Generate a synthetic imbalanced dataset
X, y = make_classification(n_samples=1000,
n_features=20, n_classes=2,
weights=[0.95, 0.05], flip_y=0,
random_state=42)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.3, random_state=42)
# Before applying SMOTE, check the class
distribution
print("Class distribution before SMOTE:",
[Link](y_train).value_counts())
# Apply SMOTE to balance the dataset
smote = SMOTE(sampling_strategy='auto',
random_state=42)
X_res, y_res = smote.fit_resample(X_train, y_train)
# Check the class distribution after SMOTE
print("Class distribution after SMOTE:",
[Link](y_res).value_counts())
# Train a Logistic Regression model
model = LogisticRegression(random_state=42)
[Link](X_res, y_res) # Fit on resampled data
# Predict on the test set
y_pred = [Link](X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
OUTPUT:-
Explanation:
1. Dense layer: In the Keras model, each Dense
layer by default includes a bias term. The first
Dense layer has 4 neutrons and is provided with
2 input features (XOR inputs). The
activation='relu' function is applied to this layer.
2. Bias term: The bias term is automatically added
inside the Dense layer. You don't have to
explicitly mention it when defining the layers; it's
included as part of the neural network layer
architecture. It helps to adjust the output of the
neuron before applying the activation function.
3. Training and Evaluation: The model is compiled
with the adam optimizer and binary
cross-entropy loss function since this is a binary
classification task (XOR). After training, the
model's loss and accuracy are evaluated.
4. Predictions: Finally, the model makes predictions
based on the trained data.
Explanation of Bias in Neural Networks:-
● The bias in the Dense layer is included as part of
the layer's weights, even though you don't
explicitly define it in the layer configuration.
● The bias term is applied to the weighted sum of
the inputs before the activation function is
applied.
● In this simple neural network, the bias term helps
the network learn more accurate representations
of the input data.
If you want to manually access the bias term:
You can access the learned weights (including the bias)
after training:
# Access the coefficients and intercept directly
weights = model.coef_
bias = model.intercept_
print("Weights:", weights)
print("Bias:", bias)
OUTPUT:-
Conclusion:--
This example demonstrates how a bias term is included
in a neural network, particularly in Keras, and how to
implement it in Python code. You don't need to
manually implement the bias term; it is integrated into
the Dense layer of neural networks. However, you can
access and modify it if needed.
***************Linear Regression*************
(for Regression Problems):-
Goal: Predict a continuous target variable from input
features.
Steps:-
1. Model Assumption: Assumes a linear
relationship between input variables and the
target variable.
2. Objective: Minimize the difference between
the predicted values and actual values (usually
using Mean Squared Error).
3. Learning Process:
● Find the coefficients (weights) that
minimize the error.
● Use gradient descent or a closed-form
solution (Normal Equation) to find the
optimal coefficients.
4. Prediction: Once the model is trained, use the
coefficients to predict the target variable on new
data.
Linear Regression is a statistical model used to
understand the relationship between one dependent
variable (target) and one or more independent variables
(predictors). The goal is to fit a linear equation to the
observed data, which helps predict the target based on
new input values.
In the context of your question, Linear Regression can
predict whether a student passes or fails based on
hours studied and hours slept. However, since the
target (pass/fail) is categorical, it's more appropriate to
use a Logistic Regression model. But, if you're
interested in predicting continuous outcomes (e.g.,
score or grade), then Linear Regression would work
well.
Let's use Linear Regression to predict a continuous
score based on hours studied and hours slept.
Example: Predicting Student Scores
Based on Hours Studied and Slept
Step-by-Step Python Example using scikit-learn
We’ll use a sample dataset where:
● Hours Studied and Hours Slept are the
features.
● Score is the target variable (continuous).
Example:---->1
Predicting Student Scores Based on Hours
Studied and Slept:-
Step-by-Step Python Example using scikit-learn
We’ll use a sample dataset where:
● Hours Studied and Hours Slept are the
features.
● Score is the target variable (continuous).
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from [Link] import mean_squared_error,
r2_score
from [Link] import StandardScaler
# Sample data: Hours Studied, Hours Slept, and Score
data = {
'Hours Studied': [5, 2, 4, 1, 6, 3, 8, 7, 9, 3],
'Hours Slept': [6, 8, 5, 7, 4, 6, 5, 7, 8, 6],
'Score': [75, 55, 65, 50, 80, 60, 85, 70, 90, 62]
}
# Convert to a DataFrame
df = [Link](data)
# Features (X) and Target (y)
X = df[['Hours Studied', 'Hours Slept']] # Independent
variables
y = df['Score'] # Dependent variable
# Split the data into training and testing sets (80% Train,
20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)
# Feature Scaling (optional, but can improve
performance)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
# Initialize the Linear Regression model
model = LinearRegression()
# Train the model on the training data
[Link](X_train_scaled, y_train)
# Make predictions on the test data
y_pred = [Link](X_test_scaled)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Output results
print(f'Mean Squared Error (MSE): {mse}')
print(f'R-squared (R²): {r2}')
# Example: Predicting a student's score given 6 hours of
study and 7 hours of sleep
new_data = [Link]([[6, 7]]) # New data point: 6 hours
studied, 7 hours slept
new_data_scaled = [Link](new_data) # Scale
the data
predicted_score = [Link](new_data_scaled)
print(f'Predicted Score for 6 hours of study and 7 hours
of sleep: {predicted_score[0]}')
OUTPUT:-
Breakdown of Code:-
1. Data Preparation: We create a DataFrame with
Hours Studied, Hours Slept, and Score.
2. Train-Test Split: We split the dataset into a
training set (80%) and a testing set (20%) using
train_test_split.
3. Feature Scaling: Since the model is sensitive to
the scale of the input data, we scale the features
using StandardScaler.
4. Linear Regression Model: We initialize the Linear
Regression model and train it using the scaled
training data.
5. Prediction: The model makes predictions based
on the test set, and we calculate the Mean
Squared Error (MSE) and R-squared (R²) to
evaluate performance.
6. Making Predictions: We predict the score for a
new student who studied for 6 hours and slept
for 7 hours.
Mean Squared Error (MSE): Measures how well the
model's predictions match the actual values. A lower
value is better.
R-squared (R²): The proportion of variance in the
dependent variable that is predictable from the
independent variables. R² ranges from 0 to 1, where
values closer to 1 indicate a better fit.
Example:---->2
Predicting House Prices Based on Square
Footage and Number of Bedrooms:-
In this case, we'll assume we have a dataset where:
● Square Footage is the size of the house in
square feet.
● Number of Bedrooms is how many bedrooms the
house has.
● Price is the target variable we want to predict
(the house price).
Steps:
1. Data Collection and Preprocessing.
2. Train-Test Split.
3. Training the Model.
4. Model Evaluation.
5. Making Predictions.
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from sklearn.linear_model import LinearRegression
from [Link] import mean_squared_error,
r2_score
from [Link] import StandardScaler
# Sample data: Square Footage, Number of
Bedrooms, and House Price
data = {
'Square Footage': [1500, 1800, 2400, 3000, 3500,
4000, 2200, 1800, 2500, 2200],
'Number of Bedrooms': [3, 3, 4, 5, 4, 5, 3, 3, 4, 3],
'Price': [400000, 450000, 500000, 600000, 650000,
700000, 475000, 460000, 525000, 490000]
}
# Convert to a DataFrame
df = [Link](data)
# Features (X) and Target (y)
X = df[['Square Footage', 'Number of Bedrooms']] #
Independent variables
y = df['Price'] # Dependent variable (House Price)
# Split the data into training and testing sets (80%
Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Feature Scaling (optional, but can improve
performance)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
# Initialize the Linear Regression model
model = LinearRegression()
# Train the model on the training data
[Link](X_train_scaled, y_train)
# Make predictions on the test data
y_pred = [Link](X_test_scaled)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Output results
print(f'Mean Squared Error (MSE): {mse}')
print(f'R-squared (R²): {r2}')
# Example: Predicting the price of a house with
2800 sq ft and 4 bedrooms
new_data = [Link]([[2800, 4]]) # New data point:
2800 sq ft, 4 bedrooms
new_data_scaled = [Link](new_data) #
Scale the data
predicted_price = [Link](new_data_scaled)
print(f'Predicted Price for 2800 sq ft and 4
bedrooms: ${predicted_price[0]:,.2f}')
OUTPUT:-
Explanation of Code:-
Data Creation: We define a dictionary that contains:
Square Footage: The size of the house.
Number of Bedrooms: How many bedrooms the
house has.
Price: The price of the house.
The data is then converted into a Pandas
DataFrame.
Splitting Data: We split the data into training and
testing sets using train_test_split(). We use 80% of the
data for training and 20% for testing.
Feature Scaling: We scale the features (Square
Footage and Number of Bedrooms) using
StandardScaler, which is important for models that are
sensitive to the scale of input data.
Training the Model: We initialize the Linear
Regression model and train it using the scaled training
data.
Evaluating the Model: We evaluate the model's
performance using Mean Squared Error (MSE) and
R-squared (R²). These metrics help us understand how
well the model is performing:
● MSE tells us how much error there is
between the predicted and actual prices
(lower is better).
● R² indicates how well the model explains the
variance in the price.
Making Predictions: After training, we can predict
the price of a house with specific values for Square
Footage and Number of Bedrooms.
***************Logistic Regression**********
(for Classification Problems):-
Goal: Predict a binary outcome (0 or 1).
Steps:
1. Model Assumption: Assumes a linear
relationship between input features and the
log-odds of the target variable.
2. Objective: Maximize the likelihood that the
model predicts the correct class.
3. Learning Process:
● Apply the sigmoid function to the weighted
sum of input features.
● Optimize using gradient descent to find
the best weights.
4. Prediction: Apply the learned weights to new
data and classify based on the output probability.
Example:---->1
Predicting Whether a Customer Will Buy a
Product Based on Age and Income:-
Dataset:
● Age: Customer's age.
● Income: Customer's income in thousands of
dollars.
● Purchase: Whether the customer bought the
product (1 = Yes, 0 = No).
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from sklearn.linear_model import
LogisticRegression
from [Link] import accuracy_score,
confusion_matrix, classification_report
from [Link] import StandardScaler
# Sample data: Age, Income, and Purchase (1 = Yes,
0 = No)
data = {
'Age': [22, 45, 35, 50, 23, 60, 25, 30, 40, 55],
'Income': [30, 60, 50, 80, 25, 90, 35, 40, 70, 85],
'Purchase': [0, 1, 0, 1, 0, 1, 0, 0, 1, 1] # 1 =
Bought, 0 = Not Bought
}
# Convert to DataFrame
df = [Link](data)
# Features (X) and Target (y)
X = df[['Age', 'Income']] # Independent variables
y = df['Purchase'] # Dependent variable (Purchase:
Yes/No)
# Split the data into training and testing sets (80%
Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Feature Scaling (important for Logistic
Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
# Initialize the Logistic Regression model
model = LogisticRegression()
# Train the model on the training data
[Link](X_train_scaled, y_train)
# Make predictions on the test data
y_pred = [Link](X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
# Output results
print(f Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f Classification Report:\n{class_report}')
# Example: Predicting whether a customer will buy
given an age of 32 and income of 55k
new_data = [Link]([[32, 55]]) # New data point: 32
years old, income = 55k
new_data_scaled = [Link](new_data) #
Scale the data
predicted_class = [Link](new_data_scaled)
print(Predicted class (Purchase: Yes/No) for age 32
and income 55k: {predicted_class[0]}')
OUTPUT:-
Explanation of the Code:
Data Creation:We define a dictionary with Age,
Income, and Purchase columns. The Purchase column
indicates whether the customer made a purchase (1 for
Yes, 0 for No).
Splitting the Data:We split the data into features (Age
and Income) and the target (Purchase). Then, we
further split the data into training and testing sets using
train_test_split().
Feature Scaling:StandardScaler is used to scale the
features. This is important for Logistic Regression since
the model is sensitive to the scale of input features.
Training the Logistic Regression Model:We initialize
the Logistic Regression model and train it on the scaled
training data using fit().
Evaluating the Model:
We evaluate the model's accuracy using
accuracy_score. We also print out the confusion matrix
and classification report to get a more detailed
performance evaluation.
The confusion matrix shows the number of correct and
incorrect predictions.
The classification report gives additional metrics like
precision, recall, and F1-score.
Making Predictions:After training, we predict whether
a customer will buy the product based on their age and
income. We input a new data point: age = 32 and
income = 55k.
Conclusion:
This example demonstrates how Logistic Regression
can be used for binary classification tasks like
predicting whether a customer will make a purchase
based on input features such as age and income.
Example:---->2
Predicting Disease Presence Based on Age
and Blood Pressure:-
We have the following information:
● Age: The age of the patient.
● Blood Pressure: The patient's blood pressure
level (in mmHg).
● Disease: Whether the patient has the disease (1
= Yes, 0 = No).
Objective:
We will use Logistic Regression to predict if a patient
has the disease (1) or not (0) based on their age and
blood pressure.
Steps:
1. Data Collection and Preprocessing.
2. Train-Test Split.
3. Model Training using Logistic Regression.
4. Model Evaluation.
5. Making Predictions.
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from sklearn.linear_model import
LogisticRegression
from [Link] import accuracy_score,
confusion_matrix, classification_report
from [Link] import StandardScaler
# Sample data: Age, Blood Pressure, and Disease (1
= Yes, 0 = No)
data = {
'Age': [45, 50, 36, 60, 55, 40, 47, 38, 49, 62],
'Blood Pressure': [130, 150, 120, 160, 140, 135,
145, 128, 138, 155],
'Disease': [1, 1, 0, 1, 1, 0, 1, 0, 0, 1] # 1 = Disease,
0 = No Disease
# Convert to DataFrame
df = [Link](data)
# Features (X) and Target (y)
X = df[['Age', 'Blood Pressure']] # Independent
variables
y = df['Disease'] # Dependent variable (Disease:
Yes/No)
# Split the data into training and testing sets (80%
Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Feature Scaling (important for Logistic
Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
# Initialize the Logistic Regression model
model = LogisticRegression()
# Train the model on the training data
[Link](X_train_scaled, y_train)
# Make predictions on the test data
y_pred = [Link](X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
# Output results
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')
# Example: Predicting whether a patient with age 55
and blood pressure 145 has the disease
new_data = [Link]([[55, 145]]) # New data point:
age 55, blood pressure 145
new_data_scaled = [Link](new_data) #
Scale the data
predicted_class = [Link](new_data_scaled)
print(f'Predicted class (Disease: Yes/No) for age 55
and blood pressure 145: {predicted_class[0]}')
OUTPUT:-
********K-Nearest Neighbor(KNN)**********
Goal: Classify data points based on their proximity to
other labeled points.
Steps:Model Assumption: No explicit model is trained.
KNN is a non-parametric method.
1. Objective: Classify a point based on the majority
class of its K nearest neighbors.
2. Learning Process:
● Store the entire training dataset.
● For a new input, calculate the distance
(e.g., Euclidean) to each training point.
● Select the K nearest points and assign the
most frequent class as the prediction.
3. Prediction: Find the K nearest neighbors and
determine the class based on majority voting.
Example:---->1
Iris Flower Classification:-
In this example, we will classify different species of Iris
flowers based on their features using the K-Nearest
Neighbors (KNN) algorithm. The dataset consists of
150 samples from 3 species of Iris flowers: Setosa,
Versicolor, and Virginica.
Features:
● Sepal Length (cm)
● Sepal Width (cm)
● Petal Length (cm)
● Petal Width (cm)
Objective:
We will use the K-Nearest Neighbors (KNN) algorithm
to classify the species of Iris flowers based on the
provided features.
Steps:
1. Data Loading and Preprocessing.
2. Train-Test Split.
3. Model Training using KNN.
4. Model Evaluation.
5. Making Predictions.
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from [Link] import
KNeighborsClassifier
from [Link] import accuracy_score,
confusion_matrix, classification_report
from [Link] import load_iris
from [Link] import StandardScaler
# Load the Iris dataset from sklearn
iris = load_iris()
# Features (X) and Target (y)
X = [Link] # Features: Sepal Length, Sepal Width,
Petal Length, Petal Width
y = [Link] # Target: Species (Setosa, Versicolor,
Virginica)
# Split the data into training and testing sets (80%
Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Feature Scaling (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
# Initialize the KNN model with k=5 (5 nearest
neighbors)
knn = KNeighborsClassifier(n_neighbors=5)
# Train the KNN model on the training data
[Link](X_train_scaled, y_train)
# Make predictions on the test data
y_pred = [Link](X_test_scaled)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
# Output results
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')
# Example: Predicting the species of a new flower
with given features
new_data = [Link]([[5.5, 2.4, 3.8, 1.1]]) # New
flower features (sepal and petal dimensions)
new_data_scaled = [Link](new_data) #
Scale the new data
predicted_species = [Link](new_data_scaled)
# Convert predicted species label to species name
species_names = iris.target_names
predicted_species_name =
species_names[predicted_species[0]]
print(f'Predicted species:
{predicted_species_name}')
OUTPUT:-
Explanation of the Code:
Data Loading:We load the Iris dataset from
[Link], which already comes with features
and the target variable. The target variable represents
three species of Iris flowers: setosa, versicolor, and
virginica.
Data Splitting:We split the data into training and
testing sets using train_test_split(). 80% of the data is
used for training, and 20% is used for testing.
Feature Scaling:Since KNN is distance-based, we
scale the features using StandardScaler to ensure that
all features are on the same scale.
Training the KNN Model:We initialize the
KNeighborsClassifier with k=5, meaning that the model
will consider the 5 nearest neighbors when making
predictions.
The model is trained using fit() on the scaled training
data.
Making Predictions:After training, we use the model to
predict the species of Iris flowers in the test set using
predict().
We evaluate the model's performance using accuracy,
confusion matrix, and classification report.
Predicting for a New Data Point:
We input the features of a new Iris flower (sepal and
petal dimensions) and predict its species using the
trained KNN model.
Example:---->2
Predicting Book Sales Using KNN:-
We will assume that the book sales depend on
factors like:
● Advertising budget (in thousands of dollars)
● Season (e.g., whether it’s the holiday season or
not)
● Book genre (e.g., fiction, non-fiction)
● Social media buzz (e.g., number of mentions on
social media)
The goal is to predict the number of books sold
based on these factors.
Steps:
1. Data Generation (Synthetic Dataset).
2. Train-Test Split.
3. Model Training using KNN.
4. Model Evaluation.
5. Making Predictions.
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from [Link] import
KNeighborsRegressor
from [Link] import mean_absolute_error,
mean_squared_error
from [Link] import StandardScaler
# Generate synthetic dataset for book sales
prediction
[Link](42)
# Number of samples
n_samples = 100
# Generate synthetic features
ad_budget = [Link](5, 100, n_samples)
# Advertising budget (in thousands)
season = [Link]([0, 1], n_samples) #
Season (0 for non-holiday, 1 for holiday)
genre = [Link]([0, 1], n_samples) #
Genre (0 for fiction, 1 for non-fiction)
social_media_buzz = [Link](50, 1000,
n_samples) # Social media buzz (mentions)
# Generate synthetic target (book sales)
# A simple relationship for the target variable (book
sales)
book_sales = (ad_budget * 0.5) + (season * 200) +
(genre * 100) + (social_media_buzz * 0.1) +
[Link](0, 30, n_samples)
# Create a DataFrame with features and target
df = [Link]({
'ad_budget': ad_budget,
'season': season,
'genre': genre,
'social_media_buzz': social_media_buzz,
'book_sales': book_sales
})
# Preview the synthetic dataset
print([Link]())
# Features: Advertising budget, season, genre, and
social media buzz
X = [Link](columns=['book_sales'])
# Target: Number of book sales
y = df['book_sales']
# Split the data into training and testing sets (80%
Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Feature Scaling (important for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
# Initialize the KNN regressor model with k=5 (5
nearest neighbors)
knn = KNeighborsRegressor(n_neighbors=5)
# Train the KNN model on the training data
[Link](X_train_scaled, y_train)
# Make predictions on the test data
y_pred = [Link](X_test_scaled)
# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
# Output results
print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')
# Example: Predicting book sales for new data
new_data = [Link]([[50, 1, 0, 300]]) # Example: Ad
budget = 50, holiday season, fiction genre, 300
social media mentions
new_data_scaled = [Link](new_data) #
Scale the new data
predicted_sales = [Link](new_data_scaled)
# Output prediction result
print(f'Predicted book sales for the new data:
{predicted_sales[0]:.2f} books')
OUTPUT:-
Explanation of the Code:
Data Generation (Synthetic Dataset):
Advertising budget: A random budget between 5,000
and 100,000 dollars.
Season: A binary value (0 or 1), indicating whether it's a
holiday season (1) or not (0).
Genre: A binary value (0 for fiction and 1 for
non-fiction).
Social media buzz: A random value between 50 and
1,000, representing the number of mentions on social
media.
Book sales: A synthetic target variable representing
book sales, with some noise added to make it more
realistic. It is a weighted sum of the features with some
random variation added using [Link].
Feature and Target Selection:
Features (X): The advertising budget, season, genre,
and social media buzz.
Target (y): The number of book sales.
Data Splitting:
We split the data into training and testing sets, with
80% for training and 20% for testing, using the
train_test_split() function.
Feature Scaling:
We apply StandardScaler to scale the features. This is
necessary because KNN is distance-based, and the
scale of features can affect the model's performance.
Model Training:
We initialize the KNeighborsRegressor with k=5 (5
nearest neighbors) and train it using the scaled training
data.
Model Evaluation:
After training the model, we make predictions on the
test set and evaluate the performance using Mean
Absolute Error (MAE) and Mean Squared Error (MSE).
Prediction for New Data:
For a new example with specific values (e.g.,
advertising budget = 50, holiday season, fiction genre,
300 social media mentions), we predict the book sales.
The new data is scaled before making the prediction.
********Support Vector Machine (SVM)********
Goal: Find a hyperplane that best separates classes in
the feature space.
Steps:
1. Model Assumption: The data can be separated
into classes by a hyperplane (in higher
dimensions).
2. Objective: Maximize the margin between the
classes.
3. Learning Process:
● Identify the support vectors (data points
that are closest to the hyperplane).
● Use an optimization technique (e.g.,
Quadratic Programming) to find the
optimal hyperplane.
4. Prediction: Classify new data points based on
which side of the hyperplane they fall on.
Support Vector Machine (SVM) is a supervised
machine learning algorithm primarily used for
classification tasks, but it can also be adapted for
regression (SVR). SVM works by finding the hyperplane
that best separates the data points of different classes.
The goal is to maximize the margin between the
classes, which improves the model's generalization.
How SVM Works:
Linear SVM:
For a linearly separable dataset, the SVM algorithm
tries to find the best possible line (or hyperplane in
higher dimensions) that separates the two classes.
The margin is the distance between the closest points
of the classes (support vectors) to the hyperplane. The
SVM maximizes this margin to improve the model's
ability to generalize.
Non-Linear SVM:
In real-world datasets, classes are often not linearly
separable. SVM handles this by using a technique
called the kernel trick, which transforms the data into
higher dimensions where a hyperplane can be found.
Common kernels include linear, polynomial, and Radial
Basis Function (RBF).
SVM for Classification:
In classification, SVM aims to find the hyperplane that
best divides the data into two classes.
The points that are closest to the hyperplane are called
support vectors, which help define the margin.
SVM for Regression (SVR):
The goal in SVM regression is to fit the best line (or
hyperplane) that approximates the data while allowing
some tolerance for errors (defined by a margin of
tolerance, ε).
Example:---->1
Support Vector Machine (SVM) model for predicting
vehicle traffic in Mumbai.
● Traffic volume (number of vehicles passing a
specific point)
● Time of day
● Weather conditions (rain, temperature, etc.)
● Day of the week (weekdays vs weekends)
● Public holidays
● Events (such as festivals, sports events, or
concerts)
Since you don't have a real dataset at hand, I’ll walk
you through a hypothetical example of how we might
approach this problem using a simple SVM model to
predict whether traffic is heavy or light based on
features like time of day, weather conditions, and day of
the week.
Let’s break down the steps:
Problem Setup:
We want to predict whether traffic in Mumbai is heavy
or light using an SVM classifier.
Features:
Time of Day: Categorized into Morning, Afternoon,
Evening, Night (e.g., 0 = Morning, 1 = Afternoon, 2 =
Evening, 3 = Night)
Weather Condition: Weather data might include Rainy
(e.g., 1), Sunny (e.g., 0), or Cloudy (e.g., 2).
Day of Week: Weekday (0-4) or Weekend (5-6), with
weekends typically seeing heavier traffic.
Target:
Traffic Status: A binary classification where:
0: Light Traffic
1: Heavy Traffic
Steps to Implement:
1. Import Libraries
2. Generate Sample Data
3. Preprocess Data (e.g., scaling)
4. Train SVM Model
5. Evaluate the Model
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from [Link] import SVC
from [Link] import accuracy_score,
confusion_matrix, classification_report
from [Link] import StandardScaler
# Step 1: Generate sample data (this would usually
be loaded from a real dataset)
# Sample data: [Time of Day, Weather Condition,
Day of Week] -> Traffic Status (0 = Light, 1 = Heavy)
data = {
'Time_of_Day': [0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1,
2], # 0=Morning, 1=Afternoon, 2=Evening, 3=Night
'Weather_Condition': [0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 2, 0,
1, 2, 0], # 0=Sunny, 1=Rainy, 2=Cloudy
'Day_of_Week': [0, 1, 2, 3, 4, 5, 5, 6, 6, 0, 1, 2, 3, 4,
5], # 0-4=Weekdays, 5-6=Weekend
'Traffic_Status': [0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1,
1] # 0=Light Traffic, 1=Heavy Traffic
# Convert the dictionary into a pandas DataFrame
df = [Link](data)
# Step 2: Feature matrix X (independent variables)
and target vector y (dependent variable)
X = df[['Time_of_Day', 'Weather_Condition',
'Day_of_Week']]
y = df['Traffic_Status']
# Step 3: Split data into training and testing sets
(80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Step 4: Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
# Step 5: Train the SVM model (using a linear
kernel)
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_scaled, y_train)
# Step 6: Make predictions on the test set
y_pred = svm_classifier.predict(X_test_scaled)
# Step 7: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
# Output results
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')
OUTPUT:-
Explanation of the Code:
Data Generation: We create a simple dataset where
each row represents a time instance with features like
Time_of_Day, Weather_Condition, and Day_of_Week,
and the corresponding Traffic_Status (either 0 for light
or 1 for heavy traffic).
Feature Matrix (X) and Target Vector (y):
X consists of the features (Time_of_Day,
Weather_Condition, and Day_of_Week).
y contains the target variable Traffic_Status, which is
the binary outcome (0 for light traffic and 1 for heavy
traffic).
Train-Test Split: We split the dataset into training and
testing sets with 80% of the data used for training and
20% for testing.
Feature Scaling: We scale the features using
StandardScaler. This step is important because SVMs
are sensitive to the scale of input features. Scaling
ensures that each feature contributes equally to the
model.
Model Training: We train the SVM classifier with a
linear kernel. This is suitable since the problem is
relatively simple and we are assuming that the data is
linearly separable.
Prediction: The trained model makes predictions on
the test data, and we evaluate its performance.
Evaluation: We use accuracy, confusion matrix, and
classification report to evaluate the model's
performance.
Example:---->2
Support Vector Machine (SVM) to classify whether a
student would pass or fail an exam based on their
hours studied and hours slept:-
Problem Setup:-
We are tasked with predicting if a student will pass or
fail an exam based on the number of hours studied and
hours slept. We will use a synthetic dataset for this
binary classification task.
Steps for the SVM Example:
1. Create Synthetic Data
2. Preprocess Data
3. Train SVM Model
4. Evaluate the Model
Importing necessary libraries
import num#py as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from [Link] import SVC
from [Link] import StandardScaler
from [Link] import accuracy_score,
confusion_matrix, classification_report
# Step 1: Create a synthetic dataset
# We will create a dataset with two features: Hours
Studied and Hours Slept
# Target variable: Pass (1) or Fail (0)
data = {
'Hours_Studied': [1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,
7, 8],
'Hours_Slept': [8, 7, 6, 5, 4, 3, 2, 7, 6, 5, 4, 3, 2, 1,
8],
'Pass/Fail': [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1] #
0 = Fail, 1 = Pass
# Converting the data into a pandas DataFrame
df = [Link](data)
# Step 2: Feature matrix X and target vector y
X = df[['Hours_Studied', 'Hours_Slept']]
y = df['Pass/Fail']
# Step 3: Split data into training and testing sets
(80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Step 4: Scale the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = [Link](X_test)
# Step 5: Initialize and train the SVM model with a
linear kernel
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_scaled, y_train)
# Step 6: Make predictions on the test set
y_pred = svm_classifier.predict(X_test_scaled)
# Step 7: Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
# Output results
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')
OUTPUT:-
Explanation of the Code:
Creating the Dataset:
We have a synthetic dataset with two features:
Hours_Studied and Hours_Slept, and the target
variable Pass/Fail.
The target variable Pass/Fail is binary, where 0
represents failure and 1 represents passing the exam.
Feature Matrix (X) and Target Vector (y):
X contains the features: Hours_Studied and
Hours_Slept.
y contains the target variable Pass/Fail.
Train-Test Split:We split the dataset into training and
testing sets using the train_test_split function. We use
80% of the data for training and 20% for testing.
Feature Scaling:We scale the features using
StandardScaler to ensure that both features are on the
same scale. This is important because SVMs are
sensitive to the scale of input features.
Model Training:We initialize an SVM classifier with a
linear kernel (since the problem seems to be linearly
separable) and train it on the scaled training data.
Making Predictions:
After training the model, we use it to predict the
Pass/Fail status for the test data.
Model Evaluation:We evaluate the model's
performance using accuracy, confusion matrix, and
classification report. These metrics will provide insight
into how well the model is performing
****************Decision Trees*******************
Goal: Split the data into branches based on feature
values, leading to predictions.
Steps:
Model Assumption: Recursively split the data based
on feature values to create branches.
Objective: Maximize information gain (for
classification) or minimize variance (for regression) at
each split.
Learning Process:For each node, choose the best
feature to split the data.
Recursively repeat the process until a stopping
condition is met (e.g., maximum depth).
Prediction: Follow the branches for a new data point to
make a prediction.
Example:---->1
predict whether a person will buy a product
based on two features: Age and Income.
This is a simple binary classification
problem where we predict whether a person
will buy a product (1) or not (0):-
Problem Setup:-
Features:
Age: Age of the person (numerical feature)
Income: Annual income of the person (numerical
feature)
Target Variable:
Buy: Binary classification target where:
1: Person will buy the product
0: Person will not buy the product
We will use a simple synthetic dataset and apply a
Decision Tree to classify the target.
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from [Link] import DecisionTreeClassifier
from [Link] import accuracy_score,
confusion_matrix, classification_report
# Step 1: Create a synthetic dataset
# We create a dataset with two features: Age and
Income, and the target variable Buy (1 or 0)
data = {
'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80],
'Income': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75,
80],
'Buy': [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0] # Target
variable (0 = No, 1 = Yes)
# Converting the data into a pandas DataFrame
df = [Link](data)
# Step 2: Feature matrix X and target vector y
X = df[['Age', 'Income']] # Features
y = df['Buy'] # Target variable (whether the person
buys the product)
# Step 3: Split data into training and testing sets
(80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Step 4: Initialize and train the Decision Tree model
dt_classifier =
DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
# Step 5: Make predictions on the test set
y_pred = dt_classifier.predict(X_test)
# Step 6: Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
# Output results
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')
# Step 7: Visualize the Decision Tree (optional)
from [Link] import plot_tree
import [Link] as plt
[Link](figsize=(12,8))
plot_tree(dt_classifier, feature_names=['Age',
'Income'], class_names=['No', 'Yes'], filled=True,
rounded=True)
[Link]()
OUTPUT:-
Explanation of the Code:
Creating the Dataset:
We define a simple dataset with Age, Income, and a
target variable Buy (binary: 1 for "buy", 0 for "no buy").
The data is stored in a pandas DataFrame.
Feature Matrix (X) and Target Vector (y):
The feature matrix X consists of the columns Age and
Income.
The target vector y contains the binary outcomes
(whether the person buys the product or not).
Train-Test Split:
We split the dataset into training (80%) and testing
(20%) sets using train_test_split from scikit-learn. The
random_state=42 ensures reproducibility.
Model Training:
We initialize the DecisionTreeClassifier and fit it to the
training data. The decision tree will learn patterns
based on the feature values and split the data into
decision nodes to classify whether a person will buy the
product.
Prediction:After training the model, we use it to predict
the target variable on the test data using predict().
Model Evaluation:
We evaluate the model's performance using:
Accuracy: Percentage of correct predictions.
Confusion Matrix: Shows the number of true positives,
true negatives, false positives, and false negatives.
Classification Report: Provides precision, recall, and
F1-score for both classes (buy or not buy).
Visualizing the Decision Tree (Optional):We use
plot_tree() to visualize the decision tree structure, which
shows how the model makes decisions based on the
features.
Example:---->2
predicting whether a person will survive or
not survive based on age and fare (the
amount paid for the ticket) in the famous
Titanic dataset. This is a classic binary
classification problem, where we predict the
survival status of passengers on the
Titanic:-
Problem Setup:
Features:
Age: Age of the passenger.
Fare: The fare the passenger paid for the ticket.
Target Variable:
Survived: Binary classification target where:
1: Survived
0: Did not survive
We will use the Titanic dataset, which is available in
the sklearn dataset library.
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from [Link] import DecisionTreeClassifier
from [Link] import accuracy_score,
confusion_matrix, classification_report
from [Link] import fetch_openml
# Step 1: Load Titanic Dataset
# Fetch the Titanic dataset from OpenML
titanic_data = fetch_openml(name='titanic',
version=1)
# Create DataFrame with correct column names
df = [Link](data=titanic_data.data,
columns=titanic_data.feature_names)
df['Survived'] = titanic_data.target # Add the target
column (Survived)
# Display the first few rows of the dataset
print([Link]())
# Step 2: Preprocess Data
# For simplicity, we'll select the 'Age' and 'Fare'
features, and drop missing values
df = df[['age', 'fare', 'Survived']].dropna() # Changed
'Age' and 'Fare' to 'age' and 'fare' respectively
# Features (X) and Target (y)
X = df[['age', 'fare']] # Changed 'Age' and 'Fare' to
'age' and 'fare' respectively
y = df['Survived']
# Step 3: Split data into training and testing sets
(80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Step 4: Initialize and train the Decision Tree model
dt_classifier =
DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
# Step 5: Make predictions on the test set
y_pred = dt_classifier.predict(X_test)
# Step 6: Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
# Output results
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')
# Step 7: Visualize the Decision Tree (optional)
from [Link] import plot_tree
import [Link] as plt
[Link](figsize=(12,8))
plot_tree(dt_classifier, feature_names=['age', 'fare'],
class_names=['Not Survived', 'Survived'],
filled=True, rounded=True) # Changed 'Age' and
'Fare' to 'age' and 'fare' respectively
[Link]()
OUTPUT:-
Explanation of the Code:
Loading the Titanic Dataset:
We use fetch_openml to load the Titanic dataset from
OpenML.
The dataset includes both features and the target
variable Survived (whether the passenger survived).
Preprocessing Data:
We select the relevant features: Age, Fare, and
Survived.
We drop any rows with missing values using .dropna().
For simplicity, we focus only on Age and Fare.
Feature Matrix (X) and Target Vector (y):
X contains the features Age and Fare.
y contains the target variable Survived.
Train-Test Split:
We split the data into training and testing sets using
train_test_split (80% train, 20% test).
Model Training:
We initialize a DecisionTreeClassifier and fit it on the
training data.
Prediction:
We use the trained model to predict the survival status
on the test set.
Evaluation:
We evaluate the model using:
Accuracy: Percentage of correct predictions.
Confusion Matrix: Shows the number of true positives,
true negatives, false positives, and false negatives.
Classification Report: Provides precision, recall, and
F1-score for both classes (survived or not survived).
Visualization (Optional):We visualize the decision tree
using plot_tree() to understand how the model is
making its decisions.
**************Random Forests******************
Goal: Build multiple decision trees and aggregate their
predictions.
Steps:
Model Assumption: Build many decision trees using
bootstrapped data (random subsets of the training set).
Objective: Use bagging (Bootstrap Aggregating) to
improve the stability and accuracy of a single decision
tree.
Learning Process:
Randomly select subsets of features and data to train
multiple decision trees.
Average predictions (for regression) or use majority
voting (for classification).
Prediction: Aggregate the predictions from all the trees
in the forest.
Random Forest is an ensemble learning method that
operates by constructing a multitude of decision trees at
training time and outputs the class that is the mode
(most frequent) of the classes of the individual trees. It's
highly effective for both classification and regression
tasks.
Example:---->2
predict whether a passenger survived on
the Titanic based on features such as Age,
Fare, Pclass, and Sex:-
Problem Setup:
We will use the famous Titanic dataset and predict
the Survived column (whether the passenger
survived or not) using the following features:
● Pclass: Passenger class (1st, 2nd, 3rd)
● Sex: Gender of the passenger
● Age: Age of the passenger
● Fare: The fare the passenger paid for the ticket
Steps:
1. Load and Preprocess the Titanic Dataset.
2. Train a Random Forest Model.
3. Evaluate the Model.
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from [Link] import
RandomForestClassifier
from [Link] import accuracy_score,
confusion_matrix, classification_report
from [Link] import
RandomForestClassifier
from [Link] import LabelEncoder
# Step 1: Load Titanic Dataset
# We use the Titanic dataset available in sklearn (or
load from a CSV if preferred)
url =
"[Link]
o/datasets/master/[Link]"
df = pd.read_csv(url)
# Step 2: Preprocess Data
# We will focus on the 'Pclass', 'Sex', 'Age', 'Fare'
columns for simplicity
df = df[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']]
# Handle missing values (Age has some missing
values)
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Convert categorical 'Sex' to numerical values (1
for male, 0 for female)
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])
# Step 3: Feature matrix (X) and target vector (y)
X = df[['Pclass', 'Sex', 'Age', 'Fare']] # Features
y = df['Survived'] # Target (1: Survived, 0: Not
Survived)
# Step 4: Split the dataset into training and testing
sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Step 5: Initialize and train the Random Forest
model
rf_classifier =
RandomForestClassifier(n_estimators=100,
random_state=42)
rf_classifier.fit(X_train, y_train)
# Step 6: Make predictions on the test set
y_pred = rf_classifier.predict(X_test)
# Step 7: Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
# Output results
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')
OUTPUT:-
Explanation of the Code:
Load Titanic Dataset:
We load the Titanic dataset using pandas.read_csv(). If
you have the dataset in a local file, you can load it
similarly. This dataset contains columns like Pclass,
Sex, Age, Fare, and Survived.
Preprocess Data:
We select only the necessary features: Pclass, Sex,
Age, Fare, and Survived.
We handle missing values in the Age column by filling
them with the mean age of the passengers.
The Sex feature is categorical, so we convert it to
numerical values using LabelEncoder (1 for male, 0 for
female).
Feature Matrix (X) and Target Vector (y):
X contains the features Pclass, Sex, Age, and Fare.
y contains the target variable Survived (1 for survived, 0
for not).
Train-Test Split:
We split the data into 80% training and 20% testing sets
using train_test_split.
Model Training:
We initialize a RandomForestClassifier with 100 trees
(n_estimators=100) and train it on the training set using
fit().
Prediction:
We use the trained model to predict the survival status
of passengers in the test set using predict().
Model Evaluation:
We evaluate the model's performance using
accuracy_score, confusion_matrix, and
classification_report.
Accuracy: The percentage of correctly predicted
passengers.
Confusion Matrix: It shows the number of true
positives, true negatives, false positives, and false
negatives.
Classification Report: Provides precision, recall, and
F1-score for both classes (survived and not survived).
Example:---->2
using a Random Forest classifier to predict
whether a person will purchase a product
based on their demographic features. We
will use a synthetic dataset for this task:
Problem Setup:
We want to predict whether a person will purchase a
product (Purchased column) based on their Age,
Salary, and Gender.
Dataset Setup:
● Age: Age of the person.
● Salary: Salary of the person.
● Gender: Gender of the person (Male or Female).
● Purchased: Whether the person purchased the
product (1 for Yes, 0 for No).
Steps:
1. Create the synthetic dataset.
2. Preprocess the data.
3. Train the Random Forest Classifier.
4. Evaluate the model.
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import
train_test_split
from [Link] import
RandomForestClassifier
from [Link] import accuracy_score,
confusion_matrix, classification_report
from [Link] import LabelEncoder
import [Link] as plt
# Step 1: Create synthetic dataset
data = {
'Age': [22, 25, 47, 52, 46, 56, 55, 23, 35, 26, 38, 40,
60, 50, 27],
'Salary': [19000, 20000, 43000, 45000, 41000,
50000, 55000, 21000, 38000, 24000, 37000, 42000,
60000, 50000, 23000],
'Gender': ['Male', 'Male', 'Female', 'Female', 'Male',
'Male', 'Female', 'Female', 'Female', 'Male', 'Female',
'Female', 'Male', 'Female', 'Male'],
'Purchased': [0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0]
# Target variable
}
# Create a pandas DataFrame
df = [Link](data)
# Step 2: Preprocess Data
# Convert categorical 'Gender' to numerical values
(Male = 1, Female = 0)
label_encoder = LabelEncoder()
df['Gender'] =
label_encoder.fit_transform(df['Gender'])
# Step 3: Feature matrix (X) and target vector (y)
X = df[['Age', 'Salary', 'Gender']] # Features
y = df['Purchased'] # Target (1: Purchased, 0: Not
Purchased)
# Step 4: Split the dataset into training and testing
sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.2, random_state=42)
# Step 5: Initialize and train the Random Forest
model
rf_classifier =
RandomForestClassifier(n_estimators=100,
random_state=42)
rf_classifier.fit(X_train, y_train)
# Step 6: Make predictions on the test set
y_pred = rf_classifier.predict(X_test)
# Step 7: Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
# Output results
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Classification Report:\n{class_report}')
# Step 8: Visualizing the feature importance
importances = rf_classifier.feature_importances_
features = [Link]
# Create a DataFrame for easy plotting
feature_importance_df = [Link]({'Feature':
features, 'Importance': importances})
feature_importance_df =
feature_importance_df.sort_values(by='Importance'
, ascending=False)
# Plot
[Link](figsize=(8, 6))
[Link](feature_importance_df['Feature'],
feature_importance_df['Importance'],
color='skyblue')
[Link]('Feature Importance')
[Link]('Random Forest Feature Importances')
[Link]()
OUTPUT:-
Explanation of the Code:
Create Synthetic Dataset:
We generate a synthetic dataset with 15 samples
where we have 3 features: Age, Salary, and Gender.
The target variable Purchased indicates whether the
person bought the product (1 for yes, 0 for no).
Preprocessing:
The Gender feature is categorical, so we encode it to
numerical values using LabelEncoder (Male = 1,
Female = 0).
We prepare the feature matrix X (which includes
Age, Salary, and Gender) and the target vector y
(which contains the Purchased column).
Train-Test Split:
We split the dataset into training and test sets (80% for
training and 20% for testing) using train_test_split.
Model Training:
We initialize the RandomForestClassifier with 100 trees
(n_estimators=100) and train it using the training set.
Prediction:We make predictions on the test set using
the trained model.
Model Evaluation:We evaluate the model using
accuracy_score, confusion_matrix, and
classification_report.
Accuracy: Percentage of correct predictions.
Confusion Matrix: It shows the number of true
positives, true negatives, false positives, and false
negatives.
Classification Report: It provides precision, recall, and
F1-score for both classes (purchased and not
purchased).
Feature Importance Visualization:
We plot the feature importances using the Random
Forest model's feature_importances_ attribute to see
which feature is most important for predicting
purchases.
Unsupervised Learning
In unsupervised learning, the model learns
patterns from unlabeled data.
************K-Means Clustering***********
Goal: Partition the data into K clusters based on
similarity.
Steps:
Model Assumption: There are K clusters, and each
cluster has a centroid.
Objective: Minimize the variance within each cluster.
Learning Process:
Randomly initialize K centroids.
Assign each data point to the nearest centroid.
Recalculate centroids as the mean of points in each
cluster.
Repeat until centroids converge (no change).
Prediction: Assign new points to the nearest centroid.
K-Means clustering is an unsupervised machine
learning algorithm used to partition a dataset into
clusters based on similarity. It groups data points into K
clusters, where each data point belongs to the cluster
with the nearest mean. Here's an example to
demonstrate how K-Means clustering works:
Example:---->1
Consider a dataset of 2D points (with x and
y coordinates) and we want to cluster them
into 3 clusters:
Steps Involved:
Choose the number of clusters: We set K=3,
meaning we want to find 3 clusters.
Initialize centroids: Randomly select K data points as
initial centroids.
Assign data points to the nearest centroid: Assign
each data point to the cluster whose centroid is nearest.
Recompute centroids: After all points are assigned to
clusters, compute the new centroids of each cluster as
the mean of all points within that cluster.
Repeat steps 3-4: Repeat the process until the
centroids no longer change or a maximum number of
iterations is reached.
import numpy as np
import [Link] as plt
from [Link] import KMeans
# Create sample 2D data points
X = [Link]([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10,
0], [3, 3], [8, 3]])
# Plot the data points
[Link](X[:, 0], X[:, 1], s=150, c='blue', label='Data
Points')
# Apply K-Means clustering
kmeans = KMeans(n_clusters=3,
random_state=0).fit(X)
# Get the centroids
centroids = kmeans.cluster_centers_
# Get the labels (cluster assignments)
labels = kmeans.labels_
# Plot the data points with cluster labels
for i in range(3):
[Link](X[labels == i, 0], X[labels == i, 1],
s=150, label=f'Cluster {i+1}')
# Plot the centroids
[Link](centroids[:, 0], centroids[:, 1], s=300,
c='red', marker='X', label='Centroids')
# Add labels and legend
[Link]('X')
[Link]('Y')
[Link]('K-Means Clustering Example')
[Link]()
[Link]()
OUTPUT:
Explanation of Code:
Data: The X array contains 8 data points with 2
features each (x and y coordinates).
KMeans: The KMeans(n_clusters=3) initializes the
K-Means algorithm to find 3 clusters.
Fitting the model: [Link](X) applies the K-Means
algorithm to the data.
Centroids and Labels: The centroids are accessed via
kmeans.cluster_centers_, and the cluster assignments
for each point are accessed via kmeans.labels_.
Visualization: We use matplotlib to plot the data points,
showing different colors for different clusters, and
highlight the centroids with red "X"s.
Visual Output:
The data points are grouped into three clusters.
Each cluster is represented by different colors.
The centroids (the centers of each cluster) are marked
with red "X"s.
Conclusion:
This example demonstrates how K-Means clustering
works by partitioning the 2D data points into 3 clusters
based on similarity, and it visualizes the clusters and
their centroids. The K-Means algorithm is very efficient
and widely used for clustering tasks.
Example:---->2
Generate a synthetic dataset of 2D data
points that form several distinct clusters,
and then apply K-Means clustering to
identify the clusters:-
Steps:-
1. We'll generate random data that forms 3 distinct
clusters.
2. We'll use the K-Means algorithm to find the
clusters.
3. We'll visualize the clusters and the centroids.
import numpy as np
import [Link] as plt
from [Link] import KMeans
from [Link] import make_blobs
# Generate a synthetic dataset with 3 clusters
X, _ = make_blobs(n_samples=300, centers=3,
cluster_std=0.60, random_state=42)
# Plot the data points
[Link](X[:, 0], X[:, 1], s=50, c='blue', label='Data
Points')
# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
[Link](X)
# Get the centroids
centroids = kmeans.cluster_centers_
# Get the labels (cluster assignments)
labels = kmeans.labels_
# Plot the data points with cluster labels
for i in range(3):
[Link](X[labels == i, 0], X[labels == i, 1], s=50,
label=f'Cluster {i+1}')
# Plot the centroids
[Link](centroids[:, 0], centroids[:, 1], s=500,
c='red', marker='*', label='Centroids')
# Add labels and legend
[Link]('X')
[Link]('Y')
[Link]('K-Means Clustering on Synthetic Data')
[Link]()
[Link]()
OUTPUT:-
Explanation of the Code:
Data Generation:make_blobs generates synthetic data
that forms 3 clusters. The number of samples is set to
300, and the standard deviation of each cluster
(cluster_std) is set to 0.60.
KMeans Application:KMeans(n_clusters=3) initializes
the K-Means algorithm to find 3 clusters.
We fit the model to the dataset using [Link](X).
Visualization:
The data points are scattered, and clusters are
visualized using different colors.
The centroids of the clusters are plotted as large red
"X"s.
************Hierarchical Clustering*************
Goal: Build a tree-like structure (dendrogram) to
represent the clustering.
Steps:
Model Assumption: Start with each data point as its
own cluster and iteratively merge the closest clusters.
Objective: Minimize the distance between clusters as
they are merged.
Learning Process:
Compute the distance between every pair of data points
or clusters.
Merge the closest clusters until one large cluster
remains.
Prediction: Use the dendrogram to determine the
clusters for new data points.
Certainly! Here's another example of Hierarchical
Clustering with a slightly different dataset and more
detailed steps to help you understand the algorithm
better. This time, let's use a dataset with more noise
and different densities in the clusters.
Example:---->1
I will generate a synthetic 2D dataset that
has clusters with varying densities and
noise. Then we will perform Agglomerative
Hierarchical Clustering, visualize the
dendrogram, and then visualize the
clusters:
Steps:
1. Generate a synthetic dataset with noise and
varying cluster densities.
2. Apply Agglomerative Hierarchical Clustering.
3. Visualize the dendrogram to observe the
merging process of clusters.
4. Apply a cut-off to form a specified number of
clusters.
5. Visualize the final clusters based on the
hierarchical clustering.
import numpy as np
import [Link] as plt
from [Link] import make_blobs
from [Link] import dendrogram,
linkage
from [Link] import
AgglomerativeClustering
# Generate synthetic data with noise and varying
densities
X, _ = make_blobs(n_samples=400, centers=4,
cluster_std=[0.60, 1.0, 1.5, 0.75], random_state=42)
# Add some noise to the dataset
noise = [Link](40, 2) * 10 # 40 random
points with large variance
X_with_noise = [Link]([X, noise])
# Apply Agglomerative Hierarchical Clustering
linkage_matrix = linkage(X_with_noise,
method='ward')
# Plot the Dendrogram
[Link](figsize=(10, 7))
dendrogram(linkage_matrix)
[Link]('Dendrogram for Agglomerative Hierarchical
Clustering')
[Link]('Sample index')
[Link]('Distance')
[Link]()
# Now apply AgglomerativeClustering with a
predefined number of clusters (e.g., 4)
agg_clustering =
AgglomerativeClustering(n_clusters=4,
linkage='ward') # Remove affinity parameter
labels = agg_clustering.fit_predict(X_with_noise)
# Visualize the clusters
[Link](figsize=(8, 6))
for i in range(4):
[Link](X_with_noise[labels == i, 0],
X_with_noise[labels == i, 1], s=50, label=f'Cluster
{i+1}')
[Link]('Clusters from Agglomerative Hierarchical
Clustering')
[Link]('X')
[Link]('Y')
[Link]()
[Link]()
OUTPUT:-
Explanation of the Code:
Data Generation:make_blobs() is used to create 4
clusters, but each cluster has a different standard
deviation (cluster_std), giving them varying densities.
This helps simulate a more realistic scenario where
clusters are not uniform.
We then introduce noise into the dataset by adding 40
random points scattered across a larger area (noise =
[Link](40, 2) * 10).
Agglomerative Hierarchical Clustering:
The linkage() function from [Link] is
used to perform hierarchical clustering using the ward
method, which minimizes the variance of merged
clusters.
The dendrogram helps visualize the merging process of
clusters.
Dendrogram Visualization:
The dendrogram visually represents how clusters are
merged at each step, and the height of the merging
lines indicates the distance between clusters.
The x-axis shows the sample index, while the y-axis
shows the distance between clusters as they are
merged.
Final Clustering:
AgglomerativeClustering from sklearn is used to cut the
dendrogram at a specific level to form 4 clusters.
The labels for each data point are generated with
fit_predict(X_with_noise), which assigns each point to
one of the 4 clusters.
Cluster Visualization:
We plot the data points with different colors based on
the cluster they belong to. This gives us the final view of
how the hierarchical algorithm grouped the data points
into clusters.
Example:---->2
This will showcase the flexibility of
Agglomerative Hierarchical Clustering in
identifying clusters based on the underlying
data structure:
Steps:
1. Generate a synthetic dataset with concentric
circles to simulate non-linearly separable
clusters.
2. Apply Agglomerative Hierarchical Clustering.
3. Visualize the Dendrogram to observe the
merging of clusters.
4. Apply a cut-off to form a specified number of
clusters (e.g., 2 or 3).
5. Visualize the final clusters based on the
hierarchical clustering.
import numpy as np
import [Link] as plt
from [Link] import make_circles
from [Link] import dendrogram,
linkage
from [Link] import
AgglomerativeClustering
# Generate synthetic data with concentric circles
(non-linearly separable)
X, _ = make_circles(n_samples=300, factor=0.5,
noise=0.1, random_state=42)
# Apply Agglomerative Hierarchical Clustering
linkage_matrix = linkage(X, method='ward')
# Plot the Dendrogram
[Link](figsize=(10, 7))
dendrogram(linkage_matrix)
[Link]('Dendrogram for Agglomerative Hierarchical
Clustering')
[Link]('Sample index')
[Link]('Distance')
[Link]()
# Now apply AgglomerativeClustering with a
predefined number of clusters (e.g., 2)
agg_clustering =
AgglomerativeClustering(n_clusters=2,
linkage='ward') # Remove affinity parameter
labels = agg_clustering.fit_predict(X)
# Visualize the clusters
[Link](figsize=(8, 6))
for i in range(2):
[Link](X[labels == i, 0], X[labels == i, 1], s=50,
label=f'Cluster {i+1}')
[Link]('Clusters from Agglomerative Hierarchical
Clustering (Concentric Circles)')
[Link]('X')
[Link]('Y')
[Link]()
[Link]()
OUTPUT:-
******Principal Component Analysis ******
Goal: Reduce the dimensionality of the data while
preserving as much variance as possible.
Steps:
Model Assumption: The data lies in a
lower-dimensional subspace.
Objective: Find the principal components (directions
with the most variance).
Learning Process:-
Compute the covariance matrix of the data.
Find the eigenvalues and eigenvectors (principal
components) of the covariance matrix.
Project the data onto the top N principal components.
Prediction: Reduce the dimensionality of new data by
projecting it onto the same components.
Principal Component Analysis (PCA). PCA is a
dimensionality reduction technique that helps reduce
the number of features in a dataset while retaining most
of the variance (information) in the data. It projects the
data onto a lower-dimensional space using a set of
orthogonal axes (principal components), where each
component captures the maximum variance in the data.
Example:---->1
I will generate a synthetic 3D dataset and apply PCA to
reduce it to 2D for visualization. This will allow us to
observe how PCA identifies the directions (principal
components) of maximum variance and reduces the
data’s dimensionality.
Steps:
1. Generate a synthetic 3D dataset.
2. Apply PCA to reduce the data from 3D to 2D.
3. Visualize the original data and the reduced data.
4. Understand the principal components and their
direction.
import numpy as np
import [Link] as plt
from [Link] import PCA
from [Link] import make_blobs
# Generate synthetic 3D data (e.g., 300 samples
with 3 features)
X, _ = make_blobs(n_samples=300, centers=1,
n_features=3, random_state=42)
# Visualize the original 3D data
fig = [Link](figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
[Link](X[:, 0], X[:, 1], X[:, 2], c='blue',
label='Original Data')
ax.set_xlabel('X1')
ax.set_ylabel('X2')
ax.set_zlabel('X3')
ax.set_title('Original 3D Data')
[Link]()
# Apply PCA to reduce the data to 2D
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# Visualize the reduced 2D data
[Link](figsize=(8, 6))
[Link](X_reduced[:, 0], X_reduced[:, 1], c='red',
label='Reduced Data (2D)')
[Link]('Reduced Data (2D) after PCA')
[Link]('Principal Component 1')
[Link]('Principal Component 2')
[Link]()
[Link]()
# Explained variance ratio
print("Explained Variance Ratios:",
pca.explained_variance_ratio_)
print("Cumulative Explained Variance:",
[Link](pca.explained_variance_ratio_))
# Plotting the directions of the principal
components
# Centering the data for visualization purposes
X_centered = X - [Link](X, axis=0)
# Plot the first two principal components
[Link](figsize=(8, 6))
[Link](X_centered[:, 0], X_centered[:, 1],
c='blue', label='Centered Data')
# Plot the principal components as vectors
for length, vector in
zip([Link](pca.explained_variance_),
pca.components_):
[Link](0, 0, vector[0], vector[1], angles='xy',
scale_units='xy', scale=length, color='green')
[Link](-5, 5)
[Link](-5, 5)
[Link]('X1')
[Link]('X2')
[Link]('Principal Components as Vectors')
[Link]()
[Link]()
[Link]()
OUTPUT:-
Explanation of the Code:
Data Generation:
We generate a synthetic 3D dataset using
make_blobs(), with 300 data points and 3 features. All
points are centered around a single point for simplicity,
making it easier to observe the dimensionality
reduction.
PCA Application:
We apply PCA to reduce the dataset from 3D to 2D
using PCA(n_components=2). The fit_transform()
method computes the principal components and applies
the transformation to the data, reducing its
dimensionality.
Visualization:
We first visualize the original 3D data using a 3D
scatter plot. After applying PCA, we plot the reduced 2D
data using a 2D scatter plot.
The explained_variance_ratio_ shows how much
variance each principal component explains. This gives
insight into the importance of each component.
Principal Components Visualization:
To further understand PCA, we plot the direction of the
principal components in the original 2D space (after
centering the data). The green arrows represent the
directions of the principal components, and their lengths
represent the variance explained by each component.
Example:---->2
Principal Component Analysis (PCA), this time using a
real-world dataset to see how PCA can be applied for
dimensionality reduction in a more practical scenario.
Steps:
1. Load the Iris dataset.
2. Apply PCA to reduce the data from 4 dimensions
to 2 dimensions.
3. Visualize the original data and the reduced data.
4. Interpret the explained variance to understand
the importance of each principal component.
5. Visualize the relationship between features and
principal components.
import numpy as np
import [Link] as plt
from [Link] import PCA
from [Link] import load_iris
from [Link] import StandardScaler
# Load the Iris dataset
iris = load_iris()
X = [Link] # Features (sepal length, sepal width,
petal length, petal width)
y = [Link] # Labels (species)
# Standardize the features before applying PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA to reduce the data to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Plot the reduced 2D data
[Link](figsize=(8, 6))
[Link](X_pca[:, 0], X_pca[:, 1], c=y,
cmap='viridis', edgecolor='k', s=100)
[Link](label='Species')
[Link]('PCA of Iris Dataset')
[Link]('Principal Component 1')
[Link]('Principal Component 2')
[Link]()
# Print the explained variance ratio for each
component
print("Explained Variance Ratios:",
pca.explained_variance_ratio_)
print("Cumulative Explained Variance:",
[Link](pca.explained_variance_ratio_))
# Plot the principal components as vectors on the
2D plot
[Link](figsize=(8, 6))
[Link](X_pca[:, 0], X_pca[:, 1], c=y,
cmap='viridis', edgecolor='k', s=100)
[Link](label='Species')
[Link]('PCA of Iris Dataset with Principal
Components')
# Add the principal component vectors
for length, vector in
zip([Link](pca.explained_variance_),
pca.components_):
[Link](0, 0, vector[0], vector[1], angles='xy',
scale_units='xy', scale=length, color='red')
[Link]('Principal Component 1')
[Link]('Principal Component 2')
[Link]()
OUTPUT:-
Explanation of the Code:
Loading the Iris Dataset:We load the Iris dataset
using load_iris() from [Link]. The dataset
contains 150 samples across 4 features (sepal length,
sepal width, petal length, and petal width) and is
already labeled with 3 species.
Standardizing the Features:PCA is sensitive to the
scale of the features, so we standardize the data using
StandardScaler to ensure each feature has a mean of 0
and a standard deviation of 1. This is a common
practice before applying PCA.
Applying PCA:We apply PCA using
PCA(n_components=2) to reduce the data from 4
dimensions to 2 dimensions. The fit_transform() method
computes the principal components and transforms the
data into the new coordinate system defined by these
components.
Visualization of the Reduced Data:After reducing the
data, we plot the first two principal components, with
colors representing different iris species. This allows us
to see how well PCA can separate the data along the
principal components.
Explained Variance:We print the explained variance
ratio for each of the two principal components. This tells
us how much of the total variance is captured by each
component. The cumulative explained variance helps
us understand how much of the original information is
retained after reducing the data.
Principal Components as Vectors:To better
understand the direction of the principal components,
we plot them as vectors on top of the 2D scatter plot.
These vectors show the direction of maximum variance
in the original 4-dimensional space, which PCA has
captured in the first two components.