0% found this document useful (0 votes)
23 views35 pages

Machine Learning Basics and Data Preprocessing

Uploaded by

mhnawab11
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views35 pages

Machine Learning Basics and Data Preprocessing

Uploaded by

mhnawab11
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Lectures 1 – 3

Contents: Basic Machine Learning, Parametric and non-parametric models, curse


of dimensionality, Over fitting and Model Selection
Machine learning:
 Machine learning is a branch of Artificial Intelligence that focuses on
developing models and algorithms that let computers learn from data
without being explicitly programmed for every task.
 In simple words, ML teaches the systems to think and understand like
humans by learning from the data.
Machine Learning is mainly divided into three core types: Supervised,
Unsupervised and Reinforcement Learning along with two additional types, Semi-
Supervised and Self-Supervised Learning.
 Supervised Learning: Trains models on labeled data to predict or classify
new, unseen data.
 Unsupervised Learning: Finds patterns or groups in unlabeled data, like
clustering or dimensionality reduction.
 Reinforcement Learning: Learns through trial and error to maximize
rewards, ideal for decision-making tasks.

Module 1: Machine Learning Pipeline


This section covers preprocessing, exploratory data analysis and model
evaluation to prepare data, uncover insights and build reliable models.

Data Preprocessing
ML workflow

Data Cleaning
How to Perform Data Cleaning
The process begins by identifying issues like missing values, duplicates and
outliers. Performing data cleaning involves a systematic process to identify
and remove errors in a dataset. The following steps are essential to perform
data cleaning:
 Remove Unwanted Observations: Eliminate duplicates, irrelevant entries or
redundant data that add noise.
 Fix Structural Errors: Standardize data formats and variable types for
consistency.
 Manage Outliers: Detect and handle extreme values that can skew results,
either by removal or transformation.
 Handle Missing Data: Address gaps using imputation, deletion or advanced
techniques to maintain accuracy and integrity.
Data Preprocessing

Steps-by-Step implementation
Let's implement various preprocessing features,
Step 1: Import Libraries and Load Dataset
We prepare the environment With libraries
like pandas, numpy, scikitlearn, matplotlib and seaborn for data
manipulation, numerical operations, visualization and scaling. Load the
dataset for preprocessing.
Step 2: Inspect Data Structure and Check Missing Values
We understand dataset size, data types and identify any incomplete
(missing) data that needs handling.
 [Link](): Prints concise summary including count of non-null entries and
data type of each column.
 [Link]().sum(): Returns the number of missing values per column.
Step 3: Statistical Summary and Visualizing Outliers
Get numeric summaries like mean, median, min/max and detect unusual
points (outliers). Outliers can skew models if not handled.
 [Link](): Computes count, mean, std deviation, min/max and quartiles
for numerical columns.
 Boxplots: Visualize spread and detect outliers using matplotlib’s boxplot().
Step 4: Remove Outliers Using the Interquartile Range (IQR) Method
Remove extreme values beyond a reasonable range to improve model
robustness.
 IQR = Q3 (75th percentile) – Q1 (25th percentile).
 Calculate lower and upper bounds for each column separately.
 Filter data points to keep only those within bounds.
Step 5: Correlation Analysis
Understand relationships between features and the target variable (Outcome).
Correlation helps gauge feature importance.
 [Link](): Computes pairwise correlation coefficients between columns.
 Heatmap via seaborn visualizes correlation matrix clearly.
 Sorting correlations with corr['Outcome'].sort_values() highlights features
most correlated with the target.
Step 6: Visualize Target Variable Distribution
Check if target classes (Diabetes vs Not Diabetes) are balanced, affecting model
training and evaluation.
 [Link](): Pie chart to display proportion of each class in the target variable
'Outcome'.
Step 7: Separate Features and Target Variable
Prepare independent variables (features) and dependent variable (target)
separately for modeling.
 [Link](columns=[...]): Drops the target column from features.
 Direct column selection df['Outcome'] selects target column.
Step 8: Feature Scaling: Normalization and Standardization
Scale features to a common range or distribution, important for many ML
algorithms sensitive to feature magnitudes.
1. Normalization (Min-Max Scaling): Rescales features between 0 and 1. Good for
algorithms like k-NN and neural networks.
 Class: MinMaxScaler from sklearn.
 .fit_transform(): Learns min/max from data and applies scaling.
2. Standardization: Transforms features to have mean = 0 and standard deviation
= 1, useful for normally distributed features.
 Class: StandardScaler from sklearn.

Feature Scaling Techniques

Sensitivity
Type Method Description to Outliers Typical Use Cases

Divides values by
Absolute Maximum max absolute value High Sparse data, simple scaling
Scaling in each feature

Min-Max Scaling Scales features to by High Neural networks, bounded


(Normalization) min-max input features
Sensitivity
Type Method Description to Outliers Typical Use Cases

normalization

Scales each sample Not


Direction-based similarity,
Normalization (Vector vector to unit length applicable
text classification
Norm) (norm = 1) (per row)

Centers features to Most ML algorithms,


Standardization (Z- mean 0 and scales to Moderate assumes approx. normal
Score) unit variance data

Centers on median Data with outliers, skewed


Low
Robust Scaling and scales using IQR distributions

Feature Extraction
Feature extraction is the process of transforming raw data into a simplified and
informative set of features or attributes. This reduces data complexity and
highlights the most relevant information making it easier for machine learning
models to analyze and learn from the data efficiently
[Link] Methods
 Mean: The average value of a dataset.
 Median: The middle value when it is sorted in ascending order.
 Standard Deviation: A measure of the spread or dispersion of a sample.
 Correlation and Covariance: Measures of the linear relationship between
two or more factors.
 Regression Analysis: A way to model the link between a dependent variable
and one or more independent factors.
2. Dimensionality Reduction
 Principal Component Analysis: It selects variables that account for most
of the data’s variation, simplifying the dataset by focusing on the most
important components.
 Linear Discriminant Analysis (LDA): It finds the best combination of
features to separate different classes, maximizing class separability for
better classification.
 t-Distributed Stochastic Neighbor Embedding (t-SNE): A technique that
reduces high-dimensional data into two- or three-dimensions ideal for
visualizing complex datasets.
3. Signal Processing Methods
 It is used for analyzing time-series, audio and sensor data

4. Image Data Extraction

Techniques for extracting features from images:

Image Data Extraction


1. Histogram of Oriented Gradients (HOG):This technique finds the
distribution of intensity gradients or edge directions in an image. It's used
in object detection and recognition tasks.
2. Convolutional Neural Networks (CNN) Features: They learn hierarchical
features from images through layers of convolutions, ideal for
classification and detection tasks.

2. Exploratory Data Analysis


Exploratory Data Analysis (EDA) is an important step in data science and data
analytics as it visualizes data to understand its main features, find patterns and
discover how different parts of the data are connected.
Handling Missing Data
Missing data is common in many datasets and can affect the quality of our
analysis. During EDA it's important to identify and handle missing data properly to
avoid biased or misleading results. Here’s how to handle it:
1. Understand the patterns and possible causes of missing data. Is it missing
completely at random (MCAR), missing at random (MAR) or missing not at
random (MNAR). Identifying this helps us to find best way to handle the
missing data.
2. Decide whether to remove missing data or impute (fill in) the missing
values. Removing data can lead to biased outcomes if the missing data isn’t
MCAR. Filling values helps to preserve data but should be done carefully.
3. Use appropriate imputation methods like mean or median
imputation, regression imputation or machine learning techniques
like KNN or decision trees based on the data’s characteristics.
4. Consider the impact of missing data. Even after imputing, missing data can
cause uncertainty and bias so understands the result with caution.
Properly handling of missing data improves the accuracy of our analysis and
prevents misleading conclusions.

Model Evaluation
Regularization in Machine Learning
Regularization is a technique used in machine learning to prevent overfitting and
performs poorly on unseen data. By adding a penalty for complexity,
regularization encourages simpler, more generalizable models.
 Prevents overfitting: Adds constraints to the model to reduce the risk of
memorizing noise in the training data.
 Improves generalization: Encourages simpler models that perform better on
new, unseen data.
1. Lasso Regression
 A regression model which uses the L1 Regularization technique is
called LASSO (Least Absolute Shrinkage and Selection
Operator) regression. It adds the absolute value of magnitude of the
coefficient as a penalty term to the loss function(L). This penalty can shrink
some coefficients to zero which helps in selecting only the important
features and ignoring the less important ones.
n
1
 Cost = ∑ ❑¿
n i=1

 Where
 m - Number of Features

 n - Number of Examples

 yi - Actual Target Value

 ^y i - Predicted Target Value

2. Ridge Regression
A regression model that uses the L2 regularization technique is called Ridge
regression. It adds the squared magnitude of the coefficient as a penalty term to
the loss function(L). It handles multicollinearity by shrinking the coefficients of
correlated features instead of eliminating them.
n
1
Cost = ∑ ❑¿
n i=1

Where,
 n = Number of examples or data points

 m = Number of features i.e predictor variables

 y i = Actual target value for the ith example

 ^y i = Predicted target value for the ith example

 w i = Coefficients of the features

 λ = Regularization parameter that controls the strength of regularization

3. Elastic Net Regression


Elastic Net Regression is a combination of both L1 as well as L2
regularization. That shows that we add the absolute norm of the weights as
well as the squared measure of the weights. With the help of an extra
hyperparameter that controls the ratio of the L1 and L2 regularization.
n
1
 Cost = ∑ ❑¿
n i=1
Where
 n = Number of examples (data points)
 m = Number of features (predictor variables)
 y i = Actual target value for the i th example
 ^y i = Predicted target value for the ith example
 wi = Coefficients of the features
 λ = Regularization parameter that controls the strength of regularization
 α = Mixing parameter where 0 ≤ α ≤ 1 and α = 1 corresponds to Lasso ( L1)
regularization, α = 0 corresponds to Ridge ( L2) regularization and Values
between 0 and 1 provide a balance of both L1 and L2 regularization
Confusion Matrix
Confusion matrix is a simple table used to measure how well a classification
model is performing. It compares the predictions made by the model with
the actual results and shows where the model was right or wrong. This
helps you understand where the model is making mistakes so you can
improve it. It breaks down the predictions into four categories:
 True Positive (TP): The model correctly predicted a positive outcome i.e the
actual outcome was positive.
 True Negative (TN): The model correctly predicted a negative outcome i.e
the actual outcome was negative.
 False Positive (FP): The model incorrectly predicted a positive outcome i.e
the actual outcome was negative. It is also known as a Type I error.
 False Negative (FN): The model incorrectly predicted a negative outcome i.e
the actual outcome was positive. It is also known as a Type II error.

It also helps calculate key measures


like accuracy, precision and recall which give a better idea of performance
especially when the data is imbalanced.
 Precision, Recall and F1-Score
Accuracy=(TP+TN) / (FP+FN+TP+TN)
Precision=TP / (FP+TP)
Recall=TP / (FN+TP)
F1-Score = 2. Precision. Recall / (Precision+Recall)
AUC-ROC Curve
AUC-ROC curve is a graph used to check how well a binary classification model
works. It helps us to understand how well the model separates the positive cases
like people with a disease from the negative cases like people without the disease
at different threshold level. It shows how good the model is at telling the
difference between the two classes by plotting:
 True Positive Rate (TPR): how often the model correctly predicts the
positive cases also known as Sensitivity or Recall.
 False Positive Rate (FPR): how often the model incorrectly predicts a
negative case as positive.
 Specificity: measures the proportion of actual negatives that the model
correctly identifies. It is calculated as 1 - FPR.
The higher the curve the better the model is at making correct predictions.
 ROC Curve : It plots TPR vs. FPR at different thresholds. It represents the
trade-off between the sensitivity and specificity of a classifier.
 AUC(Area Under the Curve): measures the area under the ROC curve. A
higher AUC value indicates better model performance as it suggests a
greater ability to distinguish between classes. An AUC value of 1.0 indicates
perfect performance while 0.5 suggests it is random guessing.
Cross-validation
Cross-validation is a technique used to check how well a machine learning
model performs on unseen data while preventing overfitting. It works by:
 Splitting the dataset into several parts.
 Training the model on some parts and testing it on the remaining part.
 Repeating this resampling process multiple times by choosing different
parts of the dataset.
 Averaging the results from each validation step to get the final
performance.
K-Fold Cross Validation splits the dataset into k equal-sized folds. The model is
trained on k-1 folds and tested on the remaining fold. This process is
repeated k times each time using a different fold for testing.
Hyperparameter Tuning
Hyperparameter tuning is the process of selecting the optimal values for a
machine learning model's hyperparameters. These are typically set before the
actual training process begins and control aspects of the learning process itself.
They influence the model's performance its complexity and how fast it learns.
For example the learning rate and number of neurons in a neural network in a
neural network or the kernel size in a support vector machine can significantly
impact how well the model trains and generalizes. The goal of hyperparameter
tuning is to find the values that lead to the best performance on a given task.
These settings can affect both the speed and quality of the model's performance.
 A high learning rate can cause the model to converge too quickly possibly
skipping over the optimal solution.
 A low learning rate might lead to slower convergence and require more
time and computational resources.

Module 2: Supervised Learning


Supervised learning algorithms are generally categorized into two main types:
 Classification - where the goal is to predict discrete labels or categories
 Regression - where the aim is to predict continuous numerical values.
Supervised Learning
There are many algorithms used in supervised learning each suited to different
types of problems. Some of the most commonly used supervised learning
algorithms are:
1. Linear Regression
This is one of the simplest ways to predict numbers using a straight line. It helps
find the relationship between input and output.

Best Fit Line in Linear Regression


For simple linear regression (with one independent variable), the best-fit line is
represented by the equation
y=mx+b

Where:
 y is the predicted value (dependent variable)
 x is the input (independent variable)
 m is the slope of the line (how much y changes when x changes)
 b is the intercept (the value of y when x = 0)
The best-fit line will be the one that optimizes the values of m (slope) and b
(intercept) so that the predicted y values are as close as possible to the actual
data points.
Here Y is called a dependent or target variable and X is called an independent
variable also known as the predictor of Y. There are many types of functions or
modules that can be used for regression. A linear function is the simplest type of
function. Here, X may be a single feature or multiple features representing the
problem.
 Gradient Descent in Linear Regression
Gradient Descent in Linear Regression
Gradient descent is a optimization algorithm used in linear regression to find the
best fit line the data. It works by gradually by adjusting the line’s slope and
intercept to reduce the difference between actual and predicted values.
Above image shows two graphs, left one plots house prices against size to show
errors measured by the cost function while right one shows how gradient descent
moves downhill on the cost curve to minimize error by updating parameters step
by step.
Why Use Gradient Descent for Linear Regression?
Linear regression finds the best-fit line for a dataset by minimizing the error
between the actual and predicted values. This error is measured using the cost
function usually Mean Squared Error (MSE). The goal is to find the model
parameters i.e. the slope m and the intercept b that minimize this cost function.

2. Logistic Regression
Used when the output is a "yes or no" type answer. It helps in predicting
categories like pass/fail or spam/not spam.
Cost function in Logistic Regression in Machine Learning

Logistic Regression is a supervised learning algorithm used for classification


problems. To measure how well the model is performing, we use a cost function,
which tells us how far the predicted values are from the actual ones. In Logistic
Regression, the cost function is based on log loss (cross-entropy loss) instead of
mean squared error.

 It measures the error between the predicted probability and the actual
class label (0 or 1).
 Instead of a straight line (like Linear Regression), Logistic Regression works
with probabilities between 0 and 1 using the sigmoid function.

 The cost function penalizes wrong predictions more heavily when the
model is confident but wrong.

The cost function is defined as


Cost (hθ (x ), y )=− y ⋅log ⁡(hθ (x ))−(1− y )⋅ log ⁡(1−hθ (x ))
Where:
 θ(x ) : Predicted probability using sigmoid
 y : Actual value (0 or 1)

For all training examples, the cost function (log loss) becomes:
m
−1
J (θ)= ∑
m i =1
❑[ y(i) log ⁡(hθ (x(i) ))+(1− y (i))log ⁡(1−hθ (x (i)))]
Why Not Use Mean Squared Error (MSE)
 MSE works well for regression, but in Logistic Regression it creates a non-
convex curve (multiple local minima).
 Log loss ensures a convex cost function, making optimization with
Gradient Descent easier and guaranteeing a global minimum.

3. Decision Trees
A model that makes decisions by asking a series of simple questions, like a
flowchart. Easy to understand and use.

Example: Predicting Whether a Customer Will Buy a Product Using Two Decision
Trees
Tree 1: Customer Demographics
First tree asks two questions:
1. "Income > $50,000?"
 If Yes, Proceed to the next question.
 If No, "No Purchase"
2. "Age > 30?"
 Yes: "Purchase"
 No: "No Purchase"
Tree 2: Previous Purchases
"Previous Purchases > 0?"
 Yes: "Purchase"
 No: "No Purchase"

Entropy: is the measure of uncertainty of a random variable it characterizes the


impurity of an arbitrary collection of examples. The higher the entropy more the
information content.
Example:
For the set X = {a,a,a,b,b,b,b,b}
Total instances: 8
Instances of b: 5
Instances of a: 3
3 3 5 5
Entropy H ( X ) ¿ [( )log ⁡2 +( ) log ⁡2 ]
8 8 8 8
¿ ¿−(−0.53−0.424)
¿ ¿

5. Support Vector Machines (SVM)


A bit more advanced—it tries to draw the best line (or boundary) to separate
different categories of data.

How does Support Vector Machine Algorithm Work?


The key idea behind the SVM algorithm is to find the hyperplane that best
separates two classes by maximizing the margin between them. This margin is
the distance from the hyperplane to the nearest data points (support vectors)
on each side.
Multiple hyperplanes separate the data from two classes

The best hyperplane also known as the "hard margin" is the one that maximizes
the distance between the hyperplane and the nearest data points from both
classes. This ensures a clear separation between the classes. So from the above
figure, we choose L2 as hard margin. Let's consider a scenario like shown below:

Selecting hyperplane for data with outlier

Here, we have one blue ball in the boundary of the red ball.

How does SVM classify the data?


The blue ball in the boundary of red ones is an outlier of blue balls. The SVM
algorithm has the characteristics to ignore the outlier and finds the best
hyperplane that maximizes the margin. SVM is robust to outliers.
Hyperplane which is the most optimized one
A soft margin allows for some misclassifications or violations of the margin to
improve generalization. The SVM optimizes the following equation to balance
margin maximization and penalty minimization:
1
Objective Function=( )+ λ ∑ penalty
margin

The penalty used for violations is often hinge loss which has the following
behavior:
 If a data point is correctly classified and within the margin there is no
penalty (loss = 0).
 If a point is incorrectly classified or violates the margin the hinge loss
increases proportionally to the distance of the violation.

Mathematical Computation of SVM


Consider a binary classification problem with two classes, labeled as +1 and -1.
We have a training dataset consisting of input feature vectors X and their
corresponding class labels Y. The equation for the linear hyperplane can be
written as:
T
w x+ b=0
Where:
 w is the normal vector to the hyperplane (the direction perpendicular to
it).
 b is the offset or bias term representing the distance of the hyperplane
from the origin along the normal vector w .

Distance from a Data Point to the Hyperplane


The distance between a data point x iand the decision boundary can be
calculated as:
T
w x i+ b
d i=
∣∣ w ∣∣
where ||w|| represents the Euclidean norm of the weight vector w.
Linear SVM Classifier
Distance from a Data Point to the Hyperplane:

yi(wTxi+b) ≥ 1fori=1,2,3,⋯,m
Where:
 y i is the class label (+1 or -1) for each training instance.
 x i is the feature vector for the i-th training instance.
 m is the total number of training instances.

Where ^y is the predicted label of a data point.


Optimization Problem for SVM
For a linearly separable dataset the goal is to find the hyperplane that maximizes
the margin between the two classes while ensuring that all data points are
correctly classified. This leads to the following optimization problem:
1 2
minimizew , b ∥ w ∥
2
Subject to the constraint:
T
y i ( w x i+ b)≥ 1 fori=1 ,2 , 3 , ⋯ ,m
Where:
 y i is the class label (+1 or -1) for each training instance.
 x i is the feature vector for the i -th training instance.
 m is the total number of training instances.
The condition y i ( wT x i+ b)≥ 1 ensures that each data point is correctly classified
and lies outside the margin.
5. k-Nearest Neighbors (k-NN)
This model looks at the closest data points (neighbors) to make predictions. Super
simple and based on similarity.
6. Naïve Bayes
A quick and smart way to classify things based on probability. It works well for text
and spam detection.
 Introduction to Naive Bayes
 Gaussian Naive Bayes
 Multinomial Naive Bayes
 Bernoulli Naive Bayes
 Complement Naive Bayes
7. Random Forest (Bagging Algorithm)
A powerful model that builds lots of decision trees and combines them for better
accuracy and stability.
 Introduction to Random forest
 Random Forest Classifier
 Random Forest Regression
 Hyperparameter Tuning in Random Forest
Introduction to Ensemble Learning
Ensemble learning combines multiple simple models to create a stronger, smarter
model. There are mainly two types of ensemble learning:
 Bagging that combines multiple models trained independently.
 Boosting that builds models sequentially each correcting the errors of the
previous one.
Module 3: Unsupervised learning
Unsupervised learning are again divided into three main categories based on their
purpose:
 Clustering
 Association Rule Mining
 Dimensionality Reduction.
Unsupervised learning
1. Clustering
Clustering algorithms group data points into clusters based on their similarities or
differences. Types of clustering algorithms are:
Centroid-based Methods:
 K-Means clustering
 Elbow Method for optimal value of k in KMeans
 K-Means++ clustering
 K-Mode clustering
 Fuzzy C-Means (FCM) Clustering
Distribution-based Methods:
 Gaussian mixture models
 Expectation-Maximization Algorithm
 Dirichlet process mixture models (DPMMs)
Connectivity based methods:
 Hierarchical clustering
 Agglomerative Clustering
 Divisive clustering
 Affinity propagation
Density Based methods:
 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
 OPTICS (Ordering Points To Identify the Clustering Structure)
2. Dimensionality Reduction
Dimensionality reduction is used to simplify datasets by reducing the number of
features while retaining the most important information.
 Principal Component Analysis (PCA)
 t-distributed Stochastic Neighbor Embedding (t-SNE)
 Non-negative Matrix Factorization (NMF)
 Independent Component Analysis (ICA)
 Isomap
 Locally Linear Embedding (LLE)
3. Association Rule
Find patterns between items in large datasets typically in market basket analysis.
 Apriori algorithm
 Implementing apriori algorithm
 FP-Growth (Frequent Pattern-Growth)
 ECLAT (Equivalence Class Clustering and bottom-up Lattice Traversal)
Module 4: Reinforcement Learning
Reinforcement learning interacts with environment and learn from them based on
rewards.
Reinforcement Learning
1. Model-Based Methods
These methods use a model of the environment to predict outcomes and help the
agent plan actions by simulating potential results.
 Markov decision processes (MDPs)
 Bellman equation
 Value iteration algorithm
 Monte Carlo Tree Search
2. Model-Free Methods
The agent learns directly from experience by interacting with the environment
and adjusting its actions based on feedback.
 Q-Learning
 SARSA
 Monte Carlo Methods
 Reinforce Algorithm
 Actor-Critic Algorithm
 Asynchronous Advantage Actor-Critic (A3C)
Module 5: Semi Supervised Learning
It uses a mix of labeled and unlabeled data making it helpful when labeling data is
costly or it is very limited.

Semi Supervised Learning


 Semi Supervised Classification
 Self-Training in Semi-Supervised Learning
 Few-shot learning in Machine Learning
Module 6: Forecasting Models
Forecasting models analyze past data to predict future trends, commonly used for
time series problems like sales, demand or stock prices.
 ARIMA (Auto-Regressive Integrated Moving Average)
 SARIMA (Seasonal ARIMA)
 Exponential Smoothing (Holt-Winters)

You might also like