0% found this document useful (0 votes)

9 views24 pages

ML Unit 1

The document outlines a comprehensive framework for Machine Learning model development, consisting of five stages: problem identification, data collection, data preprocessing, model building, and deployment. It also categorizes Machine Learning into three types: supervised, unsupervised, and reinforcement learning, each with distinct characteristics and applications. Additionally, it discusses techniques like simple and multiple linear regression, residual analysis, and distance measures for identifying influential observations in regression analysis.

Uploaded by

Irfan Nanasana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views24 pages

ML Unit 1

Uploaded by

Irfan Nanasana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1.

Framework for Machine Learning Model

Development
Machine Learning (ML) model development is a systematic and iterative process that converts
raw data into actionable intelligence. A well-defined framework ensures that ML models are
accurate, reliable, scalable, and aligned with business or real-world objectives. The framework
for developing Machine Learning models consists of five integrated stages.

1. Problem and Opportunity Identification

The first and most critical step in the Machine Learning framework is identifying the problem or
opportunity where ML can create value.

In this stage:

● The organization clearly defines the objective of the ML system.

● The type of problem is identified (classification, regression, clustering, anomaly
detection, etc.).
● Key performance indicators (KPIs) and success metrics are finalized.

For example, an organization may want to:

● Predict customer churn,

● Detect fraudulent transactions,
● Forecast sales revenue, or
● Classify customer feedback.

Clearly defining the problem helps in selecting appropriate data, algorithms, and evaluation
metrics. Poor problem formulation can lead to ineffective ML solutions, even with high-quality
data.

2. Collection of Relevant Data (Feature Extraction)

Once the problem is defined, the next step is collecting relevant data required to build the ML
model. This process is known as feature extraction.

Key aspects of this stage include:

● Identifying data sources such as databases, sensors, logs, ERP systems, or external
sources (weather, social media, etc.).
● Ensuring that the collected data contains variables that influence the target outcome.
● Integrating data from multiple sources into a unified dataset.

For example, in predicting vehicle warranty costs, internal data alone may not be sufficient.
External data such as weather conditions, road quality, and usage patterns may also be
required. Feature extraction is often an iterative process, as new useful features may be
discovered during later stages.

3. Data Pre-processing and Feature Engineering

Raw data collected from various sources is often incomplete, noisy, and inconsistent. Therefore,
data preprocessing is a crucial step in the ML framework.

This stage includes:

● Handling missing values (removal, imputation, or substitution).

● Removing noise and outliers.
● Normalizing or standardizing data.
● Encoding categorical variables.
● Transforming data into suitable formats.

Along with preprocessing, feature engineering is performed to improve model performance.

Feature engineering involves:

● Creating new features from existing ones (e.g., ratios, products, bins).
● Selecting relevant features and removing irrelevant ones.
● Reducing dimensionality when required.

Effective feature engineering significantly enhances the learning capability and accuracy of ML
models.

4. Model Building and Feature Selection

In this stage, suitable Machine Learning algorithms are selected and trained using the prepared
dataset.

Key activities include:

● Choosing appropriate models based on problem type (e.g., Linear Regression, Decision
Trees, SVMs, Neural Networks).
● Splitting data into training and validation datasets.
● Training models and tuning hyperparameters.
● Evaluating models using performance metrics such as accuracy, precision, recall,
RMSE, etc.
● Performing feature selection to identify features that have a significant impact on the
target variable.

The most accurate model may not always be chosen. Practical considerations such as
computation time, scalability, cost, and ease of deployment are also taken into account while
selecting the final model.

5. Model Deployment and Monitoring

The final stage of the ML framework is model deployment, where the trained model is
integrated into a real-world system.

Model deployment can take various forms, such as:

● Business rules,
● Recommendation systems,
● Chatbots,
● Real-time decision systems,
● Automated monitoring systems.

After deployment, continuous monitoring and maintenance are essential to ensure sustained
performance. The model must be evaluated regularly for:

● Performance degradation,
● Data drift,
● Changes in real-world conditions.

If required, the model is retrained using updated data to maintain accuracy and reliability.

Conclusion
The framework for Machine Learning model development provides a structured approach to
building effective ML solutions. It ensures that the model is aligned with the problem objectives,
trained on high-quality data, optimized through feature engineering and model selection, and
successfully deployed in real-world applications. The success of ML projects largely depends on
how innovatively data is used, rather than the mechanical application of algorithms.
2. Types of Machine Learning
Machine Learning (ML) systems can be broadly classified based on the type of supervision
provided during the learning process. The three major types of Machine Learning are:

1. Supervised Learning

2. Unsupervised Learning
3. Reinforcement Learning

Each type differs in the way the learning algorithm is trained and the kind of problems it can
solve.

1. Supervised Learning
Definition

Supervised learning is a type of Machine Learning in which the algorithm is trained using a
labeled dataset, where both the input variables (features) and the output variable (label) are
known.

In supervised learning, the model learns a mapping function from input to output using example
input–output pairs and then uses this learned mapping to predict outputs for unseen data.

Key Characteristics

● Training data contains labels

● A “teacher” guides the learning process
● Used for prediction and classification
● Performance can be measured easily using accuracy or error metrics

Types of Supervised Learning Tasks

(a) Classification

● Output variable is categorical

● Example outputs: spam/ham, fraud/not fraud, pass/fail

Examples:
● Email spam detection
● Disease diagnosis (positive/negative)
● Sentiment analysis (positive/neutral/negative)

(b) Regression

● Output variable is continuous (numeric)

● Example outputs: price, salary, temperature

Examples:

● House price prediction

● Sales forecasting
● Salary prediction

Common Supervised Learning Algorithms

● Linear Regression
● Logistic Regression
● k-Nearest Neighbors (k-NN)
● Support Vector Machines (SVM)
● Decision Trees
● Random Forests
● Neural Networks

Advantages

● High accuracy when sufficient labeled data is available

● Easy to evaluate model performance
● Well understood and widely used

Limitations

● Requires large amounts of labeled data

● Labeling data is time-consuming and expensive
● Not suitable when labels are unavailable
2. Unsupervised Learning
Definition

Unsupervised learning is a type of Machine Learning in which the algorithm is trained using
unlabeled data. The system tries to identify hidden patterns, structures, or relationships in the
data without any predefined output labels.

There is no teacher in unsupervised learning.

Key Characteristics

● No labeled output data

● Learns patterns automatically
● Used for data exploration and pattern discovery
● Results are often subjective and require interpretation

Major Types of Unsupervised Learning Tasks

(a) Clustering

● Groups similar data points together

● No predefined class labels

Examples:

● Customer segmentation
● Grouping similar documents
● Market basket analysis

Algorithms:

● K-Means
● DBSCAN
● Hierarchical Clustering

(b) Dimensionality Reduction

● Reduces the number of features while retaining important information

● Improves visualization and computational efficiency

Examples:

● Principal Component Analysis (PCA)

● t-SNE

(c) Anomaly Detection

● Identifies rare or unusual data points

● Used in fraud detection and fault detection

Examples:

● Credit card fraud detection

● Network intrusion detection

(d) Association Rule Learning

● Discovers relationships between variables

● Commonly used in retail analysis

Example:

● Customers who buy bread and butter also buy milk

Algorithms:

● Apriori
● Eclat

Advantages

● No need for labeled data

● Useful for discovering unknown patterns
● Helps in data understanding and preprocessing

Limitations
● Hard to evaluate accuracy
● Results may be difficult to interpret
● Less precise compared to supervised learning

3. Reinforcement Learning
Definition

Reinforcement Learning (RL) is a type of Machine Learning in which an agent learns by

interacting with an environment and improves its behavior based on rewards or penalties
received for its actions.

The goal of reinforcement learning is to learn an optimal policy that maximizes cumulative
reward over time.

Key Components

● Agent: Learner or decision maker

● Environment: The world in which the agent operates
● Action: Choices available to the agent
● Reward: Feedback from the environment
● Policy: Strategy used by the agent

Learning Process

1. The agent observes the environment

2. Selects an action
3. Receives reward or penalty
4. Updates its policy
5. Repeats the process until optimal behavior is learned

Examples of Reinforcement Learning

● Game playing (chess, Go, video games)

● Robot navigation and walking
● Self-driving cars
● Resource allocation systems

A famous example is AlphaGo, which learned to play the game of Go by playing millions of
games against itself.

Advantages

● Learns from direct interaction with environment

● Suitable for sequential decision-making problems
● Can handle dynamic and complex environments

Limitations

● Requires large number of interactions

● Computationally expensive
● Difficult to design reward functions

Comparison of Learning Types

Aspect Supervised Unsupervised Reinforcement

Data Labeled Unlabeled No labels

Feedback Direct (labels) No feedback Reward/Penalty

Goal Predict output Find patterns Maximize reward

Example Spam detection Customer clustering Game playing

Conclusion
Supervised, unsupervised, and reinforcement learning represent three fundamental approaches
to Machine Learning, each suited to different types of problems. Supervised learning is ideal for
prediction tasks with labeled data, unsupervised learning is useful for discovering hidden
structures in data, and reinforcement learning is powerful for decision-making in dynamic
environments. Understanding these learning types is essential for selecting the appropriate
Machine Learning approach for a given problem.
3. Simple Linear Regression (SLR)
Sample Python Code for Simple Linear Regression
# Import required libraries
import pandas as pd
import numpy as np
import [Link] as sm
from sklearn.model_selection import train_test_split

# Load dataset
mba_salary_df = pd.read_csv("MBA_Salary.csv")
# Define independent and dependent variables
X = mba_salary_df['Percentage in Grade 10']
Y = mba_salary_df['Salary']

# Add constant term for intercept

X = sm.add_constant(X)

# Split data into training and test sets (80:20)

train_X, test_X, train_y, test_y = train_test_split(
X, Y, train_size=0.8, random_state=100
)

# Fit the Simple Linear Regression model using OLS

slr_model = [Link](train_y, train_X).fit()

# Print regression parameters

print(slr_model.params)

# Make predictions on test data

pred_y = slr_model.predict(test_X)

Conclusion
Simple Linear Regression is one of the most fundamental and widely used predictive analytics
techniques. It is easy to interpret, computationally efficient, and forms the foundation for multiple
linear regression and other advanced models. However, ensuring that model assumptions are
satisfied and performing proper diagnostics is essential for obtaining reliable and meaningful
results.
4. Multiple Linear Regression (MLR)
5. Residual Analysis (Brief Discussion
with Illustrations)
What is Residual Analysis?
Residual analysis is the process of examining the residuals (errors) of a regression model to
validate the assumptions made during model building and to assess the goodness of fit of
the model.

A residual is defined as:

Residual=Actual value (Y)−Predicted value (Y^)

Residual analysis helps determine whether the regression model is appropriate, reliable, and
statistically valid.

Objectives of Residual Analysis

Residual analysis is performed to:

● Validate regression assumptions

● Detect non-linearity
● Identify heteroscedasticity
● Check normality of errors
● Identify outliers and influential observations

Types of Residual Analysis with Illustrations

1. Normality of Residuals (P–P Plot)

Purpose:
To check whether residuals follow a normal distribution, which is an important assumption of
regression.

Method:

● Use a Probability–Probability (P–P) plot

● If residuals are normally distributed, points lie close to the 45° reference line

Interpretation:

● Points close to line → normality satisfied

● Large deviations → non-normal residuals

✏️ Exam illustration:
Draw a straight diagonal line with points clustered around it.

2. Homoscedasticity (Residual Plot)

Purpose:
To verify whether the residuals have constant variance across all predicted values.

Method:

● Plot standardized residuals vs predicted values

Interpretation:

● Random scatter → homoscedasticity (assumption satisfied)

● Funnel or cone shape → heteroscedasticity (assumption violated)

✏️ Exam illustration:
● Good: random dots around zero
● Bad: funnel-shaped spread

3. Detection of Outliers

Purpose:
To identify observations that deviate significantly from the rest of the data and may influence
the regression coefficients.

Common techniques:

● Z-Score (|Z| > 3 indicates outlier)

● Cook’s Distance (> 1 indicates influential observation)
● Leverage values
Interpretation:
Outliers can distort parameter estimates and should be carefully examined before removal.

4. Model Specification Check

Purpose:
To detect incorrect functional form (e.g., missing non-linear relationship).

Method:

● Analyze residual plots for patterns

Interpretation:

● No pattern → correct model specification

● Curved or systematic pattern → incorrect functional form

Importance of Residual Analysis

● Ensures regression assumptions are satisfied
● Improves reliability of predictions
● Helps detect overfitting or underfitting
● Guides model refinement and transformation

Conclusion
Residual analysis is a crucial step in regression modeling that validates the underlying
assumptions and ensures the model’s statistical soundness. By examining residual plots,
normality plots, and outlier measures, one can assess whether the regression model is
appropriate for prediction and inference.
6. Distance Measures for Identifying
Influential Observations in Regression
Analysis
Introduction
In regression analysis, it is assumed that each observation contributes equally to the
estimation of regression parameters. However, in real-world data, some observations may have
an unusually large influence on the regression model. Such observations are called
influential observations.

Influential observations can:

● Distort regression coefficients

● Reduce model accuracy
● Lead to incorrect conclusions

To detect these observations, several distance and influence measures are used.

What is an Influential Observation?

An influential observation is a data point whose removal causes a significant change in:

● Regression coefficients
● Predicted values
● Model fit statistics

These observations may or may not be outliers, but they strongly affect the regression results.

Distance Measures Used to Identify Influential

Observations
1. Z-Score

Definition
Z-score measures how far an observation is from the mean, in terms of standard deviations.

Limitation

● Identifies outliers but does not directly measure influence on regression coefficients

2. Mahalanobis Distance

Definition

Mahalanobis distance measures the multivariate distance of an observation from the center of
the data, taking into account correlation among independent variables.

Purpose

● Identifies observations that are unusual in feature space

● Detects high-leverage points

Interpretation

● Large Mahalanobis distance → observation is far from the bulk of data

● Such observations may strongly influence regression estimates

Advantage

● Accounts for correlation between predictors

● More reliable than Euclidean distance in regression problems
3. Leverage Value (Hat Value)

Definition

Leverage measures how far an observation’s independent variable values are from the mean
of all predictors.

Leverage indicates the potential influence of an observation.

4. Cook’s Distance

Definition

Cook’s Distance measures the overall influence of an observation by assessing how much the
regression coefficients change when that observation is removed.

Concept

It combines:

● Residual size
● Leverage value

Interpretation

● Cook’s Distance > 1 → highly influential observation

● Large Cook’s distance indicates that removing the observation significantly alters the
regression model
Importance

● Most widely used influence diagnostic

● Directly measures effect on model estimates

5. Influence Plot (Combined Measure)

Definition

An influence plot is a graphical tool that simultaneously displays:

● Standardized residuals
● Leverage values
● Cook’s Distance (bubble size)

Interpretation

● Points with high leverage, large residuals, and large bubble size are most influential
● Helps visually identify problematic observations

Why Identifying Influential Observations is Important

● Prevents misleading regression coefficients
● Improves model stability and accuracy
● Ensures reliable statistical inference
● Helps decide whether to remove or transform data points

Conclusion
Identifying influential observations is a critical step in regression diagnostics. Distance measures
such as Z-score, Mahalanobis distance, leverage values, and Cook’s distance help detect
observations that can disproportionately affect the regression model. Proper analysis of these
measures ensures a robust and reliable regression model.

Machine Learning: A Comprehensive Guide
No ratings yet
Machine Learning: A Comprehensive Guide
7 pages
ML Algorithms Study Notes
No ratings yet
ML Algorithms Study Notes
14 pages
Machine Learning Basics and Applications
No ratings yet
Machine Learning Basics and Applications
22 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
46 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
46 pages
ML Notes
No ratings yet
ML Notes
49 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
25 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
69 pages
Types of Machine Learning Explained
No ratings yet
Types of Machine Learning Explained
6 pages
UNIT 1 Machine Learning Lecture Notes
No ratings yet
UNIT 1 Machine Learning Lecture Notes
15 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
11 pages
Comprehensive Introduction To Machine Learning-V2
No ratings yet
Comprehensive Introduction To Machine Learning-V2
3 pages
Machine Learning Overview and Lifecycle
No ratings yet
Machine Learning Overview and Lifecycle
100 pages
Machine Learning Basics and Applications
No ratings yet
Machine Learning Basics and Applications
2 pages
Machine Learning Fundamentals Guide
No ratings yet
Machine Learning Fundamentals Guide
13 pages
Social Media Analytics Techniques
No ratings yet
Social Media Analytics Techniques
77 pages
Overview of Machine Learning Types
No ratings yet
Overview of Machine Learning Types
3 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
35 pages
Machine Learning Algorithms Explained
No ratings yet
Machine Learning Algorithms Explained
8 pages
Machine Learning Essentials for Business
No ratings yet
Machine Learning Essentials for Business
9 pages
Beginner's Guide to Machine Learning
No ratings yet
Beginner's Guide to Machine Learning
14 pages
Rashtrakavi Ramdhari Singh Dinkar College of Engineering
No ratings yet
Rashtrakavi Ramdhari Singh Dinkar College of Engineering
21 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
7 pages
MLF - Unit 1
No ratings yet
MLF - Unit 1
54 pages
Comprehensive Machine Learning Guide
No ratings yet
Comprehensive Machine Learning Guide
7 pages
Unit-1 Machine Learning Techniques
No ratings yet
Unit-1 Machine Learning Techniques
10 pages
Machine Learning Fundamentals Explained
No ratings yet
Machine Learning Fundamentals Explained
170 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
24 pages
Overview of Machine Learning Concepts
No ratings yet
Overview of Machine Learning Concepts
6 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
21 pages
Types of Machine Learning Explained
No ratings yet
Types of Machine Learning Explained
20 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
21 pages
Overview of Machine Learning Concepts
No ratings yet
Overview of Machine Learning Concepts
8 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
6 pages
Act As A Professor of Machine Learning and Explain
No ratings yet
Act As A Professor of Machine Learning and Explain
12 pages
Inbound 8005981857837137506
No ratings yet
Inbound 8005981857837137506
10 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
12 pages
Machine Learning Overview and Concepts
No ratings yet
Machine Learning Overview and Concepts
7 pages
Fqrqfeas
No ratings yet
Fqrqfeas
5 pages
Full Unit ML
No ratings yet
Full Unit ML
78 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
7 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
22 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
16 pages
ML Unit3 Presentation
No ratings yet
ML Unit3 Presentation
18 pages
Unsupervised Learning in ML Explained
No ratings yet
Unsupervised Learning in ML Explained
13 pages
Overview of Machine Learning Types
No ratings yet
Overview of Machine Learning Types
6 pages
AI vs ML vs DL: Key Differences Explained
100% (1)
AI vs ML vs DL: Key Differences Explained
8 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
31 pages
Machine Learning Basics Overview
No ratings yet
Machine Learning Basics Overview
12 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
13 pages
Machine Learning Basics and Models Guide
No ratings yet
Machine Learning Basics and Models Guide
56 pages
Machine Learning Fundamentals Explained
No ratings yet
Machine Learning Fundamentals Explained
9 pages
Assign 1
No ratings yet
Assign 1
1 page
Nanasana Irfan - Student Certificate
No ratings yet
Nanasana Irfan - Student Certificate
1 page
Process Mining Virtual Internship Guide
No ratings yet
Process Mining Virtual Internship Guide
12 pages
AWS Academy Cloud Foundations Graduate
No ratings yet
AWS Academy Cloud Foundations Graduate
1 page
Cloud-Based Workflow Management Architecture
No ratings yet
Cloud-Based Workflow Management Architecture
10 pages
C Language Programming Fundamentals
No ratings yet
C Language Programming Fundamentals
13 pages
Topic 6 ALL Answers Pagenumber 1
No ratings yet
Topic 6 ALL Answers Pagenumber 1
38 pages
Data Structures: 1-Mark Q&A Guide
No ratings yet
Data Structures: 1-Mark Q&A Guide
3 pages
PG Audit Course Book Review Guidelines
No ratings yet
PG Audit Course Book Review Guidelines
11 pages
Grade 4 Math First Periodical Test
No ratings yet
Grade 4 Math First Periodical Test
2 pages
Understanding Fusional Languages
No ratings yet
Understanding Fusional Languages
1 page
Numerical Methods: Curve Fitting Exercises
No ratings yet
Numerical Methods: Curve Fitting Exercises
2 pages
Percentage Practice Questions Sheet
No ratings yet
Percentage Practice Questions Sheet
43 pages
Logic and Quantification Basics
No ratings yet
Logic and Quantification Basics
61 pages
State Variable Analysis in Control Systems
No ratings yet
State Variable Analysis in Control Systems
50 pages
AC Circuits and Transformer Analysis
No ratings yet
AC Circuits and Transformer Analysis
12 pages
Year 4 Place Value in 4-Digit Numbers
No ratings yet
Year 4 Place Value in 4-Digit Numbers
36 pages
Probability Concepts and Techniques
100% (1)
Probability Concepts and Techniques
21 pages
MOSFET Polarization Analysis and Examples
No ratings yet
MOSFET Polarization Analysis and Examples
23 pages
Introduction to Boosting Algorithms
No ratings yet
Introduction to Boosting Algorithms
29 pages
Divide and Conquer Algorithms Explained
No ratings yet
Divide and Conquer Algorithms Explained
37 pages
AGC Optimization in Power Systems
No ratings yet
AGC Optimization in Power Systems
15 pages
Dimensionality Reduction Techniques
No ratings yet
Dimensionality Reduction Techniques
73 pages
ENGI 1313 Mechanics I Lecture 26 3 D
No ratings yet
ENGI 1313 Mechanics I Lecture 26 3 D
1 page
Prolog Exercises for Medical Informatics
No ratings yet
Prolog Exercises for Medical Informatics
13 pages
Digital System Fundamentals Overview
No ratings yet
Digital System Fundamentals Overview
77 pages
Mastering Real Estate Math
75% (4)
Mastering Real Estate Math
51 pages
The Paradox of Choice Explained
100% (3)
The Paradox of Choice Explained
196 pages
Analytic Geometry Problem Set
100% (1)
Analytic Geometry Problem Set
4 pages
lONG QUIZ
No ratings yet
lONG QUIZ
2 pages
Summer Homework for Class 8 Students
No ratings yet
Summer Homework for Class 8 Students
10 pages
Marx's 1863 Letter to Engels on Machinery
No ratings yet
Marx's 1863 Letter to Engels on Machinery
4 pages
CFD Theory - Recap
No ratings yet
CFD Theory - Recap
76 pages
Analyzing Circuits with Mesh & Superposition
100% (2)
Analyzing Circuits with Mesh & Superposition
390 pages
Deflections in Prestressed Concrete
100% (1)
Deflections in Prestressed Concrete
48 pages
2403A52344 - NLP - Lab - 10 - Colab
No ratings yet
2403A52344 - NLP - Lab - 10 - Colab
4 pages

ML Unit 1

Uploaded by

ML Unit 1

Uploaded by

1.

Framework for Machine Learning Model

1. Problem and Opportunity Identification

●​ The organization clearly defines the objective of the ML system.

For example, an organization may want to:

●​ Predict customer churn,

2. Collection of Relevant Data (Feature Extraction)

Key aspects of this stage include:

3. Data Pre-processing and Feature Engineering

This stage includes:

●​ Handling missing values (removal, imputation, or substitution).

Along with preprocessing, feature engineering is performed to improve model performance.

4. Model Building and Feature Selection

Key activities include:

5. Model Deployment and Monitoring

Model deployment can take various forms, such as:

1.​ Supervised Learning

●​ Training data contains labels

Types of Supervised Learning Tasks

●​ Output variable is categorical

●​ Output variable is continuous (numeric)

●​ House price prediction

Common Supervised Learning Algorithms

●​ High accuracy when sufficient labeled data is available

●​ Requires large amounts of labeled data

There is no teacher in unsupervised learning.

●​ No labeled output data

Major Types of Unsupervised Learning Tasks

●​ Groups similar data points together

(b) Dimensionality Reduction

●​ Reduces the number of features while retaining important information

●​ Principal Component Analysis (PCA)

(c) Anomaly Detection

●​ Identifies rare or unusual data points

●​ Credit card fraud detection

(d) Association Rule Learning

●​ Discovers relationships between variables

●​ Customers who buy bread and butter also buy milk

●​ No need for labeled data

Reinforcement Learning (RL) is a type of Machine Learning in which an agent learns by

●​ Agent: Learner or decision maker

1.​ The agent observes the environment

Examples of Reinforcement Learning

●​ Game playing (chess, Go, video games)

●​ Learns from direct interaction with environment

●​ Requires large number of interactions

Comparison of Learning Types

Data Labeled Unlabeled No labels

Feedback Direct (labels) No feedback Reward/Penalty

Goal Predict output Find patterns Maximize reward

Example Spam detection Customer clustering Game playing

# Add constant term for intercept

# Split data into training and test sets (80:20)

# Fit the Simple Linear Regression model using OLS

# Print regression parameters

# Make predictions on test data

A residual is defined as:

Residual=Actual value (Y)−Predicted value (Y^)

Objectives of Residual Analysis

●​ Validate regression assumptions

Types of Residual Analysis with Illustrations

●​ Use a Probability–Probability (P–P) plot

●​ Points close to line → normality satisfied

2. Homoscedasticity (Residual Plot)

●​ Plot standardized residuals vs predicted values​

●​ Random scatter → homoscedasticity (assumption satisfied)

●​ Z-Score (|Z| > 3 indicates outlier)

4. Model Specification Check

●​ Analyze residual plots for patterns​

●​ No pattern → correct model specification

Importance of Residual Analysis

Influential observations can:

●​ Distort regression coefficients

What is an Influential Observation?

Distance Measures Used to Identify Influential

●​ Identifies observations that are unusual in feature space

● The organization clearly defines the objective of the ML system.

● Predict customer churn,

● Handling missing values (removal, imputation, or substitution).

1. Supervised Learning

● Training data contains labels

● Output variable is categorical

● Output variable is continuous (numeric)

● House price prediction

● High accuracy when sufficient labeled data is available

● Requires large amounts of labeled data

● No labeled output data

● Groups similar data points together

● Reduces the number of features while retaining important information

● Principal Component Analysis (PCA)

● Identifies rare or unusual data points

● Credit card fraud detection

● Discovers relationships between variables

● Customers who buy bread and butter also buy milk

● No need for labeled data

● Agent: Learner or decision maker

1. The agent observes the environment

● Game playing (chess, Go, video games)

● Learns from direct interaction with environment

● Requires large number of interactions

● Validate regression assumptions

● Use a Probability–Probability (P–P) plot

● Points close to line → normality satisfied

● Plot standardized residuals vs predicted values

● Random scatter → homoscedasticity (assumption satisfied)

● Z-Score (|Z| > 3 indicates outlier)

● Analyze residual plots for patterns

● No pattern → correct model specification

● Distort regression coefficients

● Identifies observations that are unusual in feature space

● Large Mahalanobis distance → observation is far from the bulk of data

● Accounts for correlation between predictors

● Cook’s Distance > 1 → highly influential observation

● Most widely used influence diagnostic