Understanding Overfitting in Machine Learning

Overfitting occurs when a machine learning model fits training data too closely, leading to poor generalization on unseen data. Techniques such as cross-validation and early stopping can help mitigate overfitting. The document also discusses the impact of model complexity on performance, illustrated through examples using OneR and various datasets.

Uploaded by

fatmaoztan130

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views22 pages

Understanding Overfitting in Machine Learning

Uploaded by

fatmaoztan130

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Overfitting

Overfitting
“Overfitting” is a problem that plagues all machine
learning methods. It occurs when a classifier fits the
training data too tightly and doesn’t generalize well
to independent test data.
The green line represents an overfitted
model and the black line represents a
regularized model. While the green line
best follows the training data, it is too
dependent on that data and it is likely to
have a higher error rate on new unseen
data, compared to the black line.
Overfitting

Noisy (roughly linear) data is fitted to a linear function and a

polynomial function. Although the polynomial function is a
perfect fit, the linear function can be expected to generalize
better: if the two functions were used to extrapolate beyond the
fitted data, the linear function should make better predictions.
Overfitting
As an example, if the number of parameters is the
same as or greater than the number of observations,
then a model can perfectly predict the training data
simply by memorizing the data in its entirety. Such a
model, though, will typically fail severely when
making predictions.
To lessen the chance or amount of overfitting,
several techniques are available (cross-validation,
early stopping, pruning,…).
Overfitting

This is the numeric version of the weather problem, where

temperature and humidity are numbers and not nominal
values. If you think about how OneR works, when it comes to
make a rule on the attribute “temperature”, it’s going to make
a complex rule that branches 14 different ways for the 14
different instances of the dataset.
Overfitting

Each rule is going to have zero errors: it’s going to get it

exactly right. If we branch on temperature, we’re going to
get a perfect rule, with a total error count of zero. OneR
has a parameter that limits the complexity of rule.
Overfitting
Open the numeric weather data. Run OneR with cross valiation. The
rule is based on the “outlook” attribute. Remove the outlook
attribute, and try it again. Now it branches on humidity. If humidity is
less than 82.5%, it’s a “yes” day; if it’s greater than 82.5%, it’s a “no”
day, and that gets 10 out of 14 instances correct.
Overfitting
Configure the classifier by clicking on it. We
see that there’s a parameter called
minBucketSize (The minimum bucket size
used for discretizing numeric attributes, as
minBucketSize increases, accuracy on the
training set steadily decreases. When
minBucketSize = 1 the rule is largest, and
decreases in size as minBucketSize
increases) and it’s set to 6 by default.
Change that value to 1. It’s branching many
different ways on the “temperature”
attribute. This rule is overfitted to the
dataset. It’s a very accurate rule on the
training data, but it won’t generalize well to
independent test data.
Minbucketsize6,training set 71,4%,CV 50%
Minbucketsize1,training set 92,8%,CV 35,7%
Overfitting
Open “diabetes” dataset. All the attributes are numeric,
and the class is either tested_negative or
tested_positive. Let’s run ZeroR to get a baseline figure
for this dataset which is 65% with cross-validation (cv).
Let’s run OneR with cv, with default parameter settings
– that is a value of 6 for OneR’s parameter that controls
rule complexity. We get 71.5%. We’re evaluating using
cross-validation, and OneR outperforms the baseline
accuracy by quite a bit – 71% versus 65%.
Overfitting
If we look at the rule, it
branches on “plas”. This is
the plasma-glucose
concentration. So depending
on which of these regions
the plasma-glucose
concentration falls into,
we’re going to predict a
negative or a positive
outcome. That seems like
quite a sensible rule.
Overfitting
Change OneR’s parameter to make it overfit.
We’ll configure OneR, find the minBucketSize
parameter, and change it to 1.
When we run OneR again, we get 57%
accuracy, quite a bit lower than the ZeroR
baseline of 65%. If you look at the rule –– it’s
testing a different attribute, “pedi”, which,
happens to be the diabetes pedigree function.
You can see that this attribute has a lot of
different values, and it looks like we’re
branching on pretty well every single one. That
gives us poor performance when evaluated by
cross-validation.
Overfitting
If you were to evaluate it on the training set, you would
expect to see very good performance (87.5%) accuracy
on the training set, which is very good for this dataset.
Of course, that figure is completely misleading. The
rule is strongly overfitted to the training dataset, and
doesn’t generalize well to independent test sets. That’s
a good example of overfitting.
Minbucketsize6,training set 76,4%,CV 71,4%
Minbucketsize1,training set 87,5%,CV 57,1%
Exercises
1. Open the [Link] dataset and inspect the data
using the Edit button of Weka’s Preprocess panel. What is the
maximum accuracy of rules based on temperature and
humidity respectively, in terms of the number of training
instances predicted correctly?
a) temperature : 12 correct instances
b) temperature : 13 correct instances
c) temperature : 14 correct instances
d) humidity: 10 correct instances
e) humidity : 12 correct instances
f) humidity : 14 correct instances
Exercises
2. The following questions investigate the effect of
OneR’s minBucketSize parameter on performance and rule complexity by
drawing graphs where –B (minBucketSize) ranges from 1 to 10.
Open the [Link] dataset, go to the Classify tab, and select OneR. Make a
rough paper-and-pencil plot of accuracy on the training data (on the vertical
axis) against minBucketSize (on the horizontal axis) and compare it with the
graphs below Which of these shapes do you get?
Exercises
3. Make a rough pencil-and-paper plot of cross-
validation accuracy against minBucketSize. Which of
these do you get?
Exercises
4. Using paper and pencil, plot the size of the rule
generated against minBucketSize. Which of these
plots do you get?
MODEL EVALUATION METRICS
IN CLASSIFICATION
Open the [Link] dataset. Use OneR
with default parameters and test with cross
validation.
In the classifier output, you can see detailed
accuracy by class and the metrics.
MODEL EVALUATION METRICS
IN CLASSIFICATION
TP Rate
4/9=0,444 for yes class
2/5=0,400 for no class
(0,444*9)+(0,400*5)/14
=0,429 for WA

FP Rate
3/5=0,600 for yes class
5/9=0,555 for no class
(0,600*9)+(0,556*5)/14
=0,584 for WA
MODEL EVALUATION METRICS
IN CLASSIFICATION
Precision= TP/(TP+FP)
=4/(4+3)=0,571 for yes class
=2/(2+5)=0,286 for no class
(0,571*9)+(0,286*5)/14=0,469 for WA

Recall=TP/(TP+FN)
=4/(4+5)=0,444 for yes class
=2/(2+3)=0,400 for no class
(0,444*9)+(0,400*5)/14= 0,429 for WA
MODEL EVALUATION METRICS
IN CLASSIFICATION

F-measure=
2*Precision*Recall/(PrecisionxRecall)
=2*0,571*0,444/(0,571+0,444)=0,500
for yes class
=2*0,286*0,400/(0,286+0,400)=0,333
for no class
(0,500*9)+(0,333*5)/14=0,440 for WA
MODEL EVALUATION METRICS
IN CLASSIFICATION
MCC=Matthews correlation coefficient
MCC= TP * TN – FP * FN / √ (TP +FP) * (TP +
FN) * (TN + FP) * (TN + FN)
Like most correlation coefficients, MCC ranges
between -1 and 1, where 1 is the best
agreement between actuals and predictions,
zero is no agreement.
MODEL EVALUATION METRICS
IN CLASSIFICATION
An ROC curve (receiver operating characteristic curve)
is a graph showing the performance of a classification
model.
PRC area computes the area under the Precision-
Recall curve (PRC). The PRC can be interpreted
as the relationship between precision and recall
(sensitivity), and is considered to be a more
appropriate measure for unbalanced datasets than the
ROC curve.

Overfitting and Underfitting Explained
No ratings yet
Overfitting and Underfitting Explained
14 pages
Position Sizing - StockBangladesh Excellence
100% (1)
Position Sizing - StockBangladesh Excellence
3 pages
5-Madhavan FAJ 2002 Market Microstructure - A Practitioners Guide
No ratings yet
5-Madhavan FAJ 2002 Market Microstructure - A Practitioners Guide
15 pages
A Stochastic Model For Order Book Dynamics
No ratings yet
A Stochastic Model For Order Book Dynamics
16 pages
Overfitting vs Underfitting Explained
No ratings yet
Overfitting vs Underfitting Explained
5 pages
Types of Backtests in Trading Strategies
No ratings yet
Types of Backtests in Trading Strategies
20 pages
Understanding Overfitting in Regression
No ratings yet
Understanding Overfitting in Regression
1 page
Market Microstructure in Financial Economics
No ratings yet
Market Microstructure in Financial Economics
36 pages
Auction Game Theory Overview
No ratings yet
Auction Game Theory Overview
6 pages
Understanding MACD Signals and Divergence
No ratings yet
Understanding MACD Signals and Divergence
4 pages
Order Flow in Exchange Rate Dynamics
No ratings yet
Order Flow in Exchange Rate Dynamics
46 pages
Market Liquidity Factors in Financial Markets
No ratings yet
Market Liquidity Factors in Financial Markets
12 pages
Kick The (Slippage) Tires Before Investing in A Trading System
No ratings yet
Kick The (Slippage) Tires Before Investing in A Trading System
5 pages
Forex Margin Trading Essentials
No ratings yet
Forex Margin Trading Essentials
1 page
Understanding Support and Resistance
No ratings yet
Understanding Support and Resistance
17 pages
The Microstructure of Financial Markets 1st Edition Frank de Jong Ebook Testbank Solutions Chapter Rich Content
100% (2)
The Microstructure of Financial Markets 1st Edition Frank de Jong Ebook Testbank Solutions Chapter Rich Content
114 pages
Option Microstructure and Reflexive Market Dynamics - Midhun
No ratings yet
Option Microstructure and Reflexive Market Dynamics - Midhun
49 pages
Kalman Filter for Trading Strategies
No ratings yet
Kalman Filter for Trading Strategies
29 pages
Hybrid Model for Cryptocurrency Forecasting
No ratings yet
Hybrid Model for Cryptocurrency Forecasting
15 pages
The Market Microstructure Approach To Fo PDF
No ratings yet
The Market Microstructure Approach To Fo PDF
55 pages
Almgren-Chriss Model Implementation Insights
No ratings yet
Almgren-Chriss Model Implementation Insights
29 pages
Stock Market Prediction with ML & DL
No ratings yet
Stock Market Prediction with ML & DL
8 pages
Professional Traders and Myopic Loss Aversion
No ratings yet
Professional Traders and Myopic Loss Aversion
13 pages
Stock Price Prediction Framework
No ratings yet
Stock Price Prediction Framework
12 pages
Calculating Forex Pip Values and Sizes
100% (1)
Calculating Forex Pip Values and Sizes
11 pages
Market Making in High Frequency Trading
No ratings yet
Market Making in High Frequency Trading
71 pages
Market Microstructure Theory Overview
No ratings yet
Market Microstructure Theory Overview
3 pages
Intelligent Stock Trading Strategies
No ratings yet
Intelligent Stock Trading Strategies
16 pages
Optimal Trading Strategies in Finance
No ratings yet
Optimal Trading Strategies in Finance
80 pages
Pair Trading Strategies Overview
No ratings yet
Pair Trading Strategies Overview
11 pages
Almgren-Chriss Optimal Execution Analysis
No ratings yet
Almgren-Chriss Optimal Execution Analysis
6 pages
Stock Market Prediction Using History Bits
No ratings yet
Stock Market Prediction Using History Bits
12 pages
Understanding Binomial and Poisson Distributions
No ratings yet
Understanding Binomial and Poisson Distributions
13 pages
Understanding Market Microstructure
No ratings yet
Understanding Market Microstructure
24 pages
Mean Reversion vs. Breakout Strategies
No ratings yet
Mean Reversion vs. Breakout Strategies
2 pages
Probability QB
No ratings yet
Probability QB
10 pages
Dynamic Asset Allocation Strategies
No ratings yet
Dynamic Asset Allocation Strategies
45 pages
Evaluating Trading System Performance
No ratings yet
Evaluating Trading System Performance
5 pages
Money Management Strategies Explained
100% (1)
Money Management Strategies Explained
27 pages
Cointegration Strategies for Forex Pair Trading
No ratings yet
Cointegration Strategies for Forex Pair Trading
13 pages
Volumized Order Blocks Indicator Script
No ratings yet
Volumized Order Blocks Indicator Script
12 pages
Evaluating TradeStation's Trading API
No ratings yet
Evaluating TradeStation's Trading API
124 pages
Accessing Market Data with Pandas
No ratings yet
Accessing Market Data with Pandas
21 pages
Deep Learning for Statistical Arbitrage
No ratings yet
Deep Learning for Statistical Arbitrage
56 pages
High Frequency Trading and Liquidity Dynamics
No ratings yet
High Frequency Trading and Liquidity Dynamics
57 pages
Vigilant Asset Allocation Strategy
No ratings yet
Vigilant Asset Allocation Strategy
37 pages
Image-Based Price Trend Prediction
No ratings yet
Image-Based Price Trend Prediction
80 pages
Using Llms For Market Research: James Brand Ayelet Israeli Donald Ngwe
No ratings yet
Using Llms For Market Research: James Brand Ayelet Israeli Donald Ngwe
48 pages
Probability Theory in Financial Modeling
No ratings yet
Probability Theory in Financial Modeling
18 pages
Quantitative Finance Concepts Overview
No ratings yet
Quantitative Finance Concepts Overview
20 pages
A Market Impact Model That Works - (Northfield Information Services, Inc.)
No ratings yet
A Market Impact Model That Works - (Northfield Information Services, Inc.)
29 pages
Stock Market Prediction with ML Techniques
No ratings yet
Stock Market Prediction with ML Techniques
11 pages
Hawkes Processes for Crypto Burst Detection
No ratings yet
Hawkes Processes for Crypto Burst Detection
9 pages
Calculating Portfolio Betas
No ratings yet
Calculating Portfolio Betas
9 pages
Understanding Bias-Variance Tradeoff in ML
No ratings yet
Understanding Bias-Variance Tradeoff in ML
24 pages
Learning
No ratings yet
Learning
17 pages
Understanding Simple Classifiers in Data Mining
No ratings yet
Understanding Simple Classifiers in Data Mining
15 pages
Understanding Overfitting in Models
No ratings yet
Understanding Overfitting in Models
4 pages
Model Selection and Hyperparameter Tuning
No ratings yet
Model Selection and Hyperparameter Tuning
20 pages
Decision Tree
No ratings yet
Decision Tree
44 pages
Facility Layout Optimization Models
No ratings yet
Facility Layout Optimization Models
5 pages
Material Handling System Design for SSI
No ratings yet
Material Handling System Design for SSI
9 pages
Diabetes Dataset Classifier Accuracy Analysis
No ratings yet
Diabetes Dataset Classifier Accuracy Analysis
15 pages
Building Classifiers with J48 in Weka
No ratings yet
Building Classifiers with J48 in Weka
17 pages
Supermarket Queue Management Simulation
No ratings yet
Supermarket Queue Management Simulation
18 pages
Simulation Modeling Lab Manual
No ratings yet
Simulation Modeling Lab Manual
24 pages
OOS and Atypical Results Management Policy
No ratings yet
OOS and Atypical Results Management Policy
7 pages
Statistical Tools in Research Analysis
No ratings yet
Statistical Tools in Research Analysis
8 pages
Developing Analytical Talent in Data Science
No ratings yet
Developing Analytical Talent in Data Science
73 pages
E-Banking's Role in Kisii Financial Inclusion
No ratings yet
E-Banking's Role in Kisii Financial Inclusion
44 pages
Data Analyst & Scientist Profile
No ratings yet
Data Analyst & Scientist Profile
1 page
Business Review Report Template
No ratings yet
Business Review Report Template
6 pages
Revenue Anomaly Detection in Audits
No ratings yet
Revenue Anomaly Detection in Audits
16 pages
Understanding Linear Regression Types
No ratings yet
Understanding Linear Regression Types
24 pages
Understanding Research: Definitions & Nature
No ratings yet
Understanding Research: Definitions & Nature
28 pages
Salary Survey Data Analysis Impact
No ratings yet
Salary Survey Data Analysis Impact
12 pages
Data Analytics for Accounting Insights
No ratings yet
Data Analytics for Accounting Insights
24 pages
Analyzing Primary Research Data
No ratings yet
Analyzing Primary Research Data
2 pages
Mastering User Behavior Analytics
No ratings yet
Mastering User Behavior Analytics
214 pages
Confidence Intervals and Statistical Analysis
No ratings yet
Confidence Intervals and Statistical Analysis
9 pages
Netflix Data Analysis Project Overview
No ratings yet
Netflix Data Analysis Project Overview
24 pages
Big Data Analytics and The Path From Insights To Value
100% (1)
Big Data Analytics and The Path From Insights To Value
7 pages
Excel Correlation and Regression Lab Guide
No ratings yet
Excel Correlation and Regression Lab Guide
9 pages
Data Visualization Course Syllabus 2024-25
No ratings yet
Data Visualization Course Syllabus 2024-25
1 page
Biostatistics with Computer Application
No ratings yet
Biostatistics with Computer Application
13 pages
High-Dimensional Clustering Techniques
No ratings yet
High-Dimensional Clustering Techniques
25 pages
MAHE - MIT 12042019 Attendance PDF
No ratings yet
MAHE - MIT 12042019 Attendance PDF
33 pages
Modified Levene Test in R
No ratings yet
Modified Levene Test in R
2 pages
SQL vs NoSQL for Poultry Big Data
No ratings yet
SQL vs NoSQL for Poultry Big Data
2 pages
DSCI 552 Homework 4: Time Series Classification
No ratings yet
DSCI 552 Homework 4: Time Series Classification
4 pages
Hindrances to Royal Ambassadors' Missions
100% (2)
Hindrances to Royal Ambassadors' Missions
61 pages
IoT Laboratory Manual for CE-478L
No ratings yet
IoT Laboratory Manual for CE-478L
74 pages
Power BI College Analysis Dashboard Insights
No ratings yet
Power BI College Analysis Dashboard Insights
6 pages
Grade 11 Regression Line Guide
No ratings yet
Grade 11 Regression Line Guide
9 pages
Essentials of Marketing Analytics
No ratings yet
Essentials of Marketing Analytics
22 pages
Statistical Tests in Excel
No ratings yet
Statistical Tests in Excel
6 pages