0% found this document useful (0 votes)

8 views8 pages

Train-Test Split and Model Evaluation Guide

Coding notes

Uploaded by

cassbozz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views8 pages

Train-Test Split and Model Evaluation Guide

Coding notes

Uploaded by

cassbozz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

BDA

20242 Semester
.
Train test split
Splits in trainand test data
- builds models using training data
• test model using test data
Model can remember data and therefore is there a need for splitting so it can generilize on new unseen
data → which will make it able to generilize well
The train_test_split function, splits into 75% (training set) and 25% ( test set)

Evaluating model
Accuracy is the ratio of correctly predicted instances to the total number of instances
• often in percentage
• not great in imbalanced data sets → only use in balanced dataset
Comparing with a baseline is great → used for evaluating the performance of complex models

Imbalanced data sets

One class (minority) has significant fewer sample than the other class (majority) → leads to poor
performance of ML models due to the models can be biased towards the majority class

Supervised vs un-supervised
Supervised
• you know what to predict
• labelled data with a target variable
• train test split
• good when model can predict on new unseen data
Unsupervised
• discover unknown patterns
• unlabelled date without a target variable
• No train test split
• clustering
SMOTE
Used especially in imbalanced datasets
Oversampling technique for balanced class distribution in a dataset
Creates synthetic samples for minority class to balance the distribution
→ improves model performing, reduces bias, and enhances generalisation
→ does have a risk of overfitting, and the synthetic samples may not introduce sufficient variability

Classification
Predicts categorical labels
Regression
Predicts continuous values

Generalisability
When the model applies to data that was not used to build the model

Overfitting & underfitting

Overfitting is when the model fits the training data so well that it cannot perform accurately on new
unseen data
• model is to complex, due to too much training or too many inputs feature
Underfitting is when the model is unable to capture the relationship between the input and output
variables accurately → higher error rate on both training set and unseen data
• model is too simple, due to needing more training and more input features

Regularization
Calibrating a models fit to the data with the models complexity → guard against overfitting
Accuracy, precision & recall
Accuracy is the ratio of correctly predictive instance to the total instances
• good measure when classes are well-balanced, meaning number of instances is roughly the same
• can be misleading when data is imbalanced TP + TN

TP + TN + FP +FN

Precision is the ratio of correctly predictive positives to the total predictive positives
• how many instances predictive positive was actually positive?
• important to use when the cost of false positives are high → does not account for false negatives

P
Recall is the ratio of correctly predictive positives to the actual positives
• how many positive instants was correctly identified?
• important to use when the cost of false negatives are high → does not account for false positives

GridSearch N
Hyper parameter technique searches through the best parameters to find the best one for the model
Searches through the best combination to find the best set for the model
Evaluates performance for each combination using cross-validation
• cross-validation is a technique for evaluating a model
• splits the dataset into training and validation sets multiple times → ensures that the models
performance is assessed reliably across different subsets of the data
KNN
Measures similarity's amongst customers
It is a instance based learning algorithm (supervised)
Predicts an outcome for a test instance by finding the K most similar instances in the training data and
aggregating the observed outcomes
Simple model and effective with sufficient training data
set by default k =5
• K = 1 → complex model → risk of overfitting
• k= N → simple model → risk of underfitting
Tuning
• K = number of neighbours
• distance weighting
Hamming distance is a metric for comparing to binary string → the number of bit positions in which
the two bits are different
• looks at the whole data and finds when data points are similar or dissimilar one to one
• gives the results of how many attributeswere different

Logistic regression
Outputs a categorical value → for classification tasks
Supervised algorithm which can be used to classify data into categories or classes, by predicting the
probability that an instance fall into that patrician class based on its attribute
• smaller C → stronger regularization
Support vector classifier (svc) is a linear model that outputs categorical value by finding optimal line
• classifies instances by finding the optimal one or hyperplane to separate classes in a feature space
Decision tree
Used for both classification and regression
Builds a hierarchy of if/else questions leading to a decision
Controlling complexity
• build until leaves are pure (closer to O) → tree will be 100% accurate on training data
Prevent overfitting
• limit depth, limit max numbers of leaves, require minimum number of points in mode to keep splitting
Nodes is each rectangular box → conditions based on feature
• leaf nodes represent final classification

Random forest
Build many decision trees where each tree differs in random ways → does an average to figure out which
tree is the best
Its build on training data
It improves prediction accuracy by combining predictions from multiple trees
Reduces overfitting
Randomly selects n items from original dataset = allows repetition → each tree is the same size, but
random different
Max feature is a parameter, that selects random subsets of feature of its size → high max gives chance
of overfitting
Classification: each tree makes "soft predictions" → highest average= output by forest
Regression: each tree makes own prediction → average of these predictions
Tuning
• number of trees (n_estimator), max_depth, min_samples_split, mm_samples_leaf, max_feature,
bootstrap t criterion
• by grid search or cross-validation

Gradient boosting
Powerful ensemble learningtechnique used for both regression and classification. Builds a series of weak
learned (decision tree) sequentially where one leads to a improved one
Neural network (ANN) & multilayer perception (MLP)
ANN is a computational model inspired by the way biological neural networks in the human bran
process information
MLP is a type of artificial ANN that consist of multilayer of neuron
• Input layer (receive input features from the dataset)
• hidden layer (s) (processes input layers received from input layer )
• output layer (produces final output of the network
Tuning MLP
• number of hidden layers
• number of units in each layer.
• regularization
• K of input → which is important

Scaling
adjusting the range and distribution of numerical features in your dataset → ensure all features
contribute equally to the model
Important for SVM and neural networks
Minmaxscaler ensures all features is between 0 and 1
Applied before supervised ML
Tran + test should be scaled the same way
Dummy variables
Also called one-hot-encoding
If features F has three values a, b & c → creates three new features Fa, Fb, Fc
A powerful tool in statistical modelling for incorporating categorical data
Makes a new columns
Dummy variables for word
• one feature for each word
• valve is 1 if word occurs in text otherwise 0 → makes new columns for each word

Dummy classifier
Simple baseline model used to evaluate the performance of more complex models
Primary purpose is to provide a benchmark against which the performance of more advanced models
can be compared

Confusion matrix
Powerful tool for understanding the performance of a classification model.
Provides a detailed breakdown of correct and incorrect predictions
• true positives → number of instances that are correctly predicted positive
• false positives → number of instances that are incorrectly predicted positive
• true negatives → number of instances that are correctly predicted negative

Common questions

Logistic regression approaches the decision boundary by predicting the probability that an instance belongs to a particular class based on its attributes, which allows it to determine whether it falls on one side of the decision boundary or the other. Support vector classifiers (SVC), on the other hand, find the optimal decision boundary or hyperplane that maximizes the margin between classes in the feature space. SVC explicitly identifies the most critical points (support vectors) that influence the position of the boundary, potentially offering better controls over class separation compared to logistic regression .

A decision tree model controls complexity and avoids overfitting by limiting the depth of the tree, setting a maximum number of leaves, and requiring a minimum number of data points in a node to keep splitting. These techniques prevent the model from fitting the training data too closely. The trade-off is that limiting the complexity may lead to underfitting if the model is too simplistic, thus not capturing the underlying data trends adequately .

Gradient boosting differs from other ensemble methods like random forests in its sequential approach to building models. While random forests construct multiple trees independently and combine their predictions, gradient boosting builds a series of weak learners (e.g., decision trees) sequentially, with each new model aiming to correct the errors of the previous ones. This iterative improvement leads to a more focused and potentially more accurate ensemble model, though it can also be more prone to overfitting due to its iterative nature .

Scaling features in machine learning is critical as it ensures that all features contribute equally to the model's training. This is particularly important for algorithms like Support Vector Machines (SVM) and neural networks, where the distance between data points (in SVM) or the gradients of weights (in neural networks) can significantly affect the model's ability to learn efficiently. Without scaling, features with larger ranges can disproportionately affect the model, leading to suboptimal learning .

Cross-validation is considered reliable for model evaluation because it involves splitting the dataset into multiple training and validation subsets. This process is repeated several times, and the model’s performance is averaged across these different subsets. This approach minimizes the variability and potential bias associated with a single train-test split and provides a more robust estimate of the model's performance on unseen data .

The train_test_split function contributes to a model's ability to generalize by dividing the data into a training set and a test set, typically in a 75% to 25% ratio. This ensures that the model is evaluated on unseen data (the test set) after being trained, which helps in assessing how well the model generalizes to new, unseen data beyond the training set. This is crucial because a model that performs well only on training data is likely overfitting and not generalizing well .

A random forest model improves prediction accuracy over a single decision tree by building multiple decision trees where each tree is trained on different subsets of the data. By averaging the predictions from many trees, random forests reduce the variance that a single decision tree might have, leading to lower risk of overfitting. The aggregation of diverse predictions from multiple trees enhances the model's robustness and overall accuracy .

When using accuracy as a metric for evaluating model performance on imbalanced datasets, the primary challenge is that it can be misleading. In imbalanced datasets, one class significantly outnumbers the other, leading the model to favor predicting the majority class. Thus, a model could achieve high accuracy by simply predicting the majority class most of the time, without truly understanding the data or making meaningful predictions about the minority class. Therefore, accuracy can give a false sense of performance strength in such contexts .

Setting a very high k value in the KNN algorithm increases the risk of underfitting because the model becomes too generalized, as it considers a large number of neighbors, which may include instances from different classes. This can lead to high bias, since the decision boundary may become too smooth, but reduced variance because the model is less sensitive to noise in the training data. The balance between bias and variance is crucial for the KNN model's performance .

SMOTE is particularly beneficial in scenarios where datasets are imbalanced with a minority class that has significantly fewer samples. This oversampling technique creates synthetic samples for the minority class to balance the class distribution, which helps improve model performance, reduce bias, and enhance generalization. However, its limitations include the risk of overfitting and the possibility that synthetic samples may not introduce enough variability, potentially leading the model to still generalize poorly in some cases .

Machine Learning Onlinenotepad - Io
No ratings yet
Machine Learning Onlinenotepad - Io
4 pages
Machine Learning Concepts Overview
No ratings yet
Machine Learning Concepts Overview
8 pages
Classification and Prediction Techniques
No ratings yet
Classification and Prediction Techniques
17 pages
Machine Learning Workflow Overview
No ratings yet
Machine Learning Workflow Overview
32 pages
Machine Learning Classification Overview
No ratings yet
Machine Learning Classification Overview
214 pages
Predicting Mechanical System Failures
No ratings yet
Predicting Mechanical System Failures
2 pages
Evaluating Machine Learning Models
100% (2)
Evaluating Machine Learning Models
10 pages
Machine Learning Key Terminologies Guide
No ratings yet
Machine Learning Key Terminologies Guide
9 pages
Ensemble Learning Techniques Explained
No ratings yet
Ensemble Learning Techniques Explained
4 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
40 pages
Overfitting and Feature Engineering Guide
No ratings yet
Overfitting and Feature Engineering Guide
37 pages
Machine Learning
No ratings yet
Machine Learning
22 pages
Supervised Learning Algorithms Overview
No ratings yet
Supervised Learning Algorithms Overview
3 pages
Classification Technique of Supervised Learning
No ratings yet
Classification Technique of Supervised Learning
13 pages
Machine Learning Types and Techniques
No ratings yet
Machine Learning Types and Techniques
19 pages
Supervised Learning in AI: Key Concepts
No ratings yet
Supervised Learning in AI: Key Concepts
25 pages
Machine Learning Fundamentals Guide
No ratings yet
Machine Learning Fundamentals Guide
7 pages
ML Encyclopedia
No ratings yet
ML Encyclopedia
15 pages
Machine Learning Model Overview and Techniques
No ratings yet
Machine Learning Model Overview and Techniques
3 pages
Revision22 Aiml
No ratings yet
Revision22 Aiml
7 pages
Viru Oe
No ratings yet
Viru Oe
29 pages
Supervised Learning
No ratings yet
Supervised Learning
7 pages
Complete ML Cheat Sheet - Detailed Revision Guide
No ratings yet
Complete ML Cheat Sheet - Detailed Revision Guide
24 pages
Steps to Develop a Machine Learning App
No ratings yet
Steps to Develop a Machine Learning App
14 pages
Introduction to Machine Learning Concepts
100% (1)
Introduction to Machine Learning Concepts
116 pages
Machine Learning
No ratings yet
Machine Learning
63 pages
Supervised Learning: Regression & Classification
No ratings yet
Supervised Learning: Regression & Classification
25 pages
Naïve Bayes and Perceptrons Overview
No ratings yet
Naïve Bayes and Perceptrons Overview
66 pages
Machine Learning Classification Overview
No ratings yet
Machine Learning Classification Overview
35 pages
DL Unit-V
No ratings yet
DL Unit-V
27 pages
Understanding Homoscedasticity and Metrics
No ratings yet
Understanding Homoscedasticity and Metrics
36 pages
L13: Principles of Learning: C. V. Jawahar
No ratings yet
L13: Principles of Learning: C. V. Jawahar
21 pages
Foundational Machine Learning Concepts
No ratings yet
Foundational Machine Learning Concepts
22 pages
AI Lec13 ArtificialNeuralNetwork
No ratings yet
AI Lec13 ArtificialNeuralNetwork
60 pages
Supervised Machine Learning Overview
No ratings yet
Supervised Machine Learning Overview
38 pages
Machine Learning Concepts Overview
No ratings yet
Machine Learning Concepts Overview
3 pages
Understanding Cross-Validation in ML
No ratings yet
Understanding Cross-Validation in ML
4 pages
05 Basic Practice
No ratings yet
05 Basic Practice
32 pages
AI Concepts: ML and DL Overview
No ratings yet
AI Concepts: ML and DL Overview
39 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
2 pages
Model Selection in Machine Learning
No ratings yet
Model Selection in Machine Learning
62 pages
U1 Int395
No ratings yet
U1 Int395
38 pages
BDA - ET Notes
No ratings yet
BDA - ET Notes
12 pages
BDA - ET Notes 2
No ratings yet
BDA - ET Notes 2
16 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
187 pages
K-Nearest Neighbors (KNN) Explained
No ratings yet
K-Nearest Neighbors (KNN) Explained
65 pages
ML Notes SL
No ratings yet
ML Notes SL
10 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
14 pages
NLP Classifier Performance Metrics
No ratings yet
NLP Classifier Performance Metrics
146 pages
Machine Learning
No ratings yet
Machine Learning
2 pages
Forms of Learning in Machine Learning
No ratings yet
Forms of Learning in Machine Learning
5 pages
Essential Steps in ML Workflow
No ratings yet
Essential Steps in ML Workflow
20 pages
Breast Cancer KNN Classification Analysis
No ratings yet
Breast Cancer KNN Classification Analysis
5 pages
Machine Learning for Breast Cancer Classification
100% (2)
Machine Learning for Breast Cancer Classification
16 pages
Introduction to Predictive Analytics Techniques
No ratings yet
Introduction to Predictive Analytics Techniques
9 pages
Machine Learning for Gene Expression
No ratings yet
Machine Learning for Gene Expression
14 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
12 pages
Manual de Operacion Inkjet c84
No ratings yet
Manual de Operacion Inkjet c84
80 pages
SAP Adobe Forms: Credit/Debit Memo Guide
No ratings yet
SAP Adobe Forms: Credit/Debit Memo Guide
20 pages
CMOS Gate Ordering via Euler Path Method
No ratings yet
CMOS Gate Ordering via Euler Path Method
8 pages
Laser: Lightweight Authentication and Secured Routing For NDN Iot in Smart Cities
No ratings yet
Laser: Lightweight Authentication and Secured Routing For NDN Iot in Smart Cities
10 pages
Technical Support Engineer Resume
No ratings yet
Technical Support Engineer Resume
2 pages
Accenture Interview Experience Insights
No ratings yet
Accenture Interview Experience Insights
3 pages
Agile Core Banking Implementation With Oracle FlexCube
No ratings yet
Agile Core Banking Implementation With Oracle FlexCube
2 pages
RET Setup Procedure in SRAN 18SP
No ratings yet
RET Setup Procedure in SRAN 18SP
13 pages
Advanced Database Management Syllabus
100% (1)
Advanced Database Management Syllabus
1 page
Beamer Presentation on Graphs
No ratings yet
Beamer Presentation on Graphs
14 pages
DHL Shipment Tracking Overview
No ratings yet
DHL Shipment Tracking Overview
2 pages
Overview of Programming Languages Theory
No ratings yet
Overview of Programming Languages Theory
38 pages
Graphing Techniques with HP-28S
No ratings yet
Graphing Techniques with HP-28S
12 pages
FEA Analysis of Truss and Connecting Lug
100% (1)
FEA Analysis of Truss and Connecting Lug
11 pages
ReadMe f90SQL
No ratings yet
ReadMe f90SQL
9 pages
Co-Design Methodology Overview
100% (1)
Co-Design Methodology Overview
14 pages
Understanding Finite Automata Basics
No ratings yet
Understanding Finite Automata Basics
56 pages
PWD Amalner Hoarding Tender Notice
No ratings yet
PWD Amalner Hoarding Tender Notice
80 pages
DIT vs DIF FFT Algorithms in DSP
100% (1)
DIT vs DIF FFT Algorithms in DSP
2 pages
Python for Data Science Course Overview
No ratings yet
Python for Data Science Course Overview
36 pages
NSPCL Rourkela ERP Implementation Overview
No ratings yet
NSPCL Rourkela ERP Implementation Overview
47 pages
Peephole Optimization Techniques Explained
No ratings yet
Peephole Optimization Techniques Explained
18 pages
Smart Vendo Machine for Student Health
No ratings yet
Smart Vendo Machine for Student Health
11 pages
Computer Science Engineering Resume
No ratings yet
Computer Science Engineering Resume
1 page
Overview of IT and ITeS Concepts
No ratings yet
Overview of IT and ITeS Concepts
4 pages
Simulink Basics: Creating Subsystems
100% (1)
Simulink Basics: Creating Subsystems
42 pages
Advanced Database Management Lab Manual
No ratings yet
Advanced Database Management Lab Manual
56 pages
JavaScript Bitwise Operators Explained
No ratings yet
JavaScript Bitwise Operators Explained
5 pages
Student Placement Details 2015-2016
No ratings yet
Student Placement Details 2015-2016
6 pages

Train-Test Split and Model Evaluation Guide

Uploaded by

Train-Test Split and Model Evaluation Guide

Uploaded by

BDA

Imbalanced data sets

Overfitting & underfitting

Common questions

How do logistic regression and support vector classifiers (SVC) differ in their approach to finding the decision boundary for classification tasks?

How does the decision tree model control complexity to avoid overfitting, and what are the trade-offs involved?

In what ways does gradient boosting differ from other ensemble methods such as random forests?

What is the impact of scaling features in machine learning, particularly when using algorithms like SVM and neural networks?

Why is cross-validation considered a reliable technique for model evaluation, especially when dealing with different subsets of data?

How does the train_test_split function contribute to a model's ability to generalize on new data?

How does a random forest model improve prediction accuracy over a single decision tree?

What challenges might arise when using accuracy as a metric for evaluating model performance on imbalanced datasets?

What are the potential risks associated with setting a very high k value in k-nearest neighbors (KNN) algorithm, and how does it affect model bias and variance?

In what scenarios is the use of SMOTE particularly beneficial, and what are its limitations?

You might also like