0% found this document useful (0 votes)
8 views8 pages

Train-Test Split and Model Evaluation Guide

Coding notes

Uploaded by

cassbozz
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Train-Test Split and Model Evaluation Guide

Coding notes

Uploaded by

cassbozz
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BDA

20242 Semester
.
Train test split
Splits in trainand test data
- builds models using training data
• test model using test data
Model can remember data and therefore is there a need for splitting so it can generilize on new unseen
data → which will make it able to generilize well
The train_test_split function, splits into 75% (training set) and 25% ( test set)

Evaluating model
Accuracy is the ratio of correctly predicted instances to the total number of instances
• often in percentage
• not great in imbalanced data sets → only use in balanced dataset
Comparing with a baseline is great → used for evaluating the performance of complex models

Imbalanced data sets


One class (minority) has significant fewer sample than the other class (majority) → leads to poor
performance of ML models due to the models can be biased towards the majority class

Supervised vs un-supervised
Supervised
• you know what to predict
• labelled data with a target variable
• train test split
• good when model can predict on new unseen data
Unsupervised
• discover unknown patterns
• unlabelled date without a target variable
• No train test split
• clustering
SMOTE
Used especially in imbalanced datasets
Oversampling technique for balanced class distribution in a dataset
Creates synthetic samples for minority class to balance the distribution
→ improves model performing, reduces bias, and enhances generalisation
→ does have a risk of overfitting, and the synthetic samples may not introduce sufficient variability

Classification
Predicts categorical labels
Regression
Predicts continuous values

Generalisability
When the model applies to data that was not used to build the model

Overfitting & underfitting


Overfitting is when the model fits the training data so well that it cannot perform accurately on new
unseen data
• model is to complex, due to too much training or too many inputs feature
Underfitting is when the model is unable to capture the relationship between the input and output
variables accurately → higher error rate on both training set and unseen data
• model is too simple, due to needing more training and more input features

Regularization
Calibrating a models fit to the data with the models complexity → guard against overfitting
Accuracy, precision & recall
Accuracy is the ratio of correctly predictive instance to the total instances
• good measure when classes are well-balanced, meaning number of instances is roughly the same
• can be misleading when data is imbalanced TP + TN

TP + TN + FP +FN

Precision is the ratio of correctly predictive positives to the total predictive positives
• how many instances predictive positive was actually positive?
• important to use when the cost of false positives are high → does not account for false negatives

P
Recall is the ratio of correctly predictive positives to the actual positives
• how many positive instants was correctly identified?
• important to use when the cost of false negatives are high → does not account for false positives

GridSearch N
Hyper parameter technique searches through the best parameters to find the best one for the model
Searches through the best combination to find the best set for the model
Evaluates performance for each combination using cross-validation
• cross-validation is a technique for evaluating a model
• splits the dataset into training and validation sets multiple times → ensures that the models
performance is assessed reliably across different subsets of the data
KNN
Measures similarity's amongst customers
It is a instance based learning algorithm (supervised)
Predicts an outcome for a test instance by finding the K most similar instances in the training data and
aggregating the observed outcomes
Simple model and effective with sufficient training data
set by default k =5
• K = 1 → complex model → risk of overfitting
• k= N → simple model → risk of underfitting
Tuning
• K = number of neighbours
• distance weighting
Hamming distance is a metric for comparing to binary string → the number of bit positions in which
the two bits are different
• looks at the whole data and finds when data points are similar or dissimilar one to one
• gives the results of how many attributeswere different

Logistic regression
Outputs a categorical value → for classification tasks
Supervised algorithm which can be used to classify data into categories or classes, by predicting the
probability that an instance fall into that patrician class based on its attribute
• smaller C → stronger regularization
Support vector classifier (svc) is a linear model that outputs categorical value by finding optimal line
• classifies instances by finding the optimal one or hyperplane to separate classes in a feature space
Decision tree
Used for both classification and regression
Builds a hierarchy of if/else questions leading to a decision
Controlling complexity
• build until leaves are pure (closer to O) → tree will be 100% accurate on training data
Prevent overfitting
• limit depth, limit max numbers of leaves, require minimum number of points in mode to keep splitting
Nodes is each rectangular box → conditions based on feature
• leaf nodes represent final classification

Random forest
Build many decision trees where each tree differs in random ways → does an average to figure out which
tree is the best
Its build on training data
It improves prediction accuracy by combining predictions from multiple trees
Reduces overfitting
Randomly selects n items from original dataset = allows repetition → each tree is the same size, but
random different
Max feature is a parameter, that selects random subsets of feature of its size → high max gives chance
of overfitting
Classification: each tree makes "soft predictions" → highest average= output by forest
Regression: each tree makes own prediction → average of these predictions
Tuning
• number of trees (n_estimator), max_depth, min_samples_split, mm_samples_leaf, max_feature,
bootstrap t criterion
• by grid search or cross-validation

Gradient boosting
Powerful ensemble learningtechnique used for both regression and classification. Builds a series of weak
learned (decision tree) sequentially where one leads to a improved one
Neural network (ANN) & multilayer perception (MLP)
ANN is a computational model inspired by the way biological neural networks in the human bran
process information
MLP is a type of artificial ANN that consist of multilayer of neuron
• Input layer (receive input features from the dataset)
• hidden layer (s) (processes input layers received from input layer )
• output layer (produces final output of the network
Tuning MLP
• number of hidden layers
• number of units in each layer.
• regularization
• K of input → which is important

Scaling
adjusting the range and distribution of numerical features in your dataset → ensure all features
contribute equally to the model
Important for SVM and neural networks
Minmaxscaler ensures all features is between 0 and 1
Applied before supervised ML
Tran + test should be scaled the same way
Dummy variables
Also called one-hot-encoding
If features F has three values a, b & c → creates three new features Fa, Fb, Fc
A powerful tool in statistical modelling for incorporating categorical data
Makes a new columns
Dummy variables for word
• one feature for each word
• valve is 1 if word occurs in text otherwise 0 → makes new columns for each word

Dummy classifier
Simple baseline model used to evaluate the performance of more complex models
Primary purpose is to provide a benchmark against which the performance of more advanced models
can be compared

Confusion matrix
Powerful tool for understanding the performance of a classification model.
Provides a detailed breakdown of correct and incorrect predictions
• true positives → number of instances that are correctly predicted positive
• false positives → number of instances that are incorrectly predicted positive
• true negatives → number of instances that are correctly predicted negative

Common questions

Powered by AI

Logistic regression approaches the decision boundary by predicting the probability that an instance belongs to a particular class based on its attributes, which allows it to determine whether it falls on one side of the decision boundary or the other. Support vector classifiers (SVC), on the other hand, find the optimal decision boundary or hyperplane that maximizes the margin between classes in the feature space. SVC explicitly identifies the most critical points (support vectors) that influence the position of the boundary, potentially offering better controls over class separation compared to logistic regression .

A decision tree model controls complexity and avoids overfitting by limiting the depth of the tree, setting a maximum number of leaves, and requiring a minimum number of data points in a node to keep splitting. These techniques prevent the model from fitting the training data too closely. The trade-off is that limiting the complexity may lead to underfitting if the model is too simplistic, thus not capturing the underlying data trends adequately .

Gradient boosting differs from other ensemble methods like random forests in its sequential approach to building models. While random forests construct multiple trees independently and combine their predictions, gradient boosting builds a series of weak learners (e.g., decision trees) sequentially, with each new model aiming to correct the errors of the previous ones. This iterative improvement leads to a more focused and potentially more accurate ensemble model, though it can also be more prone to overfitting due to its iterative nature .

Scaling features in machine learning is critical as it ensures that all features contribute equally to the model's training. This is particularly important for algorithms like Support Vector Machines (SVM) and neural networks, where the distance between data points (in SVM) or the gradients of weights (in neural networks) can significantly affect the model's ability to learn efficiently. Without scaling, features with larger ranges can disproportionately affect the model, leading to suboptimal learning .

Cross-validation is considered reliable for model evaluation because it involves splitting the dataset into multiple training and validation subsets. This process is repeated several times, and the model’s performance is averaged across these different subsets. This approach minimizes the variability and potential bias associated with a single train-test split and provides a more robust estimate of the model's performance on unseen data .

The train_test_split function contributes to a model's ability to generalize by dividing the data into a training set and a test set, typically in a 75% to 25% ratio. This ensures that the model is evaluated on unseen data (the test set) after being trained, which helps in assessing how well the model generalizes to new, unseen data beyond the training set. This is crucial because a model that performs well only on training data is likely overfitting and not generalizing well .

A random forest model improves prediction accuracy over a single decision tree by building multiple decision trees where each tree is trained on different subsets of the data. By averaging the predictions from many trees, random forests reduce the variance that a single decision tree might have, leading to lower risk of overfitting. The aggregation of diverse predictions from multiple trees enhances the model's robustness and overall accuracy .

When using accuracy as a metric for evaluating model performance on imbalanced datasets, the primary challenge is that it can be misleading. In imbalanced datasets, one class significantly outnumbers the other, leading the model to favor predicting the majority class. Thus, a model could achieve high accuracy by simply predicting the majority class most of the time, without truly understanding the data or making meaningful predictions about the minority class. Therefore, accuracy can give a false sense of performance strength in such contexts .

Setting a very high k value in the KNN algorithm increases the risk of underfitting because the model becomes too generalized, as it considers a large number of neighbors, which may include instances from different classes. This can lead to high bias, since the decision boundary may become too smooth, but reduced variance because the model is less sensitive to noise in the training data. The balance between bias and variance is crucial for the KNN model's performance .

SMOTE is particularly beneficial in scenarios where datasets are imbalanced with a minority class that has significantly fewer samples. This oversampling technique creates synthetic samples for the minority class to balance the class distribution, which helps improve model performance, reduce bias, and enhance generalization. However, its limitations include the risk of overfitting and the possibility that synthetic samples may not introduce enough variability, potentially leading the model to still generalize poorly in some cases .

You might also like