0% found this document useful (0 votes)
7 views6 pages

GET 307 Students Note-Machine Learning Workflow1

The document outlines the machine learning workflow, detailing its core stages including problem definition, data preparation, exploratory data analysis, feature engineering, and model evaluation. It distinguishes between regression and classification tasks, explains distance-based and linear models, and discusses evaluation metrics such as accuracy, precision, recall, and F1-score. Additionally, it covers hyperparameter tuning techniques essential for optimizing model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views6 pages

GET 307 Students Note-Machine Learning Workflow1

The document outlines the machine learning workflow, detailing its core stages including problem definition, data preparation, exploratory data analysis, feature engineering, and model evaluation. It distinguishes between regression and classification tasks, explains distance-based and linear models, and discusses evaluation metrics such as accuracy, precision, recall, and F1-score. Additionally, it covers hyperparameter tuning techniques essential for optimizing model performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Machine learning workflow compiled by Dr. I.M.

Simeon, ABE Department, FUTMINNA

Machine learning workflow


A machine learning (ML) workflow is a structured, step-by-step process for deve:loping, training,
evaluating, and deploying machine learning models. It organizes the entire lifecycle of a model, from raw
data collection and preparation to a validated and deployable solution. By following a defined workflow,
machine learning projects become more reproducible, interpretable, and efficient. The ML workflow is
also commonly referred to as an ML pipeline, modeling lifecycle, or end-to-end machine learning process.

Figure 1: A typical machine learning workflow

Core Stages of a Machine Learning Workflow:

1. Problem Definition: Identifying the goal, business objective, and key metrics (e.g., accuracy).
2. Data Preparation: Initial processing of raw data to make it usable for analysis and modeling. This
includes data collection, cleaning (handling missing values, outliers, and errors), formatting, and basic
transformations.
3. Exploratory Data Analysis (EDA): Systematic exploration of the dataset to understand its structure
and behavior. It involves summary statistics, visualizations, and correlation analysis to identify
patterns, trends, anomalies, and relationships between variables.
4. Feature Engineering: Creation, selection, and transformation of input variables (features) to improve
model performance. Examples include normalization, encoding categorical variables, creating
interaction terms, and removing irrelevant or redundant features.
5. Data Splitting: Partitioning the dataset into separate subsets to ensure unbiased model assessment.
Typically divides data into training data and test data (and sometimes validation data).
6. Training Data: Subset of the data used to fit the model. The algorithm learns patterns and parameter
values exclusively from this data.
7. Test Data: An independent subset reserved for evaluating the trained model.
It is not seen by the model during training and is used to assess generalization performance.
Machine learning workflow compiled by Dr. I.M. Simeon, ABE Department, FUTMINNA

8. Model Training: Application of a machine learning algorithm to the training data.


The model adjusts its parameters to minimize error or maximize performance based on a defined
objective function.
9. Model Evaluation: Quantitative assessment of model performance using test data.
Common metrics include accuracy, precision, recall, RMSE, MAE, or R², depending on the problem
type.
10. Model Improvement: Iterative refinement of the model based on evaluation results.
May involve hyperparameter tuning, feature re-engineering, trying alternative algorithms, or
collecting additional data.

Regression and Classification Foundations (Machine Learning)


In machine learning, regression and classification are the two core types of supervised learning problems.
In both cases, the model learns from labeled data, but the nature of the output differs.

Regression deals with predicting continuous numerical values. The goal is to learn the relationship
between input features and real-valued output. Typical examples include predicting crop yield, fuel
consumption, temperature, or exam scores. The model is evaluated using error-based metrics such as Mean
Squared Error (MSE), Mean Absolute Error (MAE), or 𝑅 2 .

Classification focuses on assigning inputs to discrete categories or classes. The objective is to learn
decision boundaries that separate different classes based on the input features. Examples include
classifying soil as fertile or infertile, detecting equipment failure (yes/no), or identifying crop disease
types. Performance is assessed using metrics such as accuracy, precision, recall, F1-score, and confusion
matrices.

In summary, regression predicts “how much”, while classification predicts “which class.”
Understanding this distinction is fundamental before learning specific algorithms or model evaluation
techniques.

Distance-Based and Linear Models (Machine Learning)

In machine learning, models can be broadly grouped by how they learn patterns from data. Distance-
based models and linear models are two foundational categories.

Distance-based models make predictions by comparing a new data point to existing examples in the
dataset using a distance measure (such as Euclidean distance). The most common example is k-Nearest
Neighbors (kNN). In kNN, a data point is classified or predicted based on the majority class or the average
value of its K-closest neighbors. These models are simple and intuitive but are sensitive to feature scaling
and become computationally expensive with large datasets.
Machine learning workflow compiled by Dr. I.M. Simeon, ABE Department, FUTMINNA

For more information about the above figure, see ”How KNN works.”

Read distance metrics used in KNN here. You will see examples of how to implement KNN in Python.
Study it. You can copy the codes and execute them in your Jupyter notebook or Colab.

K-Distance in the K-Nearest Neighbors (KNN) Algorithm

In the K-Nearest Neighbors (KNN) algorithm, the K-distance refers to the distance between a query
point and its 𝑘-th nearest neighbor in the training dataset. This distance helps determine which data points
are close enough to influence the final classification or regression decision. Smaller distances indicate
higher similarity between data points.

Steps to Calculate K-Distance in KNN

1. Define 𝐾: Choose the number of nearest neighbors to consider.

2. Calculate Distances: Compute the distance between the input query point and every point in the
training dataset.

3. Sort Distances: Arrange the calculated distances in ascending order.

4. Identify the 𝑘-th Distance: Select the distance value at the 𝑘-th position in the sorted list.

5. Select Neighbors: All points with distances less than or equal to this 𝑘-th distance are used for
classification or regression.

Common Distance Metrics

The distance between points can be computed using different metrics. The most common ones are:
1. Euclidean Distance (L2 Norm)
Machine learning workflow compiled by Dr. I.M. Simeon, ABE Department, FUTMINNA

Straight-line distance between two points 𝑥and 𝑦in a 𝑑-dimensional space:

𝑑(𝑥, 𝑦) = √∑( 𝑥𝑗 − 𝑦𝑗 )2
𝑗=1

2. Manhattan Distance (L1 Norm)

Distance measured along grid lines (city-block distance):


𝑑

𝑑(𝑥, 𝑦) = ∑ ∣ 𝑥𝑗 − 𝑦𝑗 ∣
𝑗=1
3. Minkowski Distance
A general form that includes both Euclidean and Manhattan distances:
1/𝑝
𝑑(𝑥, 𝑦) = (∑𝑑𝑗=1 ∣ 𝑥𝑗 −𝑦𝑗 ∣𝑝 )

• When 𝑝 = 2: Euclidean distance

• When 𝑝 = 1: Manhattan distance

Numerical Example (𝑲 = 𝟑)

Suppose a query point has the following distances to five training points:
[2.5, 1.2, 3.8, 0.5, 4.2]

Sorted distances:
[0.5, 1.2, 2.5, 3.8, 4.2]

• 𝐾-th value (𝐾 = 3): 2.5


• Result: Any training point with distance ≤ 2.5is considered a neighbor.

Implementation Example (Python – scikit-learn)


In Python, the NearestNeighbors class from scikit-learn can be used to compute the K-distance.

You can study and run the following codes

import numpy as np

from [Link] import NearestNeighbors


# Sample training data
Machine learning workflow compiled by Dr. I.M. Simeon, ABE Department, FUTMINNA

X = [[1, 2], [2, 3], [3, 4], [6, 7], [7, 8]]

# Query point

query_point = [[4, 5]]

k=3
# Fit model

nn = NearestNeighbors(n_neighbors=k)

[Link](X)

# Find distance to the 3rd nearest neighbor

distances, indices = [Link](query_point)

print(distances[0][-1]) # Outputs the K-distance

Linear models assume a linear relationship between input features and the output.

I. Linear regression predicts a continuous outcome by fitting a straight line (or hyperplane) that
minimizes prediction error.

II. Logistic regression, despite its name, is a classification algorithm. It models the probability that
an input belongs to a particular class using a sigmoid function and assigns a class based on a
probability threshold.

In essence, distance-based models rely on similarity between data points, while linear models learn
explicit mathematical relationships between inputs and outputs. Both are fundamental tools for
understanding more advanced machine learning methods.

Evaluation Metrics in Machine Learning

Evaluation metrics quantify how well a machine learning model performs, especially on unseen data. The
choice of metric depends on the problem and the consequences of different types of errors.

Accuracy measures the proportion of correct predictions out of all predictions made. It is simple and
intuitive, but can be misleading when classes are imbalanced, since a model can appear accurate by always
predicting the majority class.

Precision measures how many of the instances predicted as positive are actually positive. It answers the
question: When the model predicts “yes,” how often is it correct?
Recall measures how many of the actual positive instances were correctly identified. It answers: Of all
true positives, how many did the model find? There is often a trade-off between precision and recall.

F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both, and
it is especially useful when dealing with imbalanced datasets where accuracy alone is insufficient.
Machine learning workflow compiled by Dr. I.M. Simeon, ABE Department, FUTMINNA

ROC–AUC (Receiver Operating Characteristic – Area Under the Curve) evaluates a model’s ability
to distinguish between classes across all possible classification thresholds. An AUC value of 1 indicates
perfect discrimination, while 0.5 indicates no better than random guessing.

In summary, accuracy gives an overall correctness, precision and recall focus on positive predictions, F1
balances precision and recall, and ROC–AUC measures overall class separability.

Hyperparameter Tuning:

In machine learning, hyperparameters are parameters set before training, controlling the learning
process and influencing model behavior. The process of finding the best combination of
hyperparameters for a given model and dataset is called hyperparameter tuning or optimization.

Examples:

1. Epochs: An epoch is one complete pass of the entire training dataset through a machine
learning model during training. One epoch means that every sample in the training dataset
has been fed through the training model at least once. If your epochs are set to 50, for
example, the model you are training will work through the entire training dataset 50 times.
Generally, the larger the number, the better your model will learn to predict the data. You
should probably tweak (usually increase) this number until your model produces good
predictive results.
2. Batch Size: A batch is a set of samples used in one iteration of training. It represents the
number of samples processed before the model's internal parameters are updated. For
example, let's say that you have 80 images and you choose a batch size of 16. This means
the data will be split into 80 / 16 = 5 batches. Once all 5 batches have been fed through the
model, exactly one epoch will be complete. You probably won't need to tweak this number
to get good training results.
3. Learning rate: This determines how fast your model learns. Be careful tweaking this number!
Even small differences can have a huge effect on how well your model learns. It controls how
quickly the model adjusts its parameters during training.
4. Number of Hidden Layers: In a neural network, this hyperparameter controls the
complexity of the model.
5. Regularization Strength: Helps prevent overfitting by penalizing complex models.
6. Tree depth: In decision trees, this hyperparameter controls the complexity of the model.

Common techniques for hyperparameter tuning include grid search, random search, and Bayesian
optimization.

Common questions

Powered by AI

Regression and classification are both types of supervised learning but differ in the nature of their outputs. Regression predicts continuous values, learning relationships between input features and real-valued outputs, such as predicting temperatures or fuel consumption . Classification assigns inputs to discrete categories, learning decision boundaries that separate classes, like classifying emails as spam or not . Both use labeled data for learning; however, they use different evaluation metrics, with regression focusing on error measures like MSE while classification uses metrics like accuracy and precision . Their application scenarios depend on whether the prediction needs are quantitative or categorical.

The choice of distance metric in k-Nearest Neighbors (kNN) impacts the performance of the model significantly because it determines how similarity between data points is measured. Common metrics include Euclidean distance, Manhattan distance, and Minkowski distance . Different metrics can affect the sensitivity of the model to feature scaling and computational efficiency, which is critical as the size of the dataset grows . Using a metric like Euclidean distance might emphasize larger differences between data points compared to Manhattan distance, thus altering the classification or regression outcomes .

Precision and recall have a trade-off in classification tasks because improving one typically reduces the other. High precision indicates a low false positive rate, which is useful when the cost of a false positive is high, such as in medical testing where a false positive might lead to unnecessary treatment . On the other hand, high recall is crucial when it is important to capture as many positives as possible, such as in disease outbreak detection systems where missing a single case could have significant consequences . This trade-off is critical in deciding the model's operating threshold based on the specific needs of the application.

Data splitting is important in a machine learning workflow because it partitions the dataset into training, test, and sometimes validation subsets . This helps ensure unbiased model assessment by training the model on one portion of the data while testing it on an independent set that was not seen during training, thus evaluating generalization performance . Proper data splitting techniques prevent overfitting and help in accurately estimating how well the model is likely to perform on unseen data, which is critical for robust model evaluation .

Feature engineering is the process of creating, selecting, and transforming input variables to improve a model's performance . It includes techniques such as normalization, encoding categorical variables, creating interaction terms, and removing irrelevant or redundant features . This process is crucial because well-engineered features can lead to more accurate models by enhancing the structure and predictability in the dataset, ultimately improving the model’s ability to generalize to unseen data .

A machine learning workflow is significant because it provides a structured, step-by-step process for developing, training, evaluating, and deploying machine learning models. It organizes the entire lifecycle of a model from raw data collection and preparation to a validated and deployable solution . This structured approach ensures that machine learning projects become more reproducible, interpretable, and efficient .

When selecting and optimizing hyperparameters for a neural network, factors to consider include learning rate, which affects how quickly the model converges; the number of hidden layers, impacting model complexity; batch size, influencing training stability and speed; epochs, determining the duration of training; and regularization strength to prevent overfitting . Additionally, hyperparameter tuning techniques like grid search or Bayesian optimization can be used to systematically explore the hyperparameter space to find the best configuration for balancing model performance and computational efficiency .

The concept of "epochs" in machine learning training influences model outcomes by dictating how many complete passes the model makes through the training dataset . More epochs allow the model to learn patterns more thoroughly, but setting too many epochs can lead to overfitting where the model learns the training data too well and performs poorly on unseen data . Optimization involves balancing this by monitoring model performance on validation data, using techniques like early stopping, or plotting training curves to find an appropriate number of epochs where performance stabilizes without overfitting.

Hyperparameter tuning methods such as grid search, random search, and Bayesian optimization are directly related to model performance because they help find the best combination of hyperparameters for a given model and dataset, which influences the model's learning process and performance . Proper tuning can help improve accuracy, prevent overfitting, and optimize computational resources. For instance, tuning the learning rate can determine how quickly a model converges, while adjusting the number of hidden layers in a neural network impacts its capacity to learn from data .

Exploratory Data Analysis (EDA) plays a crucial role in machine learning as it involves systematically examining the dataset to understand its structure, behavior, and relationships between variables . It aids in identifying patterns, trends, anomalies, and correlations, which informs decisions on feature selection, model choice, and preprocessing steps like dealing with missing values or outliers . Effective EDA can significantly impact the quality of models by ensuring that the data fed into subsequent stages is well-understood and properly prepared .

You might also like