0% found this document useful (0 votes)
16 views10 pages

Understanding Classification in Machine Learning

Classification is a supervised machine learning technique that involves training a model on labeled data and testing it to predict unseen data classes. It can be categorized into binary and multi-class classification, with various algorithms like KNN, Naïve Bayes, MLP, and SVM available for implementation. Evaluation metrics such as accuracy, precision, and recall are used to assess the performance of classification models.

Uploaded by

steve.martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

Understanding Classification in Machine Learning

Classification is a supervised machine learning technique that involves training a model on labeled data and testing it to predict unseen data classes. It can be categorized into binary and multi-class classification, with various algorithms like KNN, Naïve Bayes, MLP, and SVM available for implementation. Evaluation metrics such as accuracy, precision, and recall are used to assess the performance of classification models.

Uploaded by

steve.martinez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE 4: CLASSIFICATION

WHAT IS CLASSIFICATION?
Classification is a type of supervised machine learning consisting of a two-step method: training the
model on a labeled dataset and testing the model on a small batch of data to predict how good the
model performs for previously unseen data (to predict the class of each instance). The dataset
contains a class for each instance (which we call label), and we use a machine learning algorithm to
train the model. Training is how machines get an insight into how the underlying patterns in the
data are associated with the class it belongs to. After training a model, we test it on a small sample
of labeled data. Testing helps to give us a fair idea about how our model will perform.

BINARY AND MULTI-CLASS CLASSIFICATION


A classification problem can be a binary classification or multi-class classification. The difference
between the two is that there are only two output classes in binary classification, and each instance
in the dataset belongs to either one of them. The multi-class classification is a bit more complicated
because the set of available classes to which an instance in the dataset can belong is more than two.
Many efficient algorithms like Support Vactor Machine (SVM), Logistic Regression (LR), and
Perceptrons are designed for binary classification and can be extended for multi-class classification
using heuristics. There are two strategies for doing that:

One v/s Rest:


In Python, this type of heuristic method is often addressed as OvR. The method involves splitting
the multi-class classification into pairs of binary class comparisons where one class represents a
unitary set of classes, and the other represents the remaining set of classes. Let us understand this
using sentiment detection, which involves three classes: positive, negative, and neutral. In the OvR
method, the classification process will involve decisions that will look like this:

[Positive] v/s [Negative, Neutral]? (or Is Positive?)

[Negative] v/s [Positive, Neutral]? (or Is Negative?)

[Neutral] v/s [Positive, Negative]? (or Is Neutral?)

So the multi-class classification problem is broken down into three binary classification problems.
The disadvantage of this approach is that it will require individual binary classification models for
each and every class we will try to predict. This heuristics method will also be slow if the number of
instances in the dataset is too large or the number of possible classes is huge.

The end product of each binary classification problem is a probability score, which denotes the
likelihood of the instance to belong to a class. Finally, the argmax (maximum of all) of these scores
is used to predict the class of the instance.

One v/s One:


More popularly known as OvO method. Like OvR, this method also involves splitting the multi-class
classification problem into multiple binary class classification problems. However, in this method,
each binary classification problem involves comparing exactly one class with another class. Let us
extend the previous example of sentiment detection with three classes Positive, Negative, and
Neutral. The binary classification problems which would be created using the OvO method will look
like this:

[Positive] v/s [Negative]?

[Positive] v/s [Neutral]?

[Neutral] v/s [Negative]?

[Neutral] v/s [Positive]?

[Negative] v/s [Positive]?

[Negative] v/s [Neutral]?

So, as you can see that for three classes, there will be six binary classification models. There is a
straightforward formula to calculate this:

(n * (n – 1) ) / 2

(which is all possible combination given by nC2)

The downside of this method is that compared to OvR, the number of models prepared are many
more and hence slower. The final class label prediction is done by calculating the total number of
predictions for each binary decision made in the OvO method.
CLASSIFICATION ALGORITHMS
There are multiple algorithms that can perform classification, and these algorithms differ from each
other based on their underlying mathematical approach. Each algorithm takes a different approach,
but finally, all of them work towards answering a decision problem. Below we will discuss some
algorithms which are popular and can be used to classify real-world data. In the sections below, we
will first discuss the algorithm, followed by a brief theoretical explanation, and finally, a code
snippet to implement the algorithm in Python.

The code snippet given below is to load the dataset and divide it into the train and the test sets. The
dataset we used here is about car evaluation 1 and has four output classes based on the predictor
attributes. Since all the features consist of string variables, we need to convert them to valid integer
values.

## library imports required


import pandas as pd
from sklearn.model_selection import train_test_split
from [Link] import OrdinalEncoder
from [Link] import LabelEncoder

## loading data
df = pd.read_csv(file)
## in y we store the classes for each instance
## in X we take all the attributes/features for each instance
y=df['class']
X=[Link][:,:-1]

## splitting the X and y in train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

## converting the features from string value to integer value


ord_enc = OrdinalEncoder()
X_train[["buying","maint","doors","persons","lug_boot","safety"]] = ord_enc.fit_transform(X_train[
["buying","maint","doors","persons","lug_boot","safety"]])
X_test[["buying","maint","doors","persons","lug_boot","safety"]] = ord_enc.fit_transform(X_test[["
buying","maint","doors","persons","lug_boot","safety"]])

## converting the labels to valid integer values


y_train = lab_enc.fit_transform(y_train)
y_test = lab_enc.fit_transform(y_test)

1Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [[Link] Irvine, CA: University of
California, School of Information and Computer Science.
KNN
KNN stands for K-nearest-neighbour, and it is mathematically the simplest of algorithms. In the
KNN training phase, all data-points from the training data are plotted in n-dimensional feature
space with respective class labels. A new data-point is plotted in the feature space during the
testing phase, and Euclidean distance is calculated from the new data-point to the already labeled
data-points. The new data-point is assigned to a class based on the k-number of labeled data-points
nearest to this point (the k-nearest neighbors).

Figure 1: Example of K-NN 2

Let us take a look at the image above for a better understanding. The new unlabelled data-point
(which we will encounter in the testing phase) is marked as the green circle. Based on the value of
k, the green circle will get a shape and color. If the value of k is three, then the green circle will be
classified as red triangle (two of the three neighbors are red triangles), and if the value of k is five,
then the class assigned will be blue square (three out of five neighbors are blue squares).

The code snippet for KNN is given below:

## importing valid library


from [Link] import KNeighborsClassifier
from sklearn import metrics

## declaring our model


KNN = KNeighborsClassifier()

## training the model with the training dataset

2 [Link]
[Link](X_train, y_train)

## predicting the labels with our model


prediction_KNN = [Link](X_test)

## printing the evaluation metrics for our model


print(metrics.classification_report(y_test, prediction_KNN))

The output of our print statement will look like:


precision recall f1-score support

0 0.81 0.76 0.78 79


1 0.71 0.50 0.59 10
2 0.92 0.98 0.95 247
3 1.00 0.30 0.46 10

accuracy 0.90 346


macro avg 0.86 0.63 0.70 346
weighted avg 0.89 0.90 0.89 346

Naïve Bayes
Naïve Bayes uses Bayes theorem of conditional probability (with an assumption of conditional
independence between every pair of features). Let us assume that we have a red fruit with a
diameter of around 3 inches. Every feature of the fruit are independent of each other, and they
individually contribute to the recognition of the fruit, and there is no need to relate the features to
find the type of fruit. Naïve Bayes models are fast and easy to build and works very well for large
datasets with a high dimensional feature set.

The code for classification using Naïve Bayes in Python is given below:

## importing libraries
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

## declaring our model


GNB = GaussianNB()

## training our model


[Link](X_train, y_train)

## predicting labels with our model


prediction_GNB = [Link](X_test)
## printing evaluation metric for out model
print(metrics.classification_report(y_test, prediction_GNB))

The output of the last print statement will look like this:
precision recall f1-score support

0 0.47 0.11 0.18 79


1 0.00 0.00 0.00 10
2 0.86 0.77 0.81 247
3 0.10 1.00 0.17 10

accuracy 0.60 346


macro avg 0.36 0.47 0.29 346
weighted avg 0.72 0.60 0.63 346

MLP
MLP stands for Multi-Layer Perceptron, and to understand how this algorithm works, you will need
to know what a perceptron is. A perceptron is a replication model of a human neuron (a nerve cell)
and is a simple binary classifier. The features are combined to generate weight, and then a linear
function is used for classification purposes. A multi-layer perceptron is a combination of layers of
perceptrons (many perceptrons organized in multiple layers) with at least one hidden layer created
to perform multi-class classification. It is never about creating a complex structure as a human
brain, but more about understanding how the decision-making process works in the human brain
(natural neural networks) and using the same logic to perform predictive analysis (using artificial
neural networks).

The input to a perceptron are feature inputs, weights of input features, and a bias. All the inputs are
passed through an activation function to generate an output. The activation function is generally a
sigmoid function that outputs a value with a range 0 to 1 or a hyperbolic tangent function (called
tanh) that outputs a value ranging between -1 to +1. The weighted input is generally a small float
value between 0 and 0.3, and the bias is 1.

The code for classification using an MLP is given below:

## importing libraries
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

## declaring our model


clf = MLPClassifier(random_state=1, max_iter=300)

## training our model


[Link](X_train, y_train)

## predicting labels with our model


prediction_MLP = [Link](X_test)

## printing evaluation metrics of our model


print(metrics.classification_report(y_test, prediction_MLP))

The output of the last print statement will look like this:
precision recall f1-score support

0 0.89 0.96 0.93 79


1 0.77 1.00 0.87 10
2 0.99 0.96 0.97 247
3 1.00 0.80 0.89 10

accuracy 0.96 346


macro avg 0.91 0.93 0.91 346
weighted avg 0.96 0.96 0.96 346
SVM
SVM or Support Vector Machine is a fast, dependable algorithm capable of performing very well if
the dataset is of a limited size. SVM is a non-probabilistic classifier that plots labeled data in n-
dimensional space and tries to separate the class of data-points using a hyperplane. New data-
points are plotted in the space, and the class of the new data-point is assigned based on the side of
the hyperplane the point falls.

Figure 2: SVM hyperplanes 3

The image above shows how hyperplanes are drawn in a 2-dimensional XY-plane. H1 is not a good
example of a hyperplane as it does not separate the data-points properly. H2 separates the data-
points but has a very small gap with the nearest black and white points (hence, chances of error are
very high). H3 is the best-suited hyperplane for separating the data-points.

The Python code for classification using SVM is given below:

## importing libraries
from [Link] import SVC
from sklearn import metrics

## declaring our model


SVM = SVC()

## training our model


[Link](X_train, y_train)

3 [Link]
## predicting labels with our model
prediction_SVM = [Link](X_test)

## printing evaluation metric of our model


print(metrics.classification_report(y_test, prediction_SVM))

The output of the last print statement will look like this:
precision recall f1-score support

0 0.85 0.77 0.81 79


1 0.83 0.50 0.62 10
2 0.93 0.98 0.95 247
3 1.00 1.00 1.00 10

accuracy 0.92 346


macro avg 0.90 0.81 0.85 346
weighted avg 0.91 0.92 0.91 346

EVALUATION METRIC
Classification is done on a labeled dataset where the dataset is split for train and test in a ratio of
approximately 4:1 (80% training, 20% test). The output labels are generated as part of the original
dataset. However, in most cases, all instances from the test set are assigned a class by human
annotators. For example, the sentiment of tweets or the rating of movies. These annotations are
considered ground truth, and the labels generated by our machine learning models are compared to
the human-generated truth values to generate various metrics. To understand evaluation metrics,
you must understand a few essential terms. Let us take an example of a binary classification
problem where you have to assign each instance to either a Positive or Negative class. Let us look at
the table given below:

Positive (predicted) Negative (Predicted)


Positive (actual) 48 2
Negative (actual) 6 44

This table is known as the Confusion Matrix. The rows represent the ground truth, and the columns
represent the predictions by our classifier. From the above table, we know that our model has
successfully predicted 48 Positive instances as Positive and 44 Negative instances as Negative.
There are two instances of Positive which are wrong classified as Negative and six instances of
Negative which are wrongly classified as Positive.

Positive (predicted) Negative (predicted)


Positive (actual) True Positive (TP) False Negative (FN)
Negative (actual) False Positive (FP) True Negative (TN)
True Positive are the correctly classified Positive instances, and True Negative are the correctly
classified Negative instances.
False Positive are the Negative instances that are wrongly classified as Positive and False Negative
are the Positive instances that are improperly classified as Negative.
The first and most basic evaluation metric is Accuracy. It practically denotes the accuracy of our
model, and the formula for accuracy is:
Accuracy = Number of Correct Predictions/ Total number of Instances
For binary classification, we can also modify the above formula as:
Accuracy = (TP + TN)/ (TP + TN + FP + FN)
Precision is another important metric that measures the proportion of Positive labels that are
classified correctly by the model. This means that it can tell us what fraction of total prediction our
classifier model has done correctly. The formula to find Precision is:
Precision = TP / (TP + FP)
Recall can be defined as the ratio of the correctly classified Positive labels to the total number of
actual Positive labels present in the dataset. It tells us the proportion of correct labels predicted by
our classifier with respect to the ground truth. The formula for Recall is:
Recall = TP / (TP + FN)
F-score is another important metric that acts as a balance between Precision and Recall. It is
mathematically expressed as the harmonic mean of Precision and Recall. The formula for F-score
can be given as follows:
F-score = 2 * (Precision * Recall) / (Precision + Recall)
This can also be expressed as:
F-score = TP / [TP + 0.5 (FP + FN)]
You can calculate any of these metrics by printing the confusion matrix of your model. In Python,
you can do this using the following code:

from [Link] import confusion_matrix

## y_true is the ground truth, and y_pred is the predicted labels


confusion_matrix(y_true, y_pred)

Common questions

Powered by AI

A confusion matrix provides a detailed visualization of the performance of a classification model by showing the actual versus predicted classifications across different categories. It helps in identifying true positive, true negative, false positive, and false negative outcomes . From the confusion matrix, several key evaluation metrics can be derived: Accuracy (the ratio of correctly predicted instances to total instances), Precision (the ratio of true positive predictions to the total predicted positives), Recall (the ratio of true positives to all actual positives), and F-score (the harmonic mean of Precision and Recall, providing a balance of the two). These metrics allow for a nuanced evaluation of a model's performance beyond a simple accuracy measure, highlighting its ability to correctly identify class instances and manage misclassification .

An SVM classifier is particularly suited for datasets of limited size because it is robust and efficient in such contexts. SVMs maximize the margin between the classes, which is defined by the distance between the hyperplane and the nearest data points (support vectors) from either class . By focusing on these critical points, SVM can effectively distinguish between classes while neglecting the potentially noise-inducing influence of distant points, thus preventing overfitting. This is crucial in small datasets where the risk of overfitting is higher due to limited data variability. The profound mathematical foundation of SVM ensures that even with a smaller dataset, the decision boundary is defined in a way that generalizes well to unseen data while maintaining the ability to handle high-dimensional spaces through kernel tricks for non-linear separations .

The One v/s Rest (OvR) strategy involves transforming a multi-class classification problem into multiple binary classification problems by considering one class against all other classes. For example, with three classes (Positive, Negative, Neutral), the classification problems would be: [Positive] v/s [Negative, Neutral], [Negative] v/s [Positive, Neutral], and [Neutral] v/s [Positive, Negative]. This requires multiple binary classifiers, but generally fewer than the OvO method, as it scales linearly with the number of classes . In contrast, the One v/s One (OvO) strategy involves creating a binary classifier for every pair of classes, which results in more classifiers being created. For three classes, there are six OvO problems: [Positive] v/s [Negative], [Positive] v/s [Neutral], [Negative] v/s [Neutral], etc. The number of classifiers required is given by (n*(n-1))/2, where n is the number of classes . The downside of OvO is the larger number of classifiers, hence it can be slower, while OvR may simplify the model at the potential cost of robustness.

The assumption of conditional independence in Naïve Bayes greatly simplifies the calculation of probabilities and reduces model complexity, making the algorithm very fast and efficient on large datasets with high-dimensional features . However, in real-world datasets, features are often interdependent, which violates this assumption. This can lead to suboptimal performance, as the interactions and correlations between features are ignored, potentially affecting the probability estimates on which the classification is based. Nonetheless, in many practical applications, Naïve Bayes performs surprisingly well despite this limitation, especially in text classification and spam filtering, where the independence assumption roughly holds true, or its violation does not significantly impact model accuracy.

Naïve Bayes classifier is based on Bayes' theorem, which assumes conditional independence between every pair of features. This assumption implies that the presence or absence of a particular feature does not affect any other feature. This simplification allows for fast computation and ease of model building, which is beneficial for large datasets with high dimensional feature sets . However, this assumption may not hold true in practice, as many features are often correlated, potentially impacting the model's performance. Despite this, Naïve Bayes can still perform surprisingly well in various practical applications due to its robustness and the simplicity of decision-making based on individual feature probabilities.

K-Nearest Neighbors (KNN) is a simple and intuitive algorithm that is easy to implement. It is non-parametric, meaning it makes no assumptions about the underlying data distribution, making it highly flexible for various types of data . The main advantage of KNN is its ability to adapt based on local information, as it assigns the class based on the majority class among k nearest neighbors in the feature space. However, its performance can significantly degrade with high-dimensional data due to the 'curse of dimensionality', where distance measures become less meaningful. Moreover, KNN requires all data to be retained for prediction, leading to high memory usage and computational cost, especially in large datasets . Additionally, the choice of k impacts performance; a small k may be too sensitive to noise, while a large k may smooth out boundaries, losing finer distinctions between classes.

The distance metric in KNN plays a critical role as it dictates how the 'nearest' neighbors are determined. The most commonly used metric is Euclidean distance, which assumes that feature space is homogeneous and isotropic. The choice of distance metric can significantly impact classification outcomes as it affects the identification of neighbors in the feature space. Varying the distance metric can adapt KNN to different data structures or domain-specific requirements. For example, Manhattan distance may be more appropriate in grid-like feature spaces, and Mahalanobis distance can account for correlations between features. Selection of an inappropriate metric might lead to poor classification, especially if it does not align with the data distribution or dimensional relationships inherent in the dataset . The choice of metric should be guided by an understanding of the data as well as empirical testing through validation.

Activation functions in a Multi-Layer Perceptron (MLP) simulate the firing mechanism of biological neurons. They introduce non-linearity into the network, enabling it to learn complex patterns in the data. Without activation functions, the network would only be able to learn linearly separable patterns, diminishing its power significantly. Common activation functions include the sigmoid function, which outputs values between 0 and 1, and the hyperbolic tangent (tanh) function, which outputs values between -1 and 1. These functions help the network to perform non-linear transformations, allowing the layered structure to adaptively learn mappings from inputs to outputs through non-linear combinations of input features and weights .

Activation functions in neural networks are essential for introducing non-linearity into the model, enabling it to learn and perform complex tasks. Without these functions, the neural network would behave merely as a linear classifier, limiting its ability to model the complexities and intricacies of real-world data. Activation functions such as ReLU, sigmoid, and tanh allow the model to learn intricate patterns by transforming the inputs into non-linear outputs, facilitating the capture of features at various levels of abstraction . This non-linear transformation process enables the network to approximate any function, thanks to the universal approximation theorem, allowing for complex decision boundaries and hence solving intricate problems.

Despite its simplicity, the OvR method in multi-class classification has potential drawbacks. It requires training a separate binary classification model for each class, which can become computationally expensive as the number of classes grows . Additionally, since OvR combines binary models that independently predict class membership probabilities, it may face issues with data imbalance, where the presence of more negative class instances could skew the training process. Furthermore, the method assumes all binary problems are separable, which might not hold for complex datasets with overlapping classes, leading to suboptimal performance and potential reduction in accuracy across finer class distinctions .

You might also like