Distance Measure in Machine Learning
Distance measure is a crucial concept in machine learning, particularly in clustering, classification,
and information retrieval tasks. It helps in quantifying the similarity or dissimilarity between two data
points. Based on the nature of data and the application, different types of distance measures are
used.
Types of Distance Measures:
1. Metric Distance Measures
2. Non-Metric Distance Measures
Euclidean Distance : Euclidean distance is a fundamental distance metric used in Machine Learning
(ML) and various other fields. It represents the straight-line distance between two points in a
multidimensional space.
Manhattan distance : Manhattan distance (also known as L1 distance or Taxicab distance) is a
metric used to measure the distance between two points by summing the absolute differences of
their coordinates.
For two points A (x1,x2,...,xnx_1, x_2, ..., x_nx1,x2,...,xn) and B (y1,y2,...,yny_1, y_2, ..., y_ny1,y2
,...,yn) in an n-dimensional space, the Manhattan distance is calculated as:
Minkowski
Minkowski Distance in Machine Learning
Minkowski distance is a generalized distance metric that includes Euclidean distance and Manhattan
distance as special cases. It is widely used in machine learning, especially in clustering and
classification algorithms.
Formula
For two points A (x1,x2,...,xnx_1, x_2, ..., x_nx1,x2,...,xn) and B (y1,y2,...,yny_1, y_2, ..., y_ny1,y2
,...,yn) in an n-dimensional space, the Minkowski distance is defined as:
(b) Non-Metric Distance Measures
Non-metric distances do not satisfy all the properties of metric distances, particularly triangle
inequality or symmetry. These are mostly used in cases where relationships between objects are not
strictly numerical but are instead based on ranks or qualitative properties.
Hamming Distance in Machine Learning
Hamming distance is a metric used to measure the number of positions at which two strings of equal
length differ. It is primarily used for categorical or binary data.
Formula
For two strings or binary vectors A and B, the Hamming distance is calculated as:
Cosine Similarity in Machine Learning
Cosine similarity is a metric used to measure the similarity between two vectors in an inner product
space. It calculates the cosine of the angle between two vectors, indicating their directional
alignment rather than magnitude differences.
Proximity Between Binary Patterns
Proximity measures are used to quantify the similarity or dissimilarity between two binary patterns
(bit strings). In machine learning and pattern recognition, these measures help in clustering,
classification, and data retrieval tasks.
When comparing binary data (where elements take values 0 or 1), we use distance (dissimilarity)
measures or similarity measures to evaluate how close or far two binary patterns are.
Application:
Used in error detection and correction (e.g., in digital communications).
Used in DNA sequence comparison.
Applied in cryptography and data clustering.
Application:
Used in categorical data analysis.
Applied in clustering binary data in machine learning.
Application:
Used in text analysis and document similarity (e.g., comparing word sets).
Applied in clustering categorical data.
Popular in recommendation systems (e.g., collaborative filtering).
K-Nearest Neighbor (K-NN) Classifier
Definition:
K-Nearest Neighbor (K-NN) is a supervised machine learning algorithm used for classification and
regression tasks. It is a non-parametric and instance-based learning algorithm that classifies new
data points based on the majority vote of their "K" nearest neighbors.
Unlike other models that build an explicit function for classification, K-NN stores the entire dataset
and classifies new instances by comparing them to existing ones.
How K-NN Works (Step-by-Step Explanation)
1. Choose the number of neighbors (K):
o Select the value of KKK, which determines how many closest points will be
considered.
2. Compute the distance between the new data point and existing points:
o Common distance measures include Euclidean Distance, Manhattan Distance, and
Minkowski Distance.
3. Find the K-nearest neighbors:
o Identify the K closest data points from the training dataset.
4. Assign a class based on majority voting:
o The class with the highest number of neighbors is assigned to the new data point.
5. Return the predicted class.
Performance Measures for Classifiers
When evaluating a classification model, we use performance measures to assess how well it predicts
the correct class labels. Two essential performance metrics are:
1. Classification Accuracy
2. Confusion Matrix
These metrics help in understanding the model's strengths and weaknesses.
Limitations of Accuracy:
Accuracy is misleading in imbalanced datasets.
Example: In a medical test for a rare disease, if 98 out of 100 people are healthy, a model
that predicts "No Disease" for everyone will be 98% accurate but useless.
Performance of Regression Algorithms
Regression models predict continuous values (e.g., predicting house prices, stock prices, or
temperature). The performance of regression algorithms is evaluated using different error metrics,
which measure how far the predicted values are from the actual values.
Two important performance metrics for regression are:
1. Mean Absolute Error (MAE)
2. Mean Squared Error (MSE)
Interpretation:
If MAE = 0, the model makes perfect predictions.
A lower MAE means better model performance.
MAE is in the same unit as the target variable, making it easy to interpret.
Advantages of MAE:
✔️Simple and easy to understand.
✔️Does not heavily penalize large errors.
Disadvantages of MAE:
❌ Treats all errors equally, even large ones.
Interpretation:
If MSE = 0, the model makes perfect predictions.
A lower MSE means better model performance.
MSE is more sensitive to large errors than MAE.
Advantages of MSE:
✔️Penalizes large errors more, making it useful in applications where big mistakes are costly.
Disadvantages of MSE:
❌ Not in the same unit as the target variable (since errors are squared).
❌ Highly sensitive to outliers.