0% found this document useful (0 votes)

42 views46 pages

Robust Mahalanobis Distance in Outlier Detection

The document discusses outlier detection using robust Mahalanobis distance. It begins by defining outliers and their effects. It then discusses Mahalanobis distance as a multi-dimensional generalization of measuring standard deviations from the mean that accounts for correlations. The document outlines robust estimators like the Minimum Covariance Determinant estimator and Rocke estimator that are less sensitive to outliers. It provides the algorithm for the Rocke estimator, which iteratively updates weights to obtain a robust mean vector and covariance matrix.

Uploaded by

Abhi S Gowda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views46 pages

Robust Mahalanobis Distance in Outlier Detection

Uploaded by

Abhi S Gowda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

OUTLIER DETECTION

USING ROBUST
MAHALANOBIS DISTANCE
SEMINAR 2
APARNA BHIDE, M.
PALB 7187
FLOW OF SEMINAR

2
OUTLIERS

◈ An outlier is a data point that drastically affects

the estimates and assumptions.
◈ An outlier can cause serious problems in
statistical analysis, hence one has to give
importance to detect the outliers.
◈ All the data points that are farther away from
the remaining points need not be a potential
outlier and such points are very important to
know the inherent characteristic of data set.
3
CAUSES

◈ Data entry errors (human ◈ Data processing errors

errors) (data manipulation or
◈ Measurement errors data set unintended
(instrument errors) mutations)
◈ Experimental errors (data ◈ Sampling errors
extraction or experiment (extracting or mixing
planning/executing data from wrong or
errors) various sources)
◈ Intentional (dummy ◈ Natural (not an error,
outliers made to test novelties in data)
detection methods) 4
EFFECT OF OUTLIERS

◈ All important and commonly used measures are

affected by the presence of outliers.
◈ It can affect the structure of the true model by
which data has been generated.
◈ An outlier might drastically affect the estimates
of the parameters of the models.

5
METHODS OF DETECTION

One-dimensional data High-dimensional data

◈ Three standard ◈ Leverage values
deviation of
◈ Mahalanobis distance
standardized variates
◈ Subspace and
◈ Box and whisker plots correlation based
etc. techniques
◈ Chauvenet’s criterion

6
MAHALANOBIS DISTANCE

◈The Mahalanobis distance measures distance

relative to the centroid - a base or central point which
can be thought of as an overall mean of multivariate
data, introduced by P. C. Mahalanobis in 1936.
◈It is a multi-dimensional generalization of the idea of
measuring how many standard deviations away a
point P is from the mean of distribution.
◈This distance is zero if P is at the mean, and grows as
P moves away from the mean along each principal
component axis.
7
◈If each of these axes is re-scaled to have unit
variance, then the Mahalanobis distance
corresponds to standard Euclidean distance in
the transformed space.
◈The Mahalanobis distance is thus unit
less and scale-invariant, and takes into
account the correlations of the data set.

8
PRINCIPLE

◈ It represents a covariance distance of data, which

can effectively estimate the similarity of sample
sets.
◈ For a p-dimensional data X={x1, x2,x3,,..,xp} with
mean vector µ={µ1, µ2,µ3,,..,µp} and covariance
matrix Σ, the Mahalanobis distance is defined as
follows:
DM x   x     x   
1
T

9
MD IN OUTLIER DETECTION

◈ For a set of samples X n, p with size n and dimensions

of p, we first calculate the mean vector μ and
covariance matrix Σ of the sample X n, p , and then
calculate the Mahalanobis distance of each sample.
◈ To identify whether a point is an outlier, a threshold
is needed. We know that square of Mahalanobis
distance approximates a chi-square distribution with
a degree of p.
◈ Therefore, given a confidence level α, if there is d
>χ2p(1−α) for a certain sample, then the sample is
an outlier, and vice versa.
10
◈ Mahalanobis distance identifies observations
that lie far away from the centroid of the data
cloud, giving less weight to variables with
large variances or to groups of highly
correlated variables.
◈ This distance is often preferred to the
Euclidean distance which ignores the
covariance structure and treats all variables
equally.
11
◈ A test statistic for Di2 can be given as follows
which has an approximate F distribution with p
and n-p degrees of freedom.

n  p n D 2

n 2
 1 p
i

12
Why robust
methods

13
Classical Mahalanobis distance is based on
sample mean vector and sample covariance
matrix.
Since the classical mean vector and covariance
matrix algorithms are sensitive to outliers, the
classical Mahalanobis distance is also sensitive
to outliers.
The purpose of robust estimation is to produce
an efficient estimator in the presence of
outliers, while minimizing bias.
14
◈ This is done by reducing the influence of the
outliers on the estimator.
◈ To evaluate robust estimators, the usual
properties such as bias and precision are of
interest.
◈ The breakdown point is the smallest fraction of
contamination that can cause the bias of the
estimator to become infinitely large, at which
point the estimator is said to break down.
15
Types of robust estimators

Many robust estimation methods have been

proposed for mean vector and covariance matrix,
such as
 Stahel - Donoho (SD) estimator
 Minimum Volume Ellipsoid (MVE) estimator
 Minimum Covariance Determinant (MCD)
estimator
 S estimator
 Rocke estimator
16
MCD Estimator

◈ One of the first affine equi-variant and highly robust

estimators of multivariate location and scatter.
◈ The idea is to use a trimmed scale for .
^

◈ Let d 1  ,    d 2   ,    ...  d n   ,   be the order of

the squared distances d 2 ( yi , ˆ , ˆ ) and for 1 h < n
define the trimmed scale as
h
̂   d(i )
i 1
17
◈ y(MCD) = sub-sample of n/2  h < n observations
whose covariance matrix has the smallest
determinant.
1
ˆ ( MCD ) 
h
 yi


ˆ
( MCD ) 
kMCD (h, n, v)
  yi  ( MCD )  yi  ( MCD ) 
ˆ ˆ
h 1 i y( MCD )

◈ The proportionality term kMCD(h,n,v) is crucial to

ensure consistency and (approximate)
^
unbiasedness of (MCD)
18
Rocke estimator

◈ Proposed by Rocke.
◈ Improved and empirically compared to other
estimates by Maronna and Yohai (2017).
◈ They pointed out that robustness was
superior to other estimates when the data
dimension was larger than 15.

19
Initial Value

◈ The initial value for Rocke estimator has a

significant influence.
◈ The sub sampling approach usually employed for
computing the starting values is very expensive
for large dimensions.
◈ This study demonstrates that a semi-
deterministic equivariant procedure for outlier
detection, dramatically improves both the
computing times and the statistical performances
of the estimators
20
◈ Rocke estimator mainly applies a non-monotonic
weight function by Rocke (1996) , and iteratively
updates the weight of each sample point, and
finally obtains a robust mean vector and
covariance matrix estimator.
◈ When the distance change is very small, that is,
the scale of the Mahalanobis distance to be
small, the iteration is stopped, and a robust mean
vector and covariance matrix can be obtained.

21
Algorithm for
Rocke
estimator
It includes the following 4 steps

22
Step 1

◈ Centring and scaling the data, and

then the mean vector 0 and
covariance matrix V0 of the sample
data are obtained.

23
Step 2

◈ Let ^ = ^ (d1,d2,…,dn ) represent the scale

estimate of mahalanobis distance, and solve it
by 1 n  di 
 
ˆ
 
n i 1  
ˆ  ˆ (d1 ,
Where (0,1) controls the size of the
breakdown point.
◈ When, 1  n  p  1, Rocke estimator can
  
n 2
achieve the highest finite sample breakdown,
where p represent the data dimension.
24
◈ The relationship between the ρ function and the
weight function W is: ρ ′ =W ,
◈ The function ρ is given by:
 0, 0  t 1 
 t  1    t  1  2  1

    3     , 1   t 1 
 4       2 (3)

 1, t 1 

Where  denotes the weight range.

25
Since the Mahalanobis distance d approximates the
chi-square 2p distribution with p degree of freedom, when
the value of d/^ is outside [2p (), 2p (1- )], there is
W(d/^)=0, (1-  is the confidence), when the p is large,
the 2p distribution tends to be symmetric, and there are:
 p2 (0.5)  p

 p2 (1   )  p  p   p2 ( )
  p2 (1   ) 
Let   min   1,1 , and calculate ^ by
 p 
fixed point method.
26
Step 3

◈ The Mahalanobis distance is obtained by the

initial value; then the new mean vector and the
covariance matrix are calculated by the
following weight function
3   t 1  
2

W (t )  1    I (1    t  1   )
4     
◈ The different weights are applied to different
samples by the following equation to calculate μ,
the uncoordinated covariance matrix C to obtain
the final covariance matrix Σ :
27
n
 di 

i 1
W  ( xi   )  0
 ˆ 
n
 di 
W   xi   ( xi   )  C
1

n i 1  ˆ 
T

C
 1
C p
28
Step 4

◈ Repeat the second and third steps until obtain

^ new has the following relationship with the
^ old obtained last time: 
^
old - 
^
new < tol stops
the iteration, where tol is the present error,
and finally obtains the final stable mean
vector and covariance matrix, and then
calculates the Mahalanobis distance of the
sample data.

29
Numerical simulation
example

30
◈ The classical Mahalanobis distance, Mahalanobis
distance based on MCD estimator, and based on
the Rocke estimator are applied to detect the
outlier via the simulation studies.
◈ Mixture distribution data set are generated, which
consists of the standard normal distribution and
contaminated data.

31
◈ The mixture distribution data is
N(0,1)+ ( N(0,1)+10p)
Where  - contaminated ratio
 - constant(determines the scatter of the
outliers)
◈ In this simulation studies, consider n = 100, p= 6;
n =300, p = 30, ε = 0,0.2 , and λ = 0,0.5 , and
take χ2p (0.99) as the threshold.

32
◈ The above three Mahalanobis distances are
calculated, and then identify the outliers in the data.
◈ Also
 Number of Real Outliers (NRO),
 Detected the corrected Number of Outliers
(DENO),
 Rate containing Outliers in the Data (DaOR),
 Detected Outliers Rate (DeOR).
are calculated.
33
λ = 0, nn == 100,
100, pp==6,6,εε==0.2.
0. (a)
(a) Rocked
Rocked detection; (b) Classical
detection; (c) MCD detection.
34
λ = 0,
0, nn == 300, 30,εε==0.2.
300, pp ==30, 0. (a)
(a) Rocked
Rocked detection; (b) Classical
detection; (c) MCD detection.
35
λ = 0.5, 100, pp==6,6,εε==0.2.
0.5, nn == 100, 0. (a)
(a) Rocked
Rocked detection; (b) Classical
detection; (c) MCD detection
36
λ = 0.5, nn == 300, 30, ε ε==0.2.
300, pp == 30, 0. (a)
(a) Rocked detection; (b)
Classical detection; (c) MCD detection
37
Empirical
Analysis

38
◈ The proposed methodology was applied to analyze
the Breast Cancer Wisconsin (Diagnostic) Data Set
(1995)
◈ The data set contained 30 test variables with a total
of 569 data, of which 357 were diagnosed as benign
and 212 were diagnosed as malignant.
◈ Classification by characteristic variables of the
sample can distinguish between benign and
malignant

39
◈ Therefore, for data diagnosed as benign, data
diagnosed as malignant is equivalent to the
contaminated data.
◈ Adding contaminated data to the data diagnosed
as benign with data diagnosed as malignant, and
use 2p (0.99) as the threshold.
◈ In the following, we apply the above three
methods to detect outliers, and obtain the number
and proportion of detected outlier and the scatter
plot.
40
◈ Since the actual data may contain a certain
proportion of outlier, it is not advisable to add too
much data when the data diagnosed as malignant is
used as the outliers.
◈ The first 200 diagnosed as benign data were taken,
and 0 and 16 (100th to 115th) of the data diagnosed
as malignant were sequentially added.
◈ We also calculate the number of outliers added
(NOA), the number of outlier detected (NOD),
detected the corrected number of outliers (DENO),
and detected outliers rate (DeOR).
41
Adding diagnosed
Diagnosed as benign as data
malignancy
test results.
data (a)
testRocked
[Link];
(a) Rocked
(b)
Classical detection;
detection; (b) Classical
(c) MCD
detection;
detection
(c) MCD detection
42
CONCLUSION

◈ Since the classical Mahalanobis distance is greatly

affected by the outliers, there is a large deviation
in either low-dimensional data set or high-
dimensional data set when there are outliers.
Therefore, the outliers cannot be accurately
detected.
◈ Mahalanobis distance based MCD estimator can
detect outliers more accurately in low-dimensional
data, but it will produce large deviation in high-
dimensional data.
43
◈ Rocke estimator is more robust in both low-dimensional
and high-dimensional data set.
◈ Specially, the robustness advantage is more obvious in
high-dimensional data set, and it can accurately detect
outliers.
◈ Through numerical simulation and empirical analysis,
the accuracy and practicability of the methodology in
high-dimensional data set are validated.
◈ Thus, this methodology can provide a new robust
method for effectively detecting outliers in high-
dimensional real data analysis.
44
📖 REFERENCES 📖
HARDIN, J. AND ROCKE, D.M., 2004, Outlier detection in the multiple
cluster setting using the minimum covariance determinant estimator.
Comput. Stat. Data Anal., 44:625–638.
HUBERT, M., ROUSSEEUW, P. J. AND VAN AELST, S., 2008, High-
Breakdown Robust Multivariate Methods. Stat. Sci., 23:92–119.
LI, X., DENG, S.R., LI, L.F. AND JIANG, Y.C., 2019, Outlier Detection
Based on Robust Mahalanobis Distance and Its Application. Open J.
Stat., 9:15-26.
MARONNA, R.A. AND YOHAI, V.J., 2017, Robust and Efficient
Estimation of Multivariate Scatter and Location. Comput. Stat. & Data
Anal., 109:64-75.
MIA, H. AND MICHIEL D., 2010, Minimum Covariance Determinant.
John Wiley & Sons, Inc., 2:36-43.
45
46

Understanding Moments in Probability Distributions
No ratings yet
Understanding Moments in Probability Distributions
7 pages
Bernoulli and Binomial Distributions
No ratings yet
Bernoulli and Binomial Distributions
3 pages
STAT3006 Statistical Learning Notes
No ratings yet
STAT3006 Statistical Learning Notes
110 pages
Successive Differentiation Syllabus
No ratings yet
Successive Differentiation Syllabus
31 pages
OLS Estimator in Matrix Form Explained
No ratings yet
OLS Estimator in Matrix Form Explained
29 pages
Recap of Linear Algebra Concepts
No ratings yet
Recap of Linear Algebra Concepts
29 pages
Curve Fitting Using Least Squares Method
No ratings yet
Curve Fitting Using Least Squares Method
51 pages
Beta vs Normal Distribution Analysis
No ratings yet
Beta vs Normal Distribution Analysis
8 pages
Mathematical Modeling and Computational Calculus 1st Edition William Flannery Ebook Testbank Solutions Fully Accessible 2026
100% (1)
Mathematical Modeling and Computational Calculus 1st Edition William Flannery Ebook Testbank Solutions Fully Accessible 2026
145 pages
ODE IVP Methods Overview by Dr. Kaisare
No ratings yet
ODE IVP Methods Overview by Dr. Kaisare
12 pages
Solving Laplace Equations via Liebmann's Method
No ratings yet
Solving Laplace Equations via Liebmann's Method
26 pages
Madhava of Sangamagrama: Legacy & Works
No ratings yet
Madhava of Sangamagrama: Legacy & Works
4 pages
Stratified Proportional Sampling Guide
No ratings yet
Stratified Proportional Sampling Guide
5 pages
Two Dimensional Random Variables Explained
No ratings yet
Two Dimensional Random Variables Explained
44 pages
Solving Second-Order ODEs: Roots Analysis
No ratings yet
Solving Second-Order ODEs: Roots Analysis
4 pages
MATLAB Modal Analysis Results
No ratings yet
MATLAB Modal Analysis Results
5 pages
Quadratic Forms and Characteristic Roots
No ratings yet
Quadratic Forms and Characteristic Roots
65 pages
Binomial Distribution: Key Concepts and Examples
No ratings yet
Binomial Distribution: Key Concepts and Examples
11 pages
Topological Data Analysis Overview
No ratings yet
Topological Data Analysis Overview
26 pages
Understanding Correlation Coefficient
No ratings yet
Understanding Correlation Coefficient
9 pages
SVD Explained with Numerical Examples
No ratings yet
SVD Explained with Numerical Examples
2 pages
Linear Algebra Techniques in MATLAB
No ratings yet
Linear Algebra Techniques in MATLAB
109 pages
Paul's Online Math Notes
No ratings yet
Paul's Online Math Notes
4 pages
Machine Learning Methods Overview
No ratings yet
Machine Learning Methods Overview
100 pages
Least Squares Curve Fitting Overview
No ratings yet
Least Squares Curve Fitting Overview
13 pages
Linear Equations Solution Methods
No ratings yet
Linear Equations Solution Methods
13 pages
Hyperbolic Equations and Numerical Methods
No ratings yet
Hyperbolic Equations and Numerical Methods
24 pages
Understanding Simple Linear Regression
No ratings yet
Understanding Simple Linear Regression
11 pages
Discrete vs Continuous Simulation Models
No ratings yet
Discrete vs Continuous Simulation Models
15 pages
Correlation and Regression Analysis Notes
No ratings yet
Correlation and Regression Analysis Notes
18 pages
Numerical Technique of Dynamical Models With Applications To Chaotic Systems
No ratings yet
Numerical Technique of Dynamical Models With Applications To Chaotic Systems
10 pages
Image Enhancement Techniques Overview
No ratings yet
Image Enhancement Techniques Overview
119 pages
Overview of Support Vector Regression
No ratings yet
Overview of Support Vector Regression
15 pages
Revised Simplex Method
No ratings yet
Revised Simplex Method
18 pages
Finite Differences and Interpolation Methods
No ratings yet
Finite Differences and Interpolation Methods
10 pages
Eigenvalues and Eigenvectors Overview
No ratings yet
Eigenvalues and Eigenvectors Overview
30 pages
Key Features and Methods of NumPy
No ratings yet
Key Features and Methods of NumPy
15 pages
Power Series Lecture Notes
No ratings yet
Power Series Lecture Notes
35 pages
Stochastic Processes Problem Set
No ratings yet
Stochastic Processes Problem Set
2 pages
Chebyshev Spectral Methods Overview
No ratings yet
Chebyshev Spectral Methods Overview
24 pages
Understanding Normal Distribution Basics
No ratings yet
Understanding Normal Distribution Basics
2 pages
Numerical Differentiation Techniques
No ratings yet
Numerical Differentiation Techniques
20 pages
LeastSquares Fit in Matlab
No ratings yet
LeastSquares Fit in Matlab
64 pages
Probability Distributions Overview
100% (1)
Probability Distributions Overview
2 pages
Stirling's Approximation
No ratings yet
Stirling's Approximation
9 pages
Statistical Inference and MLE Techniques
No ratings yet
Statistical Inference and MLE Techniques
55 pages
Discrete-Time Fourier Series Overview
No ratings yet
Discrete-Time Fourier Series Overview
7 pages
DFT Computation and FFT Algorithms
No ratings yet
DFT Computation and FFT Algorithms
14 pages
Block Matrix Determinant and Inversion Formulas
No ratings yet
Block Matrix Determinant and Inversion Formulas
4 pages
Partial Differential Equations: Analytical and Numerical Methods
No ratings yet
Partial Differential Equations: Analytical and Numerical Methods
136 pages
Adaptive Multivariate Outlier Detection
No ratings yet
Adaptive Multivariate Outlier Detection
5 pages
Adaptive Multivariate Outlier Detection
No ratings yet
Adaptive Multivariate Outlier Detection
5 pages
Detecting Multivariate Outliers
No ratings yet
Detecting Multivariate Outliers
28 pages
Robust Mahalanobis Distance for Outlier Detection
No ratings yet
Robust Mahalanobis Distance for Outlier Detection
27 pages
Outlier Detection in Multivariate Data: Applied Mathematical Sciences, Vol. 9, 2015, No. 47, 2317 - 2324
No ratings yet
Outlier Detection in Multivariate Data: Applied Mathematical Sciences, Vol. 9, 2015, No. 47, 2317 - 2324
8 pages
Outlier Detection Techniques Overview
No ratings yet
Outlier Detection Techniques Overview
67 pages
Applied Multivariate Statistics Overview
No ratings yet
Applied Multivariate Statistics Overview
26 pages
Detecting Multivariate Outliers Effectively
No ratings yet
Detecting Multivariate Outliers Effectively
8 pages
Responsible Outlier Detection
No ratings yet
Responsible Outlier Detection
50 pages
Appropiate Critical Values Multivariate Outliers Mahalanobis Distance
No ratings yet
Appropiate Critical Values Multivariate Outliers Mahalanobis Distance
10 pages
Challenges of Traditional File Systems
100% (1)
Challenges of Traditional File Systems
2 pages
Wind Load Calculation for Vertical Vessel
No ratings yet
Wind Load Calculation for Vertical Vessel
14 pages
EKS Setup with ArgoCD, Prometheus, Terraform
No ratings yet
EKS Setup with ArgoCD, Prometheus, Terraform
21 pages
Continental E-225 Engine Parts Catalog
100% (2)
Continental E-225 Engine Parts Catalog
41 pages
Essential Aquarium Setup Checklist
No ratings yet
Essential Aquarium Setup Checklist
5 pages
Textile Manufacturing in the Philippines
100% (1)
Textile Manufacturing in the Philippines
5 pages
Insights from Degas on Painting
100% (3)
Insights from Degas on Painting
38 pages
Purchase Requisition for Big Collar 500Dan
No ratings yet
Purchase Requisition for Big Collar 500Dan
2 pages
Biomimicry in Mechanical Engineering
No ratings yet
Biomimicry in Mechanical Engineering
11 pages
Physics MCQs for JEE Preparation
No ratings yet
Physics MCQs for JEE Preparation
19 pages
IGCSE Chemistry Syllabus 4CH1 Overview
No ratings yet
IGCSE Chemistry Syllabus 4CH1 Overview
2 pages
0000542-200208000-00029drogas Racemicas
No ratings yet
0000542-200208000-00029drogas Racemicas
6 pages
Z87 Extreme3
No ratings yet
Z87 Extreme3
101 pages
UMT Computer Networks Quiz CC251
No ratings yet
UMT Computer Networks Quiz CC251
3 pages
Understanding Phase Diagrams in Chemistry
No ratings yet
Understanding Phase Diagrams in Chemistry
4 pages
ADAM-4542+ Startup Manual
100% (1)
ADAM-4542+ Startup Manual
6 pages
Q.Peak Duo-G5: Enduring High Performance
No ratings yet
Q.Peak Duo-G5: Enduring High Performance
2 pages
Exotic India Art Reviews and Offerings
No ratings yet
Exotic India Art Reviews and Offerings
8 pages
Direct Bandgap in 2D Transition Metals
No ratings yet
Direct Bandgap in 2D Transition Metals
15 pages
Petronas Twin Towers: Engineering Marvels
No ratings yet
Petronas Twin Towers: Engineering Marvels
3 pages
Reservoir Monitoring and Surveillance Insights
100% (1)
Reservoir Monitoring and Surveillance Insights
34 pages
WinCon 3.0.2a: TCP/IP Control for W95
No ratings yet
WinCon 3.0.2a: TCP/IP Control for W95
49 pages
Hydrology Report for Bridge Design
No ratings yet
Hydrology Report for Bridge Design
139 pages
Exploring LC Oscillator Fundamentals
No ratings yet
Exploring LC Oscillator Fundamentals
10 pages
Identifying Rejection Regions in Hypothesis Testing
No ratings yet
Identifying Rejection Regions in Hypothesis Testing
2 pages
Pediatric Respiratory Distress Signs
No ratings yet
Pediatric Respiratory Distress Signs
2 pages
Arithmetic and Mixed Questions Set
No ratings yet
Arithmetic and Mixed Questions Set
49 pages
Understanding Telescoping Series
No ratings yet
Understanding Telescoping Series
4 pages
Process Control Hardware Overview
No ratings yet
Process Control Hardware Overview
11 pages
Comprehensive Water Quality Monitoring Solutions
No ratings yet
Comprehensive Water Quality Monitoring Solutions
8 pages