0% found this document useful (0 votes)
42 views46 pages

Robust Mahalanobis Distance in Outlier Detection

The document discusses outlier detection using robust Mahalanobis distance. It begins by defining outliers and their effects. It then discusses Mahalanobis distance as a multi-dimensional generalization of measuring standard deviations from the mean that accounts for correlations. The document outlines robust estimators like the Minimum Covariance Determinant estimator and Rocke estimator that are less sensitive to outliers. It provides the algorithm for the Rocke estimator, which iteratively updates weights to obtain a robust mean vector and covariance matrix.

Uploaded by

Abhi S Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views46 pages

Robust Mahalanobis Distance in Outlier Detection

The document discusses outlier detection using robust Mahalanobis distance. It begins by defining outliers and their effects. It then discusses Mahalanobis distance as a multi-dimensional generalization of measuring standard deviations from the mean that accounts for correlations. The document outlines robust estimators like the Minimum Covariance Determinant estimator and Rocke estimator that are less sensitive to outliers. It provides the algorithm for the Rocke estimator, which iteratively updates weights to obtain a robust mean vector and covariance matrix.

Uploaded by

Abhi S Gowda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

OUTLIER DETECTION

USING ROBUST
MAHALANOBIS DISTANCE
SEMINAR 2
APARNA BHIDE, M.
PALB 7187
FLOW OF SEMINAR

2
OUTLIERS

◈ An outlier is a data point that drastically affects


the estimates and assumptions.
◈ An outlier can cause serious problems in
statistical analysis, hence one has to give
importance to detect the outliers.
◈ All the data points that are farther away from
the remaining points need not be a potential
outlier and such points are very important to
know the inherent characteristic of data set.
3
CAUSES

◈ Data entry errors (human ◈ Data processing errors


errors) (data manipulation or
◈ Measurement errors data set unintended
(instrument errors) mutations)
◈ Experimental errors (data ◈ Sampling errors
extraction or experiment (extracting or mixing
planning/executing data from wrong or
errors) various sources)
◈ Intentional (dummy ◈ Natural (not an error,
outliers made to test novelties in data)
detection methods) 4
EFFECT OF OUTLIERS

◈ All important and commonly used measures are


affected by the presence of outliers.
◈ It can affect the structure of the true model by
which data has been generated.
◈ An outlier might drastically affect the estimates
of the parameters of the models.

5
METHODS OF DETECTION

One-dimensional data High-dimensional data


◈ Three standard ◈ Leverage values
deviation of
◈ Mahalanobis distance
standardized variates
◈ Subspace and
◈ Box and whisker plots correlation based
etc. techniques
◈ Chauvenet’s criterion

6
MAHALANOBIS DISTANCE

◈The Mahalanobis distance measures distance


relative to the centroid - a base or central point which
can be thought of as an overall mean of multivariate
data, introduced by P. C. Mahalanobis in 1936.
◈It is a multi-dimensional generalization of the idea of
measuring how many standard deviations away a
point P is from the mean of distribution.
◈This distance is zero if P is at the mean, and grows as
P moves away from the mean along each principal
component axis.
7
◈If each of these axes is re-scaled to have unit
variance, then the Mahalanobis distance
corresponds to standard Euclidean distance in
the transformed space.
◈The Mahalanobis distance is thus unit
less and scale-invariant, and takes into
account the correlations of the data set.

8
PRINCIPLE

◈ It represents a covariance distance of data, which


can effectively estimate the similarity of sample
sets.
◈ For a p-dimensional data X={x1, x2,x3,,..,xp} with
mean vector µ={µ1, µ2,µ3,,..,µp} and covariance
matrix Σ, the Mahalanobis distance is defined as
follows:
DM x   x     x   
1
T

9
MD IN OUTLIER DETECTION

◈ For a set of samples X n, p with size n and dimensions


of p, we first calculate the mean vector μ and
covariance matrix Σ of the sample X n, p , and then
calculate the Mahalanobis distance of each sample.
◈ To identify whether a point is an outlier, a threshold
is needed. We know that square of Mahalanobis
distance approximates a chi-square distribution with
a degree of p.
◈ Therefore, given a confidence level α, if there is d
>χ2p(1−α) for a certain sample, then the sample is
an outlier, and vice versa.
10
◈ Mahalanobis distance identifies observations
that lie far away from the centroid of the data
cloud, giving less weight to variables with
large variances or to groups of highly
correlated variables.
◈ This distance is often preferred to the
Euclidean distance which ignores the
covariance structure and treats all variables
equally.
11
◈ A test statistic for Di2 can be given as follows
which has an approximate F distribution with p
and n-p degrees of freedom.

n  p n D 2

n 2
 1 p
i

12
Why robust
methods

13
Classical Mahalanobis distance is based on
sample mean vector and sample covariance
matrix.
Since the classical mean vector and covariance
matrix algorithms are sensitive to outliers, the
classical Mahalanobis distance is also sensitive
to outliers.
The purpose of robust estimation is to produce
an efficient estimator in the presence of
outliers, while minimizing bias.
14
◈ This is done by reducing the influence of the
outliers on the estimator.
◈ To evaluate robust estimators, the usual
properties such as bias and precision are of
interest.
◈ The breakdown point is the smallest fraction of
contamination that can cause the bias of the
estimator to become infinitely large, at which
point the estimator is said to break down.
15
Types of robust estimators

Many robust estimation methods have been


proposed for mean vector and covariance matrix,
such as
 Stahel - Donoho (SD) estimator
 Minimum Volume Ellipsoid (MVE) estimator
 Minimum Covariance Determinant (MCD)
estimator
 S estimator
 Rocke estimator
16
MCD Estimator

◈ One of the first affine equi-variant and highly robust


estimators of multivariate location and scatter.
◈ The idea is to use a trimmed scale for .
^

◈ Let d 1  ,    d 2   ,    ...  d n   ,   be the order of


the squared distances d 2 ( yi , ˆ , ˆ ) and for 1 h < n
define the trimmed scale as
h
̂   d(i )
i 1
17
◈ y(MCD) = sub-sample of n/2  h < n observations
whose covariance matrix has the smallest
determinant.
1
ˆ ( MCD ) 
h
 yi


ˆ
( MCD ) 
kMCD (h, n, v)
  yi  ( MCD )  yi  ( MCD ) 
ˆ ˆ
h 1 i y( MCD )

◈ The proportionality term kMCD(h,n,v) is crucial to


ensure consistency and (approximate)
^
unbiasedness of (MCD)
18
Rocke estimator

◈ Proposed by Rocke.
◈ Improved and empirically compared to other
estimates by Maronna and Yohai (2017).
◈ They pointed out that robustness was
superior to other estimates when the data
dimension was larger than 15.

19
Initial Value

◈ The initial value for Rocke estimator has a


significant influence.
◈ The sub sampling approach usually employed for
computing the starting values is very expensive
for large dimensions.
◈ This study demonstrates that a semi-
deterministic equivariant procedure for outlier
detection, dramatically improves both the
computing times and the statistical performances
of the estimators
20
◈ Rocke estimator mainly applies a non-monotonic
weight function by Rocke (1996) , and iteratively
updates the weight of each sample point, and
finally obtains a robust mean vector and
covariance matrix estimator.
◈ When the distance change is very small, that is,
the scale of the Mahalanobis distance to be
small, the iteration is stopped, and a robust mean
vector and covariance matrix can be obtained.

21
Algorithm for
Rocke
estimator
It includes the following 4 steps

22
Step 1

◈ Centring and scaling the data, and


then the mean vector 0 and
covariance matrix V0 of the sample
data are obtained.

23
Step 2

◈ Let ^ = ^ (d1,d2,…,dn ) represent the scale


estimate of mahalanobis distance, and solve it
by 1 n  di 
 
ˆ
 
n i 1  
ˆ  ˆ (d1 ,
Where (0,1) controls the size of the
breakdown point.
◈ When, 1  n  p  1, Rocke estimator can
  
n 2
achieve the highest finite sample breakdown,
where p represent the data dimension.
24
◈ The relationship between the ρ function and the
weight function W is: ρ ′ =W ,
◈ The function ρ is given by:
 0, 0  t 1 
 t  1    t  1  2  1

    3     , 1   t 1 
 4       2 (3)

 1, t 1 

Where  denotes the weight range.

25
Since the Mahalanobis distance d approximates the
chi-square 2p distribution with p degree of freedom, when
the value of d/^ is outside [2p (), 2p (1- )], there is
W(d/^)=0, (1-  is the confidence), when the p is large,
the 2p distribution tends to be symmetric, and there are:
 p2 (0.5)  p

 p2 (1   )  p  p   p2 ( )
  p2 (1   ) 
Let   min   1,1 , and calculate ^ by
 p 
fixed point method.
26
Step 3

◈ The Mahalanobis distance is obtained by the


initial value; then the new mean vector and the
covariance matrix are calculated by the
following weight function
3   t 1  
2

W (t )  1    I (1    t  1   )
4     
◈ The different weights are applied to different
samples by the following equation to calculate μ,
the uncoordinated covariance matrix C to obtain
the final covariance matrix Σ :
27
n
 di 

i 1
W  ( xi   )  0
 ˆ 
n
 di 
W   xi   ( xi   )  C
1

n i 1  ˆ 
T

C
 1
C p
28
Step 4

◈ Repeat the second and third steps until obtain


^ new has the following relationship with the
^ old obtained last time: 
^
old - 
^
new < tol stops
the iteration, where tol is the present error,
and finally obtains the final stable mean
vector and covariance matrix, and then
calculates the Mahalanobis distance of the
sample data.

29
Numerical simulation
example

30
◈ The classical Mahalanobis distance, Mahalanobis
distance based on MCD estimator, and based on
the Rocke estimator are applied to detect the
outlier via the simulation studies.
◈ Mixture distribution data set are generated, which
consists of the standard normal distribution and
contaminated data.

31
◈ The mixture distribution data is
N(0,1)+ ( N(0,1)+10p)
Where  - contaminated ratio
 - constant(determines the scatter of the
outliers)
◈ In this simulation studies, consider n = 100, p= 6;
n =300, p = 30, ε = 0,0.2 , and λ = 0,0.5 , and
take χ2p (0.99) as the threshold.

32
◈ The above three Mahalanobis distances are
calculated, and then identify the outliers in the data.
◈ Also
 Number of Real Outliers (NRO),
 Detected the corrected Number of Outliers
(DENO),
 Rate containing Outliers in the Data (DaOR),
 Detected Outliers Rate (DeOR).
are calculated.
33
λ = 0, nn == 100,
100, pp==6,6,εε==0.2.
0. (a)
(a) Rocked
Rocked detection; (b) Classical
detection; (c) MCD detection.
34
λ = 0,
0, nn == 300, 30,εε==0.2.
300, pp ==30, 0. (a)
(a) Rocked
Rocked detection; (b) Classical
detection; (c) MCD detection.
35
λ = 0.5, 100, pp==6,6,εε==0.2.
0.5, nn == 100, 0. (a)
(a) Rocked
Rocked detection; (b) Classical
detection; (c) MCD detection
36
λ = 0.5, nn == 300, 30, ε ε==0.2.
300, pp == 30, 0. (a)
(a) Rocked detection; (b)
Classical detection; (c) MCD detection
37
Empirical
Analysis

38
◈ The proposed methodology was applied to analyze
the Breast Cancer Wisconsin (Diagnostic) Data Set
(1995)
◈ The data set contained 30 test variables with a total
of 569 data, of which 357 were diagnosed as benign
and 212 were diagnosed as malignant.
◈ Classification by characteristic variables of the
sample can distinguish between benign and
malignant

39
◈ Therefore, for data diagnosed as benign, data
diagnosed as malignant is equivalent to the
contaminated data.
◈ Adding contaminated data to the data diagnosed
as benign with data diagnosed as malignant, and
use 2p (0.99) as the threshold.
◈ In the following, we apply the above three
methods to detect outliers, and obtain the number
and proportion of detected outlier and the scatter
plot.
40
◈ Since the actual data may contain a certain
proportion of outlier, it is not advisable to add too
much data when the data diagnosed as malignant is
used as the outliers.
◈ The first 200 diagnosed as benign data were taken,
and 0 and 16 (100th to 115th) of the data diagnosed
as malignant were sequentially added.
◈ We also calculate the number of outliers added
(NOA), the number of outlier detected (NOD),
detected the corrected number of outliers (DENO),
and detected outliers rate (DeOR).
41
Adding diagnosed
Diagnosed as benign as data
malignancy
test results.
data (a)
testRocked
[Link];
(a) Rocked
(b)
Classical detection;
detection; (b) Classical
(c) MCD
detection;
detection
(c) MCD detection
42
CONCLUSION

◈ Since the classical Mahalanobis distance is greatly


affected by the outliers, there is a large deviation
in either low-dimensional data set or high-
dimensional data set when there are outliers.
Therefore, the outliers cannot be accurately
detected.
◈ Mahalanobis distance based MCD estimator can
detect outliers more accurately in low-dimensional
data, but it will produce large deviation in high-
dimensional data.
43
◈ Rocke estimator is more robust in both low-dimensional
and high-dimensional data set.
◈ Specially, the robustness advantage is more obvious in
high-dimensional data set, and it can accurately detect
outliers.
◈ Through numerical simulation and empirical analysis,
the accuracy and practicability of the methodology in
high-dimensional data set are validated.
◈ Thus, this methodology can provide a new robust
method for effectively detecting outliers in high-
dimensional real data analysis.
44
📖 REFERENCES 📖
HARDIN, J. AND ROCKE, D.M., 2004, Outlier detection in the multiple
cluster setting using the minimum covariance determinant estimator.
Comput. Stat. Data Anal., 44:625–638.
HUBERT, M., ROUSSEEUW, P. J. AND VAN AELST, S., 2008, High-
Breakdown Robust Multivariate Methods. Stat. Sci., 23:92–119.
LI, X., DENG, S.R., LI, L.F. AND JIANG, Y.C., 2019, Outlier Detection
Based on Robust Mahalanobis Distance and Its Application. Open J.
Stat., 9:15-26.
MARONNA, R.A. AND YOHAI, V.J., 2017, Robust and Efficient
Estimation of Multivariate Scatter and Location. Comput. Stat. & Data
Anal., 109:64-75.
MIA, H. AND MICHIEL D., 2010, Minimum Covariance Determinant.
John Wiley & Sons, Inc., 2:36-43.
45
46

You might also like