0% found this document useful (0 votes)
12 views67 pages

Understanding Data Objects and Attributes

Chapter 2 of 'Data Mining: Concepts and Techniques' discusses the importance of understanding data objects and their attributes, covering types of attributes and their applications. It also delves into basic statistical descriptions, including measures of central tendency and dispersion, as well as data visualization techniques. The chapter emphasizes the significance of data mining in various applications, such as business intelligence and web search engines.

Uploaded by

Sriya
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views67 pages

Understanding Data Objects and Attributes

Chapter 2 of 'Data Mining: Concepts and Techniques' discusses the importance of understanding data objects and their attributes, covering types of attributes and their applications. It also delves into basic statistical descriptions, including measures of central tendency and dispersion, as well as data visualization techniques. The chapter emphasizes the significance of data mining in various applications, such as business intelligence and web search engines.

Uploaded by

Sriya
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd

Data Mining:

Concepts and Techniques

— Chapter 2 —
Getting to Know Your Data
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights
reserved.
1
2
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

3
Data Objects
 Data sets are made up of data objects.
 A data object represents an entity.
 Examples:
 In sales database: objects may be customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Data objects are also called samples , examples, instances, data points,
objects, tuples.
 If data objects are stored in a database, they are data tuples.
 Data objects are described by attributes.

Database rows -> data objects; columns ->attributes.

4
Attributes
 Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Numeric: quantitative


Interval-scaled

Ratio-scaled
5
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red,
white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important

e.g., gender
 Asymmetric binary: outcomes not equally important.

e.g., medical test (positive vs. negative)

Convention: assign 1 to most important outcome
(e.g., HIV positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval

Measured on a scale of equal-sized units

Values have order
 E.g., temperature in C˚or F˚, calendar dates

No true zero-point
 Ratio

Inherent zero-point

We can speak of values as being an order of
magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities

7
Discrete vs. Continuous
Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values


E.g., zip codes, profession, or the set of words in
a collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of

discrete attributes
 Continuous Attribute
 Has real numbers as attribute values


E.g., temperature, height, or weight
 Practically, real values can only be measured and

represented using a finite number of digits


 Continuous attributes are typically represented as

floating-point variables
8
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

9
Chapter 1: Applications of Data
Mining
 Where there are data, there are data mining applications.
 Bioinformatics & software engineering
 Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological network
analysis.
 Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 From major dedicated data mining systems/tools (e.g MS SQL-Server
Analysis Manager, Oracle Data Mining Tools) to invisible data mining.

10
Chapter 1: Applications of Data
Mining
 Two highly successful and popular application examples of data mining:
business intelligence and search engines.
 Business intelligence (BI) technologies provide historical, current, and
predictive views of business operations.
 Examples include reporting, online analytical processing, business
performance management, competitive intelligence, benchmarking, and
predictive analytics.
 “How important is business intelligence?”
Without data mining, many businesses may not be able to perform effective
market analysis, compare customer feedback on similar products, discover
the strengths and weaknesses of their competitors, retain highly valuable
customers, and make smart business decisions.

11
Chapter 1: Applications of Data
Mining
Clearly, data mining is the core of business intelligence.
Online analytical processing tools in business intelligence rely on data
warehousing and multidimensional data mining.
Classification and prediction techniques are the core of predictive analytics in
business intelligence.
Moreover, clustering plays a central role in customer relationship management,
which groups customers based on their similarities.
Using characterization mining techniques, we can better understand features
of each customer group and develop customized customer reward programs.

12
Chapter 1: Applications of Data
Mining
 Web Search Engines
 A Web search engine is a specialized computer server that searches for
information on the Web.
 The search results of a user query are often returned as a list (sometimes
called hits).
 The hits may consist of web pages, images, and other types of files.
 Search engines differ from web directories in that web directories are
maintained by human editors whereas search engines operate algorithmically
or by a mixture of algorithmic and human input.
 Various data mining techniques are used in all aspects of search engines,
ranging from crawling (e.g., deciding which pages should be crawled and
the crawling frequencies), indexing (e.g., selecting pages to be indexed and
deciding to which extent the index should be constructed), and searching
(e.g., deciding how pages should be ranked, which advertisements should be
added, and how the search results can be personalized or made “context
aware”).
13
Chapter 1: Applications of Data
Mining
 Search engines pose grand challenges to data mining.
 First, they have to handle a huge and ever-growing amount of
data.
 Second, Web search engines often have to deal with online
data.
 Another challenge is maintaining and incrementally updating a
model on fast growing data streams.
 Third, Web search engines often have to deal with queries that
are asked only a very small number of times.

14
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

15
Basic Statistical Descriptions of Data

Basic Statistical Descriptions can be used to identify the


properties of data and highlight which data values should
be treated as noise or outliers.

16
Measuring the Central Tendency
 Let x1,x2,xN be a set of N values or observations, such as for some numeric
attribute X, like salary. The mean of this set of values is

Sometimes, each value xi in a set may be associated with a weight wi for i= 1,


… ,N. The weights reflect the significance, importance, or occurrence
frequency attached to their respective values.

 This is called the weighted arithmetic mean or the weighted average.

17
Measuring the Central Tendency

 A major problem with the mean is its sensitivity to extreme (e.g.,


outlier) values.
 Even a small number of extreme values can corrupt the mean.
 For example, the mean score of a class in an exam could be pulled
down quite a bit by a few very low scores.
 To offset the effect caused by a small number of extreme values,
we can instead use the trimmed mean, which is the mean
obtained after chopping off values at the high and low extremes.

18
Measuring the Central Tendency
 Median:
 Middle value if odd number of values, or average of the middle two values
otherwise
 Estimated by interpolation (for grouped data):

 where L1 is the lower boundary of the median interval, N is the number of


values in the entire data set, (∑freq)l is the sum of the frequencies of all of the
intervals that are lower than the median interval, freq median is the frequency of
the median interval, and width is the width of the median interval.

19
Measuring the Central Tendency
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal. In general, a data set with
two or more modes is multimodal. At the other extreme, if
each data value occurs only once, then there is no mode.
 For unimodal numeric data that are moderately skewed
(asymmetrical), we have the following empirical relation:
 Empirical formula:

 This implies that the mode for unimodal frequency curves that are
moderately skewed can easily be approximated if the mean and
median values are known.

20
Measuring the Central Tendency
 The midrange can also be used to assess the
central tendency of a numeric data set.
 It is the average of the largest and smallest
values in the set.
 This measure is easy to compute using the SQL
aggregate functions, max() and min().

21
Symmetric vs.
Skewed Data
 Median, mean and mode of symmetric,
positively and negatively skewed data
symmetric

•In a unimodal frequency curve with perfect


symmetric data distribution, the mean,
median, and mode are all at the same
center value.

positively skewed
Data in most real applications are not

symmetric.
negatively skewed
They
• may instead be either
positively skewed, where the mode negatively skewed, where the mode
occurs at a value greater than the median.
occurs at a value that is smaller than
the median
June 4, 2025 Data Mining: Concepts and Techniques 22
Measuring the Dispersion of
Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked;
add whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
 Standard deviation s (or σ) is the square root of variance s2 (or
σ2)

23
Measuring the Dispersion of
Data

• The measures include range, quantiles, quartiles,


percentiles, and the interquartile range.
• The five-number summary, which can be displayed as a
boxplot, is useful in identifying outliers.
• Variance and standard deviation also indicate the spread
of a data distribution.
• Let x1,x2, ,xN be a set of observations for some numeric
attribute, X.
• The range of the set is the difference between the largest
(max()) and smallest (min()) values.

24
Measuring the Dispersion of
Data
• Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size consecutive
sets.
• The 2-quantile is the data point dividing the lower and upper
halves of the data distribution. It corresponds to the median.
• The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-fourth
of the data distribution. They are more commonly referred to as
quartiles.
• The 100-quantiles are more commonly referred to as
percentiles; they divide the data distribution into 100 equal-
sized consecutive sets.
25
Measuring the Dispersion of
Data

• The quartiles give an indication of a


distribution’s center, spread, and shape. The
first quartile, denoted by Q1, is the 25th
percentile. It cuts off the lowest 25% of the
data.
26
Measuring the Dispersion of
Data

• The distance between the first and third quartiles


is a simple measure of spread that gives the range
covered by the middle half of the data. This
distance is called the interquartile range (IQR)
and is defined as

27
Five-number Summary, Boxplots
and Outliers
 Five-number summary of a distribution
 Minimum, Q1, Median(Q2), Q3, Maximum
 Boxplot
 Boxplots are a popular way of visualizing a
distribution. A boxplot incorporates the five-number
summary as follows:
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended to
Minimum and Maximum observations
 Outliers: points beyond a specified outlier threshold,
plotted individually
28
Five-number Summary, Boxplots
and Outliers

Boxplot. Figure 2.3 shows boxplots for unit price data for items sold at
four branches of AllElectronics during a given time period.
For branch 1, we see that the median price of items sold is $80, Q1 is
$60, and Q3 is $100.
Notice that two outlying observations for this branch were plotted
individually, as their values of 175 and 202 are more than 1.5 times
the IQR here of 40. Boxplots can be computed in O(nlogn) time.
29
Five-number Summary, Boxplots and
Outliers

Variance and Standard Deviation


• Variance and standard deviation are measures of data dispersion.
• They indicate how spread out a data distribution is.
• A low standard deviation means that the data observations tend to
be very close to the mean, while a high standard deviation
indicates that the data are spread out over a large range of
values.
• The variance of N observations,x1,x2,xN, for a numeric attribute X
is

Where x is the mean value of the observations. The standard deviation,


of the observations is the square root of the variance, σ2
June 4, 2025 Data Mining: Concepts and Techniques 30
Graphic Displays of Basic Statistical
Descriptions of Data
 Histogram: x-axis are values, y-axis repres. frequencies

 Quantile plot: each value xi is paired with fi indicating that approximately

100 fi % of data are  xi

 Quantile-quantile (q-q) plot: graphs the quantiles of one univariant


distribution against the corresponding quantiles of another
 Scatter plot: each pair of values is a pair of coordinates and plotted as
points in the plane. If the plotted points pattern slopes from lower left to
upper right, this means that the values of X increase as the values of Y
increase, suggesting a positive correlation. If the pattern of plotted points
slopes from upper left to lower right, the values of X increase as the values
of Y decrease, suggesting a negative correlation

31
Quantile Plot
 Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
 Plots quantile information

For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data
are below or equal to the value xi

Data Mining: Concepts and Techniques 32


Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2
for each quantile. Unit prices of items sold at Branch 1 tend to
be lower than those at Branch 2.

33
Positively and Negatively Correlated
Data

 The left half fragment is positively


correlated
 The right half is negative correlated

34
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

35
Data Visualization
 Data visualization aims to communicate data clearly and effectively through graphical
representation.
 Data visualization has been used extensively in many applications—for example, at
work for reporting, managing business operations, and tracking progress of tasks.
 Categorization of visualization methods:
 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations

36
Pixel-Oriented Visualization
Techniques
 A simple way to visualize the value of a dimension is to use a pixel where the color of the pixel
reflects the dimension’s value.
 For a data set of m dimensions, pixel-oriented techniques create m windows on the screen, one
for each dimension.
 The colors of the pixels reflect the corresponding values.
 Pixel-oriented visualization of four attributes by sorting all customers in income ascending order.

(a) Income (b) Credit (c) transaction (d) age


Limit volume 37
Pixel-Oriented Visualization
Techniques

• The pixel colors are chosen so that smaller the value,


the lighter the shading.
• Using pixel based visualization, we can easily
observe the following: credit limit increases as
income increases; customers whose income is in the
middle range are more likely to purchase more from
AllElectronics; there is no clear correlation between
income and age.

38
Laying Out Pixels in Circle
Segments
 To save space and show the connections among multiple
dimensions, space filling is often done in a circle segment

(a) Representing a data


(b) Laying out pixels in circle
record in circle segment
segment
39
Geometric Projection Visualization
Techniques
 A drawback of pixel-oriented visualization techniques is that they cannot help us much
in understanding the distribution of data in a multidimensional space.
 Geometric projection techniques help users find interesting projections of
multidimensional data sets. The central challenge is the geometric projection
techniques try to address is how to visualize a high-dimensional space on a 2-D
display.

40
Geometric Projection
Visualization Techniques
 A scatter plot displays 2-D data points
using Cartesian coordinates.
 A third dimension can be added using
different colors or shapes to represent
different data points.
 Figure 2.13 shows an example, where X and
Y are two spatial attributes and the third
dimension is represented by different
shapes.
 Through this visualization, we can see that
points of types “+” and “ X” tend to be
colocated.
41
Geometric Projection
Visualization Techniques
 A 3-D scatter plot uses three axes in a
Cartesian coordinate system. If it also uses
color, it can display up to 4-D data points
(Figure 2.14).

42
Geometric Projection
Visualization Techniques
 For data sets with more than four
dimensions, scatter plots are usually
ineffective.
 The scatter-plot matrix technique is a
useful extension to the scatter plot.
 For an n dimensional data set, a scatter-
plot matrix is an n x n grid of 2-D scatter
plots that provides a visualization of each
dimension with every other dimension.

43
Geometric Projection
Visualization Techniques
 Figure 2.15 shows an example, which
visualizes the Iris data set.
 The data set consists of 450 samples from
each of three species of Iris flowers.
 There are five dimensions in the data set:
length and width of sepal and petal, and
species.
 The scatter-plot matrix becomes less
effective as the dimensionality increases.

44
Geometric Projection
Visualization Techniques

45
Icon-Based Visualization
Techniques
 Icon-based visualization techniques use small icons to represent multidimensional data values.
 Two popular icon-based techniques: Chernoff faces and stick figures.
 Chernoff faces were introduced in 1973 by statistician Herman Chernoff.
 They display multidimensional data of up to 18 variables (or dimensions) as a cartoon human
face.
 Chernoff faces help reveal trends in the data. Components of the face, such as the eyes, ears,
mouth, and nose, represent values of the dimensions by their shape, size, placement, and
orientation.
 For example, dimensions can be mapped to the following facial characteristics: eye size, eye
spacing, nose length, nose width, mouth curvature, mouth width, mouth openness, pupil size,
eyebrow slant, eye eccentricity, and head eccentricity.

 Chernoff Faces

46
Stick Figure
 The stick figure visualization technique maps multidimensional
data to five-piece stick figures, where each figure has four
limbs and a body.
 Two dimensions are mapped to the display (x and y) axes and
the remaining dimensions are mapped to the angle and/or length
of the limbs.
 Figure 2.18 shows census data, where age and income are
mapped to the display axes, and the remaining dimensions
(gender, education, and so on) are mapped to stick figures.
 If the data items are relatively dense with respect to the two
display dimensions, the resulting visualization shows texture
patterns, reflecting data trends.

47
Stick Figure
A census data
figure showing
age, income,
used by permission of G. Grinstein, University of Massachusettes at Lowell

gender,
education, etc.

A 5-piece
stick figure (1
body and 4
limbs w.
different
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern
angle/length)
48
Hierarchical Visualization
Techniques
 Visualization of the data using a hierarchical partitioning into
subsets([Link])
 Worlds-within-Worlds
 “Worlds-within-Worlds,” also known as n-Vision, is a
representative hierarchical visualization method.

49
Worlds-within-Worlds

 Suppose we want to visualize a 6-Ddataset,where the dimensions are


F,X1,.. ,X5. We want to observe how dimension F changes with respect to the
other dimensions.
 We can first fix the values of dimensions X3,X4,X5 to some selected values,
say, c3,c4,c5.
 We can then visualize F,X1,X2 using a 3-D plot, called a world, as shown in
Figure 2.19.
 The position of the origin of the inner world is located at the point.c3,c4,c5/ in
the outer world, which is another 3-D plot using dimensions X3,X4,X5.
 A user can interactively change, in the outer world, the location of the origin of
the inner world.
 The user then views the resulting changes of the inner world. Moreover, a user
can vary the dimensions used in the inner world and the outer world.
 Given more dimensions, more levels of worlds can be used, which is why the
method is called “worlds-within worlds.”
50
Tree-Map
 As another example of hierarchical visualization methods, tree-maps display hier archical data as a set
of nested rectangles.
 For example, Figure 2.20 shows a tree-map visualizing Google news stories.
 All news stories are organized into seven categories, each shown in a large rectangle of a unique color.
 Within each category (i.e., each rectangle at the top level), the news stories are further partitioned into
smaller subcategories.

Newsmap: Google News


Stories in 2005 51
Visualizing Complex Data and
Relations
 Visualizing non-numerical data: text and social networks
 For example, many people on the Web tag various objects such as pictures, blog entries, and product reviews.
 Tag cloud: visualizing user-generated tags

 The importance of tag is


represented by font
size/color.
 Often, in a tag cloud, tags
are listed alphabetically or
in a user-preferred order
 Besides text data, there are
also methods to visualize
relationships, such as
visualizing social networks
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

53
Similarity and Dissimilarity
 Similarity

Numerical measure of how alike two data objects
are

Value is higher when objects are more alike

Often falls in the range [0,1]

Return the value 0 if the objects are unalike.
 Dissimilarity (e.g., distance)

Numerical measure of how different two data
objects are

Lower when objects are more alike

Minimum dissimilarity is often 0

Return the value 0 if the objects are alike.
 Proximity refers to a similarity or dissimilarity
54
Data Matrix and Dissimilarity
Matrix
Two data structures that are commonly used
Data matrix(used to store the data objects)

 Dissimilarity matrix (used to store dissimilarity values for pairs of objects)

Data matrix(or object-by-attribute structure)

 This structure stores the n data objects in the form of a relational table, or n-by-p
matrix (n objects p attributes)
 Each row corresponds to an object. As part of our notation, we may use f to index
through the p attributes.
 Two modes
 A data matrix is made up of two entities or “things,” namely rows (for objects) and
columns (for attributes). Therefore, the data matrix is often called a two-mode
matrix.
 x11 ... x1f ... x1p 
 
 ... ... ... ... ... 
 x i1 ... x if ... x ip 
 
 ... ... ... ... ... 
 x n1 ... x nf ... x np 
 
55
Dissimilarity Matrix
• Dissimilarity matrix(or object-by-object structure): This structure stores a
collection of proximities that are available for all pairs of n objects.
• It is often represented by an n-by-n table.
• where d(i, j) is the measured dissimilarity or “difference” between objects i and j.
• In general, d(i, j) is a non-negative number that is close to 0 when objects i and j
are highly similar or “near” each other, and becomes larger the more they differ.
• d(i, i)=0; that is, the difference between an object and itself is 0.
 0 
• d(i, j)= d(j, i) The matrix is symmetric.  d(2,1) 
 0 
Single mode  d(3,1) d ( 3,2) 0 
 
The dissimilarity matrix contains one kind  : : : 
 d ( n,1) d ( n,2) ... ... 0
of entity (dissimilarities) and
so is called a one-mode matrix.
56
Proximity Measure for Nominal
Attributes
 Can take 2 or more states, e.g., red, yellow,
blue, green
 The dissimilarity between two objects i and j
can be computed based on the ratio of
mismatches
d (i, j)  p p m

 m: # of matches(i.e., the number of


attributes for which i and j are in the same
state), p: total # of variables
57
Proximity Measure for Binary
Attributes
Object j

 A contingency table for binary data Object i

 Distance measure for symmetric


binary variables: ->for symmetric binary
attributes, each state is equally valuable

 Distance measure for asymmetric


binary variables:For asymmetric binary
attributes, the two states are not equally
important ,such as the positive(1) and negative(0)
outcomes of a disease test.

 Jaccard coefficient (similarity measure


for asymmetric binary variables): ->

58
Dissimilarity between Binary
Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0

0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
59
Dissimilarity between Binary
Variables
0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2

 These measurements suggest that Jim and Mary


are unlikely to have a similar disease because they
have the highest dissimilarity value among the
three pairs.
 Of the three patients, Jack and Mary are the most
likely to have a similar disease.

60
Dissimilarity of Numeric Data:
Minkowski Distance
 Distance measures that are commonly used for computing the dissimilarity of
objects described by numeric attributes. These measures include the Euclidean,
Manhattan, and Minkowski distances.
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
h = 2: (L2 norm) Euclidean distance
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp
h  . “supremum” (Lmax norm, L norm) distance.

This is the maximum difference between any component (attribute) of
the vectors

61
Dissimilarity of Numeric Data:
Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric

62
Example
 Euclidean distance and Manhattan
distance. Let x1 (1, 2)and x2 (3, 5)
represent two objects.
 The Euclidean distance between the two is
√2^2+ 3^2 =3.61.
 The Manhattan distance between the two is
2 +3 =5.
 Supremum distance
 The second attribute gives the greatest
difference between values for the objects,
which is 5 -2 =3. This is the supremum
distance between both objects.
63
Proximity Measures for
Ordinal Attributes
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Can be treated like interval-scaled
 Suppose f is an attribute from a set of ordinal
attributes ,M represents the [Link] possible states
 replace xif by their rank rif {1,..., M f }

map the range of each variable onto [0, 1] by
replacing i-th object in the f-th variable by
rif  1
zif 
Mf 1

compute the dissimilarity using methods for
interval-scaled variables
64
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric,

ordinal.
 One may use a weighted formula to combine their effects
 pf 1 ij( f ) dij( f )
d (i, j)  p
 f 1 ij( f )
 Suppose that the dataset contains p attributes of mixedtype.

65
Cosine Similarity

 A document can be represented by thousands of attributes, each recording the


frequency of a particular word (such as keywords) or phrase in the document.
 Cosine similarity is a measure of similarity that can be used to compare docu
ments or, say, give a ranking of documents with respect to a given vector of
query words.
 Cosine measure: If d1 and d2 are two vectors (e.g., term-
frequency vectors), then
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of
vector d

66
Example: Cosine Similarity

 Ex: Find the similarity between documents 1 and 2.


 Suppose that d1 and d2 are the first two term-frequency vectors in Table

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

67

You might also like