0% found this document useful (0 votes)

12 views67 pages

Understanding Data Objects and Attributes

Chapter 2 of 'Data Mining: Concepts and Techniques' discusses the importance of understanding data objects and their attributes, covering types of attributes and their applications. It also delves into basic statistical descriptions, including measures of central tendency and dispersion, as well as data visualization techniques. The chapter emphasizes the significance of data mining in various applications, such as business intelligence and web search engines.

Uploaded by

Sriya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views67 pages

Understanding Data Objects and Attributes

Uploaded by

Sriya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

Data Mining:

Concepts and Techniques

— Chapter 2 —
Getting to Know Your Data
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights
reserved.
1
2
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

3
Data Objects
 Data sets are made up of data objects.
 A data object represents an entity.
 Examples:
 In sales database: objects may be customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Data objects are also called samples , examples, instances, data points,
objects, tuples.
 If data objects are stored in a database, they are data tuples.
 Data objects are described by attributes.

Database rows -> data objects; columns ->attributes.

4
Attributes
 Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Numeric: quantitative


Interval-scaled

Ratio-scaled
5
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red,
white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important

e.g., gender
 Asymmetric binary: outcomes not equally important.

e.g., medical test (positive vs. negative)

Convention: assign 1 to most important outcome
(e.g., HIV positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval

Measured on a scale of equal-sized units

Values have order
 E.g., temperature in C˚or F˚, calendar dates

No true zero-point
 Ratio

Inherent zero-point

We can speak of values as being an order of
magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities

7
Discrete vs. Continuous
Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values


E.g., zip codes, profession, or the set of words in
a collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of

discrete attributes
 Continuous Attribute
 Has real numbers as attribute values


E.g., temperature, height, or weight
 Practically, real values can only be measured and

represented using a finite number of digits

 Continuous attributes are typically represented as

floating-point variables
8
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

9
Chapter 1: Applications of Data
Mining
 Where there are data, there are data mining applications.
 Bioinformatics & software engineering
 Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological network
analysis.
 Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
 Collaborative analysis & recommender systems
 Basket data analysis to targeted marketing
 From major dedicated data mining systems/tools (e.g MS SQL-Server
Analysis Manager, Oracle Data Mining Tools) to invisible data mining.

10
Chapter 1: Applications of Data
Mining
 Two highly successful and popular application examples of data mining:
business intelligence and search engines.
 Business intelligence (BI) technologies provide historical, current, and
predictive views of business operations.
 Examples include reporting, online analytical processing, business
performance management, competitive intelligence, benchmarking, and
predictive analytics.
 “How important is business intelligence?”
Without data mining, many businesses may not be able to perform effective
market analysis, compare customer feedback on similar products, discover
the strengths and weaknesses of their competitors, retain highly valuable
customers, and make smart business decisions.

11
Chapter 1: Applications of Data
Mining
Clearly, data mining is the core of business intelligence.
Online analytical processing tools in business intelligence rely on data
warehousing and multidimensional data mining.
Classification and prediction techniques are the core of predictive analytics in
business intelligence.
Moreover, clustering plays a central role in customer relationship management,
which groups customers based on their similarities.
Using characterization mining techniques, we can better understand features
of each customer group and develop customized customer reward programs.

12
Chapter 1: Applications of Data
Mining
 Web Search Engines
 A Web search engine is a specialized computer server that searches for
information on the Web.
 The search results of a user query are often returned as a list (sometimes
called hits).
 The hits may consist of web pages, images, and other types of files.
 Search engines differ from web directories in that web directories are
maintained by human editors whereas search engines operate algorithmically
or by a mixture of algorithmic and human input.
 Various data mining techniques are used in all aspects of search engines,
ranging from crawling (e.g., deciding which pages should be crawled and
the crawling frequencies), indexing (e.g., selecting pages to be indexed and
deciding to which extent the index should be constructed), and searching
(e.g., deciding how pages should be ranked, which advertisements should be
added, and how the search results can be personalized or made “context
aware”).
13
Chapter 1: Applications of Data
Mining
 Search engines pose grand challenges to data mining.
 First, they have to handle a huge and ever-growing amount of
data.
 Second, Web search engines often have to deal with online
data.
 Another challenge is maintaining and incrementally updating a
model on fast growing data streams.
 Third, Web search engines often have to deal with queries that
are asked only a very small number of times.

14
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

15
Basic Statistical Descriptions of Data

Basic Statistical Descriptions can be used to identify the

properties of data and highlight which data values should
be treated as noise or outliers.

16
Measuring the Central Tendency
 Let x1,x2,xN be a set of N values or observations, such as for some numeric
attribute X, like salary. The mean of this set of values is

Sometimes, each value xi in a set may be associated with a weight wi for i= 1,

… ,N. The weights reflect the significance, importance, or occurrence
frequency attached to their respective values.

 This is called the weighted arithmetic mean or the weighted average.

17
Measuring the Central Tendency

 A major problem with the mean is its sensitivity to extreme (e.g.,

outlier) values.
 Even a small number of extreme values can corrupt the mean.
 For example, the mean score of a class in an exam could be pulled
down quite a bit by a few very low scores.
 To offset the effect caused by a small number of extreme values,
we can instead use the trimmed mean, which is the mean
obtained after chopping off values at the high and low extremes.

18
Measuring the Central Tendency
 Median:
 Middle value if odd number of values, or average of the middle two values
otherwise
 Estimated by interpolation (for grouped data):

 where L1 is the lower boundary of the median interval, N is the number of

values in the entire data set, (∑freq)l is the sum of the frequencies of all of the
intervals that are lower than the median interval, freq median is the frequency of
the median interval, and width is the width of the median interval.

19
Measuring the Central Tendency
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal. In general, a data set with
two or more modes is multimodal. At the other extreme, if
each data value occurs only once, then there is no mode.
 For unimodal numeric data that are moderately skewed
(asymmetrical), we have the following empirical relation:
 Empirical formula:

 This implies that the mode for unimodal frequency curves that are
moderately skewed can easily be approximated if the mean and
median values are known.

20
Measuring the Central Tendency
 The midrange can also be used to assess the
central tendency of a numeric data set.
 It is the average of the largest and smallest
values in the set.
 This measure is easy to compute using the SQL
aggregate functions, max() and min().

21
Symmetric vs.
Skewed Data
 Median, mean and mode of symmetric,
positively and negatively skewed data
symmetric

•In a unimodal frequency curve with perfect

symmetric data distribution, the mean,
median, and mode are all at the same
center value.

positively skewed
Data in most real applications are not
•

symmetric.
negatively skewed
They
• may instead be either
positively skewed, where the mode negatively skewed, where the mode
occurs at a value greater than the median.
occurs at a value that is smaller than
the median
June 4, 2025 Data Mining: Concepts and Techniques 22
Measuring the Dispersion of
Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked;
add whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
 Standard deviation s (or σ) is the square root of variance s2 (or
σ2)

23
Measuring the Dispersion of
Data

• The measures include range, quantiles, quartiles,

percentiles, and the interquartile range.
• The five-number summary, which can be displayed as a
boxplot, is useful in identifying outliers.
• Variance and standard deviation also indicate the spread
of a data distribution.
• Let x1,x2, ,xN be a set of observations for some numeric
attribute, X.
• The range of the set is the difference between the largest
(max()) and smallest (min()) values.

24
Measuring the Dispersion of
Data
• Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size consecutive
sets.
• The 2-quantile is the data point dividing the lower and upper
halves of the data distribution. It corresponds to the median.
• The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-fourth
of the data distribution. They are more commonly referred to as
quartiles.
• The 100-quantiles are more commonly referred to as
percentiles; they divide the data distribution into 100 equal-
sized consecutive sets.
25
Measuring the Dispersion of
Data

• The quartiles give an indication of a

distribution’s center, spread, and shape. The
first quartile, denoted by Q1, is the 25th
percentile. It cuts off the lowest 25% of the
data.
26
Measuring the Dispersion of
Data

• The distance between the first and third quartiles

is a simple measure of spread that gives the range
covered by the middle half of the data. This
distance is called the interquartile range (IQR)
and is defined as

27
Five-number Summary, Boxplots
and Outliers
 Five-number summary of a distribution
 Minimum, Q1, Median(Q2), Q3, Maximum
 Boxplot
 Boxplots are a popular way of visualizing a
distribution. A boxplot incorporates the five-number
summary as follows:
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended to
Minimum and Maximum observations
 Outliers: points beyond a specified outlier threshold,
plotted individually
28
Five-number Summary, Boxplots
and Outliers

Boxplot. Figure 2.3 shows boxplots for unit price data for items sold at
four branches of AllElectronics during a given time period.
For branch 1, we see that the median price of items sold is $80, Q1 is
$60, and Q3 is $100.
Notice that two outlying observations for this branch were plotted
individually, as their values of 175 and 202 are more than 1.5 times
the IQR here of 40. Boxplots can be computed in O(nlogn) time.
29
Five-number Summary, Boxplots and
Outliers

Variance and Standard Deviation

• Variance and standard deviation are measures of data dispersion.
• They indicate how spread out a data distribution is.
• A low standard deviation means that the data observations tend to
be very close to the mean, while a high standard deviation
indicates that the data are spread out over a large range of
values.
• The variance of N observations,x1,x2,xN, for a numeric attribute X
is

Where x is the mean value of the observations. The standard deviation,

of the observations is the square root of the variance, σ2
June 4, 2025 Data Mining: Concepts and Techniques 30
Graphic Displays of Basic Statistical
Descriptions of Data
 Histogram: x-axis are values, y-axis repres. frequencies

 Quantile plot: each value xi is paired with fi indicating that approximately

100 fi % of data are  xi

 Quantile-quantile (q-q) plot: graphs the quantiles of one univariant

distribution against the corresponding quantiles of another
 Scatter plot: each pair of values is a pair of coordinates and plotted as
points in the plane. If the plotted points pattern slopes from lower left to
upper right, this means that the values of X increase as the values of Y
increase, suggesting a positive correlation. If the pattern of plotted points
slopes from upper left to lower right, the values of X increase as the values
of Y decrease, suggesting a negative correlation

31
Quantile Plot
 Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
 Plots quantile information

For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data
are below or equal to the value xi

Data Mining: Concepts and Techniques 32

Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2
for each quantile. Unit prices of items sold at Branch 1 tend to
be lower than those at Branch 2.

33
Positively and Negatively Correlated
Data

 The left half fragment is positively

correlated
 The right half is negative correlated

34
Chapter 2: Getting to Know Your
Data

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

35
Data Visualization
 Data visualization aims to communicate data clearly and effectively through graphical
representation.
 Data visualization has been used extensively in many applications—for example, at
work for reporting, managing business operations, and tracking progress of tasks.
 Categorization of visualization methods:
 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations

36
Pixel-Oriented Visualization
Techniques
 A simple way to visualize the value of a dimension is to use a pixel where the color of the pixel
reflects the dimension’s value.
 For a data set of m dimensions, pixel-oriented techniques create m windows on the screen, one
for each dimension.
 The colors of the pixels reflect the corresponding values.
 Pixel-oriented visualization of four attributes by sorting all customers in income ascending order.

(a) Income (b) Credit (c) transaction (d) age

Limit volume 37
Pixel-Oriented Visualization
Techniques

• The pixel colors are chosen so that smaller the value,

the lighter the shading.
• Using pixel based visualization, we can easily
observe the following: credit limit increases as
income increases; customers whose income is in the
middle range are more likely to purchase more from
AllElectronics; there is no clear correlation between
income and age.

38
Laying Out Pixels in Circle
Segments
 To save space and show the connections among multiple
dimensions, space filling is often done in a circle segment

(a) Representing a data

(b) Laying out pixels in circle
record in circle segment
segment
39
Geometric Projection Visualization
Techniques
 A drawback of pixel-oriented visualization techniques is that they cannot help us much
in understanding the distribution of data in a multidimensional space.
 Geometric projection techniques help users find interesting projections of
multidimensional data sets. The central challenge is the geometric projection
techniques try to address is how to visualize a high-dimensional space on a 2-D
display.

40
Geometric Projection
Visualization Techniques
 A scatter plot displays 2-D data points
using Cartesian coordinates.
 A third dimension can be added using
different colors or shapes to represent
different data points.
 Figure 2.13 shows an example, where X and
Y are two spatial attributes and the third
dimension is represented by different
shapes.
 Through this visualization, we can see that
points of types “+” and “ X” tend to be
colocated.
41
Geometric Projection
Visualization Techniques
 A 3-D scatter plot uses three axes in a
Cartesian coordinate system. If it also uses
color, it can display up to 4-D data points
(Figure 2.14).

42
Geometric Projection
Visualization Techniques
 For data sets with more than four
dimensions, scatter plots are usually
ineffective.
 The scatter-plot matrix technique is a
useful extension to the scatter plot.
 For an n dimensional data set, a scatter-
plot matrix is an n x n grid of 2-D scatter
plots that provides a visualization of each
dimension with every other dimension.

43
Geometric Projection
Visualization Techniques
 Figure 2.15 shows an example, which
visualizes the Iris data set.
 The data set consists of 450 samples from
each of three species of Iris flowers.
 There are five dimensions in the data set:
length and width of sepal and petal, and
species.
 The scatter-plot matrix becomes less
effective as the dimensionality increases.

44
Geometric Projection
Visualization Techniques

45
Icon-Based Visualization
Techniques
 Icon-based visualization techniques use small icons to represent multidimensional data values.
 Two popular icon-based techniques: Chernoff faces and stick figures.
 Chernoff faces were introduced in 1973 by statistician Herman Chernoff.
 They display multidimensional data of up to 18 variables (or dimensions) as a cartoon human
face.
 Chernoff faces help reveal trends in the data. Components of the face, such as the eyes, ears,
mouth, and nose, represent values of the dimensions by their shape, size, placement, and
orientation.
 For example, dimensions can be mapped to the following facial characteristics: eye size, eye
spacing, nose length, nose width, mouth curvature, mouth width, mouth openness, pupil size,
eyebrow slant, eye eccentricity, and head eccentricity.

 Chernoff Faces

46
Stick Figure
 The stick figure visualization technique maps multidimensional
data to five-piece stick figures, where each figure has four
limbs and a body.
 Two dimensions are mapped to the display (x and y) axes and
the remaining dimensions are mapped to the angle and/or length
of the limbs.
 Figure 2.18 shows census data, where age and income are
mapped to the display axes, and the remaining dimensions
(gender, education, and so on) are mapped to stick figures.
 If the data items are relatively dense with respect to the two
display dimensions, the resulting visualization shows texture
patterns, reflecting data trends.

47
Stick Figure
A census data
figure showing
age, income,
used by permission of G. Grinstein, University of Massachusettes at Lowell

gender,
education, etc.

A 5-piece
stick figure (1
body and 4
limbs w.
different
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern
angle/length)
48
Hierarchical Visualization
Techniques
 Visualization of the data using a hierarchical partitioning into
subsets([Link])
 Worlds-within-Worlds
 “Worlds-within-Worlds,” also known as n-Vision, is a
representative hierarchical visualization method.

49
Worlds-within-Worlds

 Suppose we want to visualize a 6-Ddataset,where the dimensions are

F,X1,.. ,X5. We want to observe how dimension F changes with respect to the
other dimensions.
 We can first fix the values of dimensions X3,X4,X5 to some selected values,
say, c3,c4,c5.
 We can then visualize F,X1,X2 using a 3-D plot, called a world, as shown in
Figure 2.19.
 The position of the origin of the inner world is located at the point.c3,c4,c5/ in
the outer world, which is another 3-D plot using dimensions X3,X4,X5.
 A user can interactively change, in the outer world, the location of the origin of
the inner world.
 The user then views the resulting changes of the inner world. Moreover, a user
can vary the dimensions used in the inner world and the outer world.
 Given more dimensions, more levels of worlds can be used, which is why the
method is called “worlds-within worlds.”
50
Tree-Map
 As another example of hierarchical visualization methods, tree-maps display hier archical data as a set
of nested rectangles.
 For example, Figure 2.20 shows a tree-map visualizing Google news stories.
 All news stories are organized into seven categories, each shown in a large rectangle of a unique color.
 Within each category (i.e., each rectangle at the top level), the news stories are further partitioned into
smaller subcategories.

Newsmap: Google News

Stories in 2005 51
Visualizing Complex Data and
Relations
 Visualizing non-numerical data: text and social networks
 For example, many people on the Web tag various objects such as pictures, blog entries, and product reviews.
 Tag cloud: visualizing user-generated tags

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

53
Similarity and Dissimilarity
 Similarity

Numerical measure of how alike two data objects
are

Value is higher when objects are more alike

Often falls in the range [0,1]

Return the value 0 if the objects are unalike.
 Dissimilarity (e.g., distance)

Numerical measure of how different two data
objects are

Lower when objects are more alike

Minimum dissimilarity is often 0

Return the value 0 if the objects are alike.
 Proximity refers to a similarity or dissimilarity
54
Data Matrix and Dissimilarity
Matrix
Two data structures that are commonly used
Data matrix(used to store the data objects)

 Dissimilarity matrix (used to store dissimilarity values for pairs of objects)

Data matrix(or object-by-attribute structure)

 This structure stores the n data objects in the form of a relational table, or n-by-p
matrix (n objects p attributes)
 Each row corresponds to an object. As part of our notation, we may use f to index
through the p attributes.
 Two modes
 A data matrix is made up of two entities or “things,” namely rows (for objects) and
columns (for attributes). Therefore, the data matrix is often called a two-mode
matrix.
 x11 ... x1f ... x1p 
 
 ... ... ... ... ... 
 x i1 ... x if ... x ip 
 
 ... ... ... ... ... 
 x n1 ... x nf ... x np 
 
55
Dissimilarity Matrix
• Dissimilarity matrix(or object-by-object structure): This structure stores a
collection of proximities that are available for all pairs of n objects.
• It is often represented by an n-by-n table.
• where d(i, j) is the measured dissimilarity or “difference” between objects i and j.
• In general, d(i, j) is a non-negative number that is close to 0 when objects i and j
are highly similar or “near” each other, and becomes larger the more they differ.
• d(i, i)=0; that is, the difference between an object and itself is 0.
 0 
• d(i, j)= d(j, i) The matrix is symmetric.  d(2,1) 
 0 
Single mode  d(3,1) d ( 3,2) 0 
 
The dissimilarity matrix contains one kind  : : : 
 d ( n,1) d ( n,2) ... ... 0
of entity (dissimilarities) and
so is called a one-mode matrix.
56
Proximity Measure for Nominal
Attributes
 Can take 2 or more states, e.g., red, yellow,
blue, green
 The dissimilarity between two objects i and j
can be computed based on the ratio of
mismatches
d (i, j)  p p m

 m: # of matches(i.e., the number of

attributes for which i and j are in the same
state), p: total # of variables
57
Proximity Measure for Binary
Attributes
Object j

 A contingency table for binary data Object i

 Distance measure for symmetric

binary variables: ->for symmetric binary
attributes, each state is equally valuable

 Distance measure for asymmetric

binary variables:For asymmetric binary
attributes, the two states are not equally
important ,such as the positive(1) and negative(0)
outcomes of a disease test.

 Jaccard coefficient (similarity measure

for asymmetric binary variables): ->

58
Dissimilarity between Binary
Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute
 The remaining attributes are asymmetric binary
 Let the values Y and P be 1, and the value N 0

0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
59
Dissimilarity between Binary
Variables
0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2

 These measurements suggest that Jim and Mary

are unlikely to have a similar disease because they
have the highest dissimilarity value among the
three pairs.
 Of the three patients, Jack and Mary are the most
likely to have a similar disease.

60
Dissimilarity of Numeric Data:
Minkowski Distance
 Distance measures that are commonly used for computing the dissimilarity of
objects described by numeric attributes. These measures include the Euclidean,
Manhattan, and Minkowski distances.
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
h = 2: (L2 norm) Euclidean distance
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp
h  . “supremum” (Lmax norm, L norm) distance.

This is the maximum difference between any component (attribute) of
the vectors

61
Dissimilarity of Numeric Data:
Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric

62
Example
 Euclidean distance and Manhattan
distance. Let x1 (1, 2)and x2 (3, 5)
represent two objects.
 The Euclidean distance between the two is
√2^2+ 3^2 =3.61.
 The Manhattan distance between the two is
2 +3 =5.
 Supremum distance
 The second attribute gives the greatest
difference between values for the objects,
which is 5 -2 =3. This is the supremum
distance between both objects.
63
Proximity Measures for
Ordinal Attributes
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Can be treated like interval-scaled
 Suppose f is an attribute from a set of ordinal
attributes ,M represents the [Link] possible states
 replace xif by their rank rif {1,..., M f }

map the range of each variable onto [0, 1] by
replacing i-th object in the f-th variable by
rif  1
zif 
Mf 1

compute the dissimilarity using methods for
interval-scaled variables
64
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric,

ordinal.
 One may use a weighted formula to combine their effects
 pf 1 ij( f ) dij( f )
d (i, j)  p
 f 1 ij( f )
 Suppose that the dataset contains p attributes of mixedtype.

65
Cosine Similarity

 A document can be represented by thousands of attributes, each recording the

frequency of a particular word (such as keywords) or phrase in the document.
 Cosine similarity is a measure of similarity that can be used to compare docu
ments or, say, give a ranking of documents with respect to a given vector of
query words.
 Cosine measure: If d1 and d2 are two vectors (e.g., term-
frequency vectors), then
 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of
vector d

66
Example: Cosine Similarity

 Ex: Find the similarity between documents 1 and 2.

 Suppose that d1 and d2 are the first two term-frequency vectors in Table

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
129 pages
Data Mining: Understanding Data Types
No ratings yet
Data Mining: Understanding Data Types
44 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
62 pages
DMlec 2
No ratings yet
DMlec 2
47 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
33 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
53 pages
Understanding Data Objects & Attributes
No ratings yet
Understanding Data Objects & Attributes
35 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
65 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
64 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
44 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
56 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
53 pages
ch-2 Data
No ratings yet
ch-2 Data
65 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
65 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
52 pages
Object.assign vs Spread Operator Explained
No ratings yet
Object.assign vs Spread Operator Explained
30 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
72 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
Unit 2 - Preprocessing
No ratings yet
Unit 2 - Preprocessing
74 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Mining: Techniques and Processes
No ratings yet
Data Mining: Techniques and Processes
25 pages
Data Mining: Understanding Data Basics
No ratings yet
Data Mining: Understanding Data Basics
44 pages
02 Data
No ratings yet
02 Data
65 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
30 pages
Data Mining: Understanding Data Types
No ratings yet
Data Mining: Understanding Data Types
43 pages
Data Mining Concepts and Techniques
No ratings yet
Data Mining Concepts and Techniques
56 pages
Know Your Data-Simialarity Measures
No ratings yet
Know Your Data-Simialarity Measures
23 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
22 pages
Understanding Data Objects & Attributes
No ratings yet
Understanding Data Objects & Attributes
78 pages
Concepts and Techniques
100% (2)
Concepts and Techniques
118 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
43 pages
Data Exploration and Preprocessing Guide
No ratings yet
Data Exploration and Preprocessing Guide
81 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
56 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
51 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Types and Attributes of Data Sets
No ratings yet
Types and Attributes of Data Sets
24 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
65 pages
Understanding Data Attributes and Types
No ratings yet
Understanding Data Attributes and Types
44 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
69 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
36 pages
Understanding Two Data Types
No ratings yet
Understanding Two Data Types
41 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
22 pages
Understanding Transportation Data Types
No ratings yet
Understanding Transportation Data Types
77 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
14 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
28 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
63 pages
Understanding Data and Statistics
No ratings yet
Understanding Data and Statistics
60 pages
Data Exploration and Visualization Techniques
No ratings yet
Data Exploration and Visualization Techniques
56 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
66 pages
Data Mining Concepts and Techniques Guide
No ratings yet
Data Mining Concepts and Techniques Guide
65 pages
02 Data
No ratings yet
02 Data
76 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
74 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
53 pages
DM Unit 1
No ratings yet
DM Unit 1
16 pages
Data Warehouse and Mining Overview
No ratings yet
Data Warehouse and Mining Overview
72 pages
Data Sets and Attribute Types Explained
No ratings yet
Data Sets and Attribute Types Explained
41 pages
Understanding Data Types and Analysis
No ratings yet
Understanding Data Types and Analysis
25 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Case Study
No ratings yet
Case Study
3 pages
Git and Jira
No ratings yet
Git and Jira
8 pages
Software Project Management Overview
No ratings yet
Software Project Management Overview
32 pages
Binomial Distribution Examples and Calculations
No ratings yet
Binomial Distribution Examples and Calculations
25 pages
Tableau Analysis for Sales Insights
No ratings yet
Tableau Analysis for Sales Insights
5 pages
Software Quality Management Essentials
No ratings yet
Software Quality Management Essentials
86 pages
Hydrographic Surveying Techniques and Applications
No ratings yet
Hydrographic Surveying Techniques and Applications
105 pages
Power Quality and FACTS Exam Guide
No ratings yet
Power Quality and FACTS Exam Guide
1 page
Power Amplifiers: Types and Efficiency
No ratings yet
Power Amplifiers: Types and Efficiency
39 pages
Bird Strike Simulation Methods Review
No ratings yet
Bird Strike Simulation Methods Review
20 pages
BAL-002-2 Contingency Reserve Overview
No ratings yet
BAL-002-2 Contingency Reserve Overview
29 pages
April 2025 CE Board Exam Review
No ratings yet
April 2025 CE Board Exam Review
2 pages
MBA Business Economics Syllabus 2021-22
No ratings yet
MBA Business Economics Syllabus 2021-22
64 pages
Understanding Polynomial Functions
No ratings yet
Understanding Polynomial Functions
17 pages
MOS Capacitor C-V Profiling Guide
No ratings yet
MOS Capacitor C-V Profiling Guide
7 pages
Uniform Wear Theory in Clutch Design
No ratings yet
Uniform Wear Theory in Clutch Design
1 page
Creative Industries Clustering in Italy & Spain
No ratings yet
Creative Industries Clustering in Italy & Spain
20 pages
DME Production via Fixed-Bed Reactor Modeling
100% (2)
DME Production via Fixed-Bed Reactor Modeling
37 pages
Drainage Requirements in Pavements (DRIP) User Manual
No ratings yet
Drainage Requirements in Pavements (DRIP) User Manual
114 pages
Important Database Principles Questions
No ratings yet
Important Database Principles Questions
7 pages
Animations With Auto Cad
No ratings yet
Animations With Auto Cad
58 pages
VFR & IFR Flight Planning Guide
No ratings yet
VFR & IFR Flight Planning Guide
35 pages
Sourav Kumar: Software Engineer Profile
No ratings yet
Sourav Kumar: Software Engineer Profile
1 page
Characteristics of Practical Transformers
No ratings yet
Characteristics of Practical Transformers
9 pages
Fatigue Life Assessment of Steering Knuckle
No ratings yet
Fatigue Life Assessment of Steering Knuckle
13 pages
Nuclear Physics 1
No ratings yet
Nuclear Physics 1
4 pages
List of Isolator Manufacturers and Suppliers
100% (1)
List of Isolator Manufacturers and Suppliers
4 pages
Corruption's Impact on Public Spending
No ratings yet
Corruption's Impact on Public Spending
26 pages
Cost Accounting Test Paper BQS 1202
No ratings yet
Cost Accounting Test Paper BQS 1202
3 pages
Bar Bending Schedule for Beams and Slabs
No ratings yet
Bar Bending Schedule for Beams and Slabs
18 pages
Optimized Design of Synchronous Reluctance Machines
No ratings yet
Optimized Design of Synchronous Reluctance Machines
13 pages
Time Value of Money Practice Problems
No ratings yet
Time Value of Money Practice Problems
21 pages
AAA Variable Spring Hangers Catalog
0% (1)
AAA Variable Spring Hangers Catalog
31 pages
Anglais 3eme
No ratings yet
Anglais 3eme
72 pages
Materials Management: Inventory Costs Explained
No ratings yet
Materials Management: Inventory Costs Explained
31 pages
Egyptian Multiplication Techniques Explained
No ratings yet
Egyptian Multiplication Techniques Explained
2 pages

Understanding Data Objects and Attributes

Uploaded by

Understanding Data Objects and Attributes

Uploaded by

Data Mining:

Concepts and Techniques

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Note: Binary attributes are a special case of

represented using a finite number of digits

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

Basic Statistical Descriptions can be used to identify the

Sometimes, each value xi in a set may be associated with a weight wi for i= 1,

 This is called the weighted arithmetic mean or the weighted average.

 A major problem with the mean is its sensitivity to extreme (e.g.,

 where L1 is the lower boundary of the median interval, N is the number of

•In a unimodal frequency curve with perfect

• The measures include range, quantiles, quartiles,

• The quartiles give an indication of a

• The distance between the first and third quartiles

Variance and Standard Deviation

Where x is the mean value of the observations. The standard deviation,

 Quantile plot: each value xi is paired with fi indicating that approximately

100 fi % of data are  xi

 Quantile-quantile (q-q) plot: graphs the quantiles of one univariant

Data Mining: Concepts and Techniques 32

 The left half fragment is positively

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

(a) Income (b) Credit (c) transaction (d) age

• The pixel colors are chosen so that smaller the value,

(a) Representing a data

 Suppose we want to visualize a 6-Ddataset,where the dimensions are

Newsmap: Google News

 The importance of tag is

 Data Objects and Attribute Types

 What Kind of Applications Are Targeted?

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Dissimilarity matrix (used to store dissimilarity values for pairs of objects)

Data matrix(or object-by-attribute structure)

 m: # of matches(i.e., the number of

 A contingency table for binary data Object i

 Distance measure for symmetric

 Distance measure for asymmetric

 Jaccard coefficient (similarity measure

 These measurements suggest that Jim and Mary

 A document can be represented by thousands of attributes, each recording the

 Ex: Find the similarity between documents 1 and 2.

You might also like