Understanding Data Objects and Attributes
Understanding Data Objects and Attributes
— Chapter 2 —
Getting to Know Your Data
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign
Simon Fraser University
©2011 Han, Kamber, and Pei. All rights
reserved.
1
2
Chapter 2: Getting to Know Your
Data
Data Visualization
3
Data Objects
Data sets are made up of data objects.
A data object represents an entity.
Examples:
In sales database: objects may be customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Data objects are also called samples , examples, instances, data points,
objects, tuples.
If data objects are stored in a database, they are data tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.
4
Attributes
Attribute (or dimensions, features,
variables): a data field, representing a
characteristic or feature of a data object.
E.g., customer _ID, name, address
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
5
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red,
white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome
(e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude
between successive values is not known.
Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
7
Discrete vs. Continuous
Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in
a collection of documents
Sometimes, represented as integer variables
discrete attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and
floating-point variables
8
Chapter 2: Getting to Know Your
Data
Data Visualization
9
Chapter 1: Applications of Data
Mining
Where there are data, there are data mining applications.
Bioinformatics & software engineering
Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological network
analysis.
Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
From major dedicated data mining systems/tools (e.g MS SQL-Server
Analysis Manager, Oracle Data Mining Tools) to invisible data mining.
10
Chapter 1: Applications of Data
Mining
Two highly successful and popular application examples of data mining:
business intelligence and search engines.
Business intelligence (BI) technologies provide historical, current, and
predictive views of business operations.
Examples include reporting, online analytical processing, business
performance management, competitive intelligence, benchmarking, and
predictive analytics.
“How important is business intelligence?”
Without data mining, many businesses may not be able to perform effective
market analysis, compare customer feedback on similar products, discover
the strengths and weaknesses of their competitors, retain highly valuable
customers, and make smart business decisions.
11
Chapter 1: Applications of Data
Mining
Clearly, data mining is the core of business intelligence.
Online analytical processing tools in business intelligence rely on data
warehousing and multidimensional data mining.
Classification and prediction techniques are the core of predictive analytics in
business intelligence.
Moreover, clustering plays a central role in customer relationship management,
which groups customers based on their similarities.
Using characterization mining techniques, we can better understand features
of each customer group and develop customized customer reward programs.
12
Chapter 1: Applications of Data
Mining
Web Search Engines
A Web search engine is a specialized computer server that searches for
information on the Web.
The search results of a user query are often returned as a list (sometimes
called hits).
The hits may consist of web pages, images, and other types of files.
Search engines differ from web directories in that web directories are
maintained by human editors whereas search engines operate algorithmically
or by a mixture of algorithmic and human input.
Various data mining techniques are used in all aspects of search engines,
ranging from crawling (e.g., deciding which pages should be crawled and
the crawling frequencies), indexing (e.g., selecting pages to be indexed and
deciding to which extent the index should be constructed), and searching
(e.g., deciding how pages should be ranked, which advertisements should be
added, and how the search results can be personalized or made “context
aware”).
13
Chapter 1: Applications of Data
Mining
Search engines pose grand challenges to data mining.
First, they have to handle a huge and ever-growing amount of
data.
Second, Web search engines often have to deal with online
data.
Another challenge is maintaining and incrementally updating a
model on fast growing data streams.
Third, Web search engines often have to deal with queries that
are asked only a very small number of times.
14
Chapter 2: Getting to Know Your
Data
Data Visualization
15
Basic Statistical Descriptions of Data
16
Measuring the Central Tendency
Let x1,x2,xN be a set of N values or observations, such as for some numeric
attribute X, like salary. The mean of this set of values is
17
Measuring the Central Tendency
18
Measuring the Central Tendency
Median:
Middle value if odd number of values, or average of the middle two values
otherwise
Estimated by interpolation (for grouped data):
19
Measuring the Central Tendency
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal. In general, a data set with
two or more modes is multimodal. At the other extreme, if
each data value occurs only once, then there is no mode.
For unimodal numeric data that are moderately skewed
(asymmetrical), we have the following empirical relation:
Empirical formula:
This implies that the mode for unimodal frequency curves that are
moderately skewed can easily be approximated if the mean and
median values are known.
20
Measuring the Central Tendency
The midrange can also be used to assess the
central tendency of a numeric data set.
It is the average of the largest and smallest
values in the set.
This measure is easy to compute using the SQL
aggregate functions, max() and min().
21
Symmetric vs.
Skewed Data
Median, mean and mode of symmetric,
positively and negatively skewed data
symmetric
positively skewed
Data in most real applications are not
•
symmetric.
negatively skewed
They
• may instead be either
positively skewed, where the mode negatively skewed, where the mode
occurs at a value greater than the median.
occurs at a value that is smaller than
the median
June 4, 2025 Data Mining: Concepts and Techniques 22
Measuring the Dispersion of
Data
Quartiles, outliers and boxplots
Quartiles: Q1 (25th percentile), Q3 (75th percentile)
Inter-quartile range: IQR = Q3 – Q1
Five number summary: min, Q1, median, Q3, max
Boxplot: ends of the box are the quartiles; median is marked;
add whiskers, and plot outliers individually
Outlier: usually, a value higher/lower than 1.5 x IQR
Variance and standard deviation (sample: s, population: σ)
Variance: (algebraic, scalable computation)
Standard deviation s (or σ) is the square root of variance s2 (or
σ2)
23
Measuring the Dispersion of
Data
24
Measuring the Dispersion of
Data
• Quantiles are points taken at regular intervals of a data
distribution, dividing it into essentially equal size consecutive
sets.
• The 2-quantile is the data point dividing the lower and upper
halves of the data distribution. It corresponds to the median.
• The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-fourth
of the data distribution. They are more commonly referred to as
quartiles.
• The 100-quantiles are more commonly referred to as
percentiles; they divide the data distribution into 100 equal-
sized consecutive sets.
25
Measuring the Dispersion of
Data
27
Five-number Summary, Boxplots
and Outliers
Five-number summary of a distribution
Minimum, Q1, Median(Q2), Q3, Maximum
Boxplot
Boxplots are a popular way of visualizing a
distribution. A boxplot incorporates the five-number
summary as follows:
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
The median is marked by a line within the box
Whiskers: two lines outside the box extended to
Minimum and Maximum observations
Outliers: points beyond a specified outlier threshold,
plotted individually
28
Five-number Summary, Boxplots
and Outliers
Boxplot. Figure 2.3 shows boxplots for unit price data for items sold at
four branches of AllElectronics during a given time period.
For branch 1, we see that the median price of items sold is $80, Q1 is
$60, and Q3 is $100.
Notice that two outlying observations for this branch were plotted
individually, as their values of 175 and 202 are more than 1.5 times
the IQR here of 40. Boxplots can be computed in O(nlogn) time.
29
Five-number Summary, Boxplots and
Outliers
31
Quantile Plot
Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data
are below or equal to the value xi
33
Positively and Negatively Correlated
Data
34
Chapter 2: Getting to Know Your
Data
Data Visualization
35
Data Visualization
Data visualization aims to communicate data clearly and effectively through graphical
representation.
Data visualization has been used extensively in many applications—for example, at
work for reporting, managing business operations, and tracking progress of tasks.
Categorization of visualization methods:
Pixel-oriented visualization techniques
Geometric projection visualization techniques
Icon-based visualization techniques
Hierarchical visualization techniques
Visualizing complex data and relations
36
Pixel-Oriented Visualization
Techniques
A simple way to visualize the value of a dimension is to use a pixel where the color of the pixel
reflects the dimension’s value.
For a data set of m dimensions, pixel-oriented techniques create m windows on the screen, one
for each dimension.
The colors of the pixels reflect the corresponding values.
Pixel-oriented visualization of four attributes by sorting all customers in income ascending order.
38
Laying Out Pixels in Circle
Segments
To save space and show the connections among multiple
dimensions, space filling is often done in a circle segment
40
Geometric Projection
Visualization Techniques
A scatter plot displays 2-D data points
using Cartesian coordinates.
A third dimension can be added using
different colors or shapes to represent
different data points.
Figure 2.13 shows an example, where X and
Y are two spatial attributes and the third
dimension is represented by different
shapes.
Through this visualization, we can see that
points of types “+” and “ X” tend to be
colocated.
41
Geometric Projection
Visualization Techniques
A 3-D scatter plot uses three axes in a
Cartesian coordinate system. If it also uses
color, it can display up to 4-D data points
(Figure 2.14).
42
Geometric Projection
Visualization Techniques
For data sets with more than four
dimensions, scatter plots are usually
ineffective.
The scatter-plot matrix technique is a
useful extension to the scatter plot.
For an n dimensional data set, a scatter-
plot matrix is an n x n grid of 2-D scatter
plots that provides a visualization of each
dimension with every other dimension.
43
Geometric Projection
Visualization Techniques
Figure 2.15 shows an example, which
visualizes the Iris data set.
The data set consists of 450 samples from
each of three species of Iris flowers.
There are five dimensions in the data set:
length and width of sepal and petal, and
species.
The scatter-plot matrix becomes less
effective as the dimensionality increases.
44
Geometric Projection
Visualization Techniques
45
Icon-Based Visualization
Techniques
Icon-based visualization techniques use small icons to represent multidimensional data values.
Two popular icon-based techniques: Chernoff faces and stick figures.
Chernoff faces were introduced in 1973 by statistician Herman Chernoff.
They display multidimensional data of up to 18 variables (or dimensions) as a cartoon human
face.
Chernoff faces help reveal trends in the data. Components of the face, such as the eyes, ears,
mouth, and nose, represent values of the dimensions by their shape, size, placement, and
orientation.
For example, dimensions can be mapped to the following facial characteristics: eye size, eye
spacing, nose length, nose width, mouth curvature, mouth width, mouth openness, pupil size,
eyebrow slant, eye eccentricity, and head eccentricity.
Chernoff Faces
46
Stick Figure
The stick figure visualization technique maps multidimensional
data to five-piece stick figures, where each figure has four
limbs and a body.
Two dimensions are mapped to the display (x and y) axes and
the remaining dimensions are mapped to the angle and/or length
of the limbs.
Figure 2.18 shows census data, where age and income are
mapped to the display axes, and the remaining dimensions
(gender, education, and so on) are mapped to stick figures.
If the data items are relatively dense with respect to the two
display dimensions, the resulting visualization shows texture
patterns, reflecting data trends.
47
Stick Figure
A census data
figure showing
age, income,
used by permission of G. Grinstein, University of Massachusettes at Lowell
gender,
education, etc.
A 5-piece
stick figure (1
body and 4
limbs w.
different
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern
angle/length)
48
Hierarchical Visualization
Techniques
Visualization of the data using a hierarchical partitioning into
subsets([Link])
Worlds-within-Worlds
“Worlds-within-Worlds,” also known as n-Vision, is a
representative hierarchical visualization method.
49
Worlds-within-Worlds
Data Visualization
53
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects
are
Value is higher when objects are more alike
Often falls in the range [0,1]
Return the value 0 if the objects are unalike.
Dissimilarity (e.g., distance)
Numerical measure of how different two data
objects are
Lower when objects are more alike
Minimum dissimilarity is often 0
Return the value 0 if the objects are alike.
Proximity refers to a similarity or dissimilarity
54
Data Matrix and Dissimilarity
Matrix
Two data structures that are commonly used
Data matrix(used to store the data objects)
This structure stores the n data objects in the form of a relational table, or n-by-p
matrix (n objects p attributes)
Each row corresponds to an object. As part of our notation, we may use f to index
through the p attributes.
Two modes
A data matrix is made up of two entities or “things,” namely rows (for objects) and
columns (for attributes). Therefore, the data matrix is often called a two-mode
matrix.
x11 ... x1f ... x1p
... ... ... ... ...
x i1 ... x if ... x ip
... ... ... ... ...
x n1 ... x nf ... x np
55
Dissimilarity Matrix
• Dissimilarity matrix(or object-by-object structure): This structure stores a
collection of proximities that are available for all pairs of n objects.
• It is often represented by an n-by-n table.
• where d(i, j) is the measured dissimilarity or “difference” between objects i and j.
• In general, d(i, j) is a non-negative number that is close to 0 when objects i and j
are highly similar or “near” each other, and becomes larger the more they differ.
• d(i, i)=0; that is, the difference between an object and itself is 0.
0
• d(i, j)= d(j, i) The matrix is symmetric. d(2,1)
0
Single mode d(3,1) d ( 3,2) 0
The dissimilarity matrix contains one kind : : :
d ( n,1) d ( n,2) ... ... 0
of entity (dissimilarities) and
so is called a one-mode matrix.
56
Proximity Measure for Nominal
Attributes
Can take 2 or more states, e.g., red, yellow,
blue, green
The dissimilarity between two objects i and j
can be computed based on the ratio of
mismatches
d (i, j) p p m
58
Dissimilarity between Binary
Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Gender is a symmetric attribute
The remaining attributes are asymmetric binary
Let the values Y and P be 1, and the value N 0
0 1
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
59
Dissimilarity between Binary
Variables
0 1
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
60
Dissimilarity of Numeric Data:
Minkowski Distance
Distance measures that are commonly used for computing the dissimilarity of
objects described by numeric attributes. These measures include the Euclidean,
Manhattan, and Minkowski distances.
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are
different between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
h = 2: (L2 norm) Euclidean distance
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j 2 ip jp
h . “supremum” (Lmax norm, L norm) distance.
This is the maximum difference between any component (attribute) of
the vectors
61
Dissimilarity of Numeric Data:
Minkowski Distance
Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
62
Example
Euclidean distance and Manhattan
distance. Let x1 (1, 2)and x2 (3, 5)
represent two objects.
The Euclidean distance between the two is
√2^2+ 3^2 =3.61.
The Manhattan distance between the two is
2 +3 =5.
Supremum distance
The second attribute gives the greatest
difference between values for the objects,
which is 5 -2 =3. This is the supremum
distance between both objects.
63
Proximity Measures for
Ordinal Attributes
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
Suppose f is an attribute from a set of ordinal
attributes ,M represents the [Link] possible states
replace xif by their rank rif {1,..., M f }
map the range of each variable onto [0, 1] by
replacing i-th object in the f-th variable by
rif 1
zif
Mf 1
compute the dissimilarity using methods for
interval-scaled variables
64
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric,
ordinal.
One may use a weighted formula to combine their effects
pf 1 ij( f ) dij( f )
d (i, j) p
f 1 ij( f )
Suppose that the dataset contains p attributes of mixedtype.
65
Cosine Similarity
66
Example: Cosine Similarity
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0) 0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1) 0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
67