5
COIS 448: Data Mining &
Business Intelligence
Getting To know Your Data
Information Systems Department
Faculty of Computing and Information Technology Rabigh
King Abdulaziz University
Slide adapted from Dr. Arda
2
3
4
5
Data Objects and Attributes
6
Data sets are made up of data objects.
A data object represents an entity.
Examples:
sales database: customers, store items, sales
medical database: patients, treatments
university database: students, professors, courses
Also called samples , examples, instances, data points, objects, tuples.
Data objects are described by attributes.
Database rows -> data objects; columns ->attributes.
Attribute (or dimensions, features, variables): is a data field, representing a
characteristic or feature of a data object.
E.g., customer _ID, name, address
7
8 types of attributes
9 types of attributes
10 types of attributes
11 Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:
Noise and outliers
missing values
duplicate data
12 Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on a poor phone and “snow” on
television screen
Two Sine Waves Two Sine Waves + Noise
13 Outliers
Outliers are data objects with characteristics that are considerably different than most
of the other data objects in the data set
14 Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their probabilities)
15 Duplicate Data
Data set may include data objects that are duplicates, or almost
duplicates of one another
Major issue when merging data from heterogeneous sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
Basic
16 Statistical Descriptions of Data
Mean
The most common and effective numeric measure of the “center” of a set of data is the (arithmetic)
mean.
Let x1, x, . . . , xN be a set of N values or observations, such as for some numeric attribute X, like salary.
The mean of this set of values is
Example: To find the mean of 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Basic Statistical Descriptions of Data
17
Median
For skewed (asymmetric) data, a better measure of the center of data is the median, which is the
middle value in a set of ordered data values. It is the value that separates the higher half of a data
set from the lower
Example Median.
Let’s find the median of the data salary in thousand dollars
(30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110):
There is an even number of observations; therefore, the median is not unique.
It can be any value within the two middlemost values of 52 and 56 (that is, within the sixth and
seventh values in the list).
By convention, we assign the average of the two middlemost values as the median; that is,
=. = 54
The median is 54
Basic Statistical Descriptions of Data
18
Mode
The mode for a set of data is the value that occurs most frequently in the set. Therefore,
it can be determined for qualitative and quantitative attributes. It is possible for the greatest
frequency to correspond to several different values, which results in more than one mode.
Data sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal. In general, a data set with two or more modes is multimodal.
At the other extreme, if each data value occurs only once, then there is no mode.
Example Mode
(30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110)
The data from previous Example are bimodal. The two modes are $52,000 and $70,000.
Basic Statistical Descriptions of Data
19
Midrange
The midrange can also be used to assess the central
tendency of a numeric data set. It is the average of the
largest and smallest values in the set.
Example Midrange.
(30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110)
The midrange of the data of previous Example is
$70,000.
Basic Statistical Descriptions of Data
20
In a unimodal frequency curve with perfect symmetric data distribution, the mean,
median, and mode are all at the same center value, as shown in Figure 2.1(a). Data in
most real applications are not symmetric. They may instead be either positively
skewed, where the mode occurs at a value that is smaller than the median (Figure
2.1b), or negatively skewed , where the mode occurs at a value greater than the
median (Figure 2.1c).
Measuring the Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Range
21
The measures include range, quantiles, quartiles, percentiles, and
the interquartile range. The five-number summary, which can be
displayed as a boxplot, is useful in identifying outliers. Variance and
standard deviation also indicate the spread of a data distribution.
Range, Quartiles, and Interquartile Range
To start off, let’s study the range, quantiles, quartiles, percentiles,
and the interquartile range as measures of data dispersion.
Let x1, x2, … , xN be a set of observations for some numeric attribute
X.
Range
The range of the set is the difference between the largest (max())
and smallest (min() ) values.
Measuring the Dispersion of Data
22
Quantiles
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially
equal size consecutive sets.
Quartiles
The kth q-quantile for a given data distribution is the value x such that at most k=q of the data
values are less than x and at most (q-k)/q of the data values are more than x, where k is an
integer such that 0 < k < q. There are q-1 q-quantiles.
The 2-quantile is the data point dividing the lower and upper halves of the data distribution, It
corresponds to the median.
The 4-quantiles are the three data points that split the data distribution into four equal parts;
each part represents one-fourth of the data distribution. They are more commonly referred to
as quartiles, as in Figure 2.2.
The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets.
The median, quartiles, and percentiles are the most widely used forms of quantiles.
Measuring the Dispersion of Data
23
The quartiles give an indication of a distribution’s center, spread, and shape. The
first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of
the data. The third quartile, denoted by Q3, is the 75th percentile. it cuts off the
lowest 75% (or highest 25%) of the data. The second quartile is the 50th
percentile. As the median, it gives the center of the data distribution.
The distance between the first and third quartiles is a simple measure of spread
that gives the range covered by the middle half of the data. This distance is called
the interquartile range (IQR) and is defined as
Measuring the Dispersion of Data
24
Example of Interquartile range
The quartiles are the three values that split the sorted data set into four equal parts. The
data from previous Example contain 12 observations, already sorted in increasing order.
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 (in thousands of
dollars)
Thus, the quartiles for this data are the third, sixth, and ninth values, respectively, in the
sorted list.
Therefore, Q1 = $47,000 and Q3 = $63,000.
Thus, the interquartile range is
IQR=63-47=$16,000.
(Note that the sixth value is a median, $52,000, although this data set has two medians since
the number of data values is even.)
Variance and Standard Deviation
25
Variance and standard deviation are measures of data dispersion. They
indicate how spread out a data distribution is.
A low standard deviation means that the data observations tend to be
very close to the mean, while a high standard deviation indicates that
the data are spread out over a large range of values.
Variance and Standard Deviation
26
Example Variance and standard deviation.
(30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110)
In previous Example we found the mean salary is $58,000.
To determine the variance and standard deviation of the data from that
example, we set N = 12 and use Eq. (2.6) to obtain
Exercises
27
Example:
Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35,
35, 35, 36, 40, 45,46, 52, 70.
(a) What is the mean and median of the data?
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal, trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data?
(e) IQR