Data Objects and Quality in Mining

Here are the answers to the questions about the sample age data: (a) The mean is 29.5 and the median is 25. (b) The modes are 25 and 35. The data is bimodal. (c) The midrange is 46. (d) The first quartile (Q1) is approximately 20 and the third quartile (Q3) is approximately 35.

Uploaded by

يزيد زايد

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views27 pages

Data Objects and Quality in Mining

Uploaded by

يزيد زايد

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

5

COIS 448: Data Mining &

Business Intelligence

Getting To know Your Data

Information Systems Department

Faculty of Computing and Information Technology Rabigh
King Abdulaziz University

Slide adapted from Dr. Arda

2
3
4
5
Data Objects and Attributes
6

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
 Attribute (or dimensions, features, variables): is a data field, representing a
characteristic or feature of a data object.
E.g., customer _ID, name, address
7
8 types of attributes
9 types of attributes
10 types of attributes
11 Data Quality
 What kinds of data quality problems?
 How can we detect problems with the data?
 What can we do about these problems?

 Examples of data quality problems:

 Noise and outliers
 missing values
 duplicate data
12 Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking on a poor phone and “snow” on
television screen

Two Sine Waves Two Sine Waves + Noise

13 Outliers
 Outliers are data objects with characteristics that are considerably different than most
of the other data objects in the data set
14 Missing Values
 Reasons for missing values
 Information is not collected
(e.g., people decline to give their age and weight)
 Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values

 Eliminate Data Objects
 Estimate Missing Values
 Ignore the Missing Value During Analysis
 Replace with all possible values (weighted by their probabilities)
15 Duplicate Data
 Data set may include data objects that are duplicates, or almost
duplicates of one another
 Major issue when merging data from heterogeneous sources

 Examples:
 Same person with multiple email addresses

 Data cleaning
 Process of dealing with duplicate data issues
Basic
16 Statistical Descriptions of Data
 Mean
The most common and effective numeric measure of the “center” of a set of data is the (arithmetic)
mean.
Let x1, x, . . . , xN be a set of N values or observations, such as for some numeric attribute X, like salary.
The mean of this set of values is

Example: To find the mean of 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Basic Statistical Descriptions of Data
17

 Median
For skewed (asymmetric) data, a better measure of the center of data is the median, which is the
middle value in a set of ordered data values. It is the value that separates the higher half of a data
set from the lower
 Example Median.
Let’s ﬁnd the median of the data salary in thousand dollars
(30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110):
There is an even number of observations; therefore, the median is not unique.
It can be any value within the two middlemost values of 52 and 56 (that is, within the sixth and
seventh values in the list).
By convention, we assign the average of the two middlemost values as the median; that is,

=. = 54
The median is 54
Basic Statistical Descriptions of Data
18
 Mode
The mode for a set of data is the value that occurs most frequently in the set. Therefore,
it can be determined for qualitative and quantitative attributes. It is possible for the greatest
frequency to correspond to several different values, which results in more than one mode.
Data sets with one, two, or three modes are respectively called unimodal, bimodal, and
trimodal. In general, a data set with two or more modes is multimodal.
At the other extreme, if each data value occurs only once, then there is no mode.

 Example Mode
(30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110)
The data from previous Example are bimodal. The two modes are $52,000 and $70,000.
Basic Statistical Descriptions of Data
19

 Midrange

The midrange can also be used to assess the central

tendency of a numeric data set. It is the average of the
largest and smallest values in the set.
Example Midrange.
(30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110)
The midrange of the data of previous Example is
$70,000.
Basic Statistical Descriptions of Data
20

In a unimodal frequency curve with perfect symmetric data distribution, the mean,
median, and mode are all at the same center value, as shown in Figure 2.1(a). Data in
most real applications are not symmetric. They may instead be either positively
skewed, where the mode occurs at a value that is smaller than the median (Figure
2.1b), or negatively skewed , where the mode occurs at a value greater than the
median (Figure 2.1c).
Measuring the Dispersion of Data: Range, Quartiles, Variance,
Standard Deviation, and Interquartile Range
21
 The measures include range, quantiles, quartiles, percentiles, and
the interquartile range. The ﬁve-number summary, which can be
displayed as a boxplot, is useful in identifying outliers. Variance and
standard deviation also indicate the spread of a data distribution.
 Range, Quartiles, and Interquartile Range
To start off, let’s study the range, quantiles, quartiles, percentiles,
and the interquartile range as measures of data dispersion.
Let x1, x2, … , xN be a set of observations for some numeric attribute
X.
 Range
The range of the set is the difference between the largest (max())
and smallest (min() ) values.
Measuring the Dispersion of Data
22
 Quantiles
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially
equal size consecutive sets.
 Quartiles
The kth q-quantile for a given data distribution is the value x such that at most k=q of the data
values are less than x and at most (q-k)/q of the data values are more than x, where k is an
integer such that 0 < k < q. There are q-1 q-quantiles.
The 2-quantile is the data point dividing the lower and upper halves of the data distribution, It
corresponds to the median.
The 4-quantiles are the three data points that split the data distribution into four equal parts;
each part represents one-fourth of the data distribution. They are more commonly referred to
as quartiles, as in Figure 2.2.
 The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets.
 The median, quartiles, and percentiles are the most widely used forms of quantiles.
Measuring the Dispersion of Data
23

 The quartiles give an indication of a distribution’s center, spread, and shape. The
first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of
the data. The third quartile, denoted by Q3, is the 75th percentile. it cuts off the
lowest 75% (or highest 25%) of the data. The second quartile is the 50th
percentile. As the median, it gives the center of the data distribution.
 The distance between the first and third quartiles is a simple measure of spread
that gives the range covered by the middle half of the data. This distance is called
the interquartile range (IQR) and is defined as
Measuring the Dispersion of Data
24

 Example of Interquartile range

The quartiles are the three values that split the sorted data set into four equal parts. The
data from previous Example contain 12 observations, already sorted in increasing order.
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 (in thousands of
dollars)
Thus, the quartiles for this data are the third, sixth, and ninth values, respectively, in the
sorted list.
Therefore, Q1 = $47,000 and Q3 = $63,000.
Thus, the interquartile range is
IQR=63-47=$16,000.
(Note that the sixth value is a median, $52,000, although this data set has two medians since
the number of data values is even.)
Variance and Standard Deviation
25
 Variance and standard deviation are measures of data dispersion. They
indicate how spread out a data distribution is.

 A low standard deviation means that the data observations tend to be

very close to the mean, while a high standard deviation indicates that
the data are spread out over a large range of values.
Variance and Standard Deviation
26
 Example Variance and standard deviation.
(30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110)
In previous Example we found the mean salary is $58,000.
To determine the variance and standard deviation of the data from that
example, we set N = 12 and use Eq. (2.6) to obtain
Exercises
27

Example:
Suppose that the data for analysis includes the attribute age. The age
values for the data tuples are (in increasing order)

13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35,
35, 35, 36, 40, 45,46, 52, 70.

(a) What is the mean and median of the data?

(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal, trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile
(Q3) of the data?
(e) IQR

Data Objects and Statistical Analysis
No ratings yet
Data Objects and Statistical Analysis
52 pages
Data Objects and Attribute Types in Mining
No ratings yet
Data Objects and Attribute Types in Mining
29 pages
Chap 3.2 - Summarizing Data
No ratings yet
Chap 3.2 - Summarizing Data
58 pages
STAT243 Chapter 2 - Section 2.2
No ratings yet
STAT243 Chapter 2 - Section 2.2
19 pages
Statistics in Data Science Overview
No ratings yet
Statistics in Data Science Overview
155 pages
Descriptive Statistics Overview
No ratings yet
Descriptive Statistics Overview
64 pages
Measures of Center and Dispersion
No ratings yet
Measures of Center and Dispersion
17 pages
Descriptive Statistics Phs St.
No ratings yet
Descriptive Statistics Phs St.
54 pages
Understanding Central Tendency and Dispersion
No ratings yet
Understanding Central Tendency and Dispersion
28 pages
Statistics Fundamentals and Data Analysis
No ratings yet
Statistics Fundamentals and Data Analysis
29 pages
Understanding Statistical Basics
No ratings yet
Understanding Statistical Basics
50 pages
Basics of Statistics for Business
No ratings yet
Basics of Statistics for Business
31 pages
Data Objects and Attribute Types Overview
No ratings yet
Data Objects and Attribute Types Overview
43 pages
Sampling Design and Statistical Analysis
No ratings yet
Sampling Design and Statistical Analysis
34 pages
Data Types and Statistical Measures
No ratings yet
Data Types and Statistical Measures
57 pages
Univariate and Multivariate Data Analysis
No ratings yet
Univariate and Multivariate Data Analysis
152 pages
Descriptive Statistics Overview
No ratings yet
Descriptive Statistics Overview
88 pages
Understanding Basic Statistics Concepts
No ratings yet
Understanding Basic Statistics Concepts
49 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
36 pages
Calculating Skewness and Kurtosis in Excel
No ratings yet
Calculating Skewness and Kurtosis in Excel
56 pages
Class Structure and Statistical Methods
No ratings yet
Class Structure and Statistical Methods
52 pages
Understanding Basic Statistics Concepts
No ratings yet
Understanding Basic Statistics Concepts
50 pages
Understanding Basic Statistics Concepts
No ratings yet
Understanding Basic Statistics Concepts
35 pages
Understanding Descriptive Statistics
No ratings yet
Understanding Descriptive Statistics
22 pages
Central Tendency and Variability Measures
No ratings yet
Central Tendency and Variability Measures
19 pages
Grade 11 Statistics Overview
No ratings yet
Grade 11 Statistics Overview
73 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
59 pages
Statistical Analysis and Data Presentation
No ratings yet
Statistical Analysis and Data Presentation
50 pages
Business Statistics: Data Description Techniques
No ratings yet
Business Statistics: Data Description Techniques
45 pages
Chapter 2 - Statistical Learning
No ratings yet
Chapter 2 - Statistical Learning
37 pages
Overview of Statistical Methods and Analysis
No ratings yet
Overview of Statistical Methods and Analysis
52 pages
Understanding Basic Statistics Concepts
No ratings yet
Understanding Basic Statistics Concepts
13 pages
Introduction to Statistics Concepts
No ratings yet
Introduction to Statistics Concepts
50 pages
Stat Notes
No ratings yet
Stat Notes
23 pages
Section 2
No ratings yet
Section 2
40 pages
Statistics: Mean, Median, Mode, Variance
No ratings yet
Statistics: Mean, Median, Mode, Variance
178 pages
Measurement of Tendencies and Dispersions
No ratings yet
Measurement of Tendencies and Dispersions
51 pages
Summary Measures in Data Analysis
No ratings yet
Summary Measures in Data Analysis
44 pages
Introduction to Statistics Overview
No ratings yet
Introduction to Statistics Overview
32 pages
Central Tendency and Variation Explained
No ratings yet
Central Tendency and Variation Explained
36 pages
Visualizing and Summarizing Data
No ratings yet
Visualizing and Summarizing Data
25 pages
Introduction to Biostatistics Basics
No ratings yet
Introduction to Biostatistics Basics
53 pages
Five-Number Summary and IQR Analysis
No ratings yet
Five-Number Summary and IQR Analysis
31 pages
Numerical Measures for Data Analysis
No ratings yet
Numerical Measures for Data Analysis
52 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
35 pages
Numerical Descriptive Measures Explained
No ratings yet
Numerical Descriptive Measures Explained
21 pages
Data Exploration and Visualization Techniques
No ratings yet
Data Exploration and Visualization Techniques
38 pages
ECON 1280: Data Description Essentials
No ratings yet
ECON 1280: Data Description Essentials
68 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
74 pages
Best Measure of Central Tendency
No ratings yet
Best Measure of Central Tendency
68 pages
Statistical Methods for Data Analysis
No ratings yet
Statistical Methods for Data Analysis
60 pages
Numerical Data Analysis Techniques
No ratings yet
Numerical Data Analysis Techniques
33 pages
Understanding Descriptive Statistics
No ratings yet
Understanding Descriptive Statistics
91 pages
Data Analysis & Visualization Essentials
No ratings yet
Data Analysis & Visualization Essentials
50 pages
Descriptive Statistics: Measures of Center
No ratings yet
Descriptive Statistics: Measures of Center
48 pages
Data Visualization and Statistics Overview
No ratings yet
Data Visualization and Statistics Overview
11 pages
Understanding Basic Statistics Concepts
No ratings yet
Understanding Basic Statistics Concepts
52 pages
Variance and Outlier Resistance
No ratings yet
Variance and Outlier Resistance
42 pages
Chapter 4
No ratings yet
Chapter 4
45 pages
d2 Quadra USB 3.0 User Manual
No ratings yet
d2 Quadra USB 3.0 User Manual
40 pages
New ISO Standards Published October 2017
No ratings yet
New ISO Standards Published October 2017
22 pages
Soal Ujian Nasional SMK 2011
No ratings yet
Soal Ujian Nasional SMK 2011
14 pages
Jatin Agrawal: Tech Skills & Achievements
No ratings yet
Jatin Agrawal: Tech Skills & Achievements
1 page
TCL L32D2700S Service Manual
100% (2)
TCL L32D2700S Service Manual
33 pages
Parking Management System Project Report
No ratings yet
Parking Management System Project Report
19 pages
Basics of Set Theory Explained
No ratings yet
Basics of Set Theory Explained
27 pages
NTFSInfo Utility Analysis for CS 2301
No ratings yet
NTFSInfo Utility Analysis for CS 2301
4 pages
S7-1200 PLC User Manual
0% (1)
S7-1200 PLC User Manual
36 pages
XSL and XSLT Transformation Guide
No ratings yet
XSL and XSLT Transformation Guide
13 pages
Apple Malware Warning for Sourcetree
No ratings yet
Apple Malware Warning for Sourcetree
308 pages
Windows 8 Driver Versions List
No ratings yet
Windows 8 Driver Versions List
3 pages
Ambuj Singh's Electronics Portfolio
No ratings yet
Ambuj Singh's Electronics Portfolio
1 page
EE3302 Question Paper Overview
No ratings yet
EE3302 Question Paper Overview
3 pages
Awr Celsius Rak PDK II
No ratings yet
Awr Celsius Rak PDK II
20 pages
Objectives of IPL in Mainframe Systems
No ratings yet
Objectives of IPL in Mainframe Systems
2 pages
Java Student Management System Code
No ratings yet
Java Student Management System Code
6 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
5 pages
BDI - Agent - OSN Fudamenta Informaticae
No ratings yet
BDI - Agent - OSN Fudamenta Informaticae
16 pages
Optimized Sentiment Analysis for Government
No ratings yet
Optimized Sentiment Analysis for Government
16 pages
Data Analyst Resume of Jyoti Sharma
No ratings yet
Data Analyst Resume of Jyoti Sharma
1 page
Fy 41 Ap
No ratings yet
Fy 41 Ap
36 pages
AWS DevOps Engineer Profile
No ratings yet
AWS DevOps Engineer Profile
7 pages
Android App Initialization Log
No ratings yet
Android App Initialization Log
84 pages
Internship Report: SMK Negeri 4 Jakarta
No ratings yet
Internship Report: SMK Negeri 4 Jakarta
27 pages
OneBlox 2.16.2 Release Highlights
No ratings yet
OneBlox 2.16.2 Release Highlights
6 pages
5V Dual-Channel Relay Module - Pinout, Specification, Application, Working, Datasheet
No ratings yet
5V Dual-Channel Relay Module - Pinout, Specification, Application, Working, Datasheet
7 pages
Python Data Visualization
No ratings yet
Python Data Visualization
174 pages
TeamViewer Software Overview and Features
No ratings yet
TeamViewer Software Overview and Features
29 pages