0% found this document useful (0 votes)

7 views7 pages

What Is Data Quality?

Unit 1 of the Data Analytics document covers essential concepts related to data quality, outliers, missing data treatment, data preprocessing, and data processing stages. It emphasizes the importance of data quality for accurate decision-making and outlines methods for detecting and treating outliers and missing values. Additionally, it details the steps involved in data processing, from collection to storage, highlighting the need for high-quality data in organizational contexts.

Uploaded by

naresh.cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views7 pages

What Is Data Quality?

Uploaded by

naresh.cse

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DATA ANALYTICS UNIT–I

Data Quality:
What is Data Quality?
There are many definitions of data quality, in general, data quality is the assessment of how
much the data is usable and fits its serving context.

Why Data Quality is Important?

Enhancing the data quality is a critical concern as data is considered as the core of all activities
within organizations, poor data quality leads to inaccurate reporting which will result inaccurate
decisions and surely economic damages.

Many factors help measuring data quality such as:

 Data Accuracy: Data are accurate when data values stored in the database correspond
to real-world values.
 Data Uniqueness: A measure of unwanted duplication existing within or across systems
for a particular field, record, or data set.
 Data Consistency: Violation of semantic rules defined over the dataset.
 Data Completeness: The degree to which values are present in a data collection.
 Data Timeliness: The extent to which age of the data is appropriated for the task at
hand.
Other factors can be taken into consideration such as Availability, Ease of Manipulation,
Believability.

OUTLIERS:

 Outlier is a point or an observation that deviates significantly from the

other observations.
 Outlier is a commonly used terminology by analysts and data scientists
as it needs close attention else it can result in wildly wrong estimations. Simply speaking,
Outlier is an observation that appears far away and diverges from an overall pattern in a
sample.
 Reasons for outliers: Due to experimental errors or “special circumstances”.
 There is no rigid mathematical definition of what constitutes an outlier; determining whether
or not an observation is an outlier is ultimately a subjective exercise.
 There are various methods of outlier detection. Some are graphical such as normal probability
plots. Others are model-based. Box plots are a hybrid.
Types of Outliers:

16 | P a g e
DATA ANALYTICS UNIT–I

Outlier can be of two types:

Univariate: These outliers can be found when we look at distribution of a single variable.
Multivariate: Multi-variate outliers are outliers in an n-dimensional space.

In order to find them, you have to look at distributions in multi-dimensions.

Impact of Outliers on a dataset:

Outliers can drastically change the results of the data analysis and statistical modelling. There are
numerous unfavourable impacts of outliers in the data set:
 It increases the error variance and reduces the power of statistical tests
 If the outliers are non-randomly distributed, they can decrease normality
 They can bias or influence estimates that may be of substantive interest
 They can also impact the basic assumption of Regression, ANOVA and other statistical model
[Link] Outliers:

Most commonly used method to detect outliers is visualization. We use various visualization methods,
like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for
visualization).

Outlier treatments are three types:

Retention:
 There is no rigid mathematical definition of what constitutes an outlier; determining whether
or not an observation is an outlier is ultimately a subjective exercise. There are various methods
of outlier detection. Some are graphical such as normal probability plots. Others are model-
based. Box plots are a hybrid.
Exclusion:
 According to a purpose of the study, it is necessary to decide, whether and which outlier will
be removed/excluded from the data, since they could highly bias the final results of the
analysis.

Rejection:
 Rejection of outliers is more acceptable in areas of practice where the underlying model of the
process being measured and the usual distribution of measurement error are confidently

17 | P a g e
DATA ANALYTICS UNIT–I

known.
 An outlier resulting from an instrument reading error may be excluded but it is desirable that
the reading is at least verified.

Other treatment methods

OUTLIER package in R: to detect and treat outliers in Data.
Outlier detection from graphical representation:
– Scatter plot and Box plot

–The observations out of box are treated as outliers in data

Missing Data treatment:
Missing Values
 Missing data in the training data set can reduce the
power / fit of a model or can lead to a biased model
because we have not analyzed the behavior and
relationship with other variables correctly. It can lead to
wrong prediction or classification.

 In R, missing values are represented by the symbol NA (not available).

 Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number)
and R outputs the result for dividing by zero as ‘Inf’(Infinity).

PMM approach to treat missing values:

• PMM-> Predictive Mean Matching (PMM) is a semi-parametric imputation approach.
• It is similar to the regression method except that for each missing value, it fills in a value
randomly from among the observed donor values from an observation
• whose regression-predicted values are closest to the regression-predicted value for the missing
value from the simulated regression model.

18 | P a g e
DATA ANALYTICS UNIT–I

Data Pre-processing:
Preprocessing in Data Mining: Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.

2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.

 (b). Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due
to faulty data collection, data entry errors etc. It can be handled in following ways:
1. Binning Method:
This method works on sorted data in order to smooth it. Binning, also called discretization, is a
technique for reducing the cardinality (The total number of unique values for a dimension is known as
its cardinality) of continuous and discrete data. Binning groups related values together in bins to reduce

19 | P a g e
DATA ANALYTICS UNIT–I

the number of distinct values

2. Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may
be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:

1. Normalization:
Normalization is a technique often applied as part of data preparation in Data Analytics
through machine learning. The goal of normalization is to change the values of numeric
columns in the dataset to a common scale, without distorting differences in the ranges of

values. For machine learning, every dataset does not require normalization. It is done in order
to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0).
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.

3. Discretization:
Discretization is the process through which we can transform continuous variables, models
or functions into a discrete form. We do this by creating a set of contiguous intervals (or
bins) that go across the range of our desired variable/model/function. Continuous data is
Measured, while Discrete data is Counted

4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
20 | P a g e
DATA ANALYTICS UNIT–I

Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:

The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute. The
attribute having p-value greater than significance level can be discarded.

3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of

dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).

Data Processing:
Data processing occurs when data is collected and translated into usable information. Usually
performed by a data scientist or team of data scientists, it is important for data processing to be done
correctly as not to negatively affect the end product, or data output.
Data processing starts with data in its raw form and converts it into a more readable format (graphs,
documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized
by employees throughout an organization.

Six stages of data processing

1. Data collection
Collecting data is the first step in data processing. Data is pulled from available sources, including data
lakes and data warehouses. It is important that the data sources available are trustworthy and well-
built so the data collected (and later used as information) is of the highest possible quality.

2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation, often referred
to as “pre-processing” is the stage at which raw data is cleaned up and organized for the following
stage of data processing. During preparation, raw data is diligently checked for any errors. The purpose
of this step is to eliminate bad data (redundant, incomplete, or incorrect data) and begin to create
high-quality data for the best business intelligence.

21 | P a g e
DATA ANALYTICS UNIT–I

3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse
like Redshift), and translated into a language that it can understand. Data input is the first stage in
which raw data begins to take the form of usable information.

4. Processing
During this stage, the data inputted to the computer in the previous stage is actually processed for
interpretation. Processing is done using machine learning algorithms, though the process itself may
vary slightly depending on the source of data being processed (data lakes, social networks, connected
devices etc.) and its intended use (examining advertising patterns, medical diagnosis from connected
devices, determining customer needs, etc.).

5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs, videos, images, plain text, etc.).

6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored for
future use. While some information may be put to use immediately, much of it will serve a purpose
later on. When data is properly stored, it can be quickly and easily accessed by members of the
organization when needed.

* End of Unit-1 *

22 | P a g e

Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
12 pages
Handling Noisy and Missing Data
No ratings yet
Handling Noisy and Missing Data
32 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
37 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
8 pages
Data Quality Issues
No ratings yet
Data Quality Issues
24 pages
Data Cleaning and Preprocessing Guide
No ratings yet
Data Cleaning and Preprocessing Guide
32 pages
Data Mining: Handling Missing Values & Outliers
No ratings yet
Data Mining: Handling Missing Values & Outliers
4 pages
Data Preparation Techniques for Analytics
No ratings yet
Data Preparation Techniques for Analytics
10 pages
Data Cleaning: Handling Missing Data & Outliers
No ratings yet
Data Cleaning: Handling Missing Data & Outliers
4 pages
Importance of Data Exploration Techniques
No ratings yet
Importance of Data Exploration Techniques
48 pages
Handling Noisy Data and Quality Factors
No ratings yet
Handling Noisy Data and Quality Factors
32 pages
Data Preprocessing and Outlier Detection
No ratings yet
Data Preprocessing and Outlier Detection
8 pages
Data Preparation for Effective Mining
No ratings yet
Data Preparation for Effective Mining
37 pages
Data Preparation for Data Mining Models
No ratings yet
Data Preparation for Data Mining Models
58 pages
Histogram Bucket Size in Google Sheets
No ratings yet
Histogram Bucket Size in Google Sheets
64 pages
Data Preprocessing & Classification Techniques
No ratings yet
Data Preprocessing & Classification Techniques
115 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
33 pages
Data Preparation (Sin Editar)
No ratings yet
Data Preparation (Sin Editar)
12 pages
Data Preparation Techniques in Data Mining
No ratings yet
Data Preparation Techniques in Data Mining
27 pages
Data Pre-processing for Machine Learning
No ratings yet
Data Pre-processing for Machine Learning
43 pages
Understanding Data Quality Issues
100% (2)
Understanding Data Quality Issues
16 pages
Data Preprocessing Notes in PDF
No ratings yet
Data Preprocessing Notes in PDF
50 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
12 pages
DPP M4 Slides
No ratings yet
DPP M4 Slides
47 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
57 pages
Data Analysis and Visualization Guide
No ratings yet
Data Analysis and Visualization Guide
18 pages
Essential Data Cleaning Techniques in Python
No ratings yet
Essential Data Cleaning Techniques in Python
7 pages
Philosophy of EDA in Data Analysis
100% (1)
Philosophy of EDA in Data Analysis
9 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
33 pages
Essential Steps in Exploratory Data Analysis
No ratings yet
Essential Steps in Exploratory Data Analysis
47 pages
Noisy vs Outlier Data in Mining
No ratings yet
Noisy vs Outlier Data in Mining
3 pages
Data Preprocessing: Handling Missing Values
No ratings yet
Data Preprocessing: Handling Missing Values
20 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
23 pages
Steps in Exploratory Data Analysis
No ratings yet
Steps in Exploratory Data Analysis
35 pages
Data Preprocessing Techniques for Analytics
No ratings yet
Data Preprocessing Techniques for Analytics
52 pages
Data Cleaning and Transformation Guide
No ratings yet
Data Cleaning and Transformation Guide
27 pages
Data Wrangling Techniques in Python
No ratings yet
Data Wrangling Techniques in Python
3 pages
Data Mining: Techniques and Preprocessing
No ratings yet
Data Mining: Techniques and Preprocessing
97 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
53 pages
Data Analysis Planning in Research
No ratings yet
Data Analysis Planning in Research
85 pages
Data Preprocessing & Classification Techniques
No ratings yet
Data Preprocessing & Classification Techniques
94 pages
Review 1
No ratings yet
Review 1
52 pages
Data Reduction in Preprocessing Steps
No ratings yet
Data Reduction in Preprocessing Steps
25 pages
Machine Learning EDA and Data Preprocessing
No ratings yet
Machine Learning EDA and Data Preprocessing
11 pages
Machine Learning Data Preparation Guide
No ratings yet
Machine Learning Data Preparation Guide
56 pages
Descriptive Analytics in Marketing
No ratings yet
Descriptive Analytics in Marketing
78 pages
Class 1-2
No ratings yet
Class 1-2
28 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
78 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
66 pages
Data Cleaning: Process and Techniques
No ratings yet
Data Cleaning: Process and Techniques
8 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
20 pages
Data Science Process Explained
No ratings yet
Data Science Process Explained
21 pages
Data Cleaning Techniques in Data Mining
No ratings yet
Data Cleaning Techniques in Data Mining
21 pages
Data Cleaning and Outlier Management
No ratings yet
Data Cleaning and Outlier Management
17 pages
Business Intelligence Carlo Vercellis
No ratings yet
Business Intelligence Carlo Vercellis
5 pages
Understanding Data Types and Analytics
No ratings yet
Understanding Data Types and Analytics
90 pages
Data Science Life Cycle
No ratings yet
Data Science Life Cycle
10 pages
Big Data Visualization Techniques
No ratings yet
Big Data Visualization Techniques
9 pages
Housefly Wing Length Distribution Analysis
No ratings yet
Housefly Wing Length Distribution Analysis
15 pages
Evaluating Compensation Plans and Data
No ratings yet
Evaluating Compensation Plans and Data
10 pages
Factors Influencing Pakistan's Exports
No ratings yet
Factors Influencing Pakistan's Exports
8 pages
Histogram Analysis with Outliers
No ratings yet
Histogram Analysis with Outliers
9 pages
EM3210 Lab Report on Centrifugal Pumps
No ratings yet
EM3210 Lab Report on Centrifugal Pumps
10 pages
Color Measurement of Oils in CIELAB Space
No ratings yet
Color Measurement of Oils in CIELAB Space
9 pages
Data Science Fundamentals and Applications
No ratings yet
Data Science Fundamentals and Applications
160 pages
Causes and Examples of Outliers
No ratings yet
Causes and Examples of Outliers
6 pages
TAPPI/ANSI T 1200 sp-14: 1. Scope
No ratings yet
TAPPI/ANSI T 1200 sp-14: 1. Scope
16 pages
Astm C802
100% (1)
Astm C802
24 pages
Basic Statistics with R Guide
No ratings yet
Basic Statistics with R Guide
241 pages
Grit's Mediation in Academic Burnout
No ratings yet
Grit's Mediation in Academic Burnout
12 pages
AP Statistics Quiz 2.2b Review
No ratings yet
AP Statistics Quiz 2.2b Review
19 pages
CSIR NET Statistics Theory Guide
No ratings yet
CSIR NET Statistics Theory Guide
15 pages
Flood Risk Assessment in Pampanga Basin
No ratings yet
Flood Risk Assessment in Pampanga Basin
8 pages
Statistical Rejection of Outlier Values
No ratings yet
Statistical Rejection of Outlier Values
8 pages
Analyzing Crime Data Outliers in R
No ratings yet
Analyzing Crime Data Outliers in R
8 pages
AP Statistics Problems #13
No ratings yet
AP Statistics Problems #13
2 pages
Statistical Approaches to Novelty Detection
No ratings yet
Statistical Approaches to Novelty Detection
17 pages
Module 1 - Sjbitcse
No ratings yet
Module 1 - Sjbitcse
48 pages
Descriptive Statistics Overview
No ratings yet
Descriptive Statistics Overview
6 pages
Data Science Assignment Overview
No ratings yet
Data Science Assignment Overview
56 pages
96.deep Learning For Anomaly Detection in Environmental Monitoring - FINAL
No ratings yet
96.deep Learning For Anomaly Detection in Environmental Monitoring - FINAL
10 pages
Week 6 Audit Analytics Class Agenda
No ratings yet
Week 6 Audit Analytics Class Agenda
35 pages
Oman Oil and Gas Cost Estimating FeaturedPaper2
No ratings yet
Oman Oil and Gas Cost Estimating FeaturedPaper2
22 pages
Web-Based Liver Disease Prediction System
No ratings yet
Web-Based Liver Disease Prediction System
33 pages
Demand Draft Charges in Indian Banks
No ratings yet
Demand Draft Charges in Indian Banks
22 pages
Mean, Median, Mode
No ratings yet
Mean, Median, Mode
5 pages
Statistics Beginners Guide
No ratings yet
Statistics Beginners Guide
42 pages

What Is Data Quality?

Uploaded by

What Is Data Quality?

Uploaded by

DATA ANALYTICS UNIT–I

Why Data Quality is Important?

Many factors help measuring data quality such as:

 Outlier is a point or an observation that deviates significantly from the

Outlier can be of two types:

In order to find them, you have to look at distributions in multi-dimensions.

Impact of Outliers on a dataset:

Outlier treatments are three types:

Other treatment methods

–The observations out of box are treated as outliers in data

 In R, missing values are represented by the symbol NA (not available).

PMM approach to treat missing values:

Steps Involved in Data Preprocessing:

2. Fill the Missing values:

 (b). Noisy Data:

the number of distinct values

4. Concept Hierarchy Generation:

2. Attribute Subset Selection:

Six stages of data processing

*** End of Unit-1 ***

You might also like

* End of Unit-1 *