DATA ANALYTICS UNIT–I
Data Quality:
What is Data Quality?
There are many definitions of data quality, in general, data quality is the assessment of how
much the data is usable and fits its serving context.
Why Data Quality is Important?
Enhancing the data quality is a critical concern as data is considered as the core of all activities
within organizations, poor data quality leads to inaccurate reporting which will result inaccurate
decisions and surely economic damages.
Many factors help measuring data quality such as:
Data Accuracy: Data are accurate when data values stored in the database correspond
to real-world values.
Data Uniqueness: A measure of unwanted duplication existing within or across systems
for a particular field, record, or data set.
Data Consistency: Violation of semantic rules defined over the dataset.
Data Completeness: The degree to which values are present in a data collection.
Data Timeliness: The extent to which age of the data is appropriated for the task at
hand.
Other factors can be taken into consideration such as Availability, Ease of Manipulation,
Believability.
OUTLIERS:
Outlier is a point or an observation that deviates significantly from the
other observations.
Outlier is a commonly used terminology by analysts and data scientists
as it needs close attention else it can result in wildly wrong estimations. Simply speaking,
Outlier is an observation that appears far away and diverges from an overall pattern in a
sample.
Reasons for outliers: Due to experimental errors or “special circumstances”.
There is no rigid mathematical definition of what constitutes an outlier; determining whether
or not an observation is an outlier is ultimately a subjective exercise.
There are various methods of outlier detection. Some are graphical such as normal probability
plots. Others are model-based. Box plots are a hybrid.
Types of Outliers:
16 | P a g e
DATA ANALYTICS UNIT–I
Outlier can be of two types:
Univariate: These outliers can be found when we look at distribution of a single variable.
Multivariate: Multi-variate outliers are outliers in an n-dimensional space.
In order to find them, you have to look at distributions in multi-dimensions.
Impact of Outliers on a dataset:
Outliers can drastically change the results of the data analysis and statistical modelling. There are
numerous unfavourable impacts of outliers in the data set:
It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest
They can also impact the basic assumption of Regression, ANOVA and other statistical model
[Link] Outliers:
Most commonly used method to detect outliers is visualization. We use various visualization methods,
like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for
visualization).
Outlier treatments are three types:
Retention:
There is no rigid mathematical definition of what constitutes an outlier; determining whether
or not an observation is an outlier is ultimately a subjective exercise. There are various methods
of outlier detection. Some are graphical such as normal probability plots. Others are model-
based. Box plots are a hybrid.
Exclusion:
According to a purpose of the study, it is necessary to decide, whether and which outlier will
be removed/excluded from the data, since they could highly bias the final results of the
analysis.
Rejection:
Rejection of outliers is more acceptable in areas of practice where the underlying model of the
process being measured and the usual distribution of measurement error are confidently
17 | P a g e
DATA ANALYTICS UNIT–I
known.
An outlier resulting from an instrument reading error may be excluded but it is desirable that
the reading is at least verified.
Other treatment methods
OUTLIER package in R: to detect and treat outliers in Data.
Outlier detection from graphical representation:
– Scatter plot and Box plot
–The observations out of box are treated as outliers in data
Missing Data treatment:
Missing Values
Missing data in the training data set can reduce the
power / fit of a model or can lead to a biased model
because we have not analyzed the behavior and
relationship with other variables correctly. It can lead to
wrong prediction or classification.
In R, missing values are represented by the symbol NA (not available).
Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number)
and R outputs the result for dividing by zero as ‘Inf’(Infinity).
PMM approach to treat missing values:
• PMM-> Predictive Mean Matching (PMM) is a semi-parametric imputation approach.
• It is similar to the regression method except that for each missing value, it fills in a value
randomly from among the observed donor values from an observation
• whose regression-predicted values are closest to the regression-predicted value for the missing
value from the simulated regression model.
18 | P a g e
DATA ANALYTICS UNIT–I
Data Pre-processing:
Preprocessing in Data Mining: Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are
missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by
attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due
to faulty data collection, data entry errors etc. It can be handled in following ways:
1. Binning Method:
This method works on sorted data in order to smooth it. Binning, also called discretization, is a
technique for reducing the cardinality (The total number of unique values for a dimension is known as
its cardinality) of continuous and discrete data. Binning groups related values together in bins to reduce
19 | P a g e
DATA ANALYTICS UNIT–I
the number of distinct values
2. Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may
be linear (having one independent variable) or multiple (having multiple independent
variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall
outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
Normalization is a technique often applied as part of data preparation in Data Analytics
through machine learning. The goal of normalization is to change the values of numeric
columns in the dataset to a common scale, without distorting differences in the ranges of
values. For machine learning, every dataset does not require normalization. It is done in order
to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0).
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
Discretization is the process through which we can transform continuous variables, models
or functions into a discrete form. We do this by creating a set of contiguous intervals (or
bins) that go across the range of our desired variable/model/function. Continuous data is
Measured, while Discrete data is Counted
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data
reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis
costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
20 | P a g e
DATA ANALYTICS UNIT–I
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute. The
attribute having p-value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).
Data Processing:
Data processing occurs when data is collected and translated into usable information. Usually
performed by a data scientist or team of data scientists, it is important for data processing to be done
correctly as not to negatively affect the end product, or data output.
Data processing starts with data in its raw form and converts it into a more readable format (graphs,
documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized
by employees throughout an organization.
Six stages of data processing
1. Data collection
Collecting data is the first step in data processing. Data is pulled from available sources, including data
lakes and data warehouses. It is important that the data sources available are trustworthy and well-
built so the data collected (and later used as information) is of the highest possible quality.
2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation, often referred
to as “pre-processing” is the stage at which raw data is cleaned up and organized for the following
stage of data processing. During preparation, raw data is diligently checked for any errors. The purpose
of this step is to eliminate bad data (redundant, incomplete, or incorrect data) and begin to create
high-quality data for the best business intelligence.
21 | P a g e
DATA ANALYTICS UNIT–I
3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse
like Redshift), and translated into a language that it can understand. Data input is the first stage in
which raw data begins to take the form of usable information.
4. Processing
During this stage, the data inputted to the computer in the previous stage is actually processed for
interpretation. Processing is done using machine learning algorithms, though the process itself may
vary slightly depending on the source of data being processed (data lakes, social networks, connected
devices etc.) and its intended use (examining advertising patterns, medical diagnosis from connected
devices, determining customer needs, etc.).
5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is
translated, readable, and often in the form of graphs, videos, images, plain text, etc.).
6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored for
future use. While some information may be put to use immediately, much of it will serve a purpose
later on. When data is properly stored, it can be quickly and easily accessed by members of the
organization when needed.
*** End of Unit-1 ***
22 | P a g e