0% found this document useful (0 votes)

5 views11 pages

Data Preprocessing Mod 2

Data preprocessing in data mining is essential for cleaning, integrating, transforming, and reducing raw data to improve its quality and ensure accurate analysis. Key steps include data cleaning, integration, transformation, and reduction, each employing various techniques to handle issues like missing values, inconsistencies, and irrelevant information. Effective data preprocessing enhances model performance, decision-making, and overall data analysis efficiency.

Uploaded by

aneenaandrews7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views11 pages

Data Preprocessing Mod 2

Uploaded by

aneenaandrews7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

MODULE 2

Data Preprocessing in Data Mining

Real-world data is often incomplete, noisy, and inconsistent, which can lead to incorrect results
if used directly. Data preprocessing in data mining is the process of cleaning and preparing raw
data so it can be used effectively for analysis and model building.
 Real data contains missing and incorrect values
 Data may come from multiple sources
 Large datasets often have irrelevant information
 Clean data gives better mining results
Steps in Data Preprocessing
Some key steps in data preprocessing are:

1. Data Cleaning
It is the process of identifying and correcting errors or inconsistencies in the dataset. Its common
tasks include:
 Handling missing values
 Removing duplicate records
 Correcting wrong or inconsistent data
 Handling Outliers
Techniques used:
 Mean Imputation: Replaces missing values with the average of the attribute.
 Median Imputation: Replaces missing values with the middle value, useful when outliers
exist.
 Mode Imputation: Replaces missing values with the most frequent value.
 Deletion Method: Removes records that contain missing values.
 Interquartile Range (IQR): Detects outliers using the range between Q1 and Q3.
 Z-Score Method: Identifies outliers based on standard deviation from the mean.
 Binning: Smooths noisy data by grouping values into bins.
 Regression Smoothing: Uses regression to predict and smooth noisy values.
 Duplicate Detection: Identifies and removes repeated records.
Example:
 Replacing missing age values with the average age
 Removing repeated rows in a dataset

2. Data Integration
It involves merging data from various sources into a single, unified dataset. It can be challenging
due to differences in data formats, structures, and meanings.
 Used when data comes from databases, files, or APIs
 Removes redundancy between datasets
 Resolves conflicts in data values
Techniques used:
 Schema Matching: Aligns attributes from different data sources.
 Entity Resolution: Identifies records that refer to the same real-world entity.
 Correlation Analysis: Finds and removes redundant attributes.
 Data Conflict Resolution: Resolves inconsistencies in units or data values.
 Duplicate Elimination: Removes overlapping records after integration.
Example: Merging customer data from sales and marketing databases

3. Data Transformation
Data transformation converts data into a suitable form so that data mining algorithms can work
effectively.
 Bring data into a common format
 Improve mining efficiency
 Make data suitable for modeling
Techniques used:
 Min-Max Normalization: Scales data into a fixed range, usually 0 to 1.
 Z-Score Normalization: Transforms data using mean and standard deviation.
 Decimal Scaling: Normalizes data by moving the decimal point.
 Log Transformation: Reduces data skewness using logarithmic scaling.
 One-Hot Encoding: Converts categories into binary columns.
 Label Encoding: Assigns numeric labels to categorical values.
 Aggregation: Combines detailed data into summarized form.
Example:
 Converting salary values into a fixed range (0–1)
 Changing text labels like Male/Female into numeric values

4. Data Reduction
It reduces the dataset's size while maintaining key information. This can be done through feature
selection which chooses the most relevant features and feature extraction which transforms the
data into a lower-dimensional space while preserving important details.
 Improves processing speed
 Saves storage space
 Makes analysis easier
Techniques used:
 Principal Component Analysis (PCA): Reduces dimensions by projecting data onto
principal components.
 Linear Discriminant Analysis (LDA): Reduces dimensions while maximizing class
separation.
 Filter Methods: Select features based on statistical measures.
 Wrapper Methods: Select features using model performance.
 Embedded Methods: Perform feature selection during model training.
 Simple Random Sampling: Selects data points randomly from the dataset.
 Stratified Sampling: Samples data proportionally from each class.

Benefits of Data Preprocessing

 Improves data quality
 Increases accuracy of mining results
 Reduces errors in models
 Makes data easier to understand
Advantages
 Improved Data Quality: Ensures data is clean, consistent, and reliable for analysis.
 Better Model Performance: Reduces noise and irrelevant data, leading to more accurate
predictions and insights.
 Efficient Data Analysis: Streamlines data for faster and easier processing.
 Enhanced Decision-Making: Provides clear and well-organized data for better business
decisions.

Data Cleaning in Data Mining

 Data Cleaning is the main stage of the data mining process, which allows for data utilization
that is free of errors and contains all the necessary information. Some of them include error
handling, deletion of records, and management of missing or incomplete records. Absolute
data cleaning is necessary before data mining since the conclusions given by the data mining
process could well be misleading or even wrong. This makes it an important exercise for
anyone handling big data as it sets the groundwork leading to accurate and useable outcomes.

What is Data Cleaning in Data Mining?

 Data cleaning in Data Mining is the process of identifying, validating, or eradicating the
errors and inconsistencies in data so that analysis might be exact. Raw data is usually full of
inaccuracies, outliers, missing entries, duplicates, and noise that can only worsen the result if
not handled correctly. In data cleaning, the raw data is pre-processed and made to be in a
format, which is suitable for mining activities such as pattern recognition, and predictive
modelling, amongst others. The aim is to enhance the general quality of the data to gain
significant conclusions in the data mining process.

Characteristics of Data Cleaning:

 Accuracy: Making certain that the data that is entered is accurate, which means that the values
or the information inputted are all correct. This leads to an issue of accuracy because incorrect
information will result in wrong conclusions being made.
 Coherence: All datasets should make logical sense of each other. Coherence contributes to the
understanding of data consistency, hence related data elements meet the correct orientation
about one another.
 Validity: Validity means that the values must be ‘proper’ about rules governing the schema of
data or the business rules. This is true, but also includes verifying that the various data entries
submitted conform to bounded or expected values and formats.
 Uniformity: Uniformity can be defined as the readiness to uphold steadfastness of data format
and measurement unit within the data set. For instance, dates should be in one format and the
same is applicable for numerical data, the same units should be used.
 Data Verification: This characteristic involves comparing the data against other known right
sources or formulas to verify its accuracy. Verification assists in the discovery of such errors
that may remain unnoticed during the preparation of the accounts.
 Clean Data Backflow: Once data has been cleaned the cleaned data must be re-inputted to the
system to substitute the erroneous data. It also facilitates data quality consistency over time
and ensures that the same mistakes are not repeated again and again in most of the succeeding
analyses.

Steps for Cleaning Data

1. Remove duplicates & irrelevant data

Delete repeated entries and data that is not useful for your analysis.
2. Fix structural errors
Correct spelling mistakes, inconsistent names, and formatting issues
(e.g., “N/A” and “Not Applicable” should be the same).
3. Remove or check outliers
Identify unusual values.
Remove them only if they are errors; keep them if they are meaningful.
4. Handle missing data
Fill missing values carefully or remove them, since many tools can’t handle missing data.

What is noisy data?

Noisy data is a data set that contains extra meaningless data. Almost all data sets will contain a
certain amount of unwanted noise. Noisy data can be filtered and processed into a higher quality
data set.

Noisy data unnecessarily increases the amount of storage space required and can adversely affect
the results of any data mining analysis. Statistical analysis can use information gleaned from
historical data to weed out noisy data and facilitate data mining.
Noisy data can be caused by hardware failures, programming errors, and gibberish input from
speech or optical character recognition programs. Spelling errors, industry abbreviations and
slang can also impede machine reading. Natural fluctuations in sensors and measurement can add
extra noise to readings. Gathering too broad of a data set can also make it hard to analyze.

Types of noisy data

Random noise is extra information that has no correlation to the underlying data that is somehow
introduced into the measurements or data set. It may also be called white noise.
Misclassified data is information that is incorrectly labeled or sorted in a data set. This can be
caused by human error or as a fault during data importing.

Uncontrolled variables are extra factors that affect the data but are not accounted for. They can
make the data look random when it is not or introduce patterns that aren't there.

Superfluous data is extra information that is completely unrelated to the information being
examined. There may be so much extra information that what you are looking for is completely
hidden.

How to clean noisy data

There are many methods to remove noise and produce the cleanest possible data. The exact
methods and implementations will depend on the data being worked on and the end goals.

Filtering is removing unwanted data. This can be as simple as removing certain categories or
types of data from the analysis. Analysts may also filter out outliers, such as unusually high or
low readings or ones very far from the mean data set.

Data binning is where the data is sorted into groups or categories to remove some of the random
variance between entries.

Linear regression is a mathematical method to determine the correlation between the data and
other variables. It can help determine how closely related the data is to the output.

Data Transformation

Data Transformation involves converting raw data from multiple heterogeneous sources into a
clean, standardized and analysis-ready format before loading it into the data warehouse.
Transformation ensures that the data stored in the warehouse is consistent, accurate and aligned
with business rules-ultimately improving the quality of reporting and decision-making.
The transformations can be divided into two categories:

Simple Data Transformations involve basic tasks like cleansing, standardization, aggregation and
filtering used to prepare data for analysis or reporting through straightforward manipulation
techniques

Complex Data Transformations involve advanced tasks like integration, migration, replication
and enrichment. They require techniques such as data modeling, mapping and validation and are
used to prepare data for machine learning, advanced analytics or data warehousing.

Applications of Data Transformation

Applications for data transformation are found in a number of industries:

Business intelligence (BI) is the process of transforming data for use in real-time reporting and
decision-making using BI technologies.

Healthcare: Ensuring interoperability across various healthcare systems by standardization of

medical records.

Financial Services: Compiling and de-identifying financial information for reporting and
compliance needs.

Retail: Improving customer experience through data transformation into an analytics-ready

format and customer behavior analysis.

Customer Relationship Management (CRM): By converting customer data, firms may obtain
insights into consumer behavior, tailor marketing strategies and increase customer satisfaction.

Challenges in Data Transformation

While transformation improves data quality, several challenges must be addressed:

Handling Large Volumes: High data volumes may slow down transformation or require
scaling.

Maintaining Data Accuracy: Incorrect transformation logic can introduce new errors.

Performance Optimization: Complex transformations may increase ETL execution time.

Managing Changing Business Rules: As business needs evolve, transformation logic must be
updated regularly.
Data Integration:

Data integration is a type of data transformation. As part of the data transformation process, a
data pipeline may bring data in from multiple sources. It then combines this data to provide a
single, unified view of a data set across the enterprise.
For example, data about customers in a retail business may be split across multiple systems - e-
commerce sales, marketing emails, website analytics, advertising campaigns, web search, etc.
Bringing this data together and presenting it as a unified customer record can help businesses
answer questions such as where they acquire their highest-spend customers, which types of users
are most likely to refer their friends, and more.

Types of data integration

There are several ways to perform data integration:

Data warehousing. The traditional approach in which engineers physically move data into a
single location and a minimal number of tables. Data warehousing uses different data modeling
techniques than relational database systems that enable faster querying for BI use cases.

Virtualized integration. Virtualized integration provides access to numerous data sources from
within a single location. This approach misses out on some of the performance benefits provided
by data warehousing (though it may utilize techniques such as caching to speed up subsequent
access). The upside is that business users can access data where it currently lives ‌without waiting
for engineers to import it into the warehouse.

Data mesh. Bringing data into a data warehouse typically depends on relying on a centralized
data engineering team to create new data pipelines. This can create a bottleneck as the team’s
queue fills up with new requests. A data mesh architecture solves this by modeling data as a set
of interconnected domains. A mesh architecture provides a self-service data platform that makes
it easier for teams to create their data pipelines and for business users to find and use the data
sets they produce.

Data Reduction in Data Mining

Data reduction is a technique used in data mining to reduce the size of a dataset while still
preserving the most important information. This can be beneficial in situations where the dataset
is too large to be processed efficiently, or where the dataset contains a large amount of irrelevant
or redundant information.
There are several different data reduction techniques that can be used in data mining, including:

1. Data Sampling: This technique involves selecting a subset of the data to work with, rather
than using the entire dataset. This can be useful for reducing the size of a dataset while still
preserving the overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the
dataset, either by removing features that are not relevant or by combining multiple features
into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset that
are most relevant to the task at hand.
6. It's important to note that data reduction can have a trade-off between the accuracy and the
size of the data. The more data is reduced, the less accurate the model will be and the less
generalizable it will be.

Methods of data reduction:

These are explained as following below.

1. Data Cube Aggregation:

This technique is used to aggregate data in a simpler form. For example, imagine the information
you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your
company every three months. They involve you in the annual sales, rather than the quarterly
average, So we can summarize the data in such a way that the resulting data summarizes the
total sales per year instead of per quarter. It summarizes the data.

2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the attribute
required for our analysis. It reduces data size as it eliminates outdated or redundant features.
3. Data Compression:

The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types based
on their compression techniques.
 Lossless Compression -
Encoding techniques (Run Length Encoding) allow a simple and minimal data size reduction.
Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
 Lossy Compression -
Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. In lossy-data
compression, the decompressed data may differ from the original data but are useful enough to
retrieve information from them.

4. Numerosity Reduction:

In this reduction technique, the actual data is replaced with mathematical models or smaller
representations of the data instead of actual data, it is important to only store the model
parameter. Or non-parametric methods such as clustering, histogram, and sampling.

5. Discretization & Concept Hierarchy Operation:

Techniques of data discretization are used to divide the attributes of the continuous nature into
data with intervals. We replace many constant values of the attributes by labels of small
intervals. This means that mining results are shown in a concise, and easily understandable way.
 Top-down discretization -
If you first consider one or a couple of points (so-called breakpoints or split points) to divide
the whole set of attributes and repeat this method up to the end, then the process is known as
top-down discretization also known as splitting.
 Bottom-up discretization -
If you first consider all the constant values as split points, some are discarded through a
combination of the neighborhood values in the interval, that process is called bottom-up
discretization.

ADVANTAGES & DISADVANTAGES OF Data Reduction in Data Mining :

Data reduction in data mining can have a number of advantages and disadvantages.

Advantages:

1. Improved efficiency: Data reduction can help to improve the efficiency of machine learning
algorithms by reducing the size of the dataset. This can make it faster and more practical to
work with large datasets.
2. Improved performance: Data reduction can help to improve the performance of machine
learning algorithms by removing irrelevant or redundant information from the dataset. This
can help to make the model more accurate and robust.
3. Reduced storage costs: Data reduction can help to reduce the storage costs associated with
large datasets by reducing the size of the data.
4. Improved interpretability: Data reduction can help to improve the interpretability of the results
by removing irrelevant or redundant information from the dataset.

Disadvantages:

1. Loss of information: Data reduction can result in a loss of information, if important data is
removed during the reduction process.
2. Impact on accuracy: Data reduction can impact the accuracy of a model, as reducing the size
of the dataset can also remove important information that is needed for accurate predictions.
3. Impact on interpretability: Data reduction can make it harder to interpret the results, as
removing irrelevant or redundant information can also remove context that is needed to
understand the results.
4. Additional computational costs: Data reduction can add additional computational costs to the
data mining process, as it requires additional processing time to reduce the data.
5. In conclusion, data reduction can have both advantages and disadvantages. It can improve the
efficiency and performance of machine learning algorithms by reducing the size of the dataset.
However, it can also result in a loss of information, and make it harder to interpret the results.
It's important to weigh the pros and cons of data reduction and carefully assess the risks and
benefits before implementing it.

UCS551 Chapter 3 - Data Management and Data Quality
No ratings yet
UCS551 Chapter 3 - Data Management and Data Quality
53 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
32 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
19 pages
Data Mining Techniques and Processes
No ratings yet
Data Mining Techniques and Processes
22 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
23 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
96 pages
12030822004data Mining
No ratings yet
12030822004data Mining
10 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
25 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
15 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
16 pages
Module 2 DMW
No ratings yet
Module 2 DMW
22 pages
Data Cleaning and Preparation Essentials
No ratings yet
Data Cleaning and Preparation Essentials
20 pages
Data Preprocessing Techniques in Analytics
No ratings yet
Data Preprocessing Techniques in Analytics
23 pages
Data Mining Techniques and Preprocessing
No ratings yet
Data Mining Techniques and Preprocessing
10 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
41 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
15 pages
Data Mining and Preprocessing Essentials
No ratings yet
Data Mining and Preprocessing Essentials
31 pages
Data Preprocessing Techniques in Data Science
No ratings yet
Data Preprocessing Techniques in Data Science
88 pages
Understanding Data Binning Techniques
100% (1)
Understanding Data Binning Techniques
9 pages
Data Preprocessing for Quality Mining
No ratings yet
Data Preprocessing for Quality Mining
23 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
29 pages
Data Mining and Processing Overview
No ratings yet
Data Mining and Processing Overview
16 pages
Data Mining: Overview and Techniques
No ratings yet
Data Mining: Overview and Techniques
14 pages
Data Mining and Warehousing Explained
No ratings yet
Data Mining and Warehousing Explained
20 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
4 pages
Pre Processing
No ratings yet
Pre Processing
121 pages
Data Preparation for COS10022
No ratings yet
Data Preparation for COS10022
61 pages
Dmbi Unit-2
No ratings yet
Dmbi Unit-2
25 pages
DM CS 2 Data Preprocessing Techniques
No ratings yet
DM CS 2 Data Preprocessing Techniques
56 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
29 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
9 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
19 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
13 pages
Data Cleaning Techniques in Data Mining
No ratings yet
Data Cleaning Techniques in Data Mining
8 pages
Data Pre-processing for Machine Learning
No ratings yet
Data Pre-processing for Machine Learning
61 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Data Cleaning Techniques for Quality Data
No ratings yet
Data Cleaning Techniques for Quality Data
22 pages
Data Preprocessing and Cleaning Techniques
No ratings yet
Data Preprocessing and Cleaning Techniques
21 pages
Data Mining Techniques and Challenges
No ratings yet
Data Mining Techniques and Challenges
19 pages
Data Pre-processing Techniques Explained
No ratings yet
Data Pre-processing Techniques Explained
8 pages
Data Mining
No ratings yet
Data Mining
5 pages
Unit 3 Data Preprocessing
No ratings yet
Unit 3 Data Preprocessing
28 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
18 pages
Data Mining: Benefits, Challenges, and Methods
No ratings yet
Data Mining: Benefits, Challenges, and Methods
6 pages
DWDM Take w2
No ratings yet
DWDM Take w2
7 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
60 pages
Major Tasks in Data Preprocessing
No ratings yet
Major Tasks in Data Preprocessing
14 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
57 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
32 pages
101003/CS806C Data Mining
No ratings yet
101003/CS806C Data Mining
90 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
5 pages
Service-as-a-Software: Redefining Solutions
No ratings yet
Service-as-a-Software: Redefining Solutions
2 pages
Research Alignment in Dissertations
No ratings yet
Research Alignment in Dissertations
7 pages
Essential ACT Math Formulas Guide
100% (1)
Essential ACT Math Formulas Guide
12 pages
TED Talks for Business English Teaching
No ratings yet
TED Talks for Business English Teaching
18 pages
Wim Breakdown
No ratings yet
Wim Breakdown
2 pages
Standard MQP CRP Sas Panels PDF
No ratings yet
Standard MQP CRP Sas Panels PDF
30 pages
Java in A Nutshell: A Desktop Quick Reference Benjamin J. Evans Ebook Universal PDF
100% (2)
Java in A Nutshell: A Desktop Quick Reference Benjamin J. Evans Ebook Universal PDF
146 pages
Business Law and Ethics - Individual Report
No ratings yet
Business Law and Ethics - Individual Report
11 pages
Simultaneous Equations in Econometrics
No ratings yet
Simultaneous Equations in Econometrics
52 pages
ATKT Exam Form Instructions 2025
No ratings yet
ATKT Exam Form Instructions 2025
2 pages
Distributed Leadership Inventory Validation
No ratings yet
Distributed Leadership Inventory Validation
24 pages
Expansion Tank (Et) : Rabigh II Project Interconnecting Package (UO1)
No ratings yet
Expansion Tank (Et) : Rabigh II Project Interconnecting Package (UO1)
14 pages
Industrial Profile of Tiruchirappalli
No ratings yet
Industrial Profile of Tiruchirappalli
25 pages
CFA Districts and Stations Overview
No ratings yet
CFA Districts and Stations Overview
1 page
Baltimore County Real Estate Trends 2011
No ratings yet
Baltimore County Real Estate Trends 2011
1 page
Skill Sheets for Physics Practice
0% (1)
Skill Sheets for Physics Practice
214 pages
Lintel Beam Design and Details
No ratings yet
Lintel Beam Design and Details
1 page
BONEYARD: Animal Dissection & Ethics
No ratings yet
BONEYARD: Animal Dissection & Ethics
3 pages
Forcible Entry Tools in Fire Emergencies
No ratings yet
Forcible Entry Tools in Fire Emergencies
17 pages
Hadoop 2.8.0 Installation on Windows 10
No ratings yet
Hadoop 2.8.0 Installation on Windows 10
4 pages
NRC 2025 Abstract Submission Call
No ratings yet
NRC 2025 Abstract Submission Call
5 pages
Mathematical Modeling of Cyber Resilience
No ratings yet
Mathematical Modeling of Cyber Resilience
7 pages
Biochemistry Review Questions for CHE 212
No ratings yet
Biochemistry Review Questions for CHE 212
3 pages
ETAP Study Case Management Guide
No ratings yet
ETAP Study Case Management Guide
5 pages
Physical Geography of Western Rajasthan
No ratings yet
Physical Geography of Western Rajasthan
128 pages
GeoSmart II: Soil Investigation Software
No ratings yet
GeoSmart II: Soil Investigation Software
8 pages
Custom Manifolds Catalog 2010
No ratings yet
Custom Manifolds Catalog 2010
60 pages
Gus Malzahn's Tulsa Offense Manual
93% (30)
Gus Malzahn's Tulsa Offense Manual
150 pages
Edição de Vídeo com Davinci Resolve
No ratings yet
Edição de Vídeo com Davinci Resolve
14 pages
Marketing Logistics Overview
No ratings yet
Marketing Logistics Overview
20 pages

Data Preprocessing Mod 2

Uploaded by

Data Preprocessing Mod 2

Uploaded by

MODULE 2

Data Preprocessing in Data Mining

Benefits of Data Preprocessing

Data Cleaning in Data Mining

What is Data Cleaning in Data Mining?

Characteristics of Data Cleaning:

Steps for Cleaning Data

1. Remove duplicates & irrelevant data

What is noisy data?

Types of noisy data

How to clean noisy data

Applications of Data Transformation

Applications for data transformation are found in a number of industries:

Healthcare: Ensuring interoperability across various healthcare systems by standardization of

Retail: Improving customer experience through data transformation into an analytics-ready

Challenges in Data Transformation

While transformation improves data quality, several challenges must be addressed:

Performance Optimization: Complex transformations may increase ETL execution time.

Types of data integration

There are several ways to perform data integration:

Data Reduction in Data Mining

Methods of data reduction:

1. Data Cube Aggregation:

5. Discretization & Concept Hierarchy Operation:

ADVANTAGES & DISADVANTAGES OF Data Reduction in Data Mining :

You might also like