Data Warehousing and
Mining
Course Code: CSC504
[Link] pandey
Unit 2
Introduction to Data Mining ,Data Exploration and Data Preprocessing
Data Mining Task primitives,Architecture,KDD process,Issues in data
Mining,Types of Attributes; Statistical Description of Data; Data
Visualization; Measuring similarity and dissimilarity. Why Preprocessing?
Data Cleaning; Data Integration; Data Reduction: Attribute subset selection,
Histograms, Clustering and Sampling; Data Transformation & Data
Discretization: Normalization, Binning, Histogram Analysis and Concept
hierarchy generation.
Data mining is defined as procedure of extracting information from huge sets of data.
Also defined as mining knowledge from data.
What types of data can be mined?
There are three types database, Datawarehouse and transaction data.
1) Database data – (RDMS)
• set of tables-has rows(tuples) and columns (attributes).
• while mining we can search for trends i.e. prediction
[Link] customer data for predicting credit risks of customer new customers based on previous
data.
2. Datawarehouse data:
• collection of data integrated from different sources with querying and decision making on data.
• In data warehouse, data is stored in multidimensional Structure (data cube) where each dimension is
each attribute.
3. Transactional database:
Each record is called as transaction(sales, flight booking, uses clicks on web page).
Transaction has transaction ID, list of other items making transactions and
from transaction database, we can mine frequent patterns.
other types of data: Sequence data(stock market related data), data streams(continuously
transmitted),data spatial data(maps),hypertext, multimedia(audio or video), web data (web
pages) etc.
Functionalities:-
1) concept/class descriptions: Data is always associated with class / concepts.
descriptions can be done in 2 ways- Data characterisation and data discrimination.
• Data characterisation: refers to the summary of the class.
O/P → General Overview
• Data discrimination- compares common features of classes.
o/p bar charts or curves
2. Mining frequent patterns, associations & correlations:
frequent Patterns: Things which are found most commonly in data.
• Frequent itemset
• Frequent Subsequence
• Frequent Substructure
Association Analysis :
Association Analysis relationship is a way of identifying the relation between various item
Example: used to determine sales of items that are frequently purchased together.
Eg. Dry fruits and chocolate or bread and jam.
Correlation analysis:
- Mathematical technique.
- Shows how strongly pair of attributes are related together.
Eg. Tall people tend to have more weight.
[Link] and regression for predictive analysis :
Classification : Process of finding a model that distinguishes data items.
Decision tree is used for classification.
Eg. Classifying kit Kat and munch chocolates based on wrapper colour where wrapper
colour is a model.
Regression : Statistical methodology that is used for numeric prediction of missing data.
(Based on previous data).
Eg. 1,2,3,5,6,7,9,10.
4. Cluster Analysis: Clustering analysis in data mining is a technique used to group similar
data points into clusters, revealing hidden patterns and structures within datasets.
5. Outliers Analysis: (anomaly mining) Among the data items in a database, there may be
some items which do not follow the general behaviour of data. Those data items → outlier
(noise / exceptions).eg. 2,4,6,8. if we want odd numbers.
Data Mining Task Primitives
A data mining task can be specified in the form of a data mining query, which is
input to the data mining system.
A data mining query is defined in terms of data mining task primitives.
These primitives allow the user to interactively communicate with the data
mining system during discovery to direct the mining process or examine the
findings from different angles or depths.
The data mining primitives specify the following,
1. Set of task-relevant data to be mined.(relevant attributes)/data or part of data.
2. Kind of knowledge to be mined.(functionalities)
3. Background knowledge to be used in the discovery process.(previous knowledge
about domain is required).we use concept hierarchy for this.
4. Interestingness measures and thresholds for pattern evaluation.(Less than Threshold
means uninteresting measures) and (> Threshold means interesting measures)
5. Representation for visualizing the discovered patterns.(data should be properly
visualize [Link],tables,patterns,graph).
ApplicatioApplications of Data Mining
ns of Data Mining
Data Mining architecture
Data Mining refers to the detection and extraction of new patterns from the already collected
data. Data mining is the amalgamation of the field of statistics and computer science aiming to
discover patterns in incredibly large datasets and then transform them into a comprehensible
structure for later use.
A detailed description of parts of data mining architecture is shown:
[Link] Sources: Database, World Wide Web(WWW), and data warehouse are parts of data sources. The data in these
sources may be in the form of plain text, spreadsheets, or other forms of media like photos or videos. WWW is one of the
biggest sources of data.
[Link] Server: The database server contains the actual data ready to be processed. It performs the task of handling
data retrieval as per the request of the user.
[Link] Mining Engine: It is one of the core components of the data mining architecture that performs all kinds of data
mining techniques like association, classification, characterization, clustering, prediction, etc.
[Link] Evaluation Modules: They are responsible for finding interesting patterns in the data and sometimes they also
interact with the database servers for producing the result of the user requests.
[Link] User Interface: Since the user cannot fully understand the complexity of the data mining process so graphical
user interface helps the user to communicate effectively with the data mining system.
[Link] Base: Knowledge Base is an important part of the data mining engine that is quite beneficial in guiding the
search for the result patterns. Data mining engines may also sometimes get inputs from the knowledge base. This
knowledge base may contain data from user experiences. The objective of the knowledge base is to make the result more
accurate and reliable.
Types of Data Mining architecture:
[Link] Coupling: The no coupling data mining architecture retrieves data from particular data sources. It does not use the
database for retrieving the data which is otherwise quite an efficient and accurate way to do the same. The no coupling
architecture for data mining is poor and only used for performing very simple data mining processes.
[Link] Coupling: In loose coupling architecture data mining system retrieves data from the database and stores the data
in those systems. This mining is for memory-based data mining architecture.
[Link]-Tight Coupling: It tends to use various advantageous features of the data warehouse systems. It includes sorting,
indexing, and aggregation. In this architecture, an intermediate result can be stored in the database for better performance.
[Link] coupling: In this architecture, a data warehouse is considered one of its most important components whose
features are employed for performing data mining tasks. This architecture provides scalability, performance, and integrated
information
Advantages of Data Mining:
•Assists in preventing future adversaries by accurately predicting future trends.
•Contributes to the making of important decisions.
•Compresses data into valuable information.
•Provides new trends and unexpected patterns.
•Helps to analyze huge data sets.
•Aids companies to find, attract and retain customers.
•Helps the company to improve its relationship with the customers.
•Assists Companies to optimize their production according to the likability of a certain
product thus saving costs to the company.
Disadvantages of Data Mining:
•Excessive work intensity requires high-performance teams and staff training.
•The requirement of large investments can also be considered a problem as sometimes data
collection consumes many resources that suppose a high cost.
•Lack of security could also put the data at huge risk, as the data may contain private
customer details.
•Inaccurate data may lead to the wrong output.
•Huge databases are quite difficult to manage.
Knowledge Discovery in Databases (KDD) :
The process of discovering knowledge in data and application of data mining methods refers to the term Knowledge
Discovery in Databases (KDD).
It includes a wide variety of application domains, which include Artificial Intelligence, Pattern Recognition, Machine
Learning Statistics and Data Visualization.
The main goal includes extracting knowledge from large databases, the goal is achieved by using various data
mining algorithms to identify useful patterns according to some predefined measures and thresholds.
The overall process of finding and interpreting patterns from data involves the repeated application of the following
steps:
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Representation
Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant/ inconsistent data from data collection.
Cleaning in case of Missing values.
Cleaning noisy data, where noise is a random or variance error.
In this step, the noise and inconsistent data is removed.
Data Integration
Data integration is defined as heterogeneous data from multiple data sources combined in a common source (Data
Warehouse).
i.e., In this step, multiple data sources may be combined as single data source.
Data Selection
Data selection is defined as the process where data relevant to the analysis is decided and retrieved from the data
collection. This step in the KDD process is identifying and selecting the relevant data for analysis.
Data Transformation
Data Transformation is defined as the process of transforming data into appropriate form required by mining procedure.
This step involves reducing the data dimensionality, aggregating the data, normalizing it, and discretizing it to prepare it for
further analysis.
Data Mining
This is the heart of the KDD process and involves applying various data mining techniques to the transformed data to
discover hidden patterns, trends, relationships, and insights. A few of the most common data mining techniques include
clustering, classification, association rule mining, and anomaly detection.
Pattern Evaluation
After the data mining, the next step is to evaluate the discovered patterns to determine their usefulness and relevance. This
involves assessing the quality of the patterns, evaluating their significance, and selecting the most promising patterns for
further analysis.
Knowledge Representation
This step involves representing the knowledge extracted from the data in a way humans can easily understand and use. This
can be done through visualizations, reports, or other forms of communication that provide meaningful insights into the data.
Challenges of Data Mining or Issues in data
1]Data Quality Mining
The quality of data used in data mining is one of the most significant challenges. The accuracy, completeness, and
consistency of the data affect the accuracy of the results obtained. The data may contain errors, omissions, duplications, or
inconsistencies, which may lead to inaccurate results. Moreover, the data may be incomplete, meaning that some attributes or
values are missing, making it challenging to obtain a complete understanding of the data.
2]Data Complexity
Data complexity refers to the vast amounts of data generated by various sources, such as sensors, social media, and the
internet of things (IoT). The complexity of the data may make it challenging to process, analyze, and understand. In addition,
the data may be in different formats, making it challenging to integrate into a single dataset.
3]Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more data is collected, stored, and analyzed, the
risk of data breaches and cyber-attacks increases.
4]Scalability
Data mining algorithms must be scalable to handle large datasets efficiently. As the size of the dataset increases, the time and
computational resources required to perform data mining operations also increase.
5]Interpretability
Data mining algorithms can produce complex models that are difficult to interpret. This is because the algorithms use a
combination of statistical and mathematical techniques to identify patterns and relationships in the data. Moreover, the
models may not be intuitive, making it challenging to understand how the model arrived at a particular conclusion.
Types of attributes:
This is the initial phase of data preprocessing involves categorizing attributes into different types, which serves as a
foundation for subsequent data processing steps. Attributes can be broadly classified into two main types:
[Link] (Nominal (N), Ordinal (O), Binary(B)).
[Link] (Numeric, Discrete, Continuous)
Attributes Types:
An attribute is a property or characteristic of a data object. For e.g. Gender is a characteristic of a data person.
The attributes may have values like:
1. Nominal attributes
2. Binary attributes
3. Ordinal attributes
4. Numeric attributes
5. Discrete versus continuous attributes
Qualitative Attributes:
1. Nominal Attributes :
Nominal attributes, as related to names, refer to categorical data where the values represent different categories or labels without any
inherent order or ranking. These attributes are often used to represent names or labels associated with objects, entities, or concepts.
Example :
2. Binary attributes:
• A nominal attribute which has either of the two states 0 or 1 is called Binary attribute, where 0 means that the attribute
is absent and 1 means that it is present.
• Symmetric binary variable: If both of its states i.e.. 0 and 1 are equally valuable. Here we cannot decide which
outcome should be 0 and which outcome should be 1.
• For example: Marital status of a person is "Married or Unmarried". In this case both are equally valuable and difficult
to represent in terms of absent) and 1(present).
• Asymmetric binary variable : If the outcome of the states are not equally important. An example of such a variable is
the presence or absence of a relatively rare attribute. For example: Person is "handicapped or not handicapped".
The most important outcome is usually coded as 1 (present) and the other is coded as 0 (absent).
3. Ordinal attributes: A discrete ordinal attribute is a nominal attribute, which have meaningful order or rank for its
different states. The interval between different states is uneven due to which arithmetic operations are not possible,
however logical operations may be applied. For example, Considering Age as an ordinal attribute, it can have three
different states based on an uneven range of age value. Similarly income can also be considered as an ordinal attribute,
which is categorized as low medium, high based on the income value.
Quantitative Attributes:
4. Numeric attributes: Numeric Attributes are quantifiable. It can be measured in terms of a quantity, which
can either have an integer or real value. They can be of two types
(a) Interval-scaled attributes: Interval-scaled attributes are continuous measurement on a linear scale.
Example weight, height and weather temperature. These attributes allow for ordering, comparing and
quantifying the difference between the values. An interval-scaled attributes has values whose differences are
interpretable.
b) Ratio-scaled attributes: Ratio scaled attributes are continuous positive measurements on a non linear
scale. They are also interval scaled data but are not measured on a linear scale. Operations like addition,
subtraction can be performed but multiplication and division are not possible.
For example: For instance, if a liquid is at 40 degrees and we add 10 degrees, it will be 50 degrees.
However, a liquid at 40 degrees does not have twice the temperature of a liquid at 20 degrees because 0
degrees does not represent "no temperature“
There are three different ways to handle the ratio-scaled variables:
• As interval scaled variables. The drawback of handling them as interval scaled is that it can distort the
result.
• As continuous ordinal scale.
• Transforming the data (for example, logarithmic transformation) and then treating the results as
interval scaled variables.
5. Discrete versus continuous attributes:
• If an attribute can take any value between two specified values then it is called as continuous else it is
discrete. An attribute will be continuous on one scale and discrete on another.
• For example: If we try to measure the amount of water consumed by counting the individual water
molecules then it will be discrete else it will be continuous.
• Examples of continuous attributes includes time spent waiting, direction of travel, water consumed etc.
• Examples of discrete attributes includes voltage output of a digital device, a person's age in years.
Statistical Description of Data:
Basic statistical descriptions can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers.
For data preprocessing tasks, we want to learn about data characteristics
regarding central tendency of the data.
Measures of central tendency include Mean, Median, and Mode.
1. Measure of central tendency
2. Dispersion of data
3. Graphic displays of basic Statistical Dispersion of data
Measure of central tendency
1. Mean:
• It is one of the most common measure of central tendency.
• It can be used with both continuous and discrete attributes.
• It is mostly used with continuous attributes.
• Mean is equal to the sum of the all the values in a data set divide by total number of values in the
dataset.
.
For example the mean for the following data is , µ
85 55 89 66 25 14 96 78 87 45 92
µ=66.7
• Mean is the only measure of central tendency in which the sum of the deviations of each value from mean
is always zero.
• If the weights are associated with the value xi, then weighted arithmetic mean or the weighted average is,
• Mean has one disadvantage; it is highly susceptible to the influence of outliers. Under such types of
situations median would be a better measure of central tendency.
2. Median:
Another measure of the center of data is the median. Suppose that a given data set of N distinct values is sorted in numerical order.
•If N is odd, the median is the middle value of the ordered set;
•If N is even, the median is the average of the middle two values.
Example: Median. Suppose we have the following values for salary (in thousands of dollars), shown in increasing order:
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
There is an even number of observations (i.e., 12); therefore, the median is not unique. It can be any value within the two
middlemost values of 52 and 56 (that is, within the sixth and seventh values in the list). By convention, we assign the
average of the two middlemost values as the median;
that is
30 36 47 50 52 52 56 60 63 70 70 110
(52+56)/2=108/2=54
Thus, the median is $54,000.
In probability and statistics, the median generally applies to numeric data; however,
we may extend the concept to ordinal data.
Suppose that a given data set of N values for an attribute X is sorted in increasing
order.
•If N is odd, then the median is the middle value of the ordered set.
•If N is even, then the median may not be not unique.
In this case, the median is the two middlemost values and any value in between.
3. Mode
Another measure of central tendency is the mode. The mode for a set of data is the value
that occurs most frequently in the set.
It is possible for the greatest frequency to correspond to several different values, which
results in more than one mode.
•Data sets with one, two, or three modes: called unimodal, bimodal, and trimodal.
•At the other extreme, if each data value occurs only once, then there is no mode.
Example: Mode. Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
The above data has bimodal mode.i. e The two modes are 52 and 70.
4. Midrange
The midrange can also be used to assess the central tendency of a numeric
data set.
It is the average of the largest and smallest values in the set.
Example: Midrange. Suppose we have the following values for salary (in
thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
The midrange of the data is
(30,000+110,000)/2=$70,000
Thus, the median is $70,000
Central Tendency Measures for different attributes:
Central Tendency Measures for Numerical Attributes: Mean, Median, Mode
Central Tendency Measures for Categorical Attributes:
•Central Tendency Measures for Nominal Attributes: Mode
•Central Tendency Measures for Ordinal Attributes: Mode, Median
Example:What are central tendency measures (mean, median, mode) for the following attributes?
Solution:
attr1={2,4,4,6,8,24}
mean=(2+4+4+6+8+24)/6=8 average of all values
median=(4+6)/2=5 avg. of two middle values
mode = 4 most frequent item
attr2={2,4,7,10,12}
mean=(2+4+7+10+12)/5=7 average of all values
median=7middlevalue
mode = any of them (no mode) all of them has same freq.
attr3={xs,s,s,s,m,m,l}
mean is meaningless for categorical attributes.
median=s middle value
mode = s most frequent item
DISPERSION
DISPERSION measures the extent to which the items vary from same central value.
series a series b series c
100 98 1
100 99 2
100 100 3
100 101 4
100 102 490
total- 500 500 500
1. EUCLIDEAN DISTANCE
2. MINKOWSKI DISTANCE
3. JACCARD DISTANCE
Data Visualization
What is Data Visualization and Why is It Important?
Data visualization uses charts, graphs and maps to present information clearly and simply. It turns complex data into
visuals that are easy to understand.
With large amounts of data in every industry, visualization helps spot patterns and trends quickly, leading to faster
and smarter decisions.
Common Types of Data Visualization
There are various types of visualizations where each has a unique purpose in data representation. Here are the most
common types:
[Link]: They are used to compare data points across different categories or to show trends over time. Examples:
Bar Charts, Line Charts and Pie Charts
[Link]: They are used to visualize relationships between variables which helps in making it easier to analyze correlations,
trends and outliers. Examples: Scatter Plots, Histograms
[Link]: They are used to display geographical data which provides spatial context to trends and patterns. Examples:
Geographic Maps, Heat Maps
[Link]: They combine multiple visualizations into a single interface which provides real-time insights and interactive
features for users to explore data
[Link] Usually shows the distribution of values of a single variable.
• Divide the values into bins and show a bar plot of the number of objects in each bin.
• The height of each bar indicates the number of objects.
• Shape of histogram depends on the number of bins
• Example: Petal Width
lBox plots, also known as box-and-whisker plots, are a powerful
visualization tool in data mining for summarizing and
comparing distributions of numerical data. They display key
statistical measures like the median, quartiles, and potential
outliers, allowing for quick assessment of data spread and
skewness.6
[Link] Plot
In data mining, a scatter plot is a visualization tool used to
represent the relationship between two numerical variables. It
displays data points as individual dots on a two-dimensional
plane, with the position of each dot indicating the values of the
two variables for a specific data point. Scatter plots are valuable
for identifying patterns, trends, and correlations within
datasets.
8. Parallel coordinates are a visualization technique used in
data mining to represent and analyze high-dimensional data.
They allow users to explore relationships, patterns, and outliers
within datasets containing multiple variables.
9. Chernoff faces are a data visualization technique that
represents multivariate data using a human face, where
different facial features correspond to different variables. This
method leverages the human brain's ability to quickly and
easily interpret facial expressions to identify patterns and
relationships in complex datasets.
10. Timelines in data mining refer to both the historical
development of data mining techniques and the representation
of time-related data within the field. Data mining has evolved
significantly since its origins in the 1950s, with key periods
including the initial development of techniques, the KDD era,
and the current era of big data and advanced analytics.
Furthermore, data mining often analyzes time-series data (data
recorded at regular time intervals) to identify trends,
seasonality, and other patterns.
Data Preprocessing
•Definition: Technique to transform raw data into understandable format.
•Need: Real-world data = incomplete, inconsistent, noisy, error-prone.
•Purpose: Prepares data for further processing and analysis.
•Applications:
◦ Customer Relationship Management (CRM)
◦ Neural Networks (Rule-based systems)
•Importance in ML:
◦ Encodes dataset so algorithms can interpret it properly.
🔹 Steps in Data Preprocessing
1. Data Cleaning
•Goal: Remove noise, fix missing/inconsistent values.
•Methods:
◦ Missing values: Fill/delete.
◦ Smoothing: Remove noise (important for ML).
• Techniques:Binning: Divide data into equal-sized bins.
• 2. Regression: Fit data into regression models.
• 3. Clustering: Group into similar clusters.
◦ Handle inconsistencies: Human errors, wrong entries.
◦ Remove duplicates: Prevent data bias.
4. Data Reduction
•Goal: Reduce data volume for faster access and less storage.
•Methods:
◦ Select relevant features; discard low-importance ones.
◦ Use encoding for size reduction.
◦ Lossless: No data lost after compression.
◦ Lossy: Some data lost.
◦ Aggregation: Combine data (e.g., daily to monthly summaries).
5. Data Discretization
•Converts continuous data into intervals.
•Reduces data complexity by dividing values into categories or bins.
6. Data Sampling
•Used when datasets are large or constrained by time/memory.
•Select a subset that reflects the original data's pattern.
•Ensures work with manageable data size.
Question:
In real-world data, tuples with missing values for some attributes are a common occurrence. Describe various
methods for handling this problem.
Answer – Methods to handle missing values:
[Link] the Tuple
◦ Used when missing data is negligible.
◦ The record is removed entirely.
[Link] with Global Constant
◦ Replace missing value with a constant like “Unknown” or “0”.
[Link] with Attribute Mean/Median/Mode
◦ Numerical attributes → use mean or median.
◦ Categorical → use mode (most frequent value).
[Link] with Most Probable Value
◦ Use a predictive model (like regression or classification) to estimate missing value.
[Link]/Backward Fill (Time-Series)
◦ Fill with previous or next known value.
Data Cleaning –
• Definition: Data cleaning is the process of identifying and removing (or correcting) inaccurate,
incomplete, or irrelevant data from a dataset.
• Purpose: To restore, remodel, or eliminate dirty or crude data.
• Techniques:
◦ Can be done using batch processing (via scripting), or
◦ Interactively using data cleansing tools.
•After Cleaning:
◦ The dataset should be consistent with other related datasets.
◦ Discrepancies usually occur due to:
• User entry mistakes
• Storage or transmission errors
• Conflicting data dictionary definitions
•Note: Some data cleaning methods are explained in the following section.
2.1 Missing Values
•Problem: In datasets, many records may have missing values for certain attributes (e.g., income).
✅ Methods to Handle Missing Values:
1. Ignore the Tuple
Skip the record if class label is missing.
Not useful if many attributes are missing.
2. Fill Manually
Best for small datasets.
Involves user effort to fill values.
3. Use a Global Constant
Replace missing values with a fixed label like “Unknown” or “-∞”.
4. Use Central Tendency (Mean/Median)
Example: If average income = $25,000 → use it for missing income values.
5. Use Class-Based Mean/Median
Use the mean/median of records in the same class as the missing value's record.
2.2 Noisy Data
✅ What is Noisy Data?
•Noise = Random error or variance in measured variable.
•Arises due to:
◦ Instrument errors
◦ Data input issues
◦ Technological limitations
✅ Examples:
Encoding Error: e.g., Gender = Z
Invalid Value: e.g., Age = 200
Inconsistent Entry: e.g., Date_of_joining = 32-Aug-2007
🛠 Techniques to Handle Noisy Data
(1) Binning
•Smooths data by grouping sorted values into bins ("buckets").
•Works by analyzing neighborhood values.
Types of Binning:
•(a) By Mean: Replace each value in bin with bin’s mean.
•(b) By Median: Replace each value with bin’s median.
•(c) By Boundary: Replace each value with the closest boundary (min or max) of the bin.
Approach:
[Link] the dataset.
[Link] into equal-depth bins (each bin has equal number of values).
Ex. 2.12.1 – Data Smoothing question
Sorted data:
3, 7, 8, 13, 22, 22, 22, 26, 26, 28, 30, 37
Total records: 12
Bins = 3 → Each bin = 4 values (Equal-frequency binning)
🧮 Step 1: Equal Frequency Bins
•Bin 1: 3, 7, 8, 13
•Bin 2: 22, 22, 22, 26
•Bin 3: 26, 28, 30, 37
🧮 Step 2: Smoothing by Bin Mean
•Replace all values with mean of the bin
•Bin 1: (3+7+8+13)/4 = 8 → 8, 8, 8, 8
•Bin 2: (22+22+22+26)/4 = 23 → 23, 23, 23, 23
•Bin 3: (26+28+30+37)/4 = 30 → 30, 30, 30, 30
🧮 Step 3: Smoothing by Bin Boundary
•Replace values with nearest bin boundary (min/max)
•Bin 1: 3, 7, 8, 13
•Bin 2: 22, 22, 22, 26
•Bin 3: 26, 28, 30, 37
•Bin 1: min=3, max=13 → 3, 3, 3, 13
•Bin 2: min=22, max=26 → 22, 22, 22, 26
•Bin 3: min=26, max=37 → 26, 26, 26, 37
•Final Result Summary:
•🔸 Bin Mean Smoothing:
•8, 8, 8, 8, 23, 23, 23, 23, 30, 30, 30, 30
•🔸 Bin Boundary Smoothing:
•3, 3, 3, 13, 22, 22, 22, 26, 26, 26, 26, 37
2-Outlier Analysis by Clustering
•Outliers are extreme data values that differ significantly from other observations in the dataset.
•Outlier Analysis involves identifying these unusual or anomalous data points.
•Outliers can be detected through clustering, where similar values are grouped into clusters.
•Data points that fall outside these clusters are considered outliers.
Question:
Develop a model to predict the salary of college graduates with 10 years of work experience using linear regression.
Given Data:
Years of Experience (x): 3, 8, 9, 13, 3, 6, 11, 21, 1, 16
Salary (y) in $100: 30, 57, 64, 72, 36, 43, 59, 90, 20, 83
Years of Experience (x): 3, 8, 9, 13, 3, 6, 11, 21, 1, 16
Salary (y) in $100: 30, 57, 64, 72, 36, 43, 59, 90, 20, 83
📘 DATA INTEGRATION -
Definition:
A data preprocessing technique that combines data from multiple sources to provide a unified view.
🔹 Common Sources:
Databases, data cubes, flat files.
🔹 Key Benefit:
Helps in data analysis by building a data warehouse.
🔸 Approaches to Data Integration:
[Link] Coupling:Combines data from different sources into a single location.
◦ Uses ETL process: Extraction, Transformation, Loading.
2. Loose Coupling:Data stays in original sources.
◦ Interface is used to query and transform data from source.
🔸 Data Integration Techniques:
[Link] Integration:Done manually by data analyst.
◦ Suitable for small datasets.
◦ Time-consuming for large or recurring tasks.
[Link] Integration:
◦ Middleware collects, normalizes, and stores data.
◦ Acts as bridge between legacy and modern systems.
[Link]-Based Integration:
◦ Uses software to extract, transform, load (ETL) data.
◦ Automates data transfer between systems.
[Link] Access Integration:
◦ Data is not moved; stays in original sources.
◦ Provides a virtual unified view to users.
📌 Example: Data Integration in an E-commerce Company
Scenario:
An e-commerce company (like Flipkart or Amazon) stores different types of information in separate systems:
Source Type of Data
CRM System Customer details (name, address, phone number)
Order Management System Order details (date, payment, delivery status)
Inventory System Product details (stock, price, description)
➤ Problem:All this data is stored in different systems, which makes it difficult to generate combined reports or perform analysis.
➤ Solution (Using Data Integration):The company uses the ETL process:
◦ Extract: Data is extracted from all three systems.
◦ Transform: The data is converted into a consistent format (e.g., standardizing date format, name format).
◦ Load: All data is loaded into a central data warehouse.
➤ Result:The company can now generate a unified report, such as:
"Which customer purchased which product, when, and what is the delivery status?"
📘 Issues in Data Integration (with Examples)
When integrating data from multiple sources, several challenges arise. Below are key issues:
1 Entity Identification Problem
1️⃣
Issue: Same real-world entity may have different names in different sources.
Example: One dataset has customer_id, another has customer_number, but both refer to the same customer.
✅ Solution:
• Use Schema Integration with metadata (data about data) to match attribute types, ranges, and meanings.
• Use Structural Integration to ensure attributes follow the same rules and relationships.
2️⃣Redundancy and Correlation Analysis
•Issue: Unnecessary repetition of data or data that can be derived from other fields.
•Example: If one table has age and another has date of birth, age is redundant as it can be calculated.
✅ Solution:
•Use Correlation Analysis to detect redundant attributes.
•Apply Chi-square (χ²) test for nominal data and correlation coefficient for numeric data.
3️⃣Tuple Duplication
Issue: Duplicate rows (tuples) may appear if data is not normalized.
•Example: Same transaction may appear twice if both billing and order tables are integrated.
✅ Solution:
Clean and deduplicate data during integration using unique keys or record IDs.
4️⃣Data Conflict Detection and Resolution
Issue: Same data may have conflicting values across datasets.
•Example: Hotel price is $100 in one dataset and ₹7500 in another due to currency differences.
✅ Solution:
Resolve conflicts by identifying the correct source or converting values to a common format.
5️⃣Data Warehousing
Meaning: Stores the unified data in a single location for analysis.
•Example: Combines all customer, sales, and product data into one central warehouse.
✅ Pros & Cons:
Allows complex queries, but increases storage and maintenance cost.
⃣ Chi-square (χ²) Test
• Used to test correlation of nominal data.
• Also called “Goodness of Fit” test.
• The Chi-square test is used to test whether two categorical
variables are independent or related.
• It helps in feature selection and relationship analysis between variables.
🔹 Step 1: Define the Hypotheses
•Null Hypothesis (H₀):
The two variables are independent (i.e., no relationship).
→ Rating and Restaurant Size are not related.
•Alternate Hypothesis (H₁):
The two variables are not independent (i.e., they are related).
→ Rating and Restaurant Size have a relationship.
🔹 Step 2: Create the Contingency Table
This table shows the distribution of data across two variables.
Observed Values (O) – These are actual collected data.
📊 Contingency Table (Observed Values O):
🧠 Example Summary:
•You have observed data (O)
•Calculate expected values (E)
•Apply χ² formula
•Compare with table value
•Draw conclusion
🧠 Example Summary:
•You have observed data (O)
•Calculate expected values (E)
•Apply χ² formula
•Compare with table value
•Draw conclusion
📘 2 DATA REDUCTION
📌 What is Data Reduction?
When we have a large amount of data (like terabytes), analyzing or mining it can be very slow and difficult.
That’s where Data Reduction techniques help us by:
•Reducing the size of the dataset
•Retaining important information
✅ 2.1 Methods of Data Reduction
(4) Numerosity Reduction
Below are the main methods used for data reduction:
a. Parametric Methods
(1) Data Cube Aggregation b. Non-Parametric Methods
(2) Dimensionality Reduction
(a) Step-wise Forward Selection (5)Data Transformation and Discretization
(b) Step-wise Backward Selection
(3) Data Compression
(a) Wavelet Transform
(b) Principal Component Analysis (PCA)
🔷 (1) Data Cube Aggregation
👉 What is it?
•This technique summarizes the data in a simpler format.
•Example:
Imagine you have sales data every 3 months from 2012 to 2014.
Instead of analyzing each quarter, you can aggregate the data yearly.
So:
•From: Quarterly sales
•To: Annual sales
This makes it:
•Easier to analyze
•Faster to process
📊 Diagram Explanation:
In the data cube diagram:
•Three dimensions are shown: Attribute, Spatial, and Time
•Aggregation moves to a higher level (e.g., quarterly → yearly)
🔷 (2) Dimensionality Reduction
👉 What is it?
•Sometimes, not all attributes are useful for analysis.
•Dimensionality reduction helps by removing less important or redundant attributes.
🔹 Benefits:
•Reduces the number of features
•Improves processing time and model performance
🧠 Approaches to Dimensionality Reduction:
🧩 (a) Step-wise Forward Selection
✅ How it works:
•Start with an empty set of attributes.
•Gradually add the most useful attributes based on importance.
📌 Example:
Dataset attributes:
X1, X2, X3, X4, X5, X6
Start with: { }
Step 1: Add X1 → {X1}
Step 2: Add X2 → {X1, X2}
Step 3: Add X5 → {X1, X2, X5}
✅ Final selected set: {X1, X2, X5}
🧩 (b) Step-wise Backward Selection
✅ How it works:
•Start with all attributes.
•Gradually remove the least useful one at a time.
📌 Example:
Start with: {X1, X2, X3, X4, X5, X6}
Step 1: Remove least important → {X1, X2, X3, X4, X5}
Step 2: Remove one more → {X1, X2, X3, X5}
Step 3: Remove again → {X1, X2, X5}
✅ Final selected set: {X1, X2, X5}
🧩 (c) Combination of Forward and Backward Selection
✅ How it works:Uses both techniques together:
•Adds important attributes (forward)
•Removes less important ones (backward)
🔹 Advantage:
• Faster and more accurate selection process.
Method Description
Data Cube Aggregation Summarizes detailed data (e.g., quarterly → yearly)
Step-wise Forward Selection Starts from empty set, adds best attributes
Step-wise Backward Selection Starts with full set, removes worst ones
Forward + Backward Combination Mix of both methods for best results
📘 (3) Data Compression
Purpose:
Reduces file size using encoding methods (e.g., Huffman, Run-Length Encoding).
🔹 Types:
•Lossless Compression:
◦ No data loss.
◦ Restores original data.
◦ Uses Run-Length Encoding.
•Lossy Compression:
◦ Some data loss (e.g., JPEG, PCA, Wavelet).
◦ Decompressed data may differ but still useful.
(a) Wavelet Transform
•Converts vector X to X′ (same length).
•Compressed data = strongest wavelet coefficients.
•Useful for data cube, sparse, skewed data.
(b) Principal Component Analysis (PCA)
•Identifies fewer independent attributes.
•Reduces dimensionality and data size.
•Works well on sparse, skewed data.
📘 (4) Numerosity Reduction
Purpose:
Replaces original data with smaller representations.
🔹 Types:
•Parametric Methods
•Non-Parametric Methods
(a) Parametric Methods
Use statistical models to estimate data (not store actual data).
🔹 (ii) Log-Linear Model:
🔹 (i) Regression:
•Estimates probabilities in multi-dimensional space.
•Models: Simple or Multiple Linear Regression
•Uses fewer data combinations.
•Equation: y = ax + b
◦ a: Slope (b) Non-Parametric Methods
◦ b: Intercept •Store actual data subsets.
•Predicts data using known variables.
•Techniques: Histograms, clustering, sampling.
Non-Parametric Methods
[Link]:
◦ Represents data using binning to approximate frequency distribution.
◦ Common method of data reduction.
[Link]:
◦ Divides the entire dataset into groups/clusters.
◦ Cluster representation is used instead of the actual data.
◦ Helps detect outliers in the data.
[Link]:
◦ Reduces data size by selecting a small random subset.
◦ Types:
• (a) Simple Random Sampling: Equal probability for all items.
• (b) Sampling without Replacement: Selected object not returned.
• (c) Sampling with Replacement: Selected object returned to population.
• (d) Stratified Sampling: Draws proportional samples from each partition (useful in skewed data).
[Link] Cube Aggregation:
◦ Moves data from detailed to a fewer number of dimensions.
◦ Helps in reducing volume without losing analysis information.
(5)Data Transformation and Discretization
•Used to convert unstructured data into structured format.
•Important when data is moved to a new data warehouse.
•Helps in identifying patterns in well-structured data.
(i) Data Smoothing:
•Removes noise (distorted or meaningless data).
•Highlights special features in data.
•Detects trends and modifications in the data.
(ii) Data Aggregation:
•Collects data from multiple sources into one format.
•Summarizes data for reporting.
•Crucial for analyzing customer behavior, trends, etc.
(iii) Discretization:
•Converts continuous data into intervals or labels.
•Makes data easier to handle and analyze.
(iv) Normalization:
Normalization is the process of transforming data into a standard scale or range without distorting differences in the ranges of
values. It is also called data pre-processing and is one of the essential techniques for preparing data for mining and analysis.
🔸 Purpose of Normalization:
To scale data so that it fits within a specific range (e.g., [0, 1]).
To prevent attributes with large ranges from dominating those with smaller ranges.
To make mining and data analysis more efficient.
It helps machine learning algorithms to converge faster and perform better.
✅ Types of Normalization (Step-by-Step)
🔸 1. Min-Max Normalization
🔹 Purpose:
Scales the data to a fixed range, usually [0, 1].
🔹 Formula: