Data Preprocessing & Classification Techniques
Data Preprocessing & Classification Techniques
AY 2025-2026 SEM-V
Unit II - Syllabus
• These techniques help prepare the data for more accurate analysis.
Data Normalization
• Used in data mining to transform the values of a dataset into a
common scale.
• This is important because many machine learning algorithms are
sensitive to the scale of the input features and can produce better
results when the data is normalized.
• Normalization is used to scale the data of an attribute so that it falls
in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0.
• It is generally useful for classification algorithms.
Need of Data Normalization
• Normalization is generally required when we are dealing with
attributes on a different scale, otherwise, it may lead to a dilution in
effectiveness of an important equally important attribute(on lower
scale) because of other attribute having values on larger scale.
• when multiple attributes are there but attributes have values on
different scales, this may lead to poor data models while
performing data mining operations.
• So they are normalized to bring all the attributes on the same scale.
Methods of Data Normalization
• 1. Min-Max Normalization-
• Linear transformation is performed on the original data. Minimum and maximum value
from data is fetched and each value is replaced according to the following formula.
• Min(A) - It is the minimum absolute value A.
• Max(A) - It is maximum absolute value of A.
• v’ - It is the new value of each attribute data.
• v - It is the old value of each attribute data.
• new_max(A), new_min(A) is the max and min value within the range
• (i.e boundary value of range required) respectively.
• It is used to transform data into a standard normal distribution, ensuring that all
features are on the same scale.
• This process helps to avoid the dominance of certain features over others due to
differences in their scales, which can significantly impact the performance of
model.
In this technique, values are normalized based on mean and standard deviation
of the data A.
The formula used is:
v', v is the new and old of each entry in data respectively. σA, A is the standard
deviation and mean of A respectively.
Why Z-Score Normalization is Necessary?
• Normalization is necessary when dealing with features on different scales.
Without normalization, features with larger scales can dominate those with
smaller scales, leading to biased results in machine learning models.
• Z-score normalization addresses this issue by scaling the data based on its
statistical properties, making it particularly useful for algorithms that rely
on distance calculations, such as K-nearest neighbors and clustering
algorithms
• Z-score normalization ensures that all features are treated equally,
enabling the algorithm to identify meaningful patterns and relationships
more effectively.
70, 80, 90, 100, 110 find Z-score
• Mean= (70+80+90+100+110)/5 = 90
• Standard deviation= sqrt( (90-70)^2+ (90-80)^2+….(90-110)^2)
• = 14.14213
• Z-score = (80- 90)/(14.14213) = -0.70710678
• datasets with too many features can cause issues like slow computation
and overfitting.
• Dimensionality reduction helps to reduce the number of features while
retaining key information.
• Techniques like principal component analysis (PCA), singular value
decomposition (SVD) and linear discriminant analysis (LDA) convert data
into a lower-dimensional space while preserving important details.
• Example: when you are building a model to predict house prices with features
like bedrooms, square footage and location. If you add too many features such
as room condition or flooring type, the dataset becomes large and complex.
• Feature Selection
• The most relevant features that contribute the most to the
target variable are selected and irrelevant data are removed.
• This involves selecting a subset of relevant features from the
dataset.
• Feature selection is often performed to remove irrelevant or
redundant features from the dataset.
• It can be done using various techniques such as correlation
analysis, mutual information, and principal component
analysis (PCA).
• Feature Extraction
• This involves transforming the data into a lower-dimensional
space while preserving the important information.
• Feature extraction is often used when the original features
are high-dimensional and complex.
• It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix
factorization (NMF).
2. Numerosity Reduction:
• Data Sampling: This technique involves selecting a subset of the data points to work
with, rather than using the entire dataset. This can be useful for reducing the size of a
dataset while still preserving the overall trends and patterns in the data.
• Clustering: This technique involves grouping similar data points together and then
representing each group by a single representative data point.
• Data Aggregation: This technique involves combining multiple data points into a
single data point by applying a summarization function.
• Data Generalization: This technique involves replacing a data point with a more
general data point that still preserves the important information.
• Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset
3. Wavelet Transforms
• The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied
to a data vector X, transforms it to a numerically different vector, X’, of wavelet coefficients.
• The two vectors are of the same length. When applying this technique to data reduction, we
consider each tuple as an n-dimensional data vector, that is, X={x1,x2, … ,xn}, depicting n
measurements made on the tuple from n database attributes.
• For example, all wavelet coefficients larger than some user-specified threshold can be retained.
All other coefficients are set to 0.
• The resulting data representation is therefore very sparse, so that operations that can take
advantage of data sparsity are computationally very fast if performed in wavelet space.
• The technique also works to remove noise without smoothing out the main features of the data,
making it effective for data cleaning as well.
• Given a set of coefficients, an approximation of the original data can be constructed by applying
the inverse of the DWT used.
4. Data Compression:
• Reducing the size of data by encoding it in a more compact form, making it easier to AIA5 25/8
store and process
• Lossless Compression -
• Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.
• Lossy Compression -
• Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. In lossy-
data compression, the decompressed data may differ from the original data but are
useful enough to retrieve information from them.
Data Compression
Original Data
Approximated
Importance of data preprocessing
• Data preprocessing is an important step in the data process.
• It refers to the cleaning, transforming, and integrating of data in order to make it ready
for analysis.
• The goal of data preprocessing is to improve the quality of the data and to make it more
suitable.
b. Classification
• Typical applications
• Credit/loan approval
• target marketing, performance prediction,
• manufacturing
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Prediction
• something that may go to happen in the future.
• And just like that in prediction, we identify or predict the missing or
unavailable data for a new observation based on the previous data
that we have and based on the future assumptions
• In prediction, the output is a continuous value
• The model used to predict the unknown value is called a predictor
• the accuracy of the predictor refers to how well a given predictor
can guess the value of predicted attribute for a new data.
What Is Association Rule Mining?
• Association rule mining:
• Association rule mining is a technique used in data mining to discover frequent
patterns, relationships or associations among a set of items in large datasets.
• It is particularly useful in market basket analysis, where the goal is to understand
customer purchasing behavior by finding associations between different
products.
• Motivation: finding regularities in data
• What products were often purchased together? — Beer and diapers?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
What Is Association Mining?
• Market Basket Analysis
• Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures)
that appear frequently in a data set.
• Finding frequent patterns plays an essential role in mining associations,
correlations, and many other interesting relationships among data.
• it helps in data classification, clustering, and other data mining tasks.
• frequent pattern mining has become an important data mining task and a focused
theme in data mining research.
• Frequent itemset mining leads to the discovery of associations and correlations among items in
large transactional or relational data sets.
• The discovery of interesting correlation relationships among huge amounts of business
transaction records can help in many business decision-making processes such as catalog
design, cross-marketing, and customer shopping behavior analysis.
• A typical example of frequent itemset mining is market basket analysis.
• This process analyzes customer buying habits by finding associations between the different
items that customers place in their “shopping baskets”
• discovery of these associations can help retailers develop marketing strategies by gaining
insight into which items are frequently purchased together by customers.
• For instance, if customers are buying milk, how likely are they to also buy bread on the same
trip
• Suppose, as manager of an AllElectronics branch, to learn more about the
buying habits of your customers
• market basket analysis may be performed on the retail data of customer
transactions at your store.
• use these results to plan marketing or advertising strategies, or in the design of a
new catalog.
• In one strategy, items that are frequently purchased together can be placed in
proximity to further encourage the combined sale of such items.
• If customers who purchase computers also tend to buy antivirus software at
the same time, then placing the hardware display close to the software display
may help increase the sales of both items.
• In an alternative strategy, placing hardware and software at opposite ends of
the store may entice customers who purchase such items to pick up other items
along the way.
• Eg. information that customers who purchase computers also tend to buy
antivirus software at the same time is represented in the following
association rule:
• Lift
Measures how much more likely Y is to be bought when X is bought compared to being bought
independently of X.
Formula:
• Lift values:
• >1 : Positive association (items co-occur more often than expected).
• =1 : No association (items co-occur as expected under independence).
• <1 : Negative association (items co-occur less often than expected).
Apriori Algorithm
• A classic algorithm for mining frequent itemsets for boolean association rules.
• Based on the principle that any subset of a frequent itemset must also be frequent.
• Steps:
• Generate candidate itemsets: Generate all possible itemsets of a given length.
• Count support: Count the support for each candidate itemset.
• Prune: Remove itemsets that do not meet the minimum support threshold.
• Repeat: Increase the length of itemsets by 1 and repeat the process until no more
frequent itemsets are found.
Generating Association Rules
• From the frequent itemsets identified, generate association rules.
• For each frequent itemset L, generate all non-empty subsets of L.
• For every non-empty subset s of L, output the rule s⇒(L−s) if the confidence of the rule
meets the minimum confidence threshold.
Implementation of Apriori Algorithm
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4 = {abcd}
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
The Apriori Algorithm
• Find frequent itemsets using an iterative level-wise approach based on candidate generation.
• Input:
• D, a database of transactions;
• min sup, the minimum support count threshold.
• Output: L, frequent itemsets in D.
• Method:
(1) L1 = find frequent 1-itemsets(D);
(2) for (k = 2;Lk-1≠ϕ ;k++) {
(3) Ck = apriori gen(Lk-1);
(4) for each transaction t ϵ D {// scan D for counts
(5) Ct = subset(Ck, t); // get the subsets of t that are candidates
(6) for each candidate c ϵ Ct
(7) [Link]++;
(8) }
(9) Lk =(c ϵCk|[Link] ≥ min_sup) }
(10) return L = UkLk;
The Apriori Algorithm
• procedure apriori gen(Lk-1:frequent (k -1)-itemsets)
(1) for each itemset l1 ϵ Lk-1
(2) for each itemset l2 ϵ Lk-1
(3) if (l1[1] = l2[1]) ^ (l1[2] = l2[2])
^...^(l1[k -2] = l2[k -2])^(l1[k -1] < l2[k -1]) then {
(4) c = l1 ∞ l2; // join step: generate candidates
(5) if has infrequent_subset(c, Lk-1) then
(6) delete c; // prune step: remove unfruitful candidate
(7) else add c to Ck;
(8) }
(9) return Ck;
procedure has infrequent_subset(c: candidate k-itemset)
Lk-1: frequent (k -1)-itemsets); // use prior knowledge
(1) for each (k -1)-subset s of c
(2) if s ∉ Lk-1 then
(3) return TRUE;
(4) return FALSE;
Mining Association Rules—an Example
C3 Itemset L3
3rd scan Itemset sup
{B, C, E}
{B, C, E} 2
Why Is Frequent Pattern or Association Mining an
Essential Task in Data Mining?
• Foundation for many essential data mining tasks
• Association, correlation, causality
• Sequential patterns, temporal or cyclic association, partial periodicity, spatial and
multimedia association
• Broad applications
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis
• Web log (click stream) analysis, DNA sequence analysis, etc.
Generating Association Rules from Frequent Itemsets
• Once the frequent itemsets from transactions in a database D have been found, it
is straightforward to generate strong association rules from them (where strong
association rules satisfy both minimum support and minimum confidence).
• This can be done using Eq. for confidence, which we show again here for
completeness:
• Example:
• Min support 50 % , Min confidence: 60%
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Bayesian Classification: Why?
( x )2
1
and P(xk|Ci) is g ( x, , ) e 2 2
2
P(X | C i) g ( xk , Ci , Ci )
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
>40 medium no fair yes
C2:buys_computer = ‘no’
>40 low yes fair yes
>40 low yes excellent no
Data to be classified: 31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
age income studentcredit_rating
buys_computer
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 <=30 high no fair no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 <=30 high no excellent no
31…40 high no fair yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 >40 medium no fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 >40 low yes fair yes
>40 low yes excellent no
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 31…40 low yes excellent yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 <=30 medium no fair no
<=30 low yes fair yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 >40 medium yes fair yes
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium yes excellent yes
31…40 medium no excellent yes
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair) 31…40 high yes fair yes
>40 medium no excellent no
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Naïve Bayes Classifier: An Example
• Play Tennis
The learning phase for
tennis example
P(Play=Yes) =
9/14
P(Play=No) = 5/14
We have four variables, we calculate for each we calculate
the conditional probability table
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990),
and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their “uncorrected” counterparts
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence, therefore loss of accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes Classifier
• How to deal with these dependencies? Bayesian Belief Networks
Model Evaluation and Selection
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves
84
Evaluation of Classifiers
•Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
Confusion Matrix –
• a table that compares a classification model's predicted values to the
actual values in a dataset, and is used to evaluate the model's
performance. It's also known as an error matrix.
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
86
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C Class Imbalance Problem:
C TP FN P
One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
Significant majority of the
87
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier labeled as positive are
actually positive
• Recall: completeness – what % of positive tuples did the classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and recall,
88
Classifier Evaluation Metrics: Example
89
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
• At i-th iteration, use Di as test set and others as training set
• Leave-one-out: k folds where k = # of tuples, for small sized data
• *Stratified cross-validation*: folds are stratified so that class dist. in each fold is
approx. the same as that in the initial data
90
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the
training set
• Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:
91
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
Vertical axis
• The area under the ROC curve is a represents the true
measure of the accuracy of the model positive rate
• Rank the test tuples in decreasing Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at The plot also shows a
the top of the list diagonal line
• The closer to the diagonal line (i.e., the A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
92
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules
93
Thank You