0% found this document useful (0 votes)

4 views94 pages

Data Preprocessing & Classification Techniques

Uploaded by

Samarth Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views94 pages

Data Preprocessing & Classification Techniques

Uploaded by

Samarth Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MIT Art Design and Technology University

MIT School of Computing, Pune

23CSE3007 –Data Mining & Analytics

Class - T.Y. (SEM-V), AIA

Unit - II Data Preprocessing and Classification

Prof. Swati Kadam

AY 2025-2026 SEM-V
Unit II - Syllabus

Unit II – Data Preprocessing and Classification

09 hours
Data cleaning and preprocessing techniques, Handling missing values and
outliers, normalization, data transformation and standardization: Log and
exponential transform, normal transform. Classification and prediction,
Association Rule mining, Bayesian classification: Naïve- Bayes classifier, Model
selection and model evaluation for Classifier
Data Pre-processing
Data Preprocessing Techniques
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
Data Cleaning
• Process of identifying and correcting errors or inconsistencies in the
dataset
• involves handling missing values, removing duplicates, and
correcting incorrect or outlier data to ensure the dataset is accurate
and reliable.
• Clean data is essential for effective analysis, as it improves the
quality of results and enhances the performance of data models.
Handling missing values and outliers
• Missing values occur when data is absent from a dataset.
• We can either ignore the rows with missing data or fill the gaps
manually, with the attribute mean, or by using the most probable
value.
• This ensures the dataset remains accurate and complete for
analysis.
To handle missing values
• Ignore the tuple
• Fill in the missing value manually
• Use a global constant
• Use attribute mean
• Use the most probable value calculated by decision tree, Bayesian
formalism
• Use a measure of central tendency for the attribute (e.g., the mean
or median)
To handle Noisy Data or outliers
• Noise is a random error or variance in a measured variable.
• Noisy Data or outliers are irrelevant or incorrect data that is
difficult for machines to interpret, often caused by errors in data
collection or entry. It can be handled in several ways:
• 1. Binning Method: The data is sorted into equal segments, and each segment is smoothed
by replacing values with the mean or boundary values.
• 2. Regression: Data can be smoothed by fitting it to a regression function, either linear or
multiple, to predict values.
• Linear regression involves finding the “best” line to fit two attributes (or variables) so that
one attribute can be used to predict the other.
• Multiple linear regression is an extension of linear regression, where more than two
attributes are involved and the data are fit to a multidimensional surface
• 3. Outlier analysis by Clustering
Binning
• Binning methods smooth a sorted data value by consulting its “neighborhood,” that is,
the values around it.
• The sorted values are distributed into a number of “buckets,” or bins.
• Because binning methods consult the neighborhood of values, they perform local
smoothing.
• example, fig 3.2 the data for price are first sorted and then partitioned into equal-
frequency bins of size 3 (i.e., each bin contains three values).
• In smoothing by bin means, each value in a bin is replaced by the mean value of the
bin.
• For example, the mean of the values 4, 8, and 15 in Bin 1 is 9.
• Therefore, each original value in this bin is replaced by the value 9.
• Similarly, smoothing by bin medians can be employed, in which each bin value is
replaced by the bin median. In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries.
• Each bin value is then replaced by the closest boundary value.
• Outlier analysis by Clustering: This
method groups similar data points
together, with outliers either being
undetected or falling outside the
clusters. These techniques help
remove noise and improve data
quality.
• Removing Duplicates:
• It involves identifying and eliminating repeated data entries to
ensure accuracy and consistency in the dataset.
• This process prevents errors and ensures reliable analysis by keeping
only unique records.
• Data decay, also known as data degradation, refers to the gradual
deterioration of data quality over time.
• It is a common phenomenon that affects all types of data, including
customer data, marketing data, and business data
• As data ages, it can become outdated or inaccurate due to a variety of
factors, such as changes in customer behavior, business operations, or
market trends.
• For example, contact information for customers may change as they
switch jobs, or product prices may fluctuate over time. move
Data Integration
• It involves merging data from various sources into a single, unified dataset.
• It can be challenging due to differences in data formats, structures, and meanings.
• Techniques like record linkage and data fusion help in combining data efficiently,
ensuring consistency and accuracy.
• Record Linkage is the process of identifying and matching records from different
datasets that refer to the same entity, even if they are represented differently. It helps in
combining data from various sources by finding corresponding records based on
common identifiers or attributes.
• Data Fusion involves combining data from multiple sources to create a more
comprehensive and accurate dataset. It integrates information that may be inconsistent
or incomplete from different sources, ensuring a unified and richer dataset for analysis.
Data Transformation
• It involves converting data into a format suitable for analysis.
• It is an important step in data analysis process that involves the conversion,
cleaning, and organizing of data into accessible formats. It ensures that the
information is accessible, consistent, secure, and finally recognized by the
intended business users.
• Common techniques include
• Normalization, which scales data to a common range; (normal Transform). It is done
in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
• Standardization, which adjusts data to have zero mean and unit variance;
• Discretization: Converting continuous data into discrete categories for easier
analysis. Discretization. techniques include binning, histogram analysis, cluster
analysis, decision tree analysis, and correlation analysis.
• Data Aggregation: Combining multiple data points into a summary form, such as
averages or totals, to simplify analysis.
• Concept Hierarchy Generation: Organizing data into a hierarchy of concepts to
provide a higher-level view for better understanding and analysis. For Example-The
attribute “city” can be converted to “country”.

• These techniques help prepare the data for more accurate analysis.
Data Normalization
• Used in data mining to transform the values of a dataset into a
common scale.
• This is important because many machine learning algorithms are
sensitive to the scale of the input features and can produce better
results when the data is normalized.
• Normalization is used to scale the data of an attribute so that it falls
in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0.
• It is generally useful for classification algorithms.
Need of Data Normalization
• Normalization is generally required when we are dealing with
attributes on a different scale, otherwise, it may lead to a dilution in
effectiveness of an important equally important attribute(on lower
scale) because of other attribute having values on larger scale.
• when multiple attributes are there but attributes have values on
different scales, this may lead to poor data models while
performing data mining operations.
• So they are normalized to bring all the attributes on the same scale.
Methods of Data Normalization
• 1. Min-Max Normalization-
• Linear transformation is performed on the original data. Minimum and maximum value
from data is fetched and each value is replaced according to the following formula.
• Min(A) - It is the minimum absolute value A.
• Max(A) - It is maximum absolute value of A.
• v’ - It is the new value of each attribute data.
• v - It is the old value of each attribute data.

• new_max(A), new_min(A) is the max and min value within the range
• (i.e boundary value of range required) respectively.

• Min-Max Normalization maps a value v of A to v' in the range

• [new_min(A),new_max(A)] by computing.
example
• 1000,2000,3000,9000 normalize to min:0 , max:1
• new_max(A)=1 , as given in question- max=1
• new_min(A)=0, as given in question- min=0
• max(A)=9000,
• min(A)=1000,
• v = 2000, putting all values in the formula, we get
• v '= (2000-1000) X (1-0)
----------------- + 0 = 0 .125
9000-1000

v=9000, putting all values in the formula, we get

v'=(9000-1000) X (1-0)
----------------- + 0 =1
9000-1000
2. Normalization by Decimal Scaling

Where j is the smallest integer such that Max(|ν’|) < 1

3. Standardization or z-Score Normalization(zero-mean)

• It is used to transform data into a standard normal distribution, ensuring that all
features are on the same scale.
• This process helps to avoid the dominance of certain features over others due to
differences in their scales, which can significantly impact the performance of
model.
In this technique, values are normalized based on mean and standard deviation
of the data A.
The formula used is:
v', v is the new and old of each entry in data respectively. σA, A is the standard
deviation and mean of A respectively.
Why Z-Score Normalization is Necessary?
• Normalization is necessary when dealing with features on different scales.
Without normalization, features with larger scales can dominate those with
smaller scales, leading to biased results in machine learning models.
• Z-score normalization addresses this issue by scaling the data based on its
statistical properties, making it particularly useful for algorithms that rely
on distance calculations, such as K-nearest neighbors and clustering
algorithms
• Z-score normalization ensures that all features are treated equally,
enabling the algorithm to identify meaningful patterns and relationships
more effectively.
70, 80, 90, 100, 110 find Z-score

• Mean= (70+80+90+100+110)/5 = 90
• Standard deviation= sqrt( (90-70)^2+ (90-80)^2+….(90-110)^2)
• = 14.14213
• Z-score = (80- 90)/(14.14213) = -0.70710678

• Z-scores: [-1.41421356 -0.70710678 0. 0.70710678

1.41421356]
Concept Hierarchy Generation for nominal Data
• Organizing data into a hierarchy of concepts
to provide a higher-level view for better
understanding and analysis.
• For Example-The attribute “city” can be
converted to “country”.
• attributes such as street can be generalized
to higher-level concepts, like city or
country.
• Many hierarchies for nominal attributes are
implicit within the database schema and
can be automatically defined at the schema
definition level.
Importance of Data Transformation
• Data transformation is important because it improves data quality, compatibility, and utility.
The procedure is critical for companies and organizations that depend on data to make informed
decisions because it assures the data's accuracy, reliability, and accessibility across many systems and
applications.
• Improved Data Quality: Data transformation eliminates mistakes, inserts in missing information,
and standardizes formats, resulting in higher-quality, more dependable, and accurate data.
• Enhanced Compatibility: By converting data into a suitable format, companies may avoid possible
compatibility difficulties when integrating data from many sources or systems.
• Simplified Data Management: Data transformation is the process of evaluating and modifying data
to maximize storage and discoverability, making it simpler to manage and maintain.
• Broader Application: Transformed data is more useable and applicable in a larger variety of
scenarios, allowing enterprises to get the most out of their data.
• Faster Queries: By standardizing data and appropriately storing it in a warehouse, query
performance and BI tools may be enhanced, resulting in less friction during analysis.
Data Transformation Techniques and Tools
• Programmatic Transformation: Automated using scripts in Python, R, or SQL.
• ETL Tools: Automate extract-transform-load processes for large-scale data (e.g., Talend,
Informatica).
• Normalization/Standardization: Use MinMaxScaler, StandardScaler from Scikit-learn.
• Categorical Encoding such as One-hot get_dummies (Pandas) and LabelEncoder (Scikit-learn)
• fillna (Pandas)
• SimpleImputer (Scikit-learn) for mean/median/mode imputation.
• Feature Engineering: Create new features using apply, map, transform in Pandas.
• Aggregation/Grouping: Use groupby in Pandas for sum, mean, count, etc.
• Text Preprocessing: Tokenization, stemming, stop-word removal using NLTK and SpaCy.
• Dimensionality Reduction: Use Scikit-learn's PCA and TruncatedSVD.
Log transformation
• Log transformation is a way to change data that has very large
numbers, very small numbers or a skewed shape. It works by taking
the logarithm of each number in the data which helps to “compress”
the large values and spread out the small ones.
• We use log transformation when:
• Some numbers are extremely large compared to others.
• Data is not following a normal (bell-shaped) distribution.
• We want to make patterns easier to spot for analysis.
• Log transformation shrinks big numbers and stretches small numbers
so the data becomes more balanced.
• Log transformation helps by:
• Making the data more symmetrical (normal-shaped).
• Making patterns between variables clearer.
• Making predictions more stable and reliable.
• Reducing the effect of extreme values (outliers).
Exponential transform
• a technique used to convert data values by applying an exponential
function, often to address skewness or improve model
performance.
• This transformation can make data distributions more symmetric
and linearize relationships, especially when dealing with
exponential growth or decay patterns.
• It's particularly useful when dealing with skewed data or when
certain statistical methods require data to be normally distributed.
• Y= e^x
• Benefits:
• Addressing Skewness: It can reduce the impact of outliers and
make the data distribution more symmetrical.
• Stabilizing Variance: It can make the variability of data more
consistent across different ranges.
• Linearizing Relationships: It can transform non-linear relationships
into linear ones, making them easier to model.
• When to use:
• When data exhibits a skewed distribution.
• When data has exponential growth or decay patterns.
• When statistical methods require normally distributed data.
• Example:
• If you have stock prices that tend to grow exponentially, applying a
logarithmic transformation (a type of exponential transformation)
can help stabilize the variance and make it easier to model trends.
Data Reduction
• It reduces the dataset's size while maintaining key information.
• This can be done through feature selection, which chooses the most
relevant features, and feature extraction, which transforms the data
into a lower-dimensional space while preserving important details.
It uses various reduction techniques such as,
1. Dimensionality Reduction (e.g., Principal Component Analysis
2. Numerosity Reduction
3. Wavelet Transform
4. Data Compression
1. Dimensionality Reduction

• datasets with too many features can cause issues like slow computation
and overfitting.
• Dimensionality reduction helps to reduce the number of features while
retaining key information.
• Techniques like principal component analysis (PCA), singular value
decomposition (SVD) and linear discriminant analysis (LDA) convert data
into a lower-dimensional space while preserving important details.

• Example: when you are building a model to predict house prices with features
like bedrooms, square footage and location. If you add too many features such
as room condition or flooring type, the dataset becomes large and complex.
• Feature Selection
• The most relevant features that contribute the most to the
target variable are selected and irrelevant data are removed.
• This involves selecting a subset of relevant features from the
dataset.
• Feature selection is often performed to remove irrelevant or
redundant features from the dataset.
• It can be done using various techniques such as correlation
analysis, mutual information, and principal component
analysis (PCA).
• Feature Extraction
• This involves transforming the data into a lower-dimensional
space while preserving the important information.
• Feature extraction is often used when the original features
are high-dimensional and complex.
• It can be done using techniques such as PCA, linear
discriminant analysis (LDA), and non-negative matrix
factorization (NMF).
2. Numerosity Reduction:

• Numerosity reduction is a technique used in data mining to reduce

the number of data points in a dataset while still preserving the
most important information.
• This can be beneficial in situations where the dataset is too large to
be processed efficiently, or where the dataset contains a large
amount of irrelevant or redundant data points.
Techniques for Numerosity Reduction

• Data Sampling: This technique involves selecting a subset of the data points to work
with, rather than using the entire dataset. This can be useful for reducing the size of a
dataset while still preserving the overall trends and patterns in the data.
• Clustering: This technique involves grouping similar data points together and then
representing each group by a single representative data point.
• Data Aggregation: This technique involves combining multiple data points into a
single data point by applying a summarization function.
• Data Generalization: This technique involves replacing a data point with a more
general data point that still preserves the important information.
• Data Compression: This technique involves using techniques such as lossy or lossless
compression to reduce the size of a dataset
3. Wavelet Transforms

• The discrete wavelet transform (DWT) is a linear signal processing technique that, when applied
to a data vector X, transforms it to a numerically different vector, X’, of wavelet coefficients.
• The two vectors are of the same length. When applying this technique to data reduction, we
consider each tuple as an n-dimensional data vector, that is, X={x1,x2, … ,xn}, depicting n
measurements made on the tuple from n database attributes.
• For example, all wavelet coefficients larger than some user-specified threshold can be retained.
All other coefficients are set to 0.
• The resulting data representation is therefore very sparse, so that operations that can take
advantage of data sparsity are computationally very fast if performed in wavelet space.
• The technique also works to remove noise without smoothing out the main features of the data,
making it effective for data cleaning as well.
• Given a set of coefficients, an approximation of the original data can be constructed by applying
the inverse of the DWT used.
4. Data Compression:
• Reducing the size of data by encoding it in a more compact form, making it easier to AIA5 25/8
store and process
• Lossless Compression -
• Encoding techniques (Run Length Encoding) allow a simple and minimal data size
reduction. Lossless data compression uses algorithms to restore the precise original data
from the compressed data.
• Lossy Compression -
• Methods such as the Discrete Wavelet transform technique, PCA (principal component
analysis) are examples of this compression. For e.g., the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. In lossy-
data compression, the decompressed data may differ from the original data but are
useful enough to retrieve information from them.
Data Compression

Original Data Compressed

Data
lossless

Original Data
Approximated
Importance of data preprocessing
• Data preprocessing is an important step in the data process.
• It refers to the cleaning, transforming, and integrating of data in order to make it ready
for analysis.
• The goal of data preprocessing is to improve the quality of the data and to make it more
suitable.

• It improves accuracy and reliability:

• Preprocessing data removes missing or inconsistent data values resulting from human
or computer error, which can improve the accuracy and quality of a dataset, making it
more reliable.
• It makes data consistent:
• When collecting data, it's possible to have data duplicates, and discarding them during
preprocessing can ensure the data values for analysis are consistent, which helps
produce accurate results.
Importance of data preprocessing
• It increases the data's algorithm readability. Preprocessing enhances the data's
quality and makes it easier for machine learning algorithms to read, use, and interpret it.
• Increasing productivity: Clean data can increase overall productivity.
• Improving customer relations: Clean data can lead to fewer errors, which can
improve customer relations.
• Enabling better understanding of the audience: Clean data can help businesses build
a better understanding of their audience.
• Helping people understand data: Data visualization helps people see, interact with,
and better understand data.
• Enabling models to extract meaningful patterns: Clean data ensures that training
data mirrors reality, which enables models to extract meaningful patterns.
Supervised vs. Unsupervised Learning

• Supervised learning (classification)

• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing the
existence of classes or clusters in the data
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
General Approach to Classification
• Data classification is a two-step process
• 1. a learning step (where a classification model is constructed)
• 2. a classification step (where the model is used to predict class labels for given
data)
• first step, a classifier is built describing a predetermined set of data
classes or concepts.
• classification algorithm builds the classifier by analyzing or “learning from”
a training set made up of database tuples and their associated class
labels.
• A tuple, X, is an attribute vector, X ={x1, x2, : : : , xn}, depicting n
measurements made on the tuple from n database attributes, respectively,
A1, A2, .. , An.
• Each tuple, X, is assumed to belong to a predefined class as determined
by another database attribute called the class label attribute.
• class label attribute is discrete-valued and unordered.
• It is categorical (or nominal) in that each value serves as a
category or class.
• The individual tuples making up the training set are referred
to as training tuples
• data tuples can be referred to as samples, examples,
instances, data points, or objects.
• This first step of the classification process can also be viewed as the
learning of a mapping or function, y =f(X) hat can predict the
associated class label y of a given tuple X.
a. Learning

b. Classification
• Typical applications
• Credit/loan approval
• target marketing, performance prediction,
• manufacturing
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Prediction
• something that may go to happen in the future.
• And just like that in prediction, we identify or predict the missing or
unavailable data for a new observation based on the previous data
that we have and based on the future assumptions
• In prediction, the output is a continuous value
• The model used to predict the unknown value is called a predictor
• the accuracy of the predictor refers to how well a given predictor
can guess the value of predicted attribute for a new data.
What Is Association Rule Mining?
• Association rule mining:
• Association rule mining is a technique used in data mining to discover frequent
patterns, relationships or associations among a set of items in large datasets.
• It is particularly useful in market basket analysis, where the goal is to understand
customer purchasing behavior by finding associations between different
products.
• Motivation: finding regularities in data
• What products were often purchased together? — Beer and diapers?!
• What are the subsequent purchases after buying a PC?
• What kinds of DNA are sensitive to this new drug?
• Can we automatically classify web documents?
What Is Association Mining?
• Market Basket Analysis
• Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures)
that appear frequently in a data set.
• Finding frequent patterns plays an essential role in mining associations,
correlations, and many other interesting relationships among data.
• it helps in data classification, clustering, and other data mining tasks.
• frequent pattern mining has become an important data mining task and a focused
theme in data mining research.
• Frequent itemset mining leads to the discovery of associations and correlations among items in
large transactional or relational data sets.
• The discovery of interesting correlation relationships among huge amounts of business
transaction records can help in many business decision-making processes such as catalog
design, cross-marketing, and customer shopping behavior analysis.
• A typical example of frequent itemset mining is market basket analysis.
• This process analyzes customer buying habits by finding associations between the different
items that customers place in their “shopping baskets”
• discovery of these associations can help retailers develop marketing strategies by gaining
insight into which items are frequently purchased together by customers.
• For instance, if customers are buying milk, how likely are they to also buy bread on the same
trip
• Suppose, as manager of an AllElectronics branch, to learn more about the
buying habits of your customers
• market basket analysis may be performed on the retail data of customer
transactions at your store.
• use these results to plan marketing or advertising strategies, or in the design of a
new catalog.
• In one strategy, items that are frequently purchased together can be placed in
proximity to further encourage the combined sale of such items.
• If customers who purchase computers also tend to buy antivirus software at
the same time, then placing the hardware display close to the software display
may help increase the sales of both items.
• In an alternative strategy, placing hardware and software at opposite ends of
the store may entice customers who purchase such items to pick up other items
along the way.
• Eg. information that customers who purchase computers also tend to buy
antivirus software at the same time is represented in the following
association rule:

• Rule support and confidence are two measures of rule interestingness.

• They respectively reflect the usefulness and certainty of discovered
rules. A support of 2% for means that 2% of all the transactions under
analysis show that computer and antivirus software are purchased
together.
• A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software.
• Typically, association rules are considered interesting if they satisfy both
a minimum support threshold and a minimum confidence threshold.
• These thresholds can be a set by users or domain experts
Basic Concepts
• Itemset
TID Items
– A collection of one or more items
– Item: A single article or product. 1 Bread, Milk
• Example: {Milk, Bread, Diaper} 2 Bread, Diaper, Beer, Eggs
– k-itemset 3 Milk, Diaper, Beer, Coke
• An itemset that contains k items 4 Bread, Milk, Diaper, Beer
• Support count ()-The number of transactions containing the 5 Bread, Milk, Diaper, Coke
itemset.
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater than or equal to a minsup
threshold
• Example Consider a simple dataset of transactions:
Transaction ID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Coke}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Coke}
• Confidence :
Measures how often items in Y appear in transactions that contain X. It is an indication of how frequently
the rule X⇒Y is found to be true.
confidence(X=>Y) =P(Y|X)

• Lift
Measures how much more likely Y is to be bought when X is bought compared to being bought
independently of X.
Formula:
• Lift values:
• >1 : Positive association (items co-occur more often than expected).
• =1 : No association (items co-occur as expected under independence).
• <1 : Negative association (items co-occur less often than expected).
Apriori Algorithm
• A classic algorithm for mining frequent itemsets for boolean association rules.
• Based on the principle that any subset of a frequent itemset must also be frequent.
• Steps:
• Generate candidate itemsets: Generate all possible itemsets of a given length.
• Count support: Count the support for each candidate itemset.
• Prune: Remove itemsets that do not meet the minimum support threshold.
• Repeat: Increase the length of itemsets by 1 and repeat the process until no more
frequent itemsets are found.
Generating Association Rules
• From the frequent itemsets identified, generate association rules.
• For each frequent itemset L, generate all non-empty subsets of L.
• For every non-empty subset s of L, output the rule s⇒(L−s) if the confidence of the rule
meets the minimum confidence threshold.
Implementation of Apriori Algorithm
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4 = {abcd}
The Apriori Algorithm
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
The Apriori Algorithm
• Find frequent itemsets using an iterative level-wise approach based on candidate generation.
• Input:
• D, a database of transactions;
• min sup, the minimum support count threshold.
• Output: L, frequent itemsets in D.
• Method:
(1) L1 = find frequent 1-itemsets(D);
(2) for (k = 2;Lk-1≠ϕ ;k++) {
(3) Ck = apriori gen(Lk-1);
(4) for each transaction t ϵ D {// scan D for counts
(5) Ct = subset(Ck, t); // get the subsets of t that are candidates
(6) for each candidate c ϵ Ct
(7) [Link]++;
(8) }
(9) Lk =(c ϵCk|[Link] ≥ min_sup) }
(10) return L = UkLk;
The Apriori Algorithm
• procedure apriori gen(Lk-1:frequent (k -1)-itemsets)
(1) for each itemset l1 ϵ Lk-1
(2) for each itemset l2 ϵ Lk-1
(3) if (l1[1] = l2[1]) ^ (l1[2] = l2[2])
^...^(l1[k -2] = l2[k -2])^(l1[k -1] < l2[k -1]) then {
(4) c = l1 ∞ l2; // join step: generate candidates
(5) if has infrequent_subset(c, Lk-1) then
(6) delete c; // prune step: remove unfruitful candidate
(7) else add c to Ck;
(8) }
(9) return Ck;
procedure has infrequent_subset(c: candidate k-itemset)
Lk-1: frequent (k -1)-itemsets); // use prior knowledge
(1) for each (k -1)-subset s of c
(2) if s ∉ Lk-1 then
(3) return TRUE;
(4) return FALSE;
Mining Association Rules—an Example

Transaction-id Items bought Min. support 50%

Min. confidence 50%
10 A, B, C
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
For rule A ⇒ C: {A, C} 50%
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%
Another Example
Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2
Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3
3rd scan Itemset sup
{B, C, E}
{B, C, E} 2
Why Is Frequent Pattern or Association Mining an
Essential Task in Data Mining?
• Foundation for many essential data mining tasks
• Association, correlation, causality
• Sequential patterns, temporal or cyclic association, partial periodicity, spatial and
multimedia association
• Broad applications
• Basket data analysis, cross-marketing, catalog design, sale campaign analysis
• Web log (click stream) analysis, DNA sequence analysis, etc.
Generating Association Rules from Frequent Itemsets

• Once the frequent itemsets from transactions in a database D have been found, it
is straightforward to generate strong association rules from them (where strong
association rules satisfy both minimum support and minimum confidence).
• This can be done using Eq. for confidence, which we show again here for
completeness:
• Example:
• Min support 50 % , Min confidence: 60%

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Bayesian Classification: Why?

• A statistical classifier: performs probabilistic prediction, i.e., predicts class

membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian classifier, has
comparable performance with decision tree and selected neural network classifiers
• Incremental: Each training example can incrementally increase/decrease the
probability that a hypothesis is correct — prior knowledge can be combined with
observed data
• Standard: Even when Bayesian methods are computationally intractable, they can
provide a standard of optimal decision making against which other methods can be
measured
Bayes’ Theorem: Basics
• Total probability Theorem: M
P(B)   P(B | A )P( A )
i i
i 1
• Bayes’ Theorem:
P( H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)
P(X)
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e., posteriori probability): the probability
that the hypothesis holds given the observed data sample X
• P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of observing the sample X, given that the
hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
Prediction Based on Bayes’ Theorem

• Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the

Bayes’ theorem

P( H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)

P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X)
for all the k classes
• Practical difficulty: It requires initial knowledge of many probabilities, involving
significant computational cost
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)
• Since P(X) is constant for all classes, only

needs to be maximized P(C | X)  P(X | C )P(C )

i i i
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between attributes):
n
P( X | C i)   P( x | C i)  P( x | C i)  P( x | C i)  ... P( x | C i)
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci,
D| (# of tuples of Ci in D)
• If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution
with a mean μ and standard deviation σ

( x )2
1 
and P(xk|Ci) is g ( x,  ,  )  e 2 2

2 

P(X | C i)  g ( xk , Ci , Ci )
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
>40 medium no fair yes
C2:buys_computer = ‘no’
>40 low yes fair yes
>40 low yes excellent no
Data to be classified: 31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
• Compute P(X|Ci) for each class
age income studentcredit_rating
buys_computer
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 <=30 high no fair no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 <=30 high no excellent no
31…40 high no fair yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 >40 medium no fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 >40 low yes fair yes
>40 low yes excellent no
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 31…40 low yes excellent yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 <=30 medium no fair no
<=30 low yes fair yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 >40 medium yes fair yes
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 <=30 medium yes excellent yes
31…40 medium no excellent yes
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair) 31…40 high yes fair yes
>40 medium no excellent no
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Naïve Bayes Classifier: An Example
• Play Tennis
The learning phase for
tennis example
P(Play=Yes) =
9/14
P(Play=No) = 5/14
We have four variables, we calculate for each we calculate
the conditional probability table

Outlook Play=Ye Play=No Temperatur Play=Yes Play=No

s e
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Ye Play=N
s o Wind Play=Yes Play=No
High Strong 3/9 3/5
3/9 4/5
Normal Weak 6/9 2/5
6/9 1/5
The test phase for the tennis example
• Test Phase
– Given a new instance of variable values,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Given calculated Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = P(Temperature=Cool|Play==No) = 1/5
3/9 P(Huminity=High|Play=No) = 4/5
P(Huminity=High|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Wind=Strong|Play=Yes) = 3/9 P(Play=No) = 5/14
P(Play=Yes) = 9/14
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be

“No”.
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise,
the predicted prob. will be zero
n
P( X | C i )   P( x k | C i )
k 1

• Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990),
and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their “uncorrected” counterparts
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence, therefore loss of accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes Classifier
• How to deal with these dependencies? Bayesian Belief Networks
Model Evaluation and Selection

• Evaluation metrics: How can we measure accuracy? Other metrics to consider?

• Use validation test set of class-labeled tuples instead of training set when assessing
accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap

• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves

84
Evaluation of Classifiers
•Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set

Confusion Matrix –
• a table that compares a classification model's predicted values to the
actual values in a dataset, and is used to evaluate the model's
performance. It's also known as an error matrix.
Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:

Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates # of

tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals

86
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P
 One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

negative class and minority of

• Classifier Accuracy, or
recognition rate: percentage of the positive class
test set tuples that are correctly  Sensitivity: True Positive
classified recognition rate
Accuracy = (TP + TN)/All  Sensitivity = TP/P

• Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate

 Specificity = TN/N

87
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier labeled as positive are
actually positive

• Recall: completeness – what % of positive tuples did the classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision and recall,

• Fß: weighted measure of precision and recall

• assigns ß times as much weight to recall as to precision

88
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

89
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
• At i-th iteration, use Di as test set and others as training set
• Leave-one-out: k folds where k = # of tuples, for small sized data
• *Stratified cross-validation*: folds are stratified so that class dist. in each fold is
approx. the same as that in the initial data

90
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the
training set
• Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with replacement, resulting in a training set of d
samples. The data tuples that did not make it into the training set end up forming the test set.
About 63.2% of the original data end up in the bootstrap, and the remaining 36.8% form the test
set (since (1 – 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:

91
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate
 Vertical axis
• The area under the ROC curve is a represents the true
measure of the accuracy of the model positive rate
• Rank the test tuples in decreasing  Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at  The plot also shows a
the top of the list diagonal line
• The closer to the diagonal line (i.e., the  A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
92
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules
93
Thank You

Data Preprocessing & Classification Techniques
No ratings yet
Data Preprocessing & Classification Techniques
115 pages
Understanding Regression in Data Processing
No ratings yet
Understanding Regression in Data Processing
11 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
57 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
8 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
8 pages
Mod2 - Data Preprocessing
No ratings yet
Mod2 - Data Preprocessing
78 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
11 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
41 pages
Essential Data Pre-processing Techniques
No ratings yet
Essential Data Pre-processing Techniques
25 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
52 pages
Data Preparation Techniques for Analysis
No ratings yet
Data Preparation Techniques for Analysis
45 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
24 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
33 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
16 pages
Data Preprocessing Techniques Explained
100% (1)
Data Preprocessing Techniques Explained
39 pages
Unit II - Data Mining
No ratings yet
Unit II - Data Mining
7 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
40 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
40 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
55 pages
Unit 3 Data Preprocessing
No ratings yet
Unit 3 Data Preprocessing
28 pages
DM 3rd - Docs
No ratings yet
DM 3rd - Docs
14 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
18 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
17 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
3 pages
DMDW Notes
No ratings yet
DMDW Notes
61 pages
Understanding Data and Big Data Concepts
No ratings yet
Understanding Data and Big Data Concepts
51 pages
Data Reduction in Preprocessing Steps
No ratings yet
Data Reduction in Preprocessing Steps
25 pages
Data Normalization Techniques in Mining
No ratings yet
Data Normalization Techniques in Mining
25 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
34 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
52 pages
Data Normalization and Preprocessing Techniques
No ratings yet
Data Normalization and Preprocessing Techniques
180 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
12 pages
Data Preprocessing for Machine Learning
No ratings yet
Data Preprocessing for Machine Learning
65 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
55 pages
Data Science: Essential Preprocessing Techniques
No ratings yet
Data Science: Essential Preprocessing Techniques
35 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Part3 250724 204502
No ratings yet
Part3 250724 204502
20 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
17 pages
Data Pre-Processing Techniques Explained
No ratings yet
Data Pre-Processing Techniques Explained
8 pages
Data Pre-Processing in Data Mining
No ratings yet
Data Pre-Processing in Data Mining
88 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
31 pages
Data Preprocessing Techniques in Python
No ratings yet
Data Preprocessing Techniques in Python
14 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
12 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
77 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
35 pages
Big Data Analytics: Data Cleaning Techniques
No ratings yet
Big Data Analytics: Data Cleaning Techniques
22 pages
Data Pre-Processing in Data Mining
No ratings yet
Data Pre-Processing in Data Mining
37 pages
Essential Steps in Data Preprocessing
No ratings yet
Essential Steps in Data Preprocessing
31 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
52 pages
Data Pre-Processing Techniques Explained
No ratings yet
Data Pre-Processing Techniques Explained
28 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
43 pages
Data Analysis Planning in Research
No ratings yet
Data Analysis Planning in Research
85 pages
Data Preprocessing Notes in PDF
No ratings yet
Data Preprocessing Notes in PDF
50 pages
Mod 2
No ratings yet
Mod 2
68 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
22 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
25 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
53 pages
Data Warehouse Concepts Explained
No ratings yet
Data Warehouse Concepts Explained
21 pages
Online Resume Builder Project Overview
100% (2)
Online Resume Builder Project Overview
14 pages
Global Supply Chain Management Syllabus
No ratings yet
Global Supply Chain Management Syllabus
5 pages
FOC Control of DFIG with MRAS Observer
No ratings yet
FOC Control of DFIG with MRAS Observer
7 pages
Ship Service Time Forecasting Models
No ratings yet
Ship Service Time Forecasting Models
5 pages
Assertion and Reason Chemistry Questions
No ratings yet
Assertion and Reason Chemistry Questions
6 pages
MASW: Analyzing Surface Wave Dispersion
No ratings yet
MASW: Analyzing Surface Wave Dispersion
14 pages
Temperature and Pressure Measurement Basics
No ratings yet
Temperature and Pressure Measurement Basics
25 pages
Lesson Plan: Statistics & Probability 11
100% (2)
Lesson Plan: Statistics & Probability 11
3 pages
BEF MD Refresher
No ratings yet
BEF MD Refresher
6 pages
Finding Shippers 101
No ratings yet
Finding Shippers 101
14 pages
Understanding Sphere Properties and Formulas
No ratings yet
Understanding Sphere Properties and Formulas
20 pages
Joule Ridge Reactive Power Lab Insights
No ratings yet
Joule Ridge Reactive Power Lab Insights
15 pages
Green-Tao Theorem on Prime Progressions
No ratings yet
Green-Tao Theorem on Prime Progressions
60 pages
EViews 8 Command Ref
No ratings yet
EViews 8 Command Ref
723 pages
Slurry TBM Excavation Management in Singapore
No ratings yet
Slurry TBM Excavation Management in Singapore
18 pages
Counting Problems: James Tanton
No ratings yet
Counting Problems: James Tanton
28 pages
Python and SQL Programming Exercises
No ratings yet
Python and SQL Programming Exercises
24 pages
Gibbs Energy and Cell Reactions in Electrochemistry
No ratings yet
Gibbs Energy and Cell Reactions in Electrochemistry
7 pages
Chlorophyll Content and Spectra Analysis
No ratings yet
Chlorophyll Content and Spectra Analysis
19 pages
A Causal Test of The Strength of Weak Ties PDF
No ratings yet
A Causal Test of The Strength of Weak Ties PDF
8 pages
BS 6683 (1985) Installation and Use of Valves
No ratings yet
BS 6683 (1985) Installation and Use of Valves
18 pages
Grade 6 Math Lesson Plan: Division of Integers
No ratings yet
Grade 6 Math Lesson Plan: Division of Integers
9 pages
Jinja Macros and SQL in dbt Cloud
No ratings yet
Jinja Macros and SQL in dbt Cloud
20 pages
Class 11 Maths: Probability Solutions
No ratings yet
Class 11 Maths: Probability Solutions
10 pages
Tool Condition Monitoring Using Machine Tool Spindle Current
No ratings yet
Tool Condition Monitoring Using Machine Tool Spindle Current
13 pages
DG41CN SpecUpdate03
No ratings yet
DG41CN SpecUpdate03
6 pages
Multi - Choice Question Paper (MSR-WI-1)
100% (1)
Multi - Choice Question Paper (MSR-WI-1)
8 pages
Credit Analyst Interview Master Guide 30QA
No ratings yet
Credit Analyst Interview Master Guide 30QA
6 pages
Human-Computer Interaction Workbook
No ratings yet
Human-Computer Interaction Workbook
30 pages

Data Preprocessing & Classification Techniques

Uploaded by

Data Preprocessing & Classification Techniques

Uploaded by

MIT Art Design and Technology University

MIT School of Computing, Pune

23CSE3007 –Data Mining & Analytics

Class - T.Y. (SEM-V), AIA

Unit - II Data Preprocessing and Classification

Prof. Swati Kadam

Unit II – Data Preprocessing and Classification

• Min-Max Normalization maps a value v of A to v' in the range

v=9000, putting all values in the formula, we get

Where j is the smallest integer such that Max(|ν’|) < 1

• Z-scores: [-1.41421356 -0.70710678 0. 0.70710678

• Numerosity reduction is a technique used in data mining to reduce

Original Data Compressed

• It improves accuracy and reliability:

• Supervised learning (classification)

• Rule support and confidence are two measures of rule interestingness.

Transaction-id Items bought Min. support 50%

• A statistical classifier: performs probabilistic prediction, i.e., predicts class

• Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the

P( H | X)  P(X | H ) P( H )  P(X | H ) P( H ) / P(X)

needs to be maximized P(C | X)  P(X | C )P(C )

Outlook Play=Ye Play=No Temperatur Play=Yes Play=No

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be

• Evaluation metrics: How can we measure accuracy? Other metrics to consider?

Example of Confusion Matrix:

• Given m classes, an entry, CMi,j in a confusion matrix indicates # of

negative class and minority of

• Error rate: 1 – accuracy, or  Specificity: True Negative

Error rate = (FP + FN)/All recognition rate

• Fß: weighted measure of precision and recall

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)

• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

You might also like