Data Mining –
Concepts,
Techniques &
Applications
UNIT 1
Introduction to Data
Mining
• Definition: Process of discovering patterns,
correlations, and knowledge from large datasets.
• Core step of Knowledge Discovery in Databases
(KDD).
• Integrates statistics, machine learning, and
database systems.
Roots of Data Mining
• Statistics – data analysis, hypothesis testing.
• Machine Learning & AI – pattern recognition,
classification.
• Database Systems – efficient storage & retrieval.
• Information Retrieval – searching & indexing.
The Data Mining
Process
Steps in Knowledge Discovery in Databases (KDD):
1. Data Cleaning
2. Data Integration
3. Data Selection
4. Data Transformation
5. Data Mining
6. Pattern Evaluation
7. Knowledge Presentation
Large Datasets & Data
Warehousing
• Large Datasets: High volume, variety, velocity
(Big Data).
• Require scalable storage & parallel processing.
• Data Warehouse (DW): Central repository of
integrated data.
• Supports OLAP (Online Analytical Processing).
Stages of Data Mining
Process
• Problem Definition – set business/scientific goals.
• Data Preparation – preprocessing, integration.
• Model Building – choose algorithms &
techniques.
• Evaluation – accuracy, interpretability.
• Deployment – integrate into decision making.
Task Primitives in Data
Mining
• Task-relevant Data (attributes, tuples).
• Knowledge to be mined (association, classification).
• Background knowledge (hierarchies).
• Interestingness measures (support, confidence).
• Visualization (graphs, reports).
Data Mining Techniques
• Classification – predict labels (Decision Trees, SVM).
• Clustering – group similar data (K-means, DBSCAN).
• Association Rule Mining – discover correlations (Apriori,
FP-Growth).
• Regression – predict continuous values.
• Anomaly Detection – rare event/outlier detection.
• Sequential Pattern Mining – discover sequence/time
trends.
Knowledge
Representation
• Rules (if-then statements).
• Decision Trees.
• Graphs & Networks.
• Visualization – charts, dashboards.
Data Mining Query
Languages
• SQL-like extensions for mining.
• Define task-relevant data & patterns to mine.
• Apply constraints & interestingness measures.
• Example: DMQL (Data Mining Query Language).
Business Aspects of
Data Mining
Applications: Market Basket Analysis, Customer
Segmentation, Fraud Detection, Risk Management.
Challenges: Privacy, scalability, interpretability.
Impact: Enables data-driven decision making.
Data
Preprocessing
in Data Mining
CLEANING • INTEGRATION •
TRANSFORMATION • REDUCTION
Introduction
• Data preprocessing transforms raw data
into usable form.
• Real-world data is often incomplete, noisy,
and inconsistent.
• High-quality data improves mining
accuracy and efficiency.
Sources of Poor Data
Quality
• Missing values (e.g., NaN, blanks).
• Noisy data (measurement errors, outliers).
• Inconsistent data (date formats, typos).
• Redundant data (duplicates).
• Irrelevant attributes.
Steps in Data
Preprocessing
1. Data Cleaning – fix errors, missing values.
2. Data Integration – unify multiple sources.
3. Data Transformation – normalize,
aggregate.
4. Data Reduction – reduce dimensionality.
Data Cleaning
• Handle missing values – mean/median,
predictive models.
• Handle noise – binning, regression smoothing,
outlier removal.
• Handle inconsistencies – format unification,
correcting typos.
Data Transformation
Normalization – Min-Max scaling, Z-score.
Example: [50, 80, 100] → [0, 0.6, 1]
Aggregation – e.g., monthly → yearly sales.
Discretization – continuous → categorical (e.g., Age
groups).
Data Reduction
Feature Selection (Attribute Selection):
Remove irrelevant/redundant attributes.
Example: ID numbers don’t help prediction.
Dimensionality Reduction:
Principal Component Analysis (PCA).
Singular Value Decomposition (SVD).
Numerosity Reduction:
Replace detailed data with models (histograms, clustering).
Reduce size while preserving integrity.
Data Cleaning vs Data
Transformation vs Data
Reduction
Workflow Example
Steps:
1. Fill missing Age with mean.
2. Replace '?' in Income with median.
3. Normalize Income to [0,1].
4. Drop irrelevant columns.
Regression &
Model
Building
EVALUATION WITH RMSE AND R²
Introduction to
Regression
• Regression predicts continuous outcomes.
• Examples: House prices, sales revenue,
student scores
Types of Regression
• Simple Linear Regression – one predictor.
• Multiple Linear Regression – multiple predictors.
• Polynomial Regression – non-linear
relationships.
• Other variants: Ridge, Lasso, Logistic
(classification).
Model Building Process
• Define problem – target variable (Y).
• Collect and preprocess data.
• Split into training & test sets.
• Fit model using training data.
• Evaluate using test data & metrics.
Evaluation Metrics
• RMSE – Root Mean Squared Error.
• R² – Coefficient of Determination.
• Both give complementary insights.
Root Mean Squared
Error (RMSE)
Formula:
Measures average magnitude of errors.
Lower RMSE = better model.
Sensitive to large errors (squares them).
RMSE Example
Actual: [5, 7, 9], Predicted: [4.8, 7.5, 8.7]
Errors squared: [0.04, 0.25, 0.09]
RMSE = sqrt(0.126) ≈ 0.355
Interpretation: Predictions off by ~0.36 units.
Coefficient of
Determination (R²)
Formula:
Measures how well the model explains variance in
the data.
Range: 1 = perfect,
0 = no improvement,
<0 = worse than mean.
What is Variance?
Variance measures how spread out data is from the
mean.
• High variance → Data is widely spread (e.g.,
house prices).
• Low variance → Data is clustered near the mean
(e.g., human heights).
Regression and Variance
Regression explains how much of the variation in the target
(Y) is captured by predictors (X).
• Total Variance (SStot): Overall spread of Y.
• Residual Variance (SSres): Spread unexplained by the
model.
• Explained Variance: Portion captured by the model.
R² Example
The model explains 95.25% of the variance in the data.
Explained vs
Unexplained Variance
Explained Variance Unexplained Variance
(95.25%) (4.75%)
RMSE vs R²
• RMSE: measures error magnitude.
• R²: measures variance explained.
• Good model: Low RMSE & High R².
• Always evaluate on test data.
Discretization
& Concept
Hierarchies
DATA PREPROCESSING IN DATA MINING
Introduction
• Discretization: Converts continuous data
into categorical values.
• Concept Hierarchies: Organize attributes
into multiple levels of abstraction.
• Improves interpretability and supports
OLAP operations.
Why Discretization?
• Data mining algorithms (esp. decision trees,
association rules) often work better with
categorical/abstracted data.
• Makes patterns more interpretable for humans.
• Process of converting continuous attributes into
discrete/categorical attributes.
◦ Example:
◦ Age (continuous): 1, 7, 13, 25, 40, 70 →
◦ Age (discrete): {Child, Teen, Adult, Senior}.
Discretization Methods
Unsupervised: Equal-width, Equal-frequency binning.
◦ [1–25] [26–50] [51–75] [76–100]
Supervised: Class label-based, Decision tree splits.
◦ Equal-frequency: Bin1, Bin2, Bin3
Top-down splitting vs Bottom-up merging approaches.
• Top-down splitting (recursive partitioning): Start with one
interval → split recursively.
• Bottom-up merging: Start with many small intervals → merge
based on similarity/statistics
Example of
Discretization
Age values: [5, 7, 13, 25, 40, 45, 70]
Equal-width (4 bins):
{1–25, 26–50, 51–75, 76–100}
Equal-frequency (3 bins):
Bin1={5,7}, Bin2={13,25,40}, Bin3={45,70}
Concept Hierarchies
Organizing attribute values into levels of granularity.
Forms of Concept Hierarchies
Schema hierarchy:
◦ Defined by database schema.
◦ Example: Location: Street → City → State → Country.
Set grouping hierarchy:
◦ User or domain expert defines groups.
◦ Example: Age groups:
◦ Young = {0–20},
◦ Middle-aged = {21–50},
◦ Senior = {51+}.
Automatic hierarchy generation:
◦ System detects hierarchies by clustering or data distribution.
Examples of Hierarchies
Location: Street → City → State → Country.
Time: Second → Minute → Hour → Day →
Month → Year.
Product: Item → Category → Department.
Applications
• Discretization improves accuracy in classification
and association rule mining.
• Concept hierarchies allow mining at multiple
levels of abstraction.
• Useful in Business Intelligence, OLAP, and
Knowledge Discovery.
UCI
Repository of
Datasets
Introduction
• UCI Machine Learning Repository maintained by
University of California, Irvine.
• Started in 1987, widely used for Machine
Learning & Data Mining research.
• Provides benchmark datasets for classification,
regression, clustering, etc.
Characteristics
• Wide variety of tasks: Classification, Regression,
Clustering, Time-series.
• Dataset sizes: small (hundreds) to large (millions).
• Well-documented with metadata and attribute
details.
Categories of Datasets
• Classification: Iris, Breast Cancer, Car Evaluation.
• Regression: Housing, Air Quality.
• Clustering: Wine dataset.
• Association: Retail datasets.
• Time-Series: EEG, stock market datasets.
Popular Datasets
• Iris: 150 samples, 4 attributes, 3 flower
species.
• Adult Census Income: Predict income >50k
based on census data.
• Wine: Chemical analysis of wines from
Italy.
• Car Evaluation: Predict car acceptability.
Applications
• Algorithm Testing and Benchmarking.
• Educational use in ML and Data Mining courses.
• Industry testing before applying to private data.
• Fair comparison of research results.
Using UCI Datasets
1. Select dataset relevant to problem.
2. Preprocess: cleaning, transformation,
normalization.
3. Apply mining techniques: classification,
regression, clustering.
4. Evaluate performance: Accuracy, RMSE, R², etc.
5. Compare with benchmark results.
Limitations
• Many datasets are small compared to modern
Big Data.
• Datasets are mostly clean; real-world data is
noisier.
• Still very useful for benchmarking and teaching.