0% found this document useful (0 votes)
8 views49 pages

Data Mining: Concepts and Techniques

The document provides an overview of data mining, detailing its definition, processes, techniques, and applications. It covers the data mining process, including data cleaning, integration, transformation, and various mining techniques such as classification and clustering. Additionally, it discusses the importance of data preprocessing, evaluation metrics for regression models, and the UCI Machine Learning Repository as a resource for datasets.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views49 pages

Data Mining: Concepts and Techniques

The document provides an overview of data mining, detailing its definition, processes, techniques, and applications. It covers the data mining process, including data cleaning, integration, transformation, and various mining techniques such as classification and clustering. Additionally, it discusses the importance of data preprocessing, evaluation metrics for regression models, and the UCI Machine Learning Repository as a resource for datasets.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Data Mining –

Concepts,
Techniques &
Applications
UNIT 1
Introduction to Data
Mining
• Definition: Process of discovering patterns,
correlations, and knowledge from large datasets.
• Core step of Knowledge Discovery in Databases
(KDD).
• Integrates statistics, machine learning, and
database systems.
Roots of Data Mining
• Statistics – data analysis, hypothesis testing.
• Machine Learning & AI – pattern recognition,
classification.
• Database Systems – efficient storage & retrieval.
• Information Retrieval – searching & indexing.
The Data Mining
Process
Steps in Knowledge Discovery in Databases (KDD):
1. Data Cleaning
2. Data Integration
3. Data Selection
4. Data Transformation
5. Data Mining
6. Pattern Evaluation
7. Knowledge Presentation
Large Datasets & Data
Warehousing
• Large Datasets: High volume, variety, velocity
(Big Data).
• Require scalable storage & parallel processing.
• Data Warehouse (DW): Central repository of
integrated data.
• Supports OLAP (Online Analytical Processing).
Stages of Data Mining
Process
• Problem Definition – set business/scientific goals.
• Data Preparation – preprocessing, integration.
• Model Building – choose algorithms &
techniques.
• Evaluation – accuracy, interpretability.
• Deployment – integrate into decision making.
Task Primitives in Data
Mining
• Task-relevant Data (attributes, tuples).
• Knowledge to be mined (association, classification).
• Background knowledge (hierarchies).
• Interestingness measures (support, confidence).
• Visualization (graphs, reports).
Data Mining Techniques
• Classification – predict labels (Decision Trees, SVM).
• Clustering – group similar data (K-means, DBSCAN).
• Association Rule Mining – discover correlations (Apriori,
FP-Growth).
• Regression – predict continuous values.
• Anomaly Detection – rare event/outlier detection.
• Sequential Pattern Mining – discover sequence/time
trends.
Knowledge
Representation
• Rules (if-then statements).
• Decision Trees.
• Graphs & Networks.
• Visualization – charts, dashboards.
Data Mining Query
Languages
• SQL-like extensions for mining.
• Define task-relevant data & patterns to mine.
• Apply constraints & interestingness measures.
• Example: DMQL (Data Mining Query Language).
Business Aspects of
Data Mining
Applications: Market Basket Analysis, Customer
Segmentation, Fraud Detection, Risk Management.
Challenges: Privacy, scalability, interpretability.
Impact: Enables data-driven decision making.
Data
Preprocessing
in Data Mining
CLEANING • INTEGRATION •
TRANSFORMATION • REDUCTION
Introduction
• Data preprocessing transforms raw data
into usable form.
• Real-world data is often incomplete, noisy,
and inconsistent.
• High-quality data improves mining
accuracy and efficiency.
Sources of Poor Data
Quality
• Missing values (e.g., NaN, blanks).
• Noisy data (measurement errors, outliers).
• Inconsistent data (date formats, typos).
• Redundant data (duplicates).
• Irrelevant attributes.
Steps in Data
Preprocessing
1. Data Cleaning – fix errors, missing values.
2. Data Integration – unify multiple sources.
3. Data Transformation – normalize,
aggregate.
4. Data Reduction – reduce dimensionality.
Data Cleaning
• Handle missing values – mean/median,
predictive models.
• Handle noise – binning, regression smoothing,
outlier removal.
• Handle inconsistencies – format unification,
correcting typos.
Data Transformation
Normalization – Min-Max scaling, Z-score.

Example: [50, 80, 100] → [0, 0.6, 1]

Aggregation – e.g., monthly → yearly sales.


Discretization – continuous → categorical (e.g., Age
groups).
Data Reduction
Feature Selection (Attribute Selection):

Remove irrelevant/redundant attributes.

Example: ID numbers don’t help prediction.

Dimensionality Reduction:

Principal Component Analysis (PCA).

Singular Value Decomposition (SVD).

Numerosity Reduction:

Replace detailed data with models (histograms, clustering).

Reduce size while preserving integrity.


Data Cleaning vs Data
Transformation vs Data
Reduction
Workflow Example

Steps:
1. Fill missing Age with mean.
2. Replace '?' in Income with median.
3. Normalize Income to [0,1].
4. Drop irrelevant columns.
Regression &
Model
Building
EVALUATION WITH RMSE AND R²
Introduction to
Regression
• Regression predicts continuous outcomes.
• Examples: House prices, sales revenue,
student scores
Types of Regression
• Simple Linear Regression – one predictor.
• Multiple Linear Regression – multiple predictors.
• Polynomial Regression – non-linear
relationships.
• Other variants: Ridge, Lasso, Logistic
(classification).
Model Building Process
• Define problem – target variable (Y).
• Collect and preprocess data.
• Split into training & test sets.
• Fit model using training data.
• Evaluate using test data & metrics.
Evaluation Metrics
• RMSE – Root Mean Squared Error.
• R² – Coefficient of Determination.
• Both give complementary insights.
Root Mean Squared
Error (RMSE)

Formula:

Measures average magnitude of errors.


Lower RMSE = better model.
Sensitive to large errors (squares them).
RMSE Example
Actual: [5, 7, 9], Predicted: [4.8, 7.5, 8.7]
Errors squared: [0.04, 0.25, 0.09]
RMSE = sqrt(0.126) ≈ 0.355

Interpretation: Predictions off by ~0.36 units.


Coefficient of
Determination (R²)
Formula:

Measures how well the model explains variance in


the data.
Range: 1 = perfect,
0 = no improvement,
<0 = worse than mean.
What is Variance?
Variance measures how spread out data is from the
mean.
• High variance → Data is widely spread (e.g.,
house prices).
• Low variance → Data is clustered near the mean
(e.g., human heights).
Regression and Variance
Regression explains how much of the variation in the target
(Y) is captured by predictors (X).

• Total Variance (SStot): Overall spread of Y.

• Residual Variance (SSres): Spread unexplained by the


model.

• Explained Variance: Portion captured by the model.


R² Example

The model explains 95.25% of the variance in the data.


Explained vs
Unexplained Variance
Explained Variance Unexplained Variance
(95.25%) (4.75%)
RMSE vs R²
• RMSE: measures error magnitude.
• R²: measures variance explained.
• Good model: Low RMSE & High R².
• Always evaluate on test data.
Discretization
& Concept
Hierarchies
DATA PREPROCESSING IN DATA MINING
Introduction
• Discretization: Converts continuous data
into categorical values.
• Concept Hierarchies: Organize attributes
into multiple levels of abstraction.
• Improves interpretability and supports
OLAP operations.
Why Discretization?
• Data mining algorithms (esp. decision trees,
association rules) often work better with
categorical/abstracted data.
• Makes patterns more interpretable for humans.
• Process of converting continuous attributes into
discrete/categorical attributes.
◦ Example:
◦ Age (continuous): 1, 7, 13, 25, 40, 70 →
◦ Age (discrete): {Child, Teen, Adult, Senior}.
Discretization Methods
Unsupervised: Equal-width, Equal-frequency binning.
◦ [1–25] [26–50] [51–75] [76–100]
Supervised: Class label-based, Decision tree splits.
◦ Equal-frequency: Bin1, Bin2, Bin3
Top-down splitting vs Bottom-up merging approaches.
• Top-down splitting (recursive partitioning): Start with one
interval → split recursively.
• Bottom-up merging: Start with many small intervals → merge
based on similarity/statistics
Example of
Discretization
Age values: [5, 7, 13, 25, 40, 45, 70]
Equal-width (4 bins):
{1–25, 26–50, 51–75, 76–100}
Equal-frequency (3 bins):
Bin1={5,7}, Bin2={13,25,40}, Bin3={45,70}
Concept Hierarchies
Organizing attribute values into levels of granularity.
Forms of Concept Hierarchies
Schema hierarchy:
◦ Defined by database schema.
◦ Example: Location: Street → City → State → Country.

Set grouping hierarchy:


◦ User or domain expert defines groups.
◦ Example: Age groups:
◦ Young = {0–20},
◦ Middle-aged = {21–50},
◦ Senior = {51+}.

Automatic hierarchy generation:


◦ System detects hierarchies by clustering or data distribution.
Examples of Hierarchies
Location: Street → City → State → Country.
Time: Second → Minute → Hour → Day →
Month → Year.
Product: Item → Category → Department.
Applications
• Discretization improves accuracy in classification
and association rule mining.
• Concept hierarchies allow mining at multiple
levels of abstraction.
• Useful in Business Intelligence, OLAP, and
Knowledge Discovery.
UCI
Repository of
Datasets
Introduction
• UCI Machine Learning Repository maintained by
University of California, Irvine.
• Started in 1987, widely used for Machine
Learning & Data Mining research.
• Provides benchmark datasets for classification,
regression, clustering, etc.
Characteristics
• Wide variety of tasks: Classification, Regression,
Clustering, Time-series.
• Dataset sizes: small (hundreds) to large (millions).
• Well-documented with metadata and attribute
details.
Categories of Datasets
• Classification: Iris, Breast Cancer, Car Evaluation.
• Regression: Housing, Air Quality.
• Clustering: Wine dataset.
• Association: Retail datasets.
• Time-Series: EEG, stock market datasets.
Popular Datasets
• Iris: 150 samples, 4 attributes, 3 flower
species.
• Adult Census Income: Predict income >50k
based on census data.
• Wine: Chemical analysis of wines from
Italy.
• Car Evaluation: Predict car acceptability.
Applications
• Algorithm Testing and Benchmarking.
• Educational use in ML and Data Mining courses.
• Industry testing before applying to private data.
• Fair comparison of research results.
Using UCI Datasets
1. Select dataset relevant to problem.
2. Preprocess: cleaning, transformation,
normalization.
3. Apply mining techniques: classification,
regression, clustering.
4. Evaluate performance: Accuracy, RMSE, R², etc.
5. Compare with benchmark results.
Limitations
• Many datasets are small compared to modern
Big Data.
• Datasets are mostly clean; real-world data is
noisier.
• Still very useful for benchmarking and teaching.

You might also like