Data Mining
Content
Data mining Introduction
KDD
What is (not) Data Mining?
What is not Data What is Data Mining? –
Mining?
– Certain names are more
– Look up phone number prevalent in certain US
in phone directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web search
engine for information – Group together similar
about “Amazon” documents returned by search
engine according to their
– Querying or searching context (e.g. Amazon rainforest,
[Link],)
– Finding trends and patterns
Data Mining: Classification Schemes
Decisions in data mining
– Kinds of databases to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted
Data mining tasks
– Descriptive data mining
– Predictive data mining
Decisions in data mining
Databases to be mined
Relational, transactional, object-oriented, spatial, time-
series, text, multi-media, heterogeneous, WWW, etc.
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend, deviation and outlier
analysis, etc.
Multiple/integrated functions and mining at multiple
levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, neural network, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis,
Data mining tasks/techniques
Predictive modeling
Use some variables to predict unknown or future values
of other variables
Descriptive modeling
Find human-interpretable patterns that describe the
data.
Data mining tasks/techniques
Predictive Modeling:
Classification: Assigning data instances to predefined
classes (e.g., decision trees, neural networks, support
vector machines).
Regression: Predicting continuous numerical values
(e.g., linear regression, logistic regression).
Time Series Analysis: Analyzing data points collected at
specific time intervals (e.g., ARIMA, exponential
smoothing).
Descriptive Modeling:
Clustering: Grouping similar data points together (e.g.,
k-means, hierarchical clustering).
Association Rule Mining: Discovering relationships
between items (e.g., market basket analysis).
Outlier Detection: Identifying abnormal data points
CRISP-DM: Framework for Data Mining
CRISP-DM stands for Cross-Industry Standard Process for Data
Mining.
Widely adopted methodology
Provides a structured approach for planning & executing DM
projects.
Designed to be adaptable across various industries and
applications.
Key Characteristics of CRISP-DM
Iterative: The process is not strictly linear. You may need to
revisit previous phases as you progress.
Flexible: It can be adapted to various project sizes and
CRISP-DM: Data Mining Operations
1. Business Understanding:
4. Data Modeling:
1. Determine business
objectives and 1. Select modeling techniques.
requirements. 2. Generate test design.
2. Assess situation and
3. Build and Assess models.
resources.
3. Determine data mining 5. Evaluation:
goals.
1. Evaluate results.
2. Data Understanding: 2. Review process.
1. Collect initial data. 3. Determine next steps.
2. Describe data.
3. Explore data.
6. Deployment:
4. Verify data quality. 1. Plan deployment.
2. Plan monitoring and
3. Data Preparation:
1. Select and Clean data. maintenance.
2. Construct data. 3. Produce final report.
CRISP-DM: Framework for Data Mining
Components of Data Mining
Data Source: This is the origin of the data, which can be databases,
data warehouses, or other repositories.
Data Warehouse Server: This component retrieves relevant data
from the data source based on user requests.
Data Mining Engine: The heart of the data mining process, it
applies various algorithms and techniques to extract patterns from
the data.
Pattern Evaluation Module: Assesses the discovered patterns
based on predefined criteria to determine their significance and
usefulness.
Graphical User Interface (GUI): This provides a user-friendly
interface for interaction with the data mining system.
Data Mining Architecture
Predictive Analytics
It is the use of data to predict future trends and events.
Attempts to answer the question, “What might happen next?”
It leverages historical data, statistical modeling, and machine
learning algorithms to identify patterns and make forecasts.
It works by identifying correlations between different
elements in selected datasets.
There are broadly two types of predictive analytics models:
classification models
regression models.
Predictive Analytics Challenges
Data Quality: Inaccurate, incomplete, or biased data can lead to
unreliable models.
Data Availability: Insufficient or limited data can hinder model
development.
Model Complexity: Complex models can be difficult to interpret and
explain.
Overfitting: Models that are too closely fitted to the training data
may not perform well on new data.
Ethical Considerations: Concerns about privacy, bias, and fairness
in model development and deployment.
Computational Resources: Handling large datasets and complex
models requires significant computational power.
Predictive Analytics Applications
Finance: Fraud detection, credit risk assessment, investment
portfolio optimization, market trend prediction.
Healthcare: Disease outbreak prediction, patient risk assessment,
drug discovery, personalized medicine.
Retail: Customer segmentation, demand forecasting, inventory
management, recommendation systems.
Marketing: Customer churn prediction, campaign optimization,
targeted advertising.
Manufacturing: Predictive maintenance, supply chain optimization,
quality control.
Insurance: Risk assessment, fraud detection, customer churn
prediction.