Data Warehousing and Mining
Roadmap
What is a Warehouse?
more
What is a Warehouse?
Warehouse Architecture
Client Query & Analysis Client
Metadata
Warehouse
Integration
Source
Source
Source
Why a Warehouse?
?
Source Source
Query-Driven Approach
Client Mediator Wrapper Wrapper
Client
Wrapper
Source
Source
Source
Advantages of Warehousing
Advantages of Query-Driven
OLTP vs. OLAP
OLTP: On Line Transaction Processing Describes processing at operational sites
OLAP: On Line Analytical Processing Describes processing at warehouse
OLTP vs. OLAP
OLTP
OLAP
Data Marts
ROLAP vs. MOLAP
ROLAP: Relational On-Line Analytical Processing MOLAP: Multi-Dimensional On-Line Analytical Processing
ROLAP
MOLAP
Implementing a Warehouse
Monitoring
Integrating
Processing
Managing
Design Issues
Tools required for:
design & edit: schemas, views, scripts, rules, queries, reports what-if scenarios (schema changes, refresh rates), capacity planning
Planning & Analysis
performance monitoring, usage patterns, exception reporting
Warehouse Management
Development
measure traffic (sources, warehouse, clients)
System & Network Management
reliable scripts for cleaning & analyzing data
Workflow Management
Data Mining
The efficient discovery of previously unknown, valid, potentially useful, understandable patterns in large datasets
Data Mining is:
The analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner
Examples of Large Datasets
WALMART: 20M transactions per day
MOBIL: 100 TB geological databases
AT&T 300 M calls per day
NASA, EOS project: 50 GB per hour
Examples of Data mining Applications
Fraud detection: credit cards, phone cards
Marketing: customer targeting
Data Warehousing: Walmart
Astronomy
Molecular biology
How Data Mining is used
Identify the problem Use data mining techniques to transform the data into information Act on the information Measure the results
The Data Mining Process
2. Create a dataset: 1. Understand the domain
Select the interesting attributes Data cleaning and preprocessing
4. Interpret the results, and possibly return to 2
3. Choose the data mining task and the specific algorithm
Data Mining Tasks
Classification
Regression
Clustering:
Dependencies and associations Summarization
Data Mining Methods
1. Decision Tree Classifiers:
2. Association Rules:
Used for modeling, classification Used to find associations between sets of attributes Used to find temporal associations in time series used to group customers, web users, etc
3. Sequential patterns:
4. Hierarchical clustering:
Are All the Discovered Patterns Interesting?
Objective:
based on statistics and structures of patterns, e.g., support, confidence, etc.
Subjective: based on users belief in the data, e.g., unexpectedness, novelty, actionability, etc.
Why Data Preprocessing?
Why can Data be Incomplete?
Why can Data be Noisy/Inconsistent?
Data Cleaning
Major Tasks in Data Preprocessing
Data cleaning
Data integration
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Integration of multiple databases or files
Data transformation
Data reduction Data discretization
Normalization and aggregation
Obtains reduced representation in volume but produces the same or similar analytical results Part of data reduction but with particular importance, especially for numerical data
How to Handle Missing Data?
How to Handle Noisy Data? Smoothing techniques
Simple Discretization Methods: Binning
number of values
Example: customer ages
Equi-width binning:
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80
Equi-width binning:
0-22
22-31 62-80 38-44 48-55 32-38 44-48 55-62
Cluster Analysis
salary
cluster
outlier
age
Regression
y (salary) Example of linear regression y=x+1
Y1
X1
x (age)
Data Integration
Data Transformation
Normalization: Why normalization?
Data Reduction Strategies
Data Compression
Data Compression
Original Data
lossless
Compressed Data
Original Data Approximated
Histograms
40 35
30 25
20 15 10
5 0
10000 30000 50000 70000 90000
Clustering
Sampling
Sampling
Raw Data Cluster/Stratified Sample
The number of samples drawn from each cluster/stratum is analogous to its size Thus, the samples represent better the data and outliers are avoided
Sampling
Raw Data
Example: Benefits for Healthcare Industry
Evidencebased medicine Policymaking in public health More value for money and cost saving Early detection and/or prevention of disease
Prevention of hospital errors
Management of pandemic diseases
Non-invasive diagnosis and decision support
Adverse drug event
Example: Usage in Digital Media Industry
Ad Targeting Yield Optimization Ad Sales Analysis Bid Price Optimization
Website Optimization
Attribution Analysis
Click Fraud Analysis
Network Usage Analysis