0% found this document useful (0 votes)
6 views22 pages

Data Mining Fundamentals and Applications

The document provides an overview of data mining, including its definition, importance, and various applications across sectors like e-commerce, healthcare, and banking. It outlines the Knowledge Data Discovery (KDD) process, which includes data cleaning, transformation, mining, and evaluation, as well as the types of data and patterns involved in mining. Major issues such as data quality, integrity, privacy, and scalability are also discussed, emphasizing the need for effective techniques to handle complex datasets.

Uploaded by

roboreacts17
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views22 pages

Data Mining Fundamentals and Applications

The document provides an overview of data mining, including its definition, importance, and various applications across sectors like e-commerce, healthcare, and banking. It outlines the Knowledge Data Discovery (KDD) process, which includes data cleaning, transformation, mining, and evaluation, as well as the types of data and patterns involved in mining. Major issues such as data quality, integrity, privacy, and scalability are also discussed, emphasizing the need for effective techniques to handle complex datasets.

Uploaded by

roboreacts17
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 1: Introduction

 Fundamentals of Data Mining


 Kinds of Pattern, and Technologies Used
 Applications of Data Mining
 Major Issues
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Data Visualization
Fundamentals of Data Mining
What is Data Mining?
Definition: Data mining is the process of extracting hidden patterns and useful
knowledge from large data or raw data.
Analogy: Like gold mining → you dig tons of soil but only extract a few grams of gold.
Similarly, in data mining, from a huge amount of raw data, we extract valuable information
It helps organizations make decisions based on hidden trends and insights that are not
visible in raw data.
Example: Market Basket Analysis (in Supermarkets)
When customers buy groceries, data mining can find patterns like —“People who buy
bread often buy butter too.”
This helps stores place these products together or create combo offers, increasing sales.
Why Data Mining:
Data mining is important because it helps us make better decisions by finding useful
patterns and insights hidden in large amounts of data.

To Discover Hidden Patterns:

 It finds unknown relationships or trends in data that are not visible easily.
 Example: Identifying that customers who buy coffee often buy snacks too.

To Make Better Decisions:

 Businesses use mined data to plan marketing, increase sales, or reduce costs.
 Example: Amazon recommends products you are likely to buy.

To Predict Future Trends:

 Helps in forecasting based on past data.


 Example: Banks predicting which customers might default on loans.

To Improve Customer Satisfaction:

 By understanding customer needs and behavior.


 Example: Netflix suggesting shows based on your past viewing history.
To Detect Fraud or Risk:

 Used in finance and insurance to find unusual or risky activities.


 Example: Detecting fraudulent credit card transactions.

To Support Research and Innovation:

 Scientists and researchers use data mining to analyze patterns in medicine, climate,
etc.
 Example: Identifying disease patterns from patient data.

KDD Process (Knowledge Data Discovery):


1. Databases / Flat Files (Data Sources)

 These are the original sources of data — for example, business transaction
databases, Excel sheets, CSV files, or online data repositories.
 The data here is raw, unorganized, and may contain errors.

Example: Customer purchase records, sensor readings, patient data, etc.

2. Cleaning and Integration

 This step removes noise, missing values, and inconsistencies in data.


 If data comes from multiple sources, it is merged (integrated) into one consistent
format.
 The goal is to get clean, consistent, and accurate data.

Example:
 Removing duplicate customer entries
 Filling missing ages with average values
 Converting all prices to the same currency

3. Selection and Transformation

 From the cleaned data, only relevant data for analysis is selected.
 The selected data is then transformed into a suitable format for mining.
 Transformation includes normalization, aggregation, and converting data types.

Example:
 Selecting customer ID, age, and spending pattern
 Converting income into ranges: Low / Medium / High

4. Data Warehouse

 All the selected and transformed data is stored in a data warehouse.


 The data warehouse acts as a central storage for analysis.
 It allows efficient querying and retrieval of data for mining.

Example: Company stores cleaned and formatted customer data in a warehouse before
running mining algorithms.
5. Data Mining

 This is the core step where algorithms are applied to extract patterns, trends, and
relationships.
 Techniques include classification, clustering, association, and prediction.
 This produces patterns or models that explain the data.

Example: Finding that “customers who buy milk and bread often buy butter.”

6. Evaluation and Presentation

 The discovered patterns are evaluated to identify which ones are useful and valid.
 Then, the results are presented visually using graphs, charts, or reports for better
understanding.

Example: Visualizing frequent item sets, decision trees, or customer clusters.

7. Knowledge

 The final output of the process is actionable knowledge — useful insights that can
help in decision-making.
 This knowledge can be applied in business strategy, research, healthcare,
marketing, etc.

Example: “Target young professionals for online sales promotions” — derived from mined
patterns.

On What Kind of Data:

Database Oriented Datasets


1) Relational Database:

 Data stored in a structured format within a Database Management System (DBMS),


usually relational databases.
 Organized into tables (relations) with rows (tuples) and columns (attributes).
 Each row represents an object, identified by a unique key.
 Relationships between entities can be represented using tables (e.g., purchases, works
at).
 Querying data to retrieve information.
 Aggregate operations like sum, average, count, max, min.
 Data mining can find trends, patterns, deviations, and predictions.
Example: AllElectronics Database
 Tables: customer, item, employee, branch
 Relationships: purchases (customer buys items), items sold (items in a transaction),
works at (employee branch)
 Analysis: Predicting customer credit risk based on age, income, and past credit data.

2) Data Warehouse:

 Central repository of data collected from multiple sources, organized for analysis.
 Data from multiple databases integrated, cleaned, and transformed.
 Stores historical and summarized data.
 Structured as multidimensional data cubes.
 Supports OLAP operations like drill-down (more detail) and roll-up (more
summary).
 Analyze data at different levels (city, country, quarter, month, etc.).
 Fast access to summarized information.
 Multidimensional data mining to discover patterns.
Example: AllElectronics Data Cube
 Dimensions: City (Chicago, New York, Toronto, Vancouver), Time (Q1–Q4), Item
type (computer, phone, home entertainment, security)
 Measure: Total sales amount
 OLAP operations: Drill-down to view monthly sales; Roll-up to view country-level
sales
3) Transactional Database:
 Data capturing events or transactions, often in a database or flat file.
 Each transaction has a unique ID.
 Transaction lists the items or actions involved.
 Can include additional tables for details (items, salesperson, branch, etc.).
 Market basket analysis (find which items are frequently bought together).
 Identify patterns and associations between items.

Example: AllElectronics Transactional Database

 Transaction Table:

Trans ID List of Item IDs


T100 I1, I3, I8, I16
T200 I2, I8

 Analysis: Identify items often sold together (e.g., printers with computers) to create
marketing strategies.

4) Advanced Datasets:
Advanced
Definition Example
Dataset Type
Temporal / Data that is ordered over time Stock exchange data; historical sales records;
Sequence Data or in a sequence. biological sequences (DNA, proteins).
Continuous, real-time flow of
Video surveillance; IoT sensor data; network
Data Streams data from sensors or
traffic logs.
monitoring systems.
Data related to physical Maps; city poverty rates vs. distance from
Spatial Data
locations or geographic space. highways; GIS surveys.
Unstructured or semi-
Product reviews; social media posts;
Text Data structured textual
scientific literature.
information.
Multimedia Data in forms of images, Identifying objects in images; detecting goals
Data videos, or audio. in sports videos; classifying videos/audio.
Graph / Data representing
Social networks; citation networks; web
Networked relationships or connections
graphs.
Data between entities.
Data collected from the
E-commerce websites; web page link
Web Data World Wide Web, often
structures; user activity logs.
heterogeneous.
Web pages with text, images, videos, and
Multiple / Combination of several data
hyperlinks; bioinformatics datasets
Mixed Data types present in one
combining sequences, networks, and 3D
Types application.
structures.

Kinds of Pattern, and Technologies Used


On What Kind of Patterns:
1. Association Patterns:
 Association patterns discover relationships between items in large datasets. They answer
the question: “Which items occur together frequently?”
 Identify co-occurrence of items in transactions.

 Used in market basket analysis, cross-selling, or promotions.


 Example:
 In a supermarket: Customers who buy bread also often buy butter.
 Rule: {Bread} → {Butter} with a confidence of 80%.
2) Correlation Patterns:
 Correlation patterns identify how one item or variable changes in relation to another.
 Measure strength and direction of relationships between variables.
 Helps in predicting one variable based on another.
 Example:
 Height and weight of people: As height increases, weight tends to increase.
 Stock market: As interest rates rise, stock prices may fall.

3) Classification Patterns:

 Classification patterns assign data items to predefined categories or classes based on


their attributes.
 Predict categorical labels for new data.
 Used in credit scoring, medical diagnosis, and spam detection.
 Example:
 Email filtering: Classify emails as Spam or Not Spam.
 Bank: Classify customers as High Risk or Low Risk for loans based on income, age,
and credit history.

4) Clustering Patterns:

 Clustering patterns group data items into clusters or groups such that items in the
same cluster are similar, and items in different clusters are dissimilar.
 Identify natural groupings without predefined classes.
 Useful for segmentation, anomaly detection, and pattern discovery.
 Example:
 Market segmentation: Group customers into clusters based on buying behavior.
 Social network: Cluster users based on interests or interactions.
5. Outlier: outlier analysis is the process of identifying data points that are significantly
different from the rest of the dataset. These points, known as outliers.

When data that cannot be grouped in any of the class appears, we use outlier analysis. There
will be occurrences of data that will have different attributes/features to any of the other
classes or clusters. These outstanding data are called outliers

• Example: Credit card fraud – someone usually spends ₹2,000–₹5,000 for monthly,
but suddenly a ₹2, 00,000 transaction appears.

Not all discovered patterns are interesting (useful)


 A pattern is interesting if it is:
• Understandable – easy to interpret.
• Valid – holds on new/test data.
• Novel – previously unknown.
• Useful – helps decision making.
• Surprising – goes against expectations.

Technologies Used:
Statistics: Provides mathematical tools and methods for analyzing data, estimating trends,
and validating patterns discovered through mining. Statistical measures help quantify
relationships and probabilities in the data.

Machine Learning: Supplies algorithms for predictive modeling and pattern recognition.
Techniques like classification, regression, clustering, and anomaly detection are all derived
from machine learning.

Pattern Recognition: Helps identify recurring patterns or regularities in data, which is


essential for understanding complex structures like images, sequences, or signals.

Visualization: Data visualization aids in interpreting mining results, making it easier to


explore patterns, trends, and anomalies visually. Graphs, charts, and interactive plots are
used to communicate findings effectively.

Algorithms: Efficient algorithms are necessary to process large datasets and perform
mining tasks like searching, clustering, or association rule discovery.

High-Performance Computing: Data mining often requires processing huge datasets,


which demands fast computation, parallel processing, and optimized storage to reduce
runtime.

Applications: Data mining is applied across many domains such as business, healthcare,
finance, and social media. The insights obtained are tailored to specific application needs,
like fraud detection or recommendation systems.

Information Retrieval: Techniques from information retrieval help in finding relevant


data efficiently, especially from unstructured or semi-structured sources like text
documents and the web.

Data Warehouse: Data warehouses provide integrated, cleaned, and historical data that
serve as a foundation for data mining. They allow multidimensional analysis and support
large-scale mining.

Database Systems: Databases store structured data in tables, which can be queried and
preprocessed before mining. Database management ensures data integrity, consistency, and
security.
Applications of Data Mining
E-commerce: Data mining analyzes customer data to understand buying behavior and
provide personalized recommendations; for example, Amazon suggests the “Frequently
Bought Together” items based on your previous purchases.

Healthcare: Data mining analyzes medical records to predict diseases and improve
patient care; for example, a hospital using patient blood sugar levels, BMI, and family
history to predict the risk of diabetes for early intervention.

Banking: Data mining examines transaction data to detect fraud and assess risk; for
example, Visa or Mastercard detecting unusual transactions, like multiple international
purchases in a short time, and alerting the customer.

E-Commerce: E-commerce websites use Data Mining to offer cross-sells and up-sells
through their websites. One of the most famous names is Amazon, who use Data mining
techniques to get more customers into their e-commerce store.

Social media: Data mining interprets user-generated content to understand trends and
sentiment; for example, twitter analyzing tweets about the latest iPhone launch to see
whether users are positive, negative, or neutral.

Major Issues in Data Mining


1. Data Quality:
Data quality refers to how accurate, complete, and reliable the data is. Poor quality data—
such as missing values, errors, duplicates, or inconsistencies—can lead to incorrect
patterns and wrong decisions.

For example, if a retail dataset has missing purchase records, data mining might wrongly
identify which products are popular, affecting recommendations. Ensuring high-quality
data through cleaning and preprocessing is essential before mining.

2. Data Integrity:
Data integrity means that the data is consistent, trustworthy, and unaltered across all
sources. If data is corrupted or inconsistent, the analysis may produce misleading results.
For example, if a bank’s customer database has conflicting account balances due to
synchronization errors, predictive models for loan approval could fail. Maintaining
integrity ensures that the mined patterns truly reflect reality.

3. Privacy and Security Concerns:


Data mining often deals with sensitive personal, financial, or medical information.
Privacy concerns arise when data is used without consent, and security concerns arise
when unauthorized access leads to data breaches.

For example, mining patient records without anonymization can expose personal health
information. Proper encryption, anonymization, and access controls are needed to protect
data.

4. Handling Complex and Unstructured Data:


Many real-world datasets are unstructured, like text documents, social media posts,
images, and videos, which are harder to analyze than structured tables. Traditional mining
techniques are often insufficient for such data.

For example, analyzing millions of tweets to detect sentiment requires natural language
processing and specialized algorithms. Handling this complexity is a major challenge in
data mining.

5. Scalability:
Scalability is the ability of data mining algorithms to handle extremely large datasets
efficiently. As data grows in volume, traditional algorithms may become too slow or
require too much memory.

For example, processing terabytes of e-commerce transactions or social media data needs
distributed computing and parallel processing techniques. Scalable algorithms ensure
timely and efficient analysis even with big data.
Data Objects and Attribute Types
Data objects are the basic entities or items in a dataset that we want to analyze using data
mining. Each data object represents a real-world entity and is described by a set of
attributes.

For example, in a customer database, each customer is a data object, and in a hospital
database, each patient or each medical test result can be a data object. Data objects are
essentially the “rows” in a structured dataset.

Attributes are the properties or characteristics of a data object, often represented as


“columns” in a dataset. For example, customer_id, customer_name, customer_age

Types of Attributes:

1. Qualitative (Categorical) Attributes

Qualitative attributes describe qualities or categories of data rather than numeric


measurements.

a) Nominal:
Nominal attributes represent categories with no inherent order. They are simply used to
label or classify objects.
Example: Gender (Male, Female), Blood Type (A, B, AB, O).
b) Ordinal:
Ordinal attributes represent categories with a meaningful order, but the difference
between values is not numerically meaningful.
Example: Customer satisfaction (Low, Medium, High), Education Level (High School,
Bachelor’s, Master’s).

c) Binary:

A binary attribute is a special nominal attribute with only two states: 0 or 1. Where 0
typically means that the attribute is absent, and 1 means that it is present.

Binary attributes have only two possible values and can be of two types:

 Symmetric Binary: Both outcomes are equally important.


Example: Gender (Male/Female), Pass/Fail.
 Asymmetric Binary: One outcome is more significant than the other.
Example: Disease presence (Yes/No), Clicked ad (Yes/No) – here, “Yes” is more
important.

2. Quantitative (Numeric) Attributes

Quantitative attributes represent quantity or measurements. They are numeric in nature.

a) Ratio:
Numeric attributes with a true zero, where both differences and ratios are meaningful.
Example: Age (0–100 years), Income ($0–$1, 00,000).

b) Interval:
Numeric attributes where differences are meaningful but there is no true zero.
Example: Temperature in Celsius (difference matters, 0°C is not “no temperature”), Year
of Birth.

c) Discrete:
Numeric attributes that take finite, countable values.
Example: Number of children in a family (0, 1, 2…), Number of cars owned.
d) Continuous:
Numeric attributes that can take any value within a range.
Example: Height (e.g., 165.5 cm), Weight (e.g., 72.3 kg).

Basic Statistical Descriptions of Data


1. Measuring Central Tendency

Central tendency describes the “center” or typical value of a dataset. It helps summarize
data with a single representative value. Common measures include:

a) Mean:
The arithmetic average of all values in a dataset. It is sensitive to extreme values
(outliers).
Example: For ages {20, 25, 30, 35, 40}, Mean = (20+25+30+35+40)/5 = 30.

b) Median:
The middle value when the data is arranged in ascending or descending order. It is robust
to outliers.
Example: For ages {20, 25, 30, 35, 100}, Median = 30 (not affected by 100).

c) Mode:
The most frequently occurring value in the dataset.
Example: For purchases {1, 2, 2, 3, 4}, Mode = 2 (unimode)

2. Measuring Dispersion

Dispersion indicates how spread out the data values are around the central value. It shows
the variability or consistency of the dataset. Common measures include:

a) Variance:
Average of the squared differences from the mean. It measures how far data points
deviate from the mean.
Example: For {2, 4, 6}, Mean = 4, Variance = [(2–4)² + (4–4)² + (6–4)²]/3 = 8/3 ≈ 2.67.

b) Standard Deviation:
Square root of variance. It expresses dispersion in the same units as the data.
Example: SD = √2.67 ≈ 1.63.
Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-
number summary as follows: Typically, the ends of the box are at the quartiles so that the
box length is the interquartile range. The median is marked by a line within the box. Two
lines (called whiskers) outside the box extend to the smallest (Minimum) and largest
(Maximum) observations.

Five Number Summary:


The Five-Number Summary is a way to describe a dataset using five key values:
1. Minimum (Min): The smallest value in the dataset.
2. First Quartile (Q1): The 25th percentile; 25% of data lies below this value.
3. Median (Q2): The 50th percentile; the middle value of the dataset.
4. Third Quartile (Q3): The 75th percentile; 75% of data lies below this value.
5. Maximum (Max): The largest value in the dataset.

Example Dataset:

12, 15, 18, 19, 21, 23, 24, 26, 28, 30

Step 1: Order the data (already in ascending order):


12, 15, 18, 19, 21, 23, 24, 26, 28, 30
Step 2: Identify Min and Max:

 Min = 12
 Max = 30

Step 3: Median (Q2):

 Middle value = average of 5th and 6th values → (21 + 23)/2 = 22

Step 4: First Quartile (Q1):

 Lower half of data: 12, 15, 18, 19, 21


 Median of lower half = 18 → Q1 = 18

Step 5: Third Quartile (Q3):

 Upper half of data: 23, 24, 26, 28, 30


 Median of upper half = 26 → Q3 = 26

Five-Number Summary:

 Min = 12
 Q1 = 18
 Median (Q2) = 22
 Q3 = 26
 Max = 30

Symmetric vs Skewed Symmetric:


1. Symmetric Distribution

A distribution is symmetric if it’s left and right sides are mirror images around the
central point (mean, median, and mode).
 Mean = Median = Mode
 The tails on both sides of the central value are equal in length.
 Examples: Normal distribution, Uniform distribution

2. Skewed Distribution

A distribution is skewed if it is not symmetric, i.e., one tail is longer than the other.

Types of Skewness:

a) Right-skewed (positive skew):

A distribution is right-skewed when the tail on the right side (larger values) is longer
than the tail on the left. This means most of the data is concentrated on the smaller values,
with a few larger values stretching out to the right.

Relationship of mean, median, and mode: Mode<Median<Mean

The mean is pulled toward the longer right tail, so it is larger than the median and mode.

Example: Income distribution in many countries where most people earn moderate
amounts, but a few earn extremely high amounts.

b) Left-skewed (negative skew):

A distribution is left-skewed when the tail on the left side (smaller values) is longer than
the tail on the right. This means most of the data is concentrated on the higher values,
with a few smaller values stretching out to the left.

Relationship of mean, median, and mode: Mean<Median<Mode

The mean is pulled toward the longer left tail, so it is smaller than the median and mode.

Example: Exam scores in a class where most students score high, but a few score very
low.
Data Visualization
Data visualization is the process of representing data in graphical or visual formats
such as charts, graphs, and plots, to help humans understand patterns, trends, and
insights in the data easily.

Instead of looking at raw numbers, visualization allows us to see the story behind the
data.

Purpose / Importance:

1. Simplifies complex data: Makes large datasets easy to understand.


2. Identifies patterns and trends: Helps spot trends, correlations, and outliers.
3. Supports decision-making: Visual insights help in making informed decisions.
4. Communicates results effectively: Easier for stakeholders to understand compared to
raw tables.
5. Detects errors: Anomalies or mistakes in data can be identified quickly.

Types of Data Visualization:

1. Charts and Graphs:


o Bar Chart: Compare categories (e.g., sales per region).
o Line Chart: Show trends over time (e.g., monthly revenue).
o Pie Chart: Show proportions of a whole (e.g., market share).
2. Plots:
o Scatter Plot: Show relationships between two variables (e.g., height vs.
weight).
o Histogram: Show frequency distribution of a single variable.
o Box Plot (or Whisker Plot): Show summary statistics and detect outliers.
3. Advanced Visualizations:
o Heat maps: Show intensity or correlation between variables.
o Tree Maps / Sunburst Charts: Show hierarchical data.
o Geospatial Maps: Show data over geographic locations.

Example: Suppose you have exam scores for 50 students. A histogram can quickly show
how many students scored in the ranges 0–20, 21–40, etc., instead of checking each score
individually.

1. Quantile Plot: A quantile is a value that divides a dataset into equal-sized intervals,
such that a certain proportion of data lies below it.

Example:
Dataset: 2, 4, 7, 10, 12, 15, 18, 20, 22, 25

 Q1 (25th percentile) = 7
 Q2 (50th percentile) = 13.5 (median)
 Q3 (75th percentile) = 20

Use: Quantiles help understand the distribution of data and detect outliers.

2. Quantile-Quantile (Q-Q) Plot: A Q-Q plot is a graphical tool to compare the


distribution of a two different groups by plotting their corresponding quantiles
against each other. If the data follows the theoretical distribution, the points lie
approximately on a straight line.
 Deviations from the line indicate differences from the expected distribution.

Example Use Cases:

 Checking if data is normally distributed.


 Comparing two datasets to see if they come from the same distribution.

3. Scatter Plot: A scatter plot is a graphical representation of two variables (x and y)


where each point represents one observation.

Purpose:

o Show relationship or correlation between two variables.


o Detect patterns, clusters, and outliers.
Example:

 x-axis: Hours studied


 y-axis: Exam score
 Each student = one point
 Pattern: Positive correlation → more hours studied → higher scores

THANK-YOU

You might also like