0% found this document useful (0 votes)
3 views26 pages

DM Notes Unit - 1

Data mining is the process of extracting useful information from large datasets to transform it into actionable knowledge. It involves various steps including data cleaning, integration, and applying algorithms to discover patterns, which can be classified into descriptive and predictive tasks. The architecture of data mining includes data sources, cleaning processes, mining engines, and evaluation modules, and it is applicable across various data types and industries.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views26 pages

DM Notes Unit - 1

Data mining is the process of extracting useful information from large datasets to transform it into actionable knowledge. It involves various steps including data cleaning, integration, and applying algorithms to discover patterns, which can be classified into descriptive and predictive tasks. The architecture of data mining includes data sources, cleaning processes, mining engines, and evaluation modules, and it is applicable across various data types and industries.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Why Dm is used?

 The overall goal of the data mining process is to extract information from a large data
sets or databases and transform it into an understandable structure for further use
 Data  Knowledge  Action  Goal
 Netflix collects user ratings of movies (data)  What types of movies you will like
(knowledge)  Recommend new movies to you (action)  Users stay with Netflix
(goal)
 Gene sequences of cancer patients (data)  Which genes lead to cancer? (knowledge)
 Appropriate treatment (action)  Save life (goal)

What is data Mining?

 Data mining refers to extracting or “mining” knowledge from large amounts of data.
 It can be also considered as process of automatically discovering useful information from
large data repositories
 Its also known as “Knowledge Discovery in Databases (KDD)” or “knowledge mining”.
Explain Data Mining Architecture([Link]

The Architecture of DataMining Goes Like:

1. Data Sources (Bottom Layer):


- It is the source where all raw data is stored like Database, Data Warehouse, WWW
(web), Other info repo.
- This contains raw & unprocessed data.
2. Cleaning, Integration & Selection:
- We have a lot of data, so a part of data is used using this, which is done using data
cleaning, data integration and data selection.

Extra: ----------------------------

- Cleaning: Removes noise, handles missing values


- Integration: Combines data from multiple sources
- Selection: Chooses only relevant data for analysis
3. Database & Database Server:
- As the data may come from many places database, data warehouse, etc. so to make
it standardize in single format this is used.
- This Acts as a storage manager for Pre-processed data. (retrieved data, meta data,
Intermediate results)
- It Supports queries and retrieves selected, cleaned data for mining
4. Data Mining Engine:
- This is the core functional component of Dm Architecture.
- Here various Data Mining algorithms are applied on the Pre-processed data like
Classification, Clustering, Association rule mining, Regression, etc. and we get the
hidden patterns.
5. Pattern Evaluation Module:
- This Filters out irrelevant, redundant, or unhelpful patterns.
- This provides only those patterns which are usefull for users.
6. Knowledge Base:
- when we are finding hidden patterns or removing unnecessary patterns, we require
some extra information.
- It contains background knowledge:

Extra: ---------------------------- Background Knowledge contains…. This:


o Concept Hierarchies

(Defines relationships b/w higher-level & lower-level concepts.)

Example: "City → State → Country"

o Ontologies

(A formal structure showing how concepts relate in a domain.)

Example: In healthcare, symptoms → diseases → treatments.

o User Constraints

(Rules or limitations specified by the user to restrict the search space.)

Example: Only mine patterns from users aged 18–25.

o Previously Discovered Patterns

(Stores known or previously accepted patterns to avoid repetition.)

7. Graphical User Interface (GUI):


- Allows the user to interact with the system
- Displays results through charts, tables, graphs, etc.
- Also allows feedback to improve pattern evaluation or re-mining

Diagram: ---------------
Explain KDD (Knowledge Discovery in Databases)

- KDD (Knowledge Discovery in Databases) is the overall process of discovering useful


knowledge (patterns, trends, or rules) from large amounts of data.
- Data Mining is just one step of KDD
- KDD process = Data Preprocessing + Data Mining + Data Post-Processing

Steps in KDD process includes:

1. Selection: relevant data is fetched from a larger dataset. (SQL queries can be used).
2. Preprocessing: Data cleaning is done here to remove Missing values, noise,
Inconsistencies from the data.
3. Transformation: Convert the cleaned data into a suitable format.
4. Data Mining: Here Algorithms are applied to extract patterns.
5. Pattern Evaluation: This Filters out irrelevant, redundant, or unhelpful patterns.
6. Knowledge Presentation: Displays the final patterns using Graphs, tables, trees, pie
chart, anything that user wants.

DM can be done on what kind of data?

Data Mining can be done from many kinds of Data like:


- Relational Database: here, data is stored in form of tables (rows and columns)
Examples: MySQL, Oracle, SQL Server
- Data Warehouses: A data warehouse is a repository of information collected from
multiple sources. (It’s constructed after pre-processing of the data)
Examples: Stock Market, D-Mart, Big Bazar
- Transactional Database: here Data is represented as a set of transactions. Also, each
transaction is identified by a Transaction ID (TID) and a set of items.
Examples: Online shopping on Flipkart, Amazon
- World Wide Web (WWW): A massive source of unstructured and semi-structured
data.
- And much more using Information Repositories like CSV, Excel, JSON.
Or Multimedia databases: audio, images, video
Or Sensor data: IoT devices, weather logs, smartwatches
What kind of patterns can be mined?

Data mining functionalities can be classified into two categories:

1. Descriptive

2. Predictive

 Descriptive

- These tasks present the general properties of data stored in a database.


- Here descriptive tasks are used to find out patterns in data.
- E.g.: Cluster, Trends, etc.

 Predictive

- These tasks predict the value of one attribute on the basis of values of other
attributes.
- E.g.: Festival Customer/Product Sell prediction at store

What’s Characterization and Discrimination?

Class Characterization:

- Focuses on summarizing the properties or features of a specific class in a dataset.


- Describes common or representative patterns associated with the target class.
- Helps in understanding the general behavior of the data within that class.

Data discrimination:

- It’s also known as Class Discrimination or Class Comparison.


- Focuses on comparing two or more classes to find distinctive features.
- Identifies which attributes best differentiate one class from another.
- Helps in decision-making, segmentation, and classification.
What are the Mining Frequent Patterns?

- Frequent patterns are those patterns that occur frequently in data. Here is the list of
kind of frequent patterns.
- Frequent Item Set: It refers to a set of items that frequently appear together.
- Frequent Subsequence: A sequence of patterns that occur frequently such as
purchasing a laptop is followed by digital camera and a memory card.
- Frequent Sub Structure: A substructure can refer to different structural forms (e.g.,
graphs, trees, or lattices) that may be combined with itemset or subsequence’s.
Explain Association analysis with example:

Association analysis:

- the process of uncovering the relationship among data and determining association
rules.
- It is used to discover interesting relationships and associations among items or
events in large datasets.
- Finds patterns like “if X, then Y” (known as association rules).
- Rules are evaluated using:
o Support: how often items occur together
o Confidence: how often Y is bought when X is bought.
- E.g. :- If a customer buys a laptop, will he buy an anti-virus software along with it.

buys(X,“computer”) ⇒ buys(X,“software”) [support = 1%,confidence = 50%],

Explain Mining of Correlations:

- Correlation mining is a data mining technique used to identify statistical


relationships between two or more variables in a dataset.
- It helps determine whether variables are positively correlated, negatively
correlated, or have no correlation at all.
- It measures:
o Strength of the relationship (how strong is the effect?)
o Direction of the relationship
 Positive Correlation (+): both variables increase together
 Negative Correlation (–): one increases while the other decreases
 Zero Correlation (0): no relationship
- Example: Correlation between TV Advertising and Sales: +0.95 (approximate)
Explain Classification and Regression for Predictive Analysis:

1) Classification:
- Classification is the process of building a model that can predict the class label of
data objects whose label is unknown.
- Here, the model is derived, based on the analysis of a set of training data where class
labels are already known. The model learns the relationship between attributes and
class labels.
- Example: automatically classify weather the email is spam or non-spam.

2) Regression:
- It is used to predict missing or unavailable numerical data values rather than class
labels
- Regression Analysis is generally used for prediction
- Example: predict the price of a house based on its size (in square feet).

Explain Cluster Analysis and Outlier Analysis for Descriptive Analysis:

1) Clustering:
- Clustering groups similar data objects together without using class labels.
- In many cases, class- labeled data may simply not exist at the beginning & Clustering
is used to generate class labels for a group of data
- Clustering principle follows:
o Maximize intra-class similarity (within same cluster)
o Minimize inter-class similarity (between different clusters)
- Example: Grouping customers into segments based on age, income, and purchase
behavior.

2) Outlier Analysis:
- Outlier analysis identifies data points that do not follow the general pattern of the
dataset.
- Outliers are considered as rare or exceptional cases. Or sometimes treated as noise
and removed.
- Example: In a student attendance dataset, a student with very low attendance
compared to others may be an outlier.
👆🏻Extra………………………... :
Are All Patterns in Data Mining Interesting? If not which are the Techniques to Evaluate and
Select Interesting Patterns??

- A data mining system can generate thousands or even millions of patterns, but not
all of them are interesting or useful.
- So, No. not all the patterns in DM are interesting.
- Techniques to Evaluate and Select Interesting Patterns:
o Objective Measures of Interestingness:
 Objective measures quantify the quality or interestingness of patterns
based on statistical significance or measures derived from the data.
 These measures include support, confidence, lift, and various
statistical tests.

Extra…………………..👇🏻

Extra …………………….👆🏻

o Subjective Measures of Interestingness:


 Subjective measures take into account the user's preferences,
domain knowledge, and specific application requirements.
 Here Users can specify interestingness thresholds or define
constraints to filter and focus on patterns that meet their criteria.
Which Technologies Are Used in Data Mining?

- Data mining integrates techniques from various fields to extract meaningful patterns
and knowledge from large datasets.

1. Statistics
- A statistical model is a mathematical representation of relationships between
variables.
- It consists of a set of mathematical functions or equations that define the behavior
of the objects.
- Used for:
o Summarizing data (mean, median, standard deviation)
o Making predictions (regression analysis)
o Testing hypotheses (statistical significance)

2. Machine Learning
- Focuses on building systems that can learn from data and make predictions or
decisions by their own.
- There are many types of Machine Learning : Supervised Learning, Unsupervised
Learning, Semi-Supervised Learning, Active Leanring.

3. Database Systems and Data Warehouses


- Data mining often works with very large datasets or even real-time, fast streaming
data.
- So, it can make good use of scalable database technologies to achieve high efficiency
and scalability on large datasets.

4. Information Retrieval (IR)


- Information retrieval (IR) is the science of searching for documents or information in
documents.
- Documents can be text or multimedia, and may reside on the Web
What are the different types of Machine Learning?

- There are many types of Learning:


o Supervised Learning (Classification):
 Learns from labeled data
 Goal: predict a class label
 Example: spam vs. non-spam email
o Unsupervised Learning (Clustering):
 Learns from unlabeled data
 Goal: discover hidden patterns or groupings
 Example: segmenting customers by buying behavior

o Semi-Supervised Learning
 Uses a small amount of labeled data + large amount of unlabeled data
o Active Learning
 Active learning lets users play an active role in the learning process
 The system actively queries the user (domain expert) for labels
Difference b/w traditional information retrieval and database systems

Aspect Information Retrieval (IR) Database Systems

1. Data Works with unstructured or semi- Works with structured data in tables
Structure structured data (rows & columns)

e.g., text documents, web pages, e.g., relational databases like MySQL,
images Oracle

2. Query Uses simple keyword-based or Uses structured query language


Language natural language queries (mostly SQL)

e.g., "climate change news" e.g., SELECT * FROM Customers WHERE


Age > 25

3. Data Often lacks a fixed schema; Based on well-defined schemas with


Model dynamic and flexible strict data types and constraints

Focuses on relevance ranking and Focuses on exact matches to user


scoring queries

4. Output Returns a ranked list of Returns complete and exact data


Nature documents based on relevance records
Which Kinds of Applications Are Targeted in Data Mining?

- Data mining is applied across a wide range of domains but, two highly successful
and popular application areas are:

1. Business Intelligence (BI):


- Business Intelligence technologies provide historical, current, and predictive views of
business operations.
- Common BI Applications are: reporting, OLAP (Online Analytical Processing),
Business performance management, Competitive intelligence, Benchmarking
- Without DM, many businesses may not be able to perform effective market analysis,
compare customer feedback on similar products, discover the strengths &
weaknesses of their competitors, retain highly valuable customers, & make smart
business decisions.

2. Web Search Engines:


- A Web search engine is a specialized computer server that searches for information
on the Web.
- Web search engines use data mining for: Indexing web pages, Ranking results,
Predicting user interests, Query suggestions
What are the Challenges pose by Search Engines in data mining?

- Huge and ever-growing data:


o The web is constantly expanding with billions of web pages, images, and
videos.
o Mining, indexing, and managing such massive data in real-time is highly
challenging.

- Handling Real-Time (Online) Data:

o Web search engines must process new and dynamic data continuously.

o They often operate on live data streams, requiring rapid analysis and updates.

- Need for Incremental Updates:

o Search engines must incrementally update their models (like ranking


algorithms, user profiles, etc.).

o This is difficult because data changes fast, and complete retraining from
scratch is not practical.

- Dealing with Rare or Unique Queries:

o A large portion of search queries are new or asked very few times.

o Traditional data mining methods rely on patterns found in frequent data — but
rare queries lack such patterns.
Explain Data mining Issues

- Data mining faces several challenges that can be grouped into five major categories:

1. Mining Methodology:

o These issues are related to what kind of knowledge is mined


and how effectively.

o Mining various and new kinds of knowledge:

 Data mining involves a wide range of tasks (e.g.,


classification, clustering, association).

 Requires specialized techniques for different goals.

o Mining in Multidimensional Space:

 Real-world data has many attributes/dimensions.

 Patterns can be found across combinations of


dimensions.

 This is called exploratory or multidimensional data


mining

o Interdisciplinary Nature of Data Mining:

 The power of data mining can be substantially enhanced


by integrating new methods from multiple disciplines

o Handling Uncertainty, Noise, or Incompleteness

 Real data often contains errors, missing values, or noisy


data.
 Errors and noise may confuse the data mining process,
leading to the derivation of erroneous patterns

2. User Interaction:

o Interactive Mining:

 The data mining process should be highly interactive.


Thus, it is important to build flexible user interfaces and
an exploratory mining environment, facilitating the
user’s interaction with the system.

o Incorporation of Background Knowledge:

 Background knowledge, constraints, rules, and other


information regarding the domain under study should
be incorporated into the knowledge discovery process.

o Presentation and visualization of data mining results:

 Mining results must be presented in a clear, vivid, and


understandable way.

 Visual tools (charts, graphs, dashboards) help non-


technical users interpret results easily.

3. Efficiency and Scalability:

o Efficiency and scalability of data mining algorithms:

 Data mining algorithms must be efficient and scalable in


order to effectively extract information from huge
amounts of data lies in many data repositories or in
dynamic data streams.
 In other words, the running time of a data mining
algorithm must be predictable, short, and acceptable by
applications.

 Efficiency, scalability, performance, optimization and


the ability to execute in real time are key criteria for
new mining algorithms.

o Parallel, distributed, and incremental mining


algorithms:

 The giant size of many data sets, the wide distribution


of data, and the computational complexity of some data
mining methods are factors that motivate the
development of parallel and distributed data-intensive
mining algorithms.

4. Diversity of Database Types:

 Data from multiple sources are connected by the


Internet and various kinds of networks like
distributed and heterogeneous global
information systems.

 The discovery of knowledge from different


sources of structured, semi-structured, or
unstructured is challengeable.

5. Data Mining and Society:

o This category highlights the social, legal, and ethical implications of data
mining.
 Privacy-Preserving Data Mining: Risk: mining may expose personal or
sensitive data.

 Invisible Data Mining: Users may not even know their data is being
mined.

---------------------------------------------------------------------------------------------------------------------

What’s Attribute? Also mention its types

- An attribute is a property or characteristic of a data object. It


stores information about the object, like name, age, or color.
Attributes can be called features, dimensions, or variables. A set of
attributes defines an object or entity
- Attributes are further divided into 2 types:

1) Quantitative Attributes
- Attributes that can be measured with numbers.

- Examples: Height, weight, temperature.

- Can be:

o Discrete: Finite/countable values (e.g., number of pets).

o Continuous: Measurable values (e.g., height, weight).

2) Qualitative Attributes

- Descriptive, non-numeric; observed subjectively.

- Examples: Color, gender,etc.

- Types:

o Nominal:

 These are the named attributes, which can be


separated into discrete categories, which don’t overlap

 (e.g., gender, hair color).

o Ordinal:

 These are just as Nominal Attribute, but here order is


present in this, here, order is important & significant

 (e.g., Rankings  1st, 2nd, 3rd).


o Binary:

 It is categorical attribute, with only two possible values


 (e.g., yes/no, true/false).

 Symmetric: Both values equally important

o (e.g., Male-Female).

 Asymmetric: One value more significant

o (e.g., medical test results)

What are Quantiles?

- Quantiles are statistical measures that divide a dataset into equal-sized


groups.
- Quantiles are values that divide a dataset into equal-sized groups.
They help us understand how data is spread or distributed.
- 2-Quantile: Splits data into two halves.
→ This is the median (50th percentile).
- 4-Quantiles: Splits data into four equal parts.→ These are called
quartiles:
o Q1: 25th percentile
o Q2: 50th percentile (median)
o Q3: 75th percentile
- Here, Q3-Q1= IQR (Interquartile Range) that gives spread of the
middle 50% of the data.
- 100-Quantiles: Splits data into 100 parts. → These are called
percentiles.
- Diagram of 4-Quantiles:
- Here, Q1: It shows the data points in dataset are less than
or equal to 25%
- Q2: Median (50%) (separates the dataset into half)
- Q3: It shows the data points in dataset are less than or
equal to 75%

Explain five numbers summary with Boxplot diagram.

- The five-number summary describes the distribution of a


dataset using five key values:
- Minimum – Smallest value in the dataset
- Q1 (First Quartile) – 25% of data lies below this
- Q2 (Median) – Middle value (50% of data below)
- Q3 (Third Quartile) – 75% of data lies below this
- Maximum – Largest value in the dataset

Explain Data Matrix vs Dissimilarity Matrix


Explain methods to find dissimilarity of Numeric Data with example

Methods to find dissimilarity are:

1. Euclidean Distance:

- It measures the straight-line distance between two points in space.


- Formula:

2. Manhattan Distance:

- It measures the sum of absolute differences between each pair of


values.
- Formula:

-
3. Minkowski Distance:

- It is a general form of both Euclidean and Manhattan distances.


It depends on a parameter p:
o If p = 1, it becomes Manhattan Distance
o If p = 2, it becomes Euclidean Distance
- Formula:

4. Supremum Distance:

- It measures the maximum absolute difference between the


corresponding values of two vectors. It focuses on the largest single
difference.
- Formula:

You might also like