0% found this document useful (0 votes)
58 views69 pages

Data Mining Techniques for Business Insights

Data mining unit 1 notes

Uploaded by

hariharan21m21
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views69 pages

Data Mining Techniques for Business Insights

Data mining unit 1 notes

Uploaded by

hariharan21m21
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BA4027 - DATAMINING FOR BUSINESS INTELLIGENCE

SYLLABUS

UNIT I INTRODUCTION
Data mining, Text mining, Web mining, Spatial mining, Process mining, Data
ware house and datamarts.

UNIT II DATA MINING PROCESS


Datamining process – KDD, CRISP-DM, SEMMA and Domain-Specific,
Classification and Prediction performance measures -RSME, MAD, MAP, MAPE,
Confusion matrix, Receiver Operating Characteristic curve & AUC;Validation
Techniques - hold-out, k-fold cross-validation, LOOCV, random subsampling, and
bootstrapping.

UNIT III PREDICTION TECHNIQUES


Data visualization, Time series – ARIMA, Winter Holts, Vector
Autoregressive analysis, Multivariate regression analysis.

UNIT IV CLASSIFICATION AND CLUSTERING TECHNIQUES


Classification- Decision trees, k nearest neighbour, Logistic regression,
Discriminant analysis; Clustering; Market basket analysis;

UNIT V MACHINE LEARNING AND AI


Genetic algorithms, Neural network, Fuzzy logic, Support Vector Machine,
Optimization techniques –Ant Colony, Particle Swarm, DEA

1
UNIT - 1
UNIT I INTRODUCTION
Data mining, Text mining, Web mining, Spatial mining, Process mining, Data
ware house and datamarts.
********************************************************************
What is Data Mining?
 Data mining, also known as knowledge discovery in data (KDD), is the
process of uncovering patterns and other valuable information from large
data sets.
 The primary goal of data mining is to discover hidden patterns and
relationships in the data that can be used to make informed decisions or
predictions. This involves exploring the data using various techniques such
as clustering, classification, regression analysis, association rule mining,
and anomaly detection.

3 Types of Data Mining


There are many different types of data mining, but they can generally be grouped
into three broad categories: descriptive, predictive, and prescriptive.
 Descriptive data mining:
Descriptive data mining involves summarizing and describing the
characteristics of a data set. This type of data mining is often used to explore
and understand the data, identify patterns and trends, and summarize the data
in a meaningful way.

 Predictive data mining


Predictive data mining involves using data to build models that can
make predictions or forecasts about future events or outcomes. This type of
data mining is often used to identify and model relationships between different
variables, and to make predictions about future events or outcomes based on
those relationships.

2
 Prescriptive data mining
Prescriptive data mining involves using data and models to make
recommendations or suggestions about actions or decisions. This type of data
mining is often used to optimize processes, allocate resources, or make other
decisions that can help organizations achieve their goals.

Benefits of Data Mining


Data mining is the process of extracting useful information and insights from large
data sets. It is a powerful and flexible tool that has many benefits, including:
1. Improved decision-making
 One of the main benefits of data mining is that it can help organizations make
better decisions.
 By analyzing data and uncovering hidden patterns and trends, data mining can
provide valuable insights and information that can be used to inform and
improve decision-making
2. Increased efficiency and productivity
 Data mining can also help organizations increase their efficiency and
productivity.
 By automating and streamlining the data analysis process, data mining can
save time and resources, and help organizations work more effectively and
efficiently.
3. Reduced costs
 Data mining can also help organizations reduce their costs. By identifying and
addressing inefficiencies and waste, data mining can help organizations save
money and improve their bottom line.
4. Increased customer satisfaction
 Data mining can also be used to improve customer satisfaction.
 By analyzing data on customer behavior and preferences, data mining can
help organizations understand their customers better, and provide more
personalized and relevant products and services.

3
5. Improved risk management
 Data mining can also be used to improve risk management.
 By analyzing data on potential risks and vulnerabilities, data mining can help
organizations identify and mitigate potential risks, and make more informed
and strategic decisions.

Limitations of Data Mining


Some of the main limitations of data mining include:
1. Data quality
One of the main limitations of data mining is the quality of the data.
Data mining can only be as accurate and reliable as the data that it is based
on, and poor-quality data can lead to inaccurate or misleading results.
2. Technical challenges
Data mining can also be technically challenging, especially when
dealing with large and complex data sets. Extracting useful information and
insights from data can require specialized skills and expertise, and can be
time-consuming and resource-intensive.
3. Data Mining tools are complex and require training to use
Data analytics is a complicated process and often requires people with
training to use the tools. The barrier to entry for data analytics can discourage
small businesses from using this technology. It can also be difficult to find
adequate data that isn’t already private or proprietary in nature.
4. Rising privacy concerns
One of the major disadvantages of data mining are data and privacy
concerns. Traditionally, companies would only share personal data with other
companies in order to provide a service. Nowadays, many people are worried
that their personal information is being sold to third-parties without their
knowledge. Some people might not feel comfortable knowing that the
government can track certain information about them and how they use their
devices.

4
5. Data mining requires large databases
 Data mining is one of the most powerful tools in a marketer’s toolbox,
but it does have its drawbacks. One such drawback is that data mining
requires large databases to be effective.
 For example, if an email list has only 100 people, then the data from
those emails will not provide enough information for data mining. On
the other hand, if the list contains 100,000 people, then there will be
more information available and data mining will be more successful.
6. Expensive
 Data mining can be a very expensive process. For example, companies
have to hire additional employees and technology specialists to ensure
that the data mining is done correctly.
 Many businesses have to invest in advanced data mining software,
which can also be expensive.
 The costs of data mining generally outweigh the benefits for most
small businesses because they don’t produce enough valuable insights.

Data Mining Applications


These are the following areas where data mining is widely used:
1. Data Mining in Healthcare:
 Data mining in healthcare has excellent potential to improve the health
system. It uses data and analytics for better insights and to identify best
practices that will enhance health care services and reduce costs.
 Analysts use data mining approaches such as Machine learning, Multi-
dimensional database, Data visualization, Soft computing, and statistics.
 Data Mining can be used to forecast patients in each category. The
procedures ensure that the patients get intensive care at the right place and
at the right time. Data mining also enables healthcare insurers to recognize
fraud and abuse.

5
2. Data Mining in Market Basket Analysis:
 Market basket analysis is a modeling method based on a hypothesis.
 If you buy a specific group of products, then you are more likely to buy
another group of products.
 This technique may enable the retailer to understand the purchase behavior
of a buyer. This data may assist the retailer in understanding the
requirements of the buyer and altering the store's layout accordingly.
 Using a different analytical comparison of results between various stores,
between customers in different demographic groups can be done.

3. Data mining in Education:


 Education data mining is a newly emerging field, concerned with
developing techniques that explore knowledge from the data generated
from educational Environments.
 EDM objectives are recognized as affirming student's future learning
behavior, studying the impact of educational support, and promoting
learning science.
 An organization can use data mining to make precise decisions and also to
predict the results of the student. With the results, the institution can
concentrate on what to teach and how to teach.

4. Data Mining in Manufacturing Engineering:


 Knowledge is the best asset possessed by a manufacturing company. Data
mining tools can be beneficial to find patterns in a complex manufacturing
process.
 Data mining can be used in system-level designing to obtain the
relationships between product architecture, product portfolio, and data
needs of the customers.
 It can also be used to forecast the product development period, cost, and
expectations among the other tasks.

6
5. Data Mining in CRM (Customer Relationship Management):
 Customer Relationship Management (CRM) is all about obtaining and
holding Customers, also enhancing customer loyalty and implementing
customer-oriented strategies.
 To get a decent relationship with the customer, a business organization
needs to collect data and analyze the data. With data mining technologies,
the collected data can be used for analytics.

6. Data Mining in Fraud detection:


 Billions of dollars are lost to the action of frauds. Traditional methods of
fraud detection are a little bit time consuming and sophisticated.
 Data mining provides meaningful patterns and turning data into
information. An ideal fraud detection system should protect the data of all
the users. Supervised methods consist of a collection of sample records,
and these records are classified as fraudulent or non-fraudulent.
 A model is constructed using this data, and the technique is made to
identify whether the document is fraudulent or not.

7. Data Mining in Lie Detection:


 Apprehending a criminal is not a big deal, but bringing out the truth from
him is a very challenging task. Law enforcement may use data mining
techniques to investigate offenses, monitor suspected terrorist
communications, etc.
 This technique includes text mining also, and it seeks meaningful patterns
in data, which is usually unstructured text. The information collected from
the previous investigations is compared, and a model for lie detection is
constructed.

8. Data Mining Financial Banking:


 The Digitalization of the banking system is supposed to generate an
enormous amount of data with every new transaction.

7
 The data mining technique can help bankers by solving business-related
problems in banking and finance by identifying trends, casualties, and
correlations in business information and market costs that are not instantly
evident to managers or executives because the data volume is too large or
are produced too rapidly on the screen by experts.
 The manager may find these data for better targeting, acquiring, retaining,
segmenting, and maintain a profitable customer.

Challenges of Implementation in Data mining


1. Incomplete and noisy data:
 The process of extracting useful data from large volumes of data is data
mining. The data in the real-world is heterogeneous, incomplete, and noisy.
Data in huge quantities will usually be inaccurate or unreliable.
 These problems may occur due to data measuring instrument or because of
human errors. Suppose a retail chain collects phone numbers of customers
who spend more than $ 500, and the accounting employees put the
information into their system.
 The person may make a digit mistake when entering the phone number,
which results in incorrect data. Even some customers may not be willing to
disclose their phone numbers, which results in incomplete data. The data
could get changed due to human or system error. All these consequences
(noisy and incomplete data)makes data mining challenging.

2. Data Distribution:
 Real-worlds data is usually stored on various platforms in a distributed
computing environment.
 It might be in a database, individual systems, or even on the
[Link], It is a quite tough task to make all the data to a
centralized data repository mainly due to organizational and technical
concerns.

8
 For example, various regional offices may have their servers to store their
data. It is not feasible to store, all the data from all the offices on a central
server. Therefore, data mining requires the development of tools and
algorithms that allow the mining of distributed data.
3. Complex Data:
 Real-world data is heterogeneous, and it could be multimedia data,
including audio and video, images, complex data, spatial data, time series,
and so on.
 Managing these various types of data and extracting useful information is a
tough task. Most of the time, new technologies, new tools, and
methodologies would have to be refined to obtain specific information.
4. Performance:
 The data mining system's performance relies primarily on the efficiency of
algorithms and techniques used.
 If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.
5. Data Privacy and Security:
 Data mining usually leads to serious issues in terms of data security,
governance, and privacy.
 For example, if a retailer analyzes the details of the purchased items, then it
reveals data about buying habits and preferences of the customers without
their permission.
6. Data Visualization:
 In data mining, data visualization is a very important process because it is
the primary method that shows the output to the user in a presentable way.
The extracted data should convey the exact meaning of what it intends to
express.
 But many times, representing the information to the end-user in a precise
and easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes
need to be implemented to make it successful.

9
Data Mining Process
The data mining process typically involves the following steps:
1. Business Understanding:
 This step involves understanding the problem that needs to be solved and
defining the objectives of the data mining project. \This includes
identifying the business problem, understanding the goals and objectives of
the project, and defining the KPIs that will be used to measure success.
This step is important because it helps ensure that the data mining project is
aligned with business goals and objectives.
2. Data Understanding:
 This step involves collecting and exploring the data to gain a better
understanding of its structure, quality, and content.
 This includes understanding the sources of the data, identifying any data
quality issues, and exploring the data to identify patterns and relationships.
This step is important because it helps ensure that the data is suitable for
analysis.
3. Data Preparation:
 This step involves preparing the data for analysis. This includes cleaning
the data to remove any errors or inconsistencies, transforming the data to
make it suitable for analysis, and integrating the data from different sources
to create a single dataset.
 This step is important because it ensures that the data is in a format that can
be used for modeling.
4. Modeling:
 This step involves building a predictive model using machine learning
algorithms.
 This includes selecting an appropriate algorithm, training the model on the
data, and evaluating its performance.
 This step is important because it is the heart of the data mining process and
involves developing a model that can accurately predict outcomes on new
data.

10
5. Evaluation:
 This step involves evaluating the performance of the model. This includes
using statistical measures to assess how well the model is able to predict
outcomes on new data.
 This step is important because it helps ensure that the model is accurate and
can be used in the real world.
6. Deployment:
 This step involves deploying the model into the production environment.
This includes integrating the model into existing systems and processes to
make predictions in real-time.
 This step is important because it allows the model to be used in a practical
setting and to generate value for the organization.

Data mining techniques


1. Classification
2. Clustering
3. Tracking patterns
4. Regression
5. Outer Detection or Anomaly detection
6. Sequential Patterns
7. Prediction
8. Association Rules
9. Visualisation
10. Neural networks
11. Long-term memory processing

1. Classification
 In general, classification means categorizing the available entities with respect
to some target variable. Open your messy wardrobe; to clean and organize it,
you would start separating the clothes based on ethnic, casual, formal, and
loungewear.

11
 Furthermore, the ones you do not fit into fall into the discarded category. Now
in the data mining terms, all the clothes are the data recipient from multiple
sources, the category for clothes analogies with the target class labels, and the
discarded clothes are the outliers.
 In order to provide a definition, we can term Classification as a predictive
modeling technique within supervised learning. It involves predicting class
labels based on a set of labeled observations. Businesses use this process of
compartmentalization to draw essential inferences.

2. Clustering
 Clustering is a common technique used for grouping similar data. The only
difference in classification and clustering is that the latter has no target
variable. Clustering is the process of separating the data set into subgroups.
 Recalling the previous wardrobe example, the clothes can be sub-grouped as
tops and bottoms. The process of clustering is essential as it prepares the data
for analysis. A major example of clustering is finding customers with similar
purchasing behavior to generate interesting recommendations.

3. Tracking patterns
 A fundamental data mining technique, tracking patterns helps find hidden
patterns and monitor trends in the data to build valuable insights. Pattern
recognition is the most important data mining technique.
 It helps understand customer behavior, buying patterns, and people with
similar interests. This knowledge discovery helps find potential customers,
predict sales and much more for business proliferation.

4. Regression
 Regression analysis is a technique within supervised learning. It involves
training algorithms with input features and output labels to analyze the relation
between independent and dependent variables.

12
 Plotting a best-fit line or a curve between the data is the ultimate aim of the
applying regression algorithm.
 Trained models, primarily a predictive modeling component, forecast the
outputs of dynamic input data or bridge gaps in missing data. We evaluate
regression models using three metrics: variance, bias, and error.
 The types of regression are Linear regression, Multiple Linear Regression,
Multivariate Linear Regression, Polynomial Regression, and Ridge and Lasso
Regression.
5. Outer Detection or Anomaly detection
 Outlier mining or outer detection, or anomaly detection, is a data mining
technique used to identify the data items that do not match or fit in the
expected behavior or predefined patterns.
 Discovering the outliers helps find the reasons behind their occurrence and
prepare for future occurrences. Outer detection detects credit card fraud,
network intrusions, and interruptions.

6. Sequential Patterns
 It is a data mining technique that discovers meaningful associations between
the data occurrences. It helps find a time-ordered series of events occurring
with a precise frequency to associate the dependency between them.
 Sequential pattern mining is particularly useful in applying mining to
transactional data for a specific period of time. Sequential Patterns has its use
in stock market analysis, forecasting natural disasters, DNA sequencing
research, and predicting possible attacks in cyber security.

7. Prediction
 Prediction is a valuable data mining technique that combines different data
mining techniques, including sequential patterns, classification, clustering,
trends, etc.

13
 This data mining technique involves the use of historical data and events in
sequence to understand the behavior and predict the occurrences of future
events.
 One most used applications of prediction is evaluating the loss/profit for a
business by understanding the sale.

8. Association Rules
 Finding its grounds in statistics, the association rule is a data mining technique
that finds relational patterns between the variables.
 This data mining technique follows the law of association, indicating the
likelihood of occurrence of an event is dependent on the other data-driven
events. For example, one is likely to buy car accessories from the market after
buying a car.

9. Visualisation
 Data visualization is a data mining technique granting people access to
insights based on visual sensory perceptions.
 The dynamic visual patterns allow users to unveil the trends in the data to
understand their business information.

10. Neural networks


 Data mining techniques use artificial intelligence and deep learning to
understand complex problems by functioning similarly to the human brain.
 Neural networks learn by example and build their own conclusions for certain
sets of inputs based on previous data.

11. Long-term memory processing


 A data mining technique that drills out the large historical data stored in the
data warehouses to analyze it over a longer time duration.
 It is majorly used in analyzing time-based information trends such as weather
data.

14
Data Mining Architecture
The significant components of data mining systems are a data source, data mining
engine, data warehouse server, the pattern evaluation module, graphical user
interface, and knowledge base.

Data Source:
 The actual source of data is the Database, data warehouse, World Wide
Web (WWW), text files, and other documents. You need a huge amount of
historical data for data mining to be successful. Organi
Organizations
zations typically
store data in databases or data warehouses.
 Data warehouses may comprise one or more databases, text files
spreadsheets, or other repositories of data. Sometimes, even plain text files
or spreadsheets may contain information. Another pri
primary
mary source of data is
the World Wide Web or the internet.

Different processes:
 Before passing the data to the database or data warehouse server, the data
must be cleaned, integrated, and selected.
 As the information comes from various sources and in different formats,
it can't be used directly for the data mining procedure because the data
may not be complete and accurate.
 So, the first data requires to be cleaned and unified. More information
than needed will be collected from various data sources, and only the data
of interest will have to be selected and passed to the server.
 These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.

Database or Data Warehouse Server:


 The database or data warehouse server consists of the original data that
is ready to be processed.
 Hence, the server is cause for retrieving the relevant data that is based on
data mining as per user request.

Data Mining Engine:


 The data mining engine is a major component of any data mining
system. It contains several modules for operating data mining tasks,
including association, characterization, classification, clustering,
prediction, time-series analysis, etc.
 In other words, we can say data mining is the root of our data mining
architecture.
 It comprises instruments and software used to obtain insights and
knowledge from data collected from various data sources and stored
within the data warehouse.

Pattern Evaluation Module:


 The Pattern evaluation module is primarily responsible for the measure
of investigation of the pattern by using a threshold value. It collaborates
with the data mining engine to focus the search on exciting patterns.

16
 This segment commonly employs stake measures that cooperate with the
data mining modules to focus the search towards fascinating patterns.
 It might utilize a stake threshold to filter out discovered patterns.
 On the other hand, the pattern evaluation module might be coordinated
with the mining module, depending on the implementation of the data
mining techniques used.
 For efficient data mining, it is abnormally suggested to push the
evaluation of pattern stake as much as possible into the mining
procedure to confine the search to only fascinating patterns.

Graphical User Interface:


 The graphical user interface (GUI) module communicates between the
data mining system and the user.
 This module helps the user to easily and efficiently use the system
without knowing the complexity of the process.
 This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.

Knowledge Base:
 The knowledge base is helpful in the entire process of data mining. It
might be helpful to guide the search or evaluate the stake of the result
patterns.
 The knowledge base may even contain user views and data from user
experiences that might be helpful in the data mining process.
 The data mining engine may receive inputs from the knowledge base to
make the result more accurate and reliable.
 The pattern assessment module regularly interacts with the knowledge
base to get inputs, and also update it.

17
TEXT MINING
MEANING:
 Text mining (also known as text analysis), is the process of transforming
unstructured text into structured data for easy analysis.
 Text mining uses natural language processing (NLP), allowing machines to
understand the human language and process it automatically.
 Text mining is a process of extracting useful information and nontrivial
patterns from a large volume of text databases.

Text Mining Techniques


1. Information Retrieval
 In the process of Information retrieval, we try to process the available
documents and the text data into a structured form so, that we can apply
different pattern recognition and analytical processes.
 It is a process of extracting relevant and associated patterns according to a
given set of words or text documents. For this, we have processes like
Tokenization of the document or the stemming process in which we try to
extract the base word or let’s say the root word present there.

2. Information Extraction
It is a process of extracting meaningful words from documents.
 Feature Extraction
In this process, we try to develop some new features from existing ones.
This objective can be achieved by parsing an existing feature or
combining two or more features based on some mathematical operation.
 Feature Selection
In this process, we try to reduce the dimensionality of the dataset which
is generally a common issue while dealing with the text data by selecting a
subset of features from the whole dataset.

18
3. Natural Language Processing
Natural Language Processing includes tasks that are accomplished by using
Machine Learning and Deep Learning methodologies. It concerns the automatic
processing and analysis of unstructured text information.
 Named Entity Recognition (NER): Identifying and classifying named
entities such as people, organizations, and locations in text data.
 Sentiment Analysis: Identifying and extracting the sentiment (e.g. positive,
negative, neutral) of text data.
 Text Summarization: Creating a condensed version of a text document that
captures the main points.

The Process of Text Mining


The process of text mining mainly involves five steps:
1. Text Pre-processing:
The raw text data obtained will be unstructured in nature. First, it needs to be
cleaned. There are a few steps in this pre-processing.
 Text Normalization: This process involves the conversion of the data
into a standard format. Here, the whole text is converted into upper or lower
case, the numbers, punctuation, accent marks, white spaces, stop words and
other diacritics are removed. Python can be used to implement this.
 Tokenization: In this process, the whole text is split into smaller parts
called tokens. The numbers, punctuation marks, words, etc. can be
considered as tokens. Natural Language Toolkit (NLTK), Spacy and
Gensim are a few tools that can be used for tokenization.
 Stemming: It is the process of reduction of words to their stem, base or
root form. The two main algorithms used for this process is Porter
stemming algorithm and Lancaster stemming algorithm. NLTK as well as
Snowball can be used for this.
 Lemmatization: The aim of lemmatization, like stemming, is to reduce
inflectional forms to a common base form. But, as compared to stemming,

19
lemmatization does not simply remove the inflections. Instead, it uses
information from different computational repositories to get the correct base
forms of words.
 Part-of-speech Tagging: It aims to assign parts of speech to each word
of a given text based on a its meaning and context. NLTK, spaCy, Pattern
are a few softwares that can be used for this.
 Chunking: It is a natural language process that identifies constituent parts
of sentences and links them to higher order units that have discrete
grammatical meanings. NLTK is a good tool for this.
 Named Entity Recognition (NER): It aims to find named entities in
text and classify them into pre-defined categories. NLTK, spaCy can be
used for this.
 Relationship Extraction: This helps in identifying relations among
named entities like people, organizations, etc. It allows to get structured
information from unstructured sources such as raw text.

2. Text Transformation:
This process mainly involves the document representation by the text it contains
and the number of occurrences. There are mainly two approaches for this step.
 Bag of Words: A text is represented as a bag of its words (multi set),
disregarding grammar and even word order, but keeping multiplicity.
 Vector Space: In this model, a document is converted into a vector of index
terms derived from words. Each dimension of the vector corresponds to a term
that appears in their text. Its weight records the importance of the term to the
text.

3. Feature Selection:
 Feature selection is also known as attribute selection or variable selection. It
is the selection of the most relevant features from the available variables
that gives most information for your prediction variables.

20
 The irrelevant features can increase the complexity and decrease the
accuracy of the analysis.
 Pearson correlation coefficient, Chi — squared, recursive feature
elimination, lasso regression, tree-based algorithms are a few methods that
can be used for this. Python can be used for all the analysis.

4. Data Mining:
 Here we combine the process of text mining with the traditional data mining
techniques. Once the data is structured after the above processes, the classic
data mining techniques are applied on the data to retrieve the information.
 These techniques include classification, clustering, regression, outer,
sequential patters, prediction and association rules.

5. Evaluation:
 After the data mining techniques are applied, we get an end result. That
result is to be evaluated and checked for the accuracy in the prediction.

Applications of Text Mining


1. Risk Management:
 The humongous amount of textual data that is available helps the
companies to have a deeper look into their health and performance. Risk
analysis is an important factor in the development of every companies.
Insufficient risk analysis can result in major failures for the company.
 Text mining can enable the company to mitigate the risk factors and also
can help in deciding which firms to invest in, which people to give loans to
and so much more by analyzing the documents and profiles of various
clients.
2. Customer Care Services:
 Text mining as well as natural language processing has been extensively
used in order to enhance the customer experience.

21
 Now-a-days chat bots that mimic human customer care officers have been
used in many websites in order to make the user experience more
customized.
 Text mining has been used in order to provide a rapid, automated response
to the customers, which has reduced their reliance on the call center
operators to solve the problems.
3. Personalized Advertising:
 The field of digital advertising has been revolutionized by the development
of text and web mining and this is one of the latest applications of text
mining.
 The text data related to all that a person types or searches online are shared
with the other companies, which in turn show ads that has a higher
probability of being clicked and converted into a sale.
4. Spam Filtering:
 One of the widely used means of official communication is e-mail. It has a
really wide application, but a darker side to this are the spam mails that
infest the inboxes of the users.
 These spam mails use up a lot of storage and they can also be a source
from which the viruses or scams can enter. Various companies are using
intelligent text mining softwares as well as the traditional keyword
matching techniques in order to identify and filter the spam mails.
5. Social Media Analysis and Crime Prevention:
 Social media has been on the trend for a long time and millions of normal
users use this medium as a means of communication.
 The anonymous nature of internet has made it easy for many criminals to
plan their various strategies online. The task of identifying the potentially
threatening messages from the normal ones is a task that has been made
possible by the use of advanced text mining softwares.
 Also, online text analysis can be a good method to analyse what is ‘hot’ or
trending in a particular time. This can be highly beneficial for various
commercial companies.

22
6. Digital Library
 Various text mining strategies and tools are being used to get the pattern
and trends from journal and proceedings which is stored in text database
repositories.
 These resources of information help in the field of research area. Libraries
are a good resource for text data in digital form. It gives a novel technique
for getting useful data in such a way that makes it conceivable to access
millions of records online.
7. Academic and Research Field
 In the education field, different text-mining tools and strategies are
utilized to examine the instructive patterns in a specific region/research
field.
 The main purpose of text mining utilization in the research field is help to
discover and arrange research papers and relevant material from various
fields on one platform.
8. Life Science
 Life science and healthcare industries are producing an enormous volume
of textual and mathematical data regarding patient records, sicknesses,
medicines, symptoms, and treatments of diseases, etc.
 It is a major issue to filter data and relevant text to make decisions from a
biological data repository. The clinical records contain variable data which
is unpredictable, and lengthy. Text mining can help to manage such kinds
of data.
9. Business Intelligence
 Text mining plays an important role in business intelligence that help
different organization and enterprises to analyze their customers and
competitors to make better decisions.
 It gives an accurate understanding of business and gives data on how to
improve consumer satisfaction and gain competitive benefits. The text
mining devices like IBM text analytics.

23
Text Mining Approaches in Data Mining
The following text-mining techniques are applied in data mining:
1. Automatic Document Classification Analysis
2. Keyword-based Association Analysis
1. Automatic Document Classification Analysis
This technique is used to automatically classify the vast majority of online text
documents, such as emails and web pages. As document databases are not arranged
according to attribute value pairs, the categorization of text documents differs from
the classification of relational data.
2. Keyword-Based Association Analysis
It gathers groups of terms or keywords that frequently appear together and then
determines the correlation between them. The text data is first preprocessed by
parsing, stemming, deleting stop words, etc. After preprocessing the data, association
mining methods are introduced. Since no human effort is necessary in this case,
fewer undesirable results are obtained, and the time of execution is shorter.

Issues in Text Mining


Numerous issues happen during the text mining process:
1. The efficiency and effectiveness of decision-making.
2. The uncertain problem can come at an intermediate stage of text mining. In
the pre-processing stage, different rules and guidelines are characterized to
normalize the text which makes the text-mining process efficient. Before
applying pattern analysis to the document, there is a need to change over
unstructured data into a moderate structure.
3. Sometimes original message or meaning can be changed due to alteration.
4. Another issue in text mining is many algorithms and techniques support multi-
language text. It may create ambiguity in text meaning. This problem can lead
to false-positive results.
5. The utilization of synonyms, polysemy, and antonyms in the document text
makes issues for the text mining tools that take both in a similar setting. It is
difficult to categorize such kinds of text/ words.

24
Advantages of Text Mining
1. Large Amounts of Data:
Text mining allows organizations to extract insights from large amounts
of unstructured text data. This can include customer feedback, social media
posts, and news articles.
2. Variety of Applications:
Text mining has a wide range of applications, including sentiment
analysis, named entity recognition, and topic modeling. This makes it a
versatile tool for organizations to gain insights from unstructured text data.
3. Improved Decision Making:
Text mining can be used to extract insights from unstructured text data,
which can be used to make data-driven decisions.
4. Cost-effective:
Text mining can be a cost-effective way to extract insights from
unstructured text data, as it eliminates the need for manual data entry.
5. Broader benefits:
Cost reductions, productivity increases, the creation of novel new
services, and new business models are just a few of the larger economic
advantages mentioned by those consulted.

Disadvantages of Text Mining


1. Complexity:
Text mining can be a complex process that requires advanced skills in
natural language processing and machine learning.
2. Quality of Data:
The quality of text data can vary, which can affect the accuracy of the
insights extracted from text mining.
3. High Computational Cost:
Text mining requires high computational resources, and it may be
difficult for smaller organizations to afford the technology.

25
4. Limited to Text Data:
Text mining is limited to extracting insights from unstructured text data
and cannot be used with other data types.
5. Noise in text mining results:
Text mining of documents may result in mistakes. It’s possible to find
false links or to miss others. In most situations, if the noise (error rate) is
sufficiently low, the benefits of automation exceed the chance of a larger
mistake than that produced by a human reader.
6. Lack of transparency:
Text mining is frequently viewed as a mysterious process where large
corpora of text documents are input and new information is produced. Text
mining is in fact opaque when researchers lack the technical know-how or
expertise to comprehend how it operates, or when they lack access to corpora
or text mining tools.

Types of data in databases is text


 Unstructured data:
This data lacks a predetermined data structure. It may contain text taken
from reviews of products or social media platforms, as well as rich media
formats, including audio and video files.
 Structured Data:
Data that is organized into a tabular format with many rows and
columns is said to be structured, and this arrangement makes it simpler to
store and handle the data for analysis and algorithms for machine learning.
Input data, such as phone numbers, addresses, and names, can be found in
structured data.
 Semi-structured:
Data that is a combination of both structured and unstructured
information types, as the name implies. It has some organization but not
enough structure to satisfy a relational database's criteria. XML, JSON,
and HTML files are examples of semi-structured data.

26
WEB MINING
Meaning:
 Web mining is the process of extracting valuable information from the vast
data available on the World Wide Web.
 Web mining is the process of discovering patterns, structures, and
relationships in web data. It involves using data mining techniques to analyze
web data and extract valuable insights.

What is Web Mining?


 Web mining refers to the process of discovering and extracting useful
information from a large amount of data available on the World Wide Web. It
involves applying various data mining techniques to web data to identify
patterns, trends, and relationships. Web mining is a multidisciplinary field that
combines techniques from data mining, machine learning, artificial
intelligence, statistics, and information retrieval.
 One example of web mining is to analyze website traffic and user behavior.
By analyzing clickstream data and other user interactions with a website,
organizations can gain insights into how users navigate their site, what content
is most popular, and where users are dropping off. This information can be
used to optimize website design and improve user experience.

Applications of Web Mining:


The applications of web mining are wide-ranging and include:
1. Personalized marketing:
Web mining can be used to analyze customer behavior on websites and social
media platforms. This information can be used to create personalized marketing
campaigns that target customers based on their interests and preferences.
2. E-commerce
Web mining can be used to analyze customer behavior on e-commerce
websites. This information can be used to improve the user experience and
increase sales by recommending products based on customer preferences.

27
3. Search engine optimization:
Web mining can be used to analyze search engine queries and search engine
results pages (SERPs). This information can be used to improve the visibility of
websites in search engine results and increase traffic to the website.
4. Fraud detection:
Web mining can be used to detect fraudulent activity on websites. This
information can be used to prevent financial fraud, identity theft, and other types
of online fraud.
5. Sentiment analysis:
Web mining can be used to analyze social media data and extract sentiment
from posts, comments, and reviews. This information can be used to understand
customer sentiment towards products and services and make informed business
decisions.
6. Web content analysis:
Web mining can be used to analyze web content and extract valuable
information such as keywords, topics, and themes. This information can be used
to improve the relevance of web content and optimize search engine rankings.
7. Customer service:
Web mining can be used to analyze customer service interactions on websites
and social media platforms. This information can be used to improve the quality
of customer service and identify areas for improvement.
8. Healthcare:
Web mining can be used to analyze health-related websites and extract
valuable information about diseases, treatments, and medications. This
information can be used to improve the quality of healthcare and inform medical
research.

Process of Web Mining:

28
Type of web mining
1. Web Content Mining
2. Web Structure Mining
3. Web Usage Mining

1. Web Content Mining


 Web content mining is the process of extracting useful information from
web pages, including text, images, and multimedia content. This
involves techniques such as text mining, natural language processing,
and image analysis.
 Web content mining can be used to extract structured and unstructured
data from web pages, including product descriptions, reviews, and user-
generated content. The extracted information can be used for various
purposes, such as sentiment analysis, product recommendation, and
opinion mining.
2. Web Structure Mining
 Web structure mining focuses on analyzing the web structure and the
relationships between web pages. This includes analyzing links between
pages, identifying communities of pages, and detecting patterns in
website design.
 Web structure mining techniques are used to improve search engine
results, identify authoritative pages, and detect web spam.
3. Web Usage Mining
 Web usage mining involves analyzing user behavior on the web,
including clickstream data, search queries, and other interactions with
web pages.
 Web usage mining can help identify user preferences, behavior patterns,
and trends. This information can be used to personalize content, improve
website design, and target advertising. Web usage mining can also be
used for security purposes, such as detecting fraud and identifying
potential security threats.

29
Process of Web Mining
The process of web mining typically involves the following steps -
1. Data collection –
Web data is collected from various sources, including web pages,
databases, and APIs.
2. Data pre-processing –
The collected data is pre-processed to remove irrelevant information,
such as advertisements and duplicate content.
3. Data integration –
The pre-processed data is integrated and transformed into a structured
format for analysis.
4. Pattern discovery –
Web mining techniques are applied to identify patterns, trends, and
relationships.
5. Evaluation –
The discovered patterns are evaluated to determine their significance
and usefulness.
6. Visualization –
The analysis results are visualized through graphs, charts, and other
visualizations.

Difference between data mining and web mining

Data Mining Web Mining

Web mining refers to the process of


Data mining refers to the process of
extracting information from the web
extracting useful information, patterns,
document and services, hyperlinks, and
and trends from huge data sets.
server logs

30
Data engineers and data scientists can Data scientists, data engineers, and data
do data mining. analysts can do web mining.

Data mining is based on pattern


Web mining is based on pattern
identification from data available in
identification from web data.
any system.

Tools used by data mining are machine Tools used by web mining are PageRank,
learning algorithms. Scrappy, Apache logs.

Applications of data mining are


It uses the same process but on the web
weather forecast, market analysis,
using the web documents.
fraud detection, etc.

Skill needed for data mining is Skills needed for wen mining are
machine learning algorithms, application-level knowledge, probability,
probability, statistics. statistics.

Advantages of Web Mining


1. Improves customer experience
Web mining helps understand customer behavior, which allows
businesses to tailor their services or products to better meet customer needs,
leading to happier users.
2. Enhances website personalization
By analyzing user interactions, businesses can customize website
content for individual visitors, making their online journey more relevant and
engaging.
3. Increases sales and conversions
Through understanding what customers are looking for, companies can
suggest products more effectively, boosting the likelihood of purchases and
improving their bottom line.

31
4. Detects fraudulent activities
By monitoring user behavior, web mining can identify unusual patterns
that may indicate fraudulent activities, helping to protect both the business and
its customers.
5. Uncovers data patterns and trends
Web mining reveals insights into how users interact with a site,
enabling businesses to spot emerging trends and make data-driven decisions to
stay ahead of the market.

Disadvantages of Web Mining


1. Privacy concerns for users
Web mining can invade personal privacy as it often involves collecting
information about individuals without their consent. This can include tracking
online behavior and preferences.
2. High computational costs
It requires significant computing power to process and analyze the vast
amounts of data involved, which can be expensive and energy-intensive.
3. Data quality and relevance issues
There’s a risk of gathering inaccurate or irrelevant data, which can lead
to incorrect conclusions and decisions being made by businesses or other
users.
4. Potential for misuse
The information obtained through web mining can be used in harmful
ways, such as for identity theft, discrimination, or spreading false information.
5. Difficulty in data interpretation
Interpreting the data correctly is challenging because the context and
nuances of online information can be complex, leading to potential
misunderstandings or oversimplifications.

32
Some Techniques in Web Usage Mining
1. Association Rules:
 The most used technique in Web usage mining is Association Rules.
Basically, this technique focuses on relations among the web pages that
frequently appear together in users’ sessions.
 The pages accessed together are always put together into a single server
session.
 Association Rules help in the reconstruction of websites using the access
logs.
 Access logs generally contain information about requests which are
approaching the webserver.
 The major drawback of this technique is that having so many sets of rules
produced together may result in some of the rules being completely
inconsequential. They may not be used for future use too.

2. Classification:
 Classification is mainly to map a particular record to multiple predefined
classes.
 The main target here in web usage mining is to develop that kind of profile
of users/customers that are associated with a particular class/category. For
this exact thing, one requires to extract the best features that will be best
suitable for the associated class.
 Classification can be implemented by various algorithms – some of them
include- Support vector machines, K-Nearest Neighbors, Logistic
Regression, Decision Trees, etc.
 For example, having a track record of data of customers regarding their
purchase history in the last 6 months the customer can be classified into
frequent and non-frequent classes/categories. There can be multiclass also
in other cases too.

33
3. Clustering:
 Clustering is a technique to group together a set of things having similar
features/traits. There are mainly 2 types of clusters- the first one is the
usage cluster and the second one is the page cluster.
 The clustering of pages can be readily performed based on the usage data.
In usage-based clustering, items that are commonly accessed /purchased
together can be automatically organized into groups.
 The clustering of users tends to establish groups of users exhibiting similar
browsing patterns. In page clustering, the basic concept is to get
information quickly over the web pages.

SPATIAL MINING
 Spatial data mining refers to the process of discovering hidden patterns and
relationships in geospatial data. Geospatial data includes information that is
associated with specific locations or geographic coordinates.

What is Spatial Data Mining?


 Spatial data mining is a specialized subfield of data mining that deals with
extracting knowledge from spatial data. Spatial data refers to data that is
associated with a particular location or geography.
 Examples of spatial data include maps, satellite images, GPS data, and other
geospatial information.
 Spatial data mining involves analyzing and discovering patterns,
relationships, and trends in this data to gain insights and make informed
decisions.

Spatial data mining tasks


1. Classification:
Classification determines a set of rules which find the class of the
specified object as per its attributes.

34
2. Association rules:
Association rules determine rules from the data sets, and it describes
patterns that are usually in the database.
3. Characteristic rules:
Characteristic rules describe some parts of the data set.
4. Discriminate rules:
As the name suggests, discriminate rules describe the differences
between two parts of the database, such as calculating the difference between
two cities as per employment rate.

What is Spatial Data Mining?


 Spatial Data Mining is a data mining technique that is used to extract
information from the data that belongs to a particular location. This type of
data is also known as Spatial Data.
 Spatial Data contains information about the boundaries of objects or
geographic coordinates. Examples are satellite data, maps, GPS coordinates,
etc.

Types of Spatial Data


Spatial Data is divided into multiple categories on the basis of their characteristics
and representation. Some of the types of spatial data are:
1. Point Data
The data of any location can be represented by its coordinates in the
space. The Point Data consists of individual points in the form of X, Y
coordinates ( latitude and longitude coordinates ). Examples of point data are
the location of a landmark, specific addresses, or GPS coordinates.
2. Line Data
The Line Data contains a sequence of points in the form of lines or
curves. It is used to represent the linear entities in space. Examples of line data
are rivers, pipelines, roads, etc.

35
3. Polygon Data
The Polygon Data consists of areas of regions in space. It includes
closed shapes, which are formed by connecting multiple points, and the last
point is connected to the first to make it close. Examples of polygon data are
traffic zones, land parcels, boundaries of lakes or forests, etc.
4. Raster Data
The Raster Data refers to space in the form of a grid of pixels, where
each cell of the grid represents some attributes. It is used to represent the
phenomena like satellite imagery, remote sensing, etc.
5. Image Data
Image Data, as the name suggests, consists of spatial data in the form of
images. It is most often used for object detection, capturing visual information
about Earth, land cover classification, etc.

Applications of Spatial Data Mining


1. Urban Planning
Spatial Data Mining is used by urban planners to analyze and improve
urban dynamics. It can be used to enhance urban growth, improve
transportation systems, and refine decisions about land.
2. Public Health
Spatial Data Mining plays an important role in public health research. It
is used to develop strategies to identify diseases, track the spread of
infections, and optimize healthcare resources.
3. Transportation
Spatial Data Mining can be used to identify traffic patterns, prevent
congestion, manage the transportation network, and optimize transportation
routes.
4. Environmental Management
Spatial Data Mining also contributes to environmental management by
detecting changes in the environment, identifying the land at risk, conserving
water and biodiversity, and monitoring natural resources.

36
5. Crime Analysis
Spatial Data Mining can be used to identify crime hotspots, understand
crime patterns and develop proper strategies to prevent crimes and hence
improve public safety.

Benefits of Spatial Data Mining


 The use of spatial data mining offers several benefits to organizations.
 Firstly, it enables businesses to better understand their customers by analyzing
their location-based preferences, behaviors, and demographics. This
information can be used to develop targeted marketing campaigns and
improve customer engagement.
 Secondly, spatial data mining helps organizations optimize their operations by
identifying patterns and trends in the movement of goods, vehicles, or people.
This can lead to more efficient routing, reduced costs, and improved resource
allocation. Finally, spatial data mining enables organizations to make data-
driven decisions based on the geographical context, allowing them to mitigate
risks, identify opportunities, and plan for future growth.

Challenges in Spatial Data Mining


 While spatial data mining offers excellent potential, it also comes with its own
set of challenges. One of the main challenges is the complexity and size of
geospatial data. Spatial datasets can be massive and heterogeneous, making
them difficult to process and analyze. Additionally, spatial data often comes
with inherent uncertainties and errors, which can impact the accuracy of the
analysis.
 Another challenge is the need for specialized skills and expertise in data
mining and GIS. Organizations need professionals who can effectively
integrate these two domains and apply the appropriate techniques and
algorithms. Lastly, privacy and ethical concerns also arise when dealing with
location-based data. Organizations must ensure that they handle and analyze
geospatial data in a responsible and secure manner.

37
Tools and Software for Spatial Data Mining
 There are several tools and software available to support spatial data mining.
One popular tool is ArcGIS, which is a comprehensive GIS platform that
offers data management, power of visualization, and spatial online analytical
processing capabilities.
 Another widely used tool is QGIS, an open-source GIS software that provides
similar functionalities to ArcGIS. For data mining tasks, tools like RapidMiner
and KNIME offer spatial extensions that enable integrating spatial data into
the data mining workflow. These tools provide a user-friendly interface, a
wide range of algorithms, and advanced visualization options to facilitate
geospatial data analysis.

Techniques and Algorithms Used in Spatial Data Mining


 There are several techniques and algorithms that are commonly used in spatial
data mining.
 One such technique is clustering, which groups similar spatial objects together
based on their proximity or similarity. This can be useful for identifying
hotspots, detecting anomalies, or segmenting customers based on their
location-based preferences.
 Another technique is spatial association rule mining, which identifies
relationships between spatial objects and attributes. This can help
organizations understand the co-occurrence of events, such as the correlation
between the location of retail stores and customer purchasing behavior. Other
commonly used techniques include spatial regression analysis, spatial
interpolation, and spatial outlier detection.

38
PROCESS MINING
 Process mining is a technique designed to discover, monitor, and improve
real business processes by extracting readily available knowledge from the
event logs of information systems.
 Process mining helps organizations gain a full understanding of the processes
that support their customers through the examination of the actual processes,
which often differ from the documented processes that they currently use.

What is process mining?


 Process mining is a method of applying specialized algorithms to event log
data to identify trends, patterns and details of how a process unfolds.
 Process mining applies data science to discover, validate and
improve workflows.

Process mining techniques / Types


Following are the three main types or techniques of process mining:
1. Discovery
 The discovery type of process mining helps organizations create an entirely
new process model based on current procedure data. With the discovery
technique, the process mining software automatically generates a process
map that provides different information about the procedure.
 Many companies use the discovery technique, as process maps allow
executives to visualize data, helping them develop or improve workflows.
2. Conformance
 With the conformance technique, process mining software compares an
organization's workflow to a pre-existing process model.
 This pre-existing model uses data from your event logs to show how each
aspect of your workflow could operate in an optimized state.
 The process mining software then compares the pre-existing model to your
actual workflow data and identifies any areas for improvement.

39
3. Enhancement
 The enhancement technique uses additional information to change each
workflow’s pre-existing models. Also called organizational mining,
performance mining or extension, the enhancement strategy aims to
continually optimize these pre-existing models based on your organizational
logs. Enhancement can help make both your pre-existing workflow models
and your actual workflows more detailed, which may improve accuracy.

Advantages of process mining


1. Enhanced transparency:
Process mining offers a data-driven view of operational processes, surpassing
traditional business process mapping. This deep visibility is crucial for
identifying inefficiencies and compliance issues and understanding the actual
process flow.
2. Simplified process analysis and enhanced efficiency:
Process mining utilizes event-log data to quickly analyze business processes,
enabling the visualization of multiple variants and streamlining operations to
reduce cycle times and costs. This approach simplifies management and
facilitates the automation of routine tasks.
3. Data-driven decision making:
Process mining facilitates objective decisions using IT systems data. This
approach is key in precisely identifying and resolving issues such as bottlenecks
and deviations.
4. Process optimization:
By continuously monitoring process performance metrics, such as KPIs and
SLAs, process mining identifies opportunities for optimization and automation
across various operations.
5. Customer-centric process view:
It offers detailed insights into customer journeys by aligning external customer
interactions with internal operations, highlighting areas for improvement in
customer experience.

40
6. Process standardization:
It supports standardizing processes across an organization by identifying
variations and aligning them with the optimal process model. This helps ensure
consistent performance and quality.
7. Better customer experience:
Streamlining processes and enhancing efficiency leads to improved service
delivery, fostering greater customer satisfaction and loyalty.

Limitations of process mining


1. Data quality and availability:
Effective process mining relies on high-quality, complete data. Inaccuracies
can distort process models and lead to incorrect insights. Engaging data analysts
in the initial stages can ensure the integrity and completeness of data used for
process mining.
2. Inability to capture tasks:
Process mining may miss manual tasks outside IT systems that are not
recorded in event logs, limiting its scope in workflow optimizations. By
integrating task mining with process mining organizastions can address this gap
and enhance the analysis of workflows and task-level optimizations.
3. Integration hurdles:
Some IT systems pose integration challenges with process mining due to the
lack of connectors or data format issues. Pre-packaged solutions designed for
specific systems or processes can simplify integration, making the process more
seamless.
4. Concept drift:
As processes evolve, it can be difficult to keep process mining models
updated. With outdated models there is a greater risk of outdated analyses.
Advanced process mining solutions analyze processes in near real-time, which
helps keep models current and relevant.

41
5. Complexity in large organizations:
In larger organizations, the volume and complexity of processes can amplify
the challenges of process mining, affecting insight extraction. By adopting object-
centric or multi-level process mining techniques, organizations can better manage
and analyze complex processes.
6. Potential resistance to change:
Significant changes in process management due to process mining can meet
resistance from employees accustomed to existing workflows. Effective change
management is critical for successful implementation and adoption.
Implementing effective change management strategies, including staff training
and engagement, can facilitate smoother transition and adoption.

Process mining use cases


 Education:
Process mining can help identify effective course curriculums by monitoring
and evaluating student performance and behaviors, such as how much time a
student spends viewing class materials.
 Finance:
Financial services, institutions, and procurement operations use process
mining software to improve inter-organizational processes and account auditing,
increase income and expand their customer base.
 Public works:
Process mining is used to streamline the invoice process for public works
projects that involve various stakeholders, such as construction companies,
cleaning businesses and environmental bureaus.
 Software development:
Engineering processes can be disorganized, and process mining can help
identify a clear, documented process. It can also help IT administrators monitor
the process, allowing them to verify that the system is running as expected.

42
 Healthcare:
Process mining provides recommendations for reducing patients' treatment
processing time.
 E-commerce:
Process mining can provide insight into buyer behaviors and accurate
recommendations to increase sales.
 Manufacturing:
Process mining enhances supply chain and manufacturing business operations
by assigning appropriate resources based on product attributes. Insights into
production times and resource allocation, such as storage space, machines or
workers, allow for more efficient management and operational transformation.
 IT service management (ITSM):
Process mining can optimize service delivery and incident management
processes. It enables IT teams to analyze service workflows, identify
inefficiencies and improve response times. This helps enhance overall IT support
and customer satisfaction.

5 Process Mining Steps


We recommend following these five valuable steps to increase the productivity and
efficacy of your process discovery tool usage.

1. Process discovery
 Process discovery is now a simple and speedy part of rapid workflows, used to
analyze desktop user interaction data and link it with process details mined
from system event data.

43
 Using these timestamps, the software can then create visual models to help
identify the variations and bottlenecks impacting efficiency or customer
experience.
2. Process analysis
 Process analysis is where you develop an understanding of process
performance based on actual data. Insights gained from process analytics may
include optimal process paths, paths for the greatest return on investment
(ROI) and causes of variations.
 This is also the stage where you can better understand your processes by
visualizing them through process mapping. Some platforms include process
simulations, and with these, you can also visualize target business outcomes.
 While process mining gives you insight into how your processes actually run,
BPM gives you a map of your business’s ideal processes. You can use them
together to improve processes for better workflow management and business
[Link] how processes currently work is invaluable for BPM.
Process mining helps across almost every step of the BPM lifecycle.
3. Process optimization
 Using the data you’ve gathered in the discovery and analysis phases, you can
now make changes to your processes to prepare them for automation.
 This is also where you select the most important processes on which to place
your focus.
4. Process automation
 At this stage, your process data can be exported for RPA. With insights gained
in previous steps, you can choose the ‘happy path’ with the least resistance
and highest ROI for automation.
5. Process monitoring & prediction
 Some process mining solutions will monitor every process instance. This is
usually only available with full process intelligence.
 The platform will issue alerts or automatically act when unusual process
behavior occurs. Some platforms can even offer predictive analytics to further
enhance your process planning strategy.

44
How does process mining work?
Here are the four stages of process mining:
1. Read and transform organizational data
 Many companies already track and store data related to their workflows.
Process mining technology transforms these data sets into event logs for each
of your organizational processes.
 The process mining software then assigns every piece of data a time stamp, a
case ID and an activity. These three elements help the process mining program
analyze and prioritize your workflows more effectively.

2. Evaluate details of each process


 After creating event logs, process mining software often designs process
graphs that reveal each workflow step, including which alternative routes may
occur .The process mining program compares this more detailed graph with
another chart of an organization’s ideal workflow path so you can easily see
inefficiencies and devise ways to optimize them.
 For example, an ideal process graph for a workflow related to invoicing might
contain only a few steps, such as receiving, processing and paying the invoice.
With process mining, however, you might discover many smaller steps within
a process that could increase the duration of this workflow and others with
similar requirements.

3. Prioritize your fixes


 Process mining software allows you to view advanced data and visualizations
related to your processes, such as the exact time it takes to accomplish various
subtasks. With this knowledge, you can more easily determine which aspects
of your workflows to fix first and how to accomplish these optimizations.
 The process mining software can also help you identify whether your
proposed changes may save the company time and money before
implementing them officially.

45
4. Continually track your workflows
 Process mining software continues to monitor and evaluate your workflows,
even after you've made improvements.
 This can help you gather and analyze real-time data that shows you if you've
designed effective improvements. Process mining programs may also evaluate
whether your workflows adhere to compliance regulations for your industry.

Process mining is important:


 Reducing expenses:
Process mining software can show executives whether a company
spends money inefficiently. By reviewing this information, they can
better manage resources and streamline production steps.
 Increasing ROI:
Reducing costs associated with business processes and removing
ineffective steps can allow companies to improve their return on investment
(ROI). This allows companies to improve their overall revenue and profits.
 Enhancing quality:
This method can help companies improve the overall efficiency of their
internal processes, which may encourage staff members to think more
creatively when developing new project ideas. Companies may also receive
more high-quality data and better customer engagement.

Process mining software


 Process mining software helps organizations analyze and visualize their
business processes based on data extracted from various sources, such as
transaction logs or event data.
 This software can identify patterns, bottlenecks, and inefficiencies within a
process, enabling organizations to improve their operational efficiency, reduce
costs, and enhance their customer experience.

46
 In March 2023 The Analytics Insight Magazine identified top 5 process
mining software companies for 2023
1. Celonis
2. UiPath Process Mining
3. SAP Signavio Process Intelligence
4. Software AG ARIS Process Mining
5. ABBYY Timeline
eline

DATA WAREHOUSE:
 A data warehouse is where data can be collected for mining purposes, usually
with large storage capacity.
 Data warehousing is a method of organizing and compiling data into one
database, whereas data mining deals with fetching impo
important
rtant data from
databases.
 Data Warehousing integrates data and information collected from various
sources into one comprehensive database. For example, a data warehouse
might combine customer information from an organization's point-of-sale
point
systems, its mailing
ailing lists, website, and comment cards.
Key Characteristics of Data Warehouse
The main characteristics of a data warehouse are as follows:
 Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information
rather than the overall processes of a business. Such subjects may be sales,
promotion, inventory, etc. For example, if you want to analyze your company’s
sales data, you need to build a data warehouse that concentrates on sales. Such a
warehouse would provide valuable information like ‘who was your best customer
last year?’ or ‘who is likely to be your best customer in the coming year?’
 Integrated
A data warehouse is developed by integrating data from varied sources into a
consistent format. The data must be stored in the warehouse in a consistent and
universally acceptable manner in terms of naming, format, and coding. This
facilitates effective data analysis.
 Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is
read-only. Previous data is not erased when current data is entered. This helps
you to analyze what has happened and when.
 Time-Variant
The data stored in a data warehouse is documented with an element of time,
either explicitly or implicitly. An example of time variance in Data Warehouse is
exhibited in the Primary Key, which must have an element of time like the day,
week, or month.

Types of Data Warehouse


There are three main types of data warehouse.
1. Enterprise Data Warehouse (EDW)
2. Operational Data Store (ODS)
3. Data Mart

48
1. Enterprise Data Warehouse (EDW)
This type of warehouse serves as a key or central database that facilitates
decision-support services throughout the enterprise. The advantage to this type of
warehouse is that it provides access to cross-organizational information, offers a
unified approach to data representation, and allows running complex queries.
2. Operational Data Store (ODS)
This type of data warehouse refreshes in real-time. It is often preferred for
routine activities like storing employee records. It is required when data warehouse
systems do not support reporting needs of the business.
3. Data Mart
A data mart is a subset of a data warehouse built to maintain a particular
department, region, or business unit. Every department of a business has a central
repository or data mart to store data. The data from the data mart is stored in the
ODS periodically. The ODS then sends the data to the EDW, where it is stored and
used.

Benefits of Data Warehouse


 Better business analytics: Data warehouse plays an important role in
every business to store and analysis of all the past data and records of the
company. which can further increase the understanding or analysis of data for
the company.

 Faster Queries: The data warehouse is designed to handle large queries


that’s why it runs queries faster than the database.

 Improved data Quality: In the data warehouse the data you gathered from
different sources is being stored and analyzed it does not interfere with or add
data by itself so your quality of data is maintained and if you get any issue
regarding data quality then the data warehouse team will solve this.

 Historical Insight: The warehouse stores all your historical data which
contains details about the business so that one can analyze it at any time and
extract insights from it.

49
Features of Data Warehousing
 Centralized Data Repository: Data warehousing provides a centralized
repository for all enterprise data from various sources, such as transactional
databases, operational systems, and external sources. This enables
organizations to have a comprehensive view of their data, which can help in
making informed business decisions.
 Data Integration: Data warehousing integrates data from different sources
into a single, unified view, which can help in eliminating data silos and
reducing data inconsistencies.
 Historical Data Storage: Data warehousing stores historical data, which
enables organizations to analyze data trends over time. This can help in
identifying patterns and anomalies in the data, which can be used to improve
business performance.
 Query and Analysis: Data warehousing provides powerful query and
analysis capabilities that enable users to explore and analyze data in different
ways. This can help in identifying patterns and trends, and can also help in
making informed business decisions.
 Data Transformation: Data warehousing includes a process of data
transformation, which involves cleaning, filtering, and formatting data from
various sources to make it consistent and usable. This can help in improving
data quality and reducing data inconsistencies.
 Data Mining: Data warehousing provides data mining capabilities, which
enable organizations to discover hidden patterns and relationships in their
data. This can help in identifying new opportunities, predicting future trends,
and mitigating risks.
 Data Security: Data warehousing provides robust data security features,
such as access controls, data encryption, and data backups, which ensure that
the data is secure and protected from unauthorized access.

50
Advantages of Data Warehousing
 Intelligent Decision-Making: With centralized data in warehouses,
decisions may be made more quickly and intelligently.
 Business Intelligence: Provides strong operational insights through
business intelligence.
 Historical Analysis: Predictions and trend analysis are made easier by
storing past data.
 Data Quality: Guarantees data quality and consistency for trustworthy
reporting.
 Scalability: Capable of managing massive data volumes and expanding to
meet changing requirements.
 Effective Queries: Fast and effective data retrieval is made possible by an
optimized structure.
 Cost reductions:
Data warehousing can result in cost savings over time by reducing data
management procedures and increasing overall efficiency, even when there are
setup costs initially.
 Data security:
Data warehouses employ security protocols to safeguard confidential
information, guaranteeing that only authorized personnel are granted access to
certain data.

Disadvantages of Data Warehousing


 Cost: Building a data warehouse can be expensive, requiring significant
investments in hardware, software, and personnel.
 Complexity: Data warehousing can be complex, and businesses may need
to hire specialized personnel to manage the system.
 Time-consuming: Building a data warehouse can take a significant amount
of time, requiring businesses to be patient and committed to the process.

51
 Data integration challenges:
Data from different sources can be challenging to integrate, requiring
significant effort to ensure consistency and accuracy.
 Data security:
Data warehousing can pose data security risks, and businesses must take
measures to protect sensitive data from unauthorized access or breaches.

Components of Data Warehouse


There are various components of data warehouse that are specifically designed to
enhance the system's speed so that you can get the result faster and precisely analyze
the data in one go.
1. Warehouse Databases
It is one of the first components of a data warehouse; let's discuss some of the
warehouse databases-
a) Analytics Database:-
These databases help sustain and manage data storage's analytics part.
b) Cloud-based Database:-
In this case, you hosted your database on the cloud so that you don't
have to acquire any hardware system for establishing your data warehouse.
c) Central Database:-
It keeps all the data related to business organizations and makes it easier
for analysts to build report about it.
d) Typical rational databases:-
These databases consist of data in the form of rows and columns, which
collectively form a table.

2. ETL(Extraction, Transformation, and Loading)


 ETL is another important component of the data warehouse. ETL, which
stands for Extraction, Transformation, and Loading, is a data integration
process in which data is extracted from various sources, it is transformed into
a suitable format, and then loaded into a data warehouse.

52
 This component allows us to extract data, fill disarranged data, highlight data
distribution from the central repository to the business intelligence
applications, and much more.
How does ETL Work?
To understand how ETL works, we should go through each step of the ETL process.
 Extract
Copy raw data from the source locations to a staging area. This data is
collected by the data management team and can be structured or unstructured.
 Transform
In the staging area, this data is transformed by filtering, cleansing, de-
duplicating, validating, and authenticating the data, etc.
 Loading
In this step, transformed data is moved from the staging area to the
target data warehouse. It initially loads all the data, then gradually load it as
changes occur in the data.

3. Metadata
 Metadata is a component that can be used in a variety of conditions to build,
manage and maintain the system. The simplest definition of metadata is “it is
data about the data.”
 It helps us to understand the context, nature, and structure of the data.
 It enables the user to have an easy search and retrieval of data.
 It is a key to unlocking the hidden content of the data and getting a proper
understanding of it.

53
4. Query Tools
Tools are the components using which we interact with the data warehouse and get
relevant data out of it.
Some of the tools used for interaction purposes are query and reporting tools,
application development tools, data mining tools, and online analytical tools.
 Firstly, query and reporting tools are categorized into managed query tools
and reporting tools. Reporting tools are used for developing business reports,
and the end-users can use them at an affordable cost.
 Managed query tools to protect the end user from SQL query-related
complexities by adding a security layer between the database and users.
 Online analytical processing tools are generally used to extract or retrieve data
selectively so that they can be analyzed from a different standpoint. These
tools believe that data is managed in a multidimensional model.
 Data mining tools are the set of tools that are used to analyze large amounts of
data and the relationship in that data.

5. Data Marts
Data marts are components of data warehouse. Let’s discuss it in detail:
 It is a data store that is designed for a particular department of an organization,
or a Data mart is a subset of the data warehouse that is usually oriented to a
specific purpose.
 It helps stakeholders to make decisions quickly from the summarized data and
make knowledgeable decisions.
 In a data mart, companies can retrieve information more efficiently as it
contains the most relevant information.
 Used for making streamlined decisions and gives privileges to access minute
data.
 As it has very few data tables, data engineers can manage and change
information without causing significant database changes.

54
6. Data Warehouse Management and Administration
 Unlike traditional relational databases, a data warehouse collects vast amounts
of past and current data. To manage this, we require an administrator with a
skillset different from the traditional data administrator.
 All the control elements manage the system within the data warehouse, and
these components also control the transformation of data into the data
warehouse.

7. Information Delivery System


 In organizations, all the business executives with no training in data
warehouses require information about the data warehouse; in such cases, they
use the delivery component to convey the information.
 The idea behind the information delivery system is made for those cases when
the warehouse becomes completely functional; its users will not have to be
aware of its maintenance. All the need is the contention of data at the required
time.
 After the expansion of the world-wide-web and internet, this delivery system
has become more usable as any user can get information using the worldwide
network.

DATA MART
 A Data Mart is a smaller version of a data warehouse and it is meant to be
used by a particular department or a group of individuals in the company.
 It focuses on a single functional unit of an organization and keeps a subset of
data stored in the data warehouse. It is normally controlled by a unit
department in the organization. Whereas a data warehouse draws data from
many sources, a Data Mart draws data from only a few sources.
 A data mart is a subset of a database —usually a data warehouse—
where data is stored for a specific business area. That is, a data mart stores
concise and specific data sets used for analysis for a specific department or
line of business, such as the sales department.

55
What is a Data Mart?
 A data mart is an access layer of a data warehouse focused on a specific line
of business, function, or department. It is used for retrieving client-facing data.
A data mart contains a subset of the data that is stored in a data warehouse.

Data Mart Structures


 A data mart structure is a subject-oriented relational database that stores data
in tables, i.e., rows and columns that are easier to access, organize and
comprehend. Data fields can refer to one or multiple objects.
 Data marts are structured in a multidimensional schema that works as a
blueprint for data analysis by users of the database. The three main structures
or schema for data marts are star, snowflake, and vault.

1. Star
 The star schema is a blueprint that resembles a star shape and consists of fact
tables that reference dimension tables in a relational database. The fact table is
placed at the center of the star and relates a metric set that relates to a specific
process.
 The star schema requires fewer joints when writing queries as there is no
dependency between dimension tables. The ETL request process makes it
vastly efficient for accessing and navigating large data sets. The said benefits
make star schemas widely used in most information technology systems.
2. Snowflake
 A snowflake schema extends the star schema blueprint with additional
dimension tables that are normalized to protect data integrity and minimize
data redundancy.

56
 The snowflake schema’s main benefit is that it requires less storage space for
dimension tables.
 However, a snowflake structure is difficult to maintain due to multiple tables
that need to be populated and synchronized. It also adversely impacts
performance as a result of the need for additional dimension tables.
3. Vault
 The vault schema enables users to design agile enterprise data warehouses. It
is a fairly modern database modeling technique. The vault schema is a layered
structure that focuses on agility and scalability.

Uses of Data Marts


 Efficient access to information: It is more efficient to access specific
data in a data mart that is relevant to real-time needs. Data marts hold a subset
of data warehouse information which makes it quick and easy to retrieve
information.
 Cost-effective alternative to data warehouse : It is more cost-effective
to create and design an independent data mart than creating a data warehouse,
especially when it comes to small businesses or projects with smaller data sets.
Setting up a data mart incurs a small fraction of costs compared to data
warehouse setup costs.
 Increased processing efficiency: Using dependent and hybrid data marts
reduces the burden for processing by data warehouses, thereby improving
performance. A separate processing facility for the two data marts will help
reduce analytics processing costs.
 Efficient data maintenance: It is easier to maintain a data mart
because it accesses leaner and less cluttered information. Also, a data
mart requires less storage, which is easier to maintain. Different business units
are able to maintain and own their own data.
 Faster implementation: A data mart requires a small subset of data to set
up instead of significant setup costs required for a data warehouse, which
contains a large collection of external and internal data.

57
 Business intelligence: Data marts enable quicker insights into strategic
information contained in a data warehouse. Business intelligence benefits the
organization through accelerated information access and potentially higher
productivity.
 Analytics: It is easier to track key performance indicators through a data
mart.

Types of Data Marts


There are mainly approaches to designing data marts. These approaches are
1. Dependent Data Marts
2. Independent Data Marts
3. Hybrid Data Marts

1. Dependent Data Marts


 A dependent data marts is a logical subset of a physical subset of a higher data
warehouse. According to this technique, the data marts are treated as the
subsets of a data warehouse. In this technique, firstly a data warehouse is
created from which further various data marts can be created.
 These data mart are dependent on the data warehouse and extract the essential
record from it. In this technique, as the data warehouse creates the data mart;
therefore, there is no need for data mart integration. It is also known as a top-
down approach.

58
2. Independent Data Marts
 The second approach is Independent data marts (IDM) Here, firstly
independent data marts are created, and then a data warehouse is designed
using these independent multiple data marts.
 In this approach, as all the data marts are designed independently; therefore,
the integration of data marts is required. It is also termed as a bottom-up
approach as the data marts are integrated to develop a data warehouse.

3. Hybrid Data Marts


 It allows us to combine input from sources other than a data warehouse.
This could be helpful for many situations; especially when Adhoc
integrations are needed, such as after a new group or product is added to
the organizations.

Characteristics of Data Mart


Data Marts possess different characteristics that show its type of design and
how it works. The following are four characteristics that define Data Marts.
1. Subject-oriented
 One of Data Mart’s defining features is its subject-orientedness.
As opposed to a traditional data warehouses, is created to specifically
address the requirements of a given business unit or function.

59
 Instead of being organized around the needs of the enterprise as a whole,
the data is usually organized around a particular subject area, such
as sales or inventory.
 Business users can now more quickly and easily access the data they
require without having to wade through extraneous information.
 The information in a Data Mart is typically obtained from a central
warehouse or other data sources, but it is arranged and displayed
in a manner that is specific to the particular business unit or function.
In addition to lowering the risk of data redundancy and enhancing data
governance, this helps to ensure that the data is accurate, consistent, and
up-to-date.
2. Integrated
 A Data Mart’s ability to integrate data from various sources to produce
a single, comprehensive view of information is referred to as its integrated
characteristic. As a result, the information kept it is not only subject-
oriented but also combined from a variety of sources to give users
a thorough understanding of a particular business function or process. Data
from both internal and external sources, such as data lakes, relational
databases, and business intelligence tools, are incorporated into the process
of integration.
 Data Marts give business users a centralized location to access data that
is consistent, accurate, and up-to-date by combining data from various
sources. By integrating these systems, data provided is guaranteed
to be accurate and usable for data mining and querying, enabling
businesses to base decisions on specific data trends and insights.
3. Time-variant
 The 'Time-variant' characteristic refers to how data changes over time.
In a Data Mart, data is stored in a way that allows for analysis of trends and
patterns over time. This means that the data stored in a marts is time-
stamped, enabling business users to query and analyze historical data
to identify trends and patterns in the data. By capturing changes in data

60
over time, Data Marts provide insights into how business data has evolved,
which can help organizations make data-driven decisions.
 Additionally, the time-variant nature of Data Marts allows businesses
to track changes in specific data over time, which is particularly useful for
tracking trends in sales, customer behavior, and other business functions.
4. Non-volatile
 Data Marts are non-volatile, meaning that once data is stored, it cannot
be altered or updated. This characteristic ensures data integrity and
consistency, which is crucial for business intelligence and decision-
making. Unlike operational databases that are subject to frequent changes,
Data Marts store historical data that is relevant to specific business
functions or departments. This allows for simplified data access and
querying, as well as faster response times when retrieving information.
 Data Marts can be independent or part of an existing enterprise warehouse,
depending on the organization’s data management system and business
requirements. While Data Marts are focused on a specific subset of data,
they can still access and incorporate external data sources to support data
mining and provide a comprehensive view of business trends.

Data Mart Architecture


Understanding Data Mart’s Architecture will bring you closer to grasping the
orientation and potential it can offer.

61
1. Bottom-up approach
 In the bottom-up approach of Data Mart architecture, Data Marts are created
from the operational data sources that a business unit or department uses. This
approach starts with a department or business unit identifying the data they
need to access to support their specific business function, and then creating
a mart that stores that data. Data is loaded into the mart from internal
operational systems or external data sources and then structured into
dimension tables for simplified data access by business users.
 This approach allows for a more agile warehouse, as it is built incrementally
from specific data needs rather than attempting to design and implement
an entire data warehouse upfront. Additionally, the use of independent Data
Marts can provide faster querying of data and support data mining for specific
data trends.
2. Top-down approach
 The top-down approach to Data Mart architecture involves the creation
of a centralized data warehouse, which serves as the primary data source for
all Data Marts. In this approach, Data Marts are designed to serve the needs
of specific business units or functions and are created based on the
requirements of business users.
 The data warehouses are designed to store all types of data, including
structured and unstructured data, from various data sources, including internal
operational systems and external data sources. Dimension tables and fact
tables are used to store and organize data within the warehouse. The top-down
approach to Data Mart architecture is a more warehouse-focused approach and
is often used in larger organizations with more complex data requirements.
3. Federated approach
 A federated Data Mart architecture is a data management approach that allows
for the autonomous integration of multiple Data Marts while maintaining their
independence. This approach enables organizations to access and analyze data
from disparate sources without having to physically move the data
in a centralized repository. Usually, in marts that have federated architecture,

62
data remains in its original location and is accessed on an as-needed basis
through a virtual layer.
 This type of architecture provides increased agility and flexibility, as well
as a simplified data access process, making it easier for business users
to query data and gain insights. Additionally, a federated approach can support
a variety of data types, including structured, unstructured, and semi-structured
data, sourced from internal operational systems or external data sources,
making it a versatile solution for businesses looking to streamline their data
management processes.

What are the Data Mart Use Cases?


Here are a few pivotal use cases where Data Marts can come in handy:
 Improved Resource Management: You can provide each department
with a separate repository to manage the imbalance of resource use by various
organizational units. For instance, if the department running logistics
operations performs a lot of actions with a database on a daily basis, then this
might cause system malfunctions in other departments that carry out fewer
database queries. Eventually, this might end up reducing the performance
effectiveness of the entire company. These repositories allow you to use
resources more effectively and efficiently.

63
 Subject-focused Data Analytics: Data Analytics plays a pivotal role in
any business lifecycle. These repositories allow for more focused data analysis
since they only contain records that are organized around particular subjects
like sales, products, customers, etc. Since there is no extraneous information to
deal with, businesses can filter more accurate and clearer insights.
 Selective Data Access: You can leverage these repositories in situations
when an organization needs selective privileges for managing and accessing
data. Generally, this can be the case for big enterprises that can’t reveal the
entire Data Warehouse to all the users. By building multiple dependent
repositories, you can help protect sensitive data from accidental writes and
unauthorized access.
 Time-limited Data Projects: As opposed to corporate data warehouses
that need considerable effort and time, these are much easier and faster to set
up. Since, data developers and engineers work with smaller amounts of data,
simpler schemas, and fewer sources, this comes in handy. Apart from this,
these repositories are also easier to implement compared to a Data Warehouse.
So, if you are facing any time crunches in terms of completing a data project,
these repositories may be the way to go.

Procedure for Implementation of Data Mart


The process of building a Data Mart can be complex, but it generally involves the
following 5 easy steps:
 Step 1: Design
 Step 2: Build / Construct
 Step 3: Populate / Data Transfer
 Step 4: Data Access
 Step 5: Manage

Step 1: Design
 This is the first step when building a Data Mart.

64
 It includes tasks such as initiating a request for the Data Mart and collecting
information about the requirements. Other tasks involved in this step include
identifying the data sources and selecting the right data subset.
 The output of this step is the logical and physical design of the Data Mart.
Step 2: Build / Construct
 This is the step during which both the physical and the logical structures for
the Data Mart are created.
 In this step, you create the tables, indexes, fields, and access controls.
Step 3: Populate / Data Transfer
 This is the step in which you populate the Data Mart by transferring data into
it. You can also set the frequency with which data transfer will be done,
whether daily or weekly.
 To ensure that information stored in the structure is clean, it is always
overwritten during the population of the Data Mart. In this step, the source
information is extracted, cleaned, transformed, and loaded into the Data Mart.
Step 4: Data Access
 In this step, the data that has been loaded into the Data Mart is put into active
use. Activities involved here include querying, generating graphs and reports,
and publishing.
 To make it easy for non-technical users to use the Data Mart, a meta-layer
should be set up and item names and database structures translated into
corporate expressions.
 If possible, interfaces and APIs should be set up to ease the process of data
access.
Step 5: Manage
This is the last step when building a Data Mart and it involves the following tasks:
 Controlling user access.
 Refining and optimizing the target system to improve its performance.
 Adding new data into the Data Mart and managing it.
 Configuring recovery settings and ensuring that the system is available even
after the occurrences of disasters.

65
Data Mart Tools
There are several tools available for managing and building Data Marts, including:
 IBM InfoSphere Warehouse: This program offers data integration, data
transformation, and data management functions, as well as data warehousing
capabilities.
 Oracle Database: The Oracle Data Warehousing option, which offers
advanced data management, integration, and analysis capabilities, is one of the
features for creating and managing Data Marts that are included in the Oracle
Database.
 Microsoft SQL Server: SQL Server Analysis Services, which offers
multidimensional data analysis and reporting capabilities, is just one of the
tools available through Microsoft SQL Server for creating and managing Data
Marts.
 Teradata: Teradata is a potent analytics and data warehousing platform that
offers assistance with creating and maintaining Data Marts.
 SAP BusinessObjects: This program can be used to create and maintain
Data Marts because it has data integration, data quality, and data analysis
features.
 Tableau: To support business intelligence and decision-making, Tableau
offers robust data visualization and analysis capabilities that can be used with
Data Marts.
 Amazon Redshift: An affordable and highly scalable cloud-based data
warehousing solution, Amazon Redshift supports the creation and
management of cloud Data Marts.
Depending on the unique requirements of the organization, each of these tools offers
a variety of capabilities for creating and managing Data Marts, with varying
strengths and weaknesses. To choose the right tool for a specific project or use case,
it’s critical to carefully assess each one.

66
Benefits of data marts
Data marts can offer benefits to every industry. Here are a few benefits to consider:
 Centralized data: Data marts help centralize specific data sets so everyone
is drawing information from a single source. This helps prevent data
discrepancies and reduces errors.
 Scalable data management: Data marts allow for more scalability for
data sets, which is the ability to grow as a company's needs change. Teams can
scale the data in a data mart to ensure a business meets its data needs.
 Fast implementation: Data marts are more specific than larger data
warehouses, which can make them faster and easier to implement. This can
save time and money for a company.
 Quick data access: Data marts make it easier for teams to review data sets
and access specific data more quickly. This can make the data acquisition
process much quicker and save time and money.
 Better decision-making: With access to faster and more accurate data
sets, teams may make better decisions based on tangible information. This can
improve overall efficiency and reduce costs.
 Low cost: Setting up a data mart can cost significantly less than a full-size
data warehouse for the business. The company can then reinvest those savings
into other parts of the business.

Features of data marts:


 Subset of Data: Data marts are designed to store a subset of data from a
larger data warehouse or data lake. This allows for faster query performance
since the data in the data mart is focused on a specific business unit or
department.
 Optimized for Query Performance: Data marts are optimized for query
performance, which means that they are designed to support fast queries and
analysis of the data stored in the data mart.

67
 Customizable: Data marts are customizable, which means that they can be
designed to meet the specific needs of a business unit or department.
 Self-Contained: Data marts are self-contained, which means that they have
their own set of tables, indexes, and data models. This allows for easier
management and maintenance of the data mart.
 Security: Data marts can be secured, which means that access to the data in
the data mart can be controlled and restricted to specific users or groups.
 Scalability: Data marts can be scaled horizontally or vertically to
accommodate larger volumes of data or to support more users.
 Integration with Business Intelligence Tools: Data marts can be
integrated with business intelligence tools, such as Tableau, Power BI, or
QlikView, which allows users to analyze and visualize the data stored in the
data mart.
 ETL Process: Data marts are typically populated using an Extract,
Transform, Load (ETL) process, which means that data is extracted from the
larger data warehouse or data lake, transformed to meet the requirements of
the data mart, and loaded into the data mart.

Disadvantages of Data Mart


1. Data redundancy across multiple data marts causes data inconsistencies and
more storage space.
2. Limited scope of data that's hard to make cross-departmental or cross-
functional decisions.
3. Integration challenges due to the fact that data marts might be built with
different data models.
4. Limited scalability since data marts are not designed to handle large and
complex data, for example when a company grows.

68
Data Mart Vs Data Warehouse:

Data Warehouse Data Mart

A data warehouse is used to store data A data mart carries data related to a
from numerous subject areas. department, such as HR, marketing,
finance data mart, etc.

It acts as a central data repository for a It is a logical subsection of a data


company. warehouse for particular departmental
applications.

However, a star schema is used most Data marts use a star schema for
widely. designing tables.

Tricky to design and use due to its Comparatively more manageable due to
large size (more than 100GB). its small size (less than 100GB).

Designed to support the decision- Data marts are designed for particular
making process in a company. user groups or corporate departments.

Data warehouses are used to store Data marts hold highly denormalized data
detailed information in denormalized in a summarized form.
or normalized form.

Has large dimensions and integrates Smaller dimensions to integrate data sets
data from many sources. from a smaller number of sources.

Data warehouses are subject-oriented Data marts are used for particular areas
and time-variant with data existing for related to a business, retains data for a
a longer duration. shorter duration.

69

Common questions

Powered by AI

Key benefits of data marts include centralized data, scalable data management, fast implementation, quick data access, enhanced decision-making capabilities, and low setup costs compared to data warehouses. However, they also have limitations such as data redundancy leading to inconsistencies, limited scope which poses challenges for cross-departmental decisions, integration issues due to differing data models, and limited scalability for handling large, complex datasets .

Data warehouses are large centralized systems used to store data from multiple subject areas, serving as a central repository for companies and designed to support decision-making processes. They typically use sophisticated architectures and tend to integrate significant volumes of detailed and historical data. Data marts, in contrast, are specific to departments like HR, marketing, or finance, and function as logical subsections of data warehouses. They feature smaller dimensions and are often designed using a star schema, with a focus on quick implementation and optimized query performance for particular applications or user groups .

Web mining leverages data mining approaches like clustering, classification, and association rule mining to analyze web data and extract valuable insights. By examining patterns like website traffic and user behavior via clickstream analysis, organizations can understand user navigation, content popularity, and other dynamics crucial for strategic web management. The multidisciplinary application of data mining techniques in web contexts facilitates the discovery of valuable data-driven insights .

Text mining applications enhance customer care services by enabling rapid, automated responses to customer queries through chatbots, which simulate human interaction, thus enhancing customization and reducing reliance on human operators. In personalized advertising, text mining analyzes user-generated data to tailor advertisements that are more likely to engage the user and lead to sales, thereby increasing advertising effectiveness and user engagement .

The ETL (Extract, Transform, Load) process is crucial in sustaining data marts by systematically channeling data from larger warehouses or lakes, transforming it to align with specific departmental needs, and loading it into the mart for optimized analyses. This structured flow ensures that the data within data marts is relevant, fresh, and ready for precise querying and business intelligence reporting. Integration with tools like Tableau or Power BI further enhances decision-making and strategic planning capabilities within business units .

The main stages and techniques in the data mining process according to widely-used methodologies include the KDD process, CRISP-DM, and SEMMA. The KDD process involves selection, preprocessing, transformation, data mining, and interpretation/evaluation. CRISP-DM includes understanding, data preparation, modeling, evaluation, and deployment stages. SEMMA consists of sample, explore, modify, model, and assess phases. Techniques used in these processes involve classification, prediction, clustering, and association rule mining .

Text mining is complex due to the advanced skills required in natural language processing and machine learning, compounded by high computational costs. For smaller organizations, these demands can be prohibitive, making it challenging to afford the necessary technology. Limitations include variability in data quality, computational expense, and the inability to work beyond unstructured text data. These factors often restrict smaller organizations' capabilities to fully leverage text mining for competitive advantage .

Within business intelligence frameworks, genetic algorithms are used for optimization problems by simulating natural evolution, providing solutions through mechanisms such as selection, crossover, and mutation. Neural networks, with their ability to learn complex patterns through layer-based structures, are instrumental in tasks requiring classification and predictive analytics. Both technologies contribute to extracting deeper insights and enhancing decision-making processes within machine learning and AI environments .

Dependent data marts benefit from integration with a central data warehouse, ensuring consistency and eliminating the need for data mart integration. They are advantageous for organizations seeking structured, unified data environments. Conversely, independent data marts allow for quicker, department-specific setup but require integration to form a cohesive data warehouse, which can lead to inconsistencies and redundant data if not managed properly. The choice often depends on the organization's size, investment capability, and integration needs .

Feature selection improves the accuracy and efficiency of data mining by identifying the most relevant features that provide significant predictive power while excluding irrelevant ones, which often lead to increased complexity and decreased accuracy. Methods like Pearson correlation, Chi-squared, lasso regression, and tree-based algorithms effectively reduce data dimensionality, leading to simpler and more robust models. This results in more efficient data processing and improved model performance .

You might also like