Data Mining Techniques for Business Insights
Data Mining Techniques for Business Insights
SYLLABUS
UNIT I INTRODUCTION
Data mining, Text mining, Web mining, Spatial mining, Process mining, Data
ware house and datamarts.
1
UNIT - 1
UNIT I INTRODUCTION
Data mining, Text mining, Web mining, Spatial mining, Process mining, Data
ware house and datamarts.
********************************************************************
What is Data Mining?
Data mining, also known as knowledge discovery in data (KDD), is the
process of uncovering patterns and other valuable information from large
data sets.
The primary goal of data mining is to discover hidden patterns and
relationships in the data that can be used to make informed decisions or
predictions. This involves exploring the data using various techniques such
as clustering, classification, regression analysis, association rule mining,
and anomaly detection.
2
Prescriptive data mining
Prescriptive data mining involves using data and models to make
recommendations or suggestions about actions or decisions. This type of data
mining is often used to optimize processes, allocate resources, or make other
decisions that can help organizations achieve their goals.
3
5. Improved risk management
Data mining can also be used to improve risk management.
By analyzing data on potential risks and vulnerabilities, data mining can help
organizations identify and mitigate potential risks, and make more informed
and strategic decisions.
4
5. Data mining requires large databases
Data mining is one of the most powerful tools in a marketer’s toolbox,
but it does have its drawbacks. One such drawback is that data mining
requires large databases to be effective.
For example, if an email list has only 100 people, then the data from
those emails will not provide enough information for data mining. On
the other hand, if the list contains 100,000 people, then there will be
more information available and data mining will be more successful.
6. Expensive
Data mining can be a very expensive process. For example, companies
have to hire additional employees and technology specialists to ensure
that the data mining is done correctly.
Many businesses have to invest in advanced data mining software,
which can also be expensive.
The costs of data mining generally outweigh the benefits for most
small businesses because they don’t produce enough valuable insights.
5
2. Data Mining in Market Basket Analysis:
Market basket analysis is a modeling method based on a hypothesis.
If you buy a specific group of products, then you are more likely to buy
another group of products.
This technique may enable the retailer to understand the purchase behavior
of a buyer. This data may assist the retailer in understanding the
requirements of the buyer and altering the store's layout accordingly.
Using a different analytical comparison of results between various stores,
between customers in different demographic groups can be done.
6
5. Data Mining in CRM (Customer Relationship Management):
Customer Relationship Management (CRM) is all about obtaining and
holding Customers, also enhancing customer loyalty and implementing
customer-oriented strategies.
To get a decent relationship with the customer, a business organization
needs to collect data and analyze the data. With data mining technologies,
the collected data can be used for analytics.
7
The data mining technique can help bankers by solving business-related
problems in banking and finance by identifying trends, casualties, and
correlations in business information and market costs that are not instantly
evident to managers or executives because the data volume is too large or
are produced too rapidly on the screen by experts.
The manager may find these data for better targeting, acquiring, retaining,
segmenting, and maintain a profitable customer.
2. Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed
computing environment.
It might be in a database, individual systems, or even on the
[Link], It is a quite tough task to make all the data to a
centralized data repository mainly due to organizational and technical
concerns.
8
For example, various regional offices may have their servers to store their
data. It is not feasible to store, all the data from all the offices on a central
server. Therefore, data mining requires the development of tools and
algorithms that allow the mining of distributed data.
3. Complex Data:
Real-world data is heterogeneous, and it could be multimedia data,
including audio and video, images, complex data, spatial data, time series,
and so on.
Managing these various types of data and extracting useful information is a
tough task. Most of the time, new technologies, new tools, and
methodologies would have to be refined to obtain specific information.
4. Performance:
The data mining system's performance relies primarily on the efficiency of
algorithms and techniques used.
If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.
5. Data Privacy and Security:
Data mining usually leads to serious issues in terms of data security,
governance, and privacy.
For example, if a retailer analyzes the details of the purchased items, then it
reveals data about buying habits and preferences of the customers without
their permission.
6. Data Visualization:
In data mining, data visualization is a very important process because it is
the primary method that shows the output to the user in a presentable way.
The extracted data should convey the exact meaning of what it intends to
express.
But many times, representing the information to the end-user in a precise
and easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes
need to be implemented to make it successful.
9
Data Mining Process
The data mining process typically involves the following steps:
1. Business Understanding:
This step involves understanding the problem that needs to be solved and
defining the objectives of the data mining project. \This includes
identifying the business problem, understanding the goals and objectives of
the project, and defining the KPIs that will be used to measure success.
This step is important because it helps ensure that the data mining project is
aligned with business goals and objectives.
2. Data Understanding:
This step involves collecting and exploring the data to gain a better
understanding of its structure, quality, and content.
This includes understanding the sources of the data, identifying any data
quality issues, and exploring the data to identify patterns and relationships.
This step is important because it helps ensure that the data is suitable for
analysis.
3. Data Preparation:
This step involves preparing the data for analysis. This includes cleaning
the data to remove any errors or inconsistencies, transforming the data to
make it suitable for analysis, and integrating the data from different sources
to create a single dataset.
This step is important because it ensures that the data is in a format that can
be used for modeling.
4. Modeling:
This step involves building a predictive model using machine learning
algorithms.
This includes selecting an appropriate algorithm, training the model on the
data, and evaluating its performance.
This step is important because it is the heart of the data mining process and
involves developing a model that can accurately predict outcomes on new
data.
10
5. Evaluation:
This step involves evaluating the performance of the model. This includes
using statistical measures to assess how well the model is able to predict
outcomes on new data.
This step is important because it helps ensure that the model is accurate and
can be used in the real world.
6. Deployment:
This step involves deploying the model into the production environment.
This includes integrating the model into existing systems and processes to
make predictions in real-time.
This step is important because it allows the model to be used in a practical
setting and to generate value for the organization.
1. Classification
In general, classification means categorizing the available entities with respect
to some target variable. Open your messy wardrobe; to clean and organize it,
you would start separating the clothes based on ethnic, casual, formal, and
loungewear.
11
Furthermore, the ones you do not fit into fall into the discarded category. Now
in the data mining terms, all the clothes are the data recipient from multiple
sources, the category for clothes analogies with the target class labels, and the
discarded clothes are the outliers.
In order to provide a definition, we can term Classification as a predictive
modeling technique within supervised learning. It involves predicting class
labels based on a set of labeled observations. Businesses use this process of
compartmentalization to draw essential inferences.
2. Clustering
Clustering is a common technique used for grouping similar data. The only
difference in classification and clustering is that the latter has no target
variable. Clustering is the process of separating the data set into subgroups.
Recalling the previous wardrobe example, the clothes can be sub-grouped as
tops and bottoms. The process of clustering is essential as it prepares the data
for analysis. A major example of clustering is finding customers with similar
purchasing behavior to generate interesting recommendations.
3. Tracking patterns
A fundamental data mining technique, tracking patterns helps find hidden
patterns and monitor trends in the data to build valuable insights. Pattern
recognition is the most important data mining technique.
It helps understand customer behavior, buying patterns, and people with
similar interests. This knowledge discovery helps find potential customers,
predict sales and much more for business proliferation.
4. Regression
Regression analysis is a technique within supervised learning. It involves
training algorithms with input features and output labels to analyze the relation
between independent and dependent variables.
12
Plotting a best-fit line or a curve between the data is the ultimate aim of the
applying regression algorithm.
Trained models, primarily a predictive modeling component, forecast the
outputs of dynamic input data or bridge gaps in missing data. We evaluate
regression models using three metrics: variance, bias, and error.
The types of regression are Linear regression, Multiple Linear Regression,
Multivariate Linear Regression, Polynomial Regression, and Ridge and Lasso
Regression.
5. Outer Detection or Anomaly detection
Outlier mining or outer detection, or anomaly detection, is a data mining
technique used to identify the data items that do not match or fit in the
expected behavior or predefined patterns.
Discovering the outliers helps find the reasons behind their occurrence and
prepare for future occurrences. Outer detection detects credit card fraud,
network intrusions, and interruptions.
6. Sequential Patterns
It is a data mining technique that discovers meaningful associations between
the data occurrences. It helps find a time-ordered series of events occurring
with a precise frequency to associate the dependency between them.
Sequential pattern mining is particularly useful in applying mining to
transactional data for a specific period of time. Sequential Patterns has its use
in stock market analysis, forecasting natural disasters, DNA sequencing
research, and predicting possible attacks in cyber security.
7. Prediction
Prediction is a valuable data mining technique that combines different data
mining techniques, including sequential patterns, classification, clustering,
trends, etc.
13
This data mining technique involves the use of historical data and events in
sequence to understand the behavior and predict the occurrences of future
events.
One most used applications of prediction is evaluating the loss/profit for a
business by understanding the sale.
8. Association Rules
Finding its grounds in statistics, the association rule is a data mining technique
that finds relational patterns between the variables.
This data mining technique follows the law of association, indicating the
likelihood of occurrence of an event is dependent on the other data-driven
events. For example, one is likely to buy car accessories from the market after
buying a car.
9. Visualisation
Data visualization is a data mining technique granting people access to
insights based on visual sensory perceptions.
The dynamic visual patterns allow users to unveil the trends in the data to
understand their business information.
14
Data Mining Architecture
The significant components of data mining systems are a data source, data mining
engine, data warehouse server, the pattern evaluation module, graphical user
interface, and knowledge base.
Data Source:
The actual source of data is the Database, data warehouse, World Wide
Web (WWW), text files, and other documents. You need a huge amount of
historical data for data mining to be successful. Organi
Organizations
zations typically
store data in databases or data warehouses.
Data warehouses may comprise one or more databases, text files
spreadsheets, or other repositories of data. Sometimes, even plain text files
or spreadsheets may contain information. Another pri
primary
mary source of data is
the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data
must be cleaned, integrated, and selected.
As the information comes from various sources and in different formats,
it can't be used directly for the data mining procedure because the data
may not be complete and accurate.
So, the first data requires to be cleaned and unified. More information
than needed will be collected from various data sources, and only the data
of interest will have to be selected and passed to the server.
These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.
16
This segment commonly employs stake measures that cooperate with the
data mining modules to focus the search towards fascinating patterns.
It might utilize a stake threshold to filter out discovered patterns.
On the other hand, the pattern evaluation module might be coordinated
with the mining module, depending on the implementation of the data
mining techniques used.
For efficient data mining, it is abnormally suggested to push the
evaluation of pattern stake as much as possible into the mining
procedure to confine the search to only fascinating patterns.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It
might be helpful to guide the search or evaluate the stake of the result
patterns.
The knowledge base may even contain user views and data from user
experiences that might be helpful in the data mining process.
The data mining engine may receive inputs from the knowledge base to
make the result more accurate and reliable.
The pattern assessment module regularly interacts with the knowledge
base to get inputs, and also update it.
17
TEXT MINING
MEANING:
Text mining (also known as text analysis), is the process of transforming
unstructured text into structured data for easy analysis.
Text mining uses natural language processing (NLP), allowing machines to
understand the human language and process it automatically.
Text mining is a process of extracting useful information and nontrivial
patterns from a large volume of text databases.
2. Information Extraction
It is a process of extracting meaningful words from documents.
Feature Extraction
In this process, we try to develop some new features from existing ones.
This objective can be achieved by parsing an existing feature or
combining two or more features based on some mathematical operation.
Feature Selection
In this process, we try to reduce the dimensionality of the dataset which
is generally a common issue while dealing with the text data by selecting a
subset of features from the whole dataset.
18
3. Natural Language Processing
Natural Language Processing includes tasks that are accomplished by using
Machine Learning and Deep Learning methodologies. It concerns the automatic
processing and analysis of unstructured text information.
Named Entity Recognition (NER): Identifying and classifying named
entities such as people, organizations, and locations in text data.
Sentiment Analysis: Identifying and extracting the sentiment (e.g. positive,
negative, neutral) of text data.
Text Summarization: Creating a condensed version of a text document that
captures the main points.
19
lemmatization does not simply remove the inflections. Instead, it uses
information from different computational repositories to get the correct base
forms of words.
Part-of-speech Tagging: It aims to assign parts of speech to each word
of a given text based on a its meaning and context. NLTK, spaCy, Pattern
are a few softwares that can be used for this.
Chunking: It is a natural language process that identifies constituent parts
of sentences and links them to higher order units that have discrete
grammatical meanings. NLTK is a good tool for this.
Named Entity Recognition (NER): It aims to find named entities in
text and classify them into pre-defined categories. NLTK, spaCy can be
used for this.
Relationship Extraction: This helps in identifying relations among
named entities like people, organizations, etc. It allows to get structured
information from unstructured sources such as raw text.
2. Text Transformation:
This process mainly involves the document representation by the text it contains
and the number of occurrences. There are mainly two approaches for this step.
Bag of Words: A text is represented as a bag of its words (multi set),
disregarding grammar and even word order, but keeping multiplicity.
Vector Space: In this model, a document is converted into a vector of index
terms derived from words. Each dimension of the vector corresponds to a term
that appears in their text. Its weight records the importance of the term to the
text.
3. Feature Selection:
Feature selection is also known as attribute selection or variable selection. It
is the selection of the most relevant features from the available variables
that gives most information for your prediction variables.
20
The irrelevant features can increase the complexity and decrease the
accuracy of the analysis.
Pearson correlation coefficient, Chi — squared, recursive feature
elimination, lasso regression, tree-based algorithms are a few methods that
can be used for this. Python can be used for all the analysis.
4. Data Mining:
Here we combine the process of text mining with the traditional data mining
techniques. Once the data is structured after the above processes, the classic
data mining techniques are applied on the data to retrieve the information.
These techniques include classification, clustering, regression, outer,
sequential patters, prediction and association rules.
5. Evaluation:
After the data mining techniques are applied, we get an end result. That
result is to be evaluated and checked for the accuracy in the prediction.
21
Now-a-days chat bots that mimic human customer care officers have been
used in many websites in order to make the user experience more
customized.
Text mining has been used in order to provide a rapid, automated response
to the customers, which has reduced their reliance on the call center
operators to solve the problems.
3. Personalized Advertising:
The field of digital advertising has been revolutionized by the development
of text and web mining and this is one of the latest applications of text
mining.
The text data related to all that a person types or searches online are shared
with the other companies, which in turn show ads that has a higher
probability of being clicked and converted into a sale.
4. Spam Filtering:
One of the widely used means of official communication is e-mail. It has a
really wide application, but a darker side to this are the spam mails that
infest the inboxes of the users.
These spam mails use up a lot of storage and they can also be a source
from which the viruses or scams can enter. Various companies are using
intelligent text mining softwares as well as the traditional keyword
matching techniques in order to identify and filter the spam mails.
5. Social Media Analysis and Crime Prevention:
Social media has been on the trend for a long time and millions of normal
users use this medium as a means of communication.
The anonymous nature of internet has made it easy for many criminals to
plan their various strategies online. The task of identifying the potentially
threatening messages from the normal ones is a task that has been made
possible by the use of advanced text mining softwares.
Also, online text analysis can be a good method to analyse what is ‘hot’ or
trending in a particular time. This can be highly beneficial for various
commercial companies.
22
6. Digital Library
Various text mining strategies and tools are being used to get the pattern
and trends from journal and proceedings which is stored in text database
repositories.
These resources of information help in the field of research area. Libraries
are a good resource for text data in digital form. It gives a novel technique
for getting useful data in such a way that makes it conceivable to access
millions of records online.
7. Academic and Research Field
In the education field, different text-mining tools and strategies are
utilized to examine the instructive patterns in a specific region/research
field.
The main purpose of text mining utilization in the research field is help to
discover and arrange research papers and relevant material from various
fields on one platform.
8. Life Science
Life science and healthcare industries are producing an enormous volume
of textual and mathematical data regarding patient records, sicknesses,
medicines, symptoms, and treatments of diseases, etc.
It is a major issue to filter data and relevant text to make decisions from a
biological data repository. The clinical records contain variable data which
is unpredictable, and lengthy. Text mining can help to manage such kinds
of data.
9. Business Intelligence
Text mining plays an important role in business intelligence that help
different organization and enterprises to analyze their customers and
competitors to make better decisions.
It gives an accurate understanding of business and gives data on how to
improve consumer satisfaction and gain competitive benefits. The text
mining devices like IBM text analytics.
23
Text Mining Approaches in Data Mining
The following text-mining techniques are applied in data mining:
1. Automatic Document Classification Analysis
2. Keyword-based Association Analysis
1. Automatic Document Classification Analysis
This technique is used to automatically classify the vast majority of online text
documents, such as emails and web pages. As document databases are not arranged
according to attribute value pairs, the categorization of text documents differs from
the classification of relational data.
2. Keyword-Based Association Analysis
It gathers groups of terms or keywords that frequently appear together and then
determines the correlation between them. The text data is first preprocessed by
parsing, stemming, deleting stop words, etc. After preprocessing the data, association
mining methods are introduced. Since no human effort is necessary in this case,
fewer undesirable results are obtained, and the time of execution is shorter.
24
Advantages of Text Mining
1. Large Amounts of Data:
Text mining allows organizations to extract insights from large amounts
of unstructured text data. This can include customer feedback, social media
posts, and news articles.
2. Variety of Applications:
Text mining has a wide range of applications, including sentiment
analysis, named entity recognition, and topic modeling. This makes it a
versatile tool for organizations to gain insights from unstructured text data.
3. Improved Decision Making:
Text mining can be used to extract insights from unstructured text data,
which can be used to make data-driven decisions.
4. Cost-effective:
Text mining can be a cost-effective way to extract insights from
unstructured text data, as it eliminates the need for manual data entry.
5. Broader benefits:
Cost reductions, productivity increases, the creation of novel new
services, and new business models are just a few of the larger economic
advantages mentioned by those consulted.
25
4. Limited to Text Data:
Text mining is limited to extracting insights from unstructured text data
and cannot be used with other data types.
5. Noise in text mining results:
Text mining of documents may result in mistakes. It’s possible to find
false links or to miss others. In most situations, if the noise (error rate) is
sufficiently low, the benefits of automation exceed the chance of a larger
mistake than that produced by a human reader.
6. Lack of transparency:
Text mining is frequently viewed as a mysterious process where large
corpora of text documents are input and new information is produced. Text
mining is in fact opaque when researchers lack the technical know-how or
expertise to comprehend how it operates, or when they lack access to corpora
or text mining tools.
26
WEB MINING
Meaning:
Web mining is the process of extracting valuable information from the vast
data available on the World Wide Web.
Web mining is the process of discovering patterns, structures, and
relationships in web data. It involves using data mining techniques to analyze
web data and extract valuable insights.
27
3. Search engine optimization:
Web mining can be used to analyze search engine queries and search engine
results pages (SERPs). This information can be used to improve the visibility of
websites in search engine results and increase traffic to the website.
4. Fraud detection:
Web mining can be used to detect fraudulent activity on websites. This
information can be used to prevent financial fraud, identity theft, and other types
of online fraud.
5. Sentiment analysis:
Web mining can be used to analyze social media data and extract sentiment
from posts, comments, and reviews. This information can be used to understand
customer sentiment towards products and services and make informed business
decisions.
6. Web content analysis:
Web mining can be used to analyze web content and extract valuable
information such as keywords, topics, and themes. This information can be used
to improve the relevance of web content and optimize search engine rankings.
7. Customer service:
Web mining can be used to analyze customer service interactions on websites
and social media platforms. This information can be used to improve the quality
of customer service and identify areas for improvement.
8. Healthcare:
Web mining can be used to analyze health-related websites and extract
valuable information about diseases, treatments, and medications. This
information can be used to improve the quality of healthcare and inform medical
research.
28
Type of web mining
1. Web Content Mining
2. Web Structure Mining
3. Web Usage Mining
29
Process of Web Mining
The process of web mining typically involves the following steps -
1. Data collection –
Web data is collected from various sources, including web pages,
databases, and APIs.
2. Data pre-processing –
The collected data is pre-processed to remove irrelevant information,
such as advertisements and duplicate content.
3. Data integration –
The pre-processed data is integrated and transformed into a structured
format for analysis.
4. Pattern discovery –
Web mining techniques are applied to identify patterns, trends, and
relationships.
5. Evaluation –
The discovered patterns are evaluated to determine their significance
and usefulness.
6. Visualization –
The analysis results are visualized through graphs, charts, and other
visualizations.
30
Data engineers and data scientists can Data scientists, data engineers, and data
do data mining. analysts can do web mining.
Tools used by data mining are machine Tools used by web mining are PageRank,
learning algorithms. Scrappy, Apache logs.
Skill needed for data mining is Skills needed for wen mining are
machine learning algorithms, application-level knowledge, probability,
probability, statistics. statistics.
31
4. Detects fraudulent activities
By monitoring user behavior, web mining can identify unusual patterns
that may indicate fraudulent activities, helping to protect both the business and
its customers.
5. Uncovers data patterns and trends
Web mining reveals insights into how users interact with a site,
enabling businesses to spot emerging trends and make data-driven decisions to
stay ahead of the market.
32
Some Techniques in Web Usage Mining
1. Association Rules:
The most used technique in Web usage mining is Association Rules.
Basically, this technique focuses on relations among the web pages that
frequently appear together in users’ sessions.
The pages accessed together are always put together into a single server
session.
Association Rules help in the reconstruction of websites using the access
logs.
Access logs generally contain information about requests which are
approaching the webserver.
The major drawback of this technique is that having so many sets of rules
produced together may result in some of the rules being completely
inconsequential. They may not be used for future use too.
2. Classification:
Classification is mainly to map a particular record to multiple predefined
classes.
The main target here in web usage mining is to develop that kind of profile
of users/customers that are associated with a particular class/category. For
this exact thing, one requires to extract the best features that will be best
suitable for the associated class.
Classification can be implemented by various algorithms – some of them
include- Support vector machines, K-Nearest Neighbors, Logistic
Regression, Decision Trees, etc.
For example, having a track record of data of customers regarding their
purchase history in the last 6 months the customer can be classified into
frequent and non-frequent classes/categories. There can be multiclass also
in other cases too.
33
3. Clustering:
Clustering is a technique to group together a set of things having similar
features/traits. There are mainly 2 types of clusters- the first one is the
usage cluster and the second one is the page cluster.
The clustering of pages can be readily performed based on the usage data.
In usage-based clustering, items that are commonly accessed /purchased
together can be automatically organized into groups.
The clustering of users tends to establish groups of users exhibiting similar
browsing patterns. In page clustering, the basic concept is to get
information quickly over the web pages.
SPATIAL MINING
Spatial data mining refers to the process of discovering hidden patterns and
relationships in geospatial data. Geospatial data includes information that is
associated with specific locations or geographic coordinates.
34
2. Association rules:
Association rules determine rules from the data sets, and it describes
patterns that are usually in the database.
3. Characteristic rules:
Characteristic rules describe some parts of the data set.
4. Discriminate rules:
As the name suggests, discriminate rules describe the differences
between two parts of the database, such as calculating the difference between
two cities as per employment rate.
35
3. Polygon Data
The Polygon Data consists of areas of regions in space. It includes
closed shapes, which are formed by connecting multiple points, and the last
point is connected to the first to make it close. Examples of polygon data are
traffic zones, land parcels, boundaries of lakes or forests, etc.
4. Raster Data
The Raster Data refers to space in the form of a grid of pixels, where
each cell of the grid represents some attributes. It is used to represent the
phenomena like satellite imagery, remote sensing, etc.
5. Image Data
Image Data, as the name suggests, consists of spatial data in the form of
images. It is most often used for object detection, capturing visual information
about Earth, land cover classification, etc.
36
5. Crime Analysis
Spatial Data Mining can be used to identify crime hotspots, understand
crime patterns and develop proper strategies to prevent crimes and hence
improve public safety.
37
Tools and Software for Spatial Data Mining
There are several tools and software available to support spatial data mining.
One popular tool is ArcGIS, which is a comprehensive GIS platform that
offers data management, power of visualization, and spatial online analytical
processing capabilities.
Another widely used tool is QGIS, an open-source GIS software that provides
similar functionalities to ArcGIS. For data mining tasks, tools like RapidMiner
and KNIME offer spatial extensions that enable integrating spatial data into
the data mining workflow. These tools provide a user-friendly interface, a
wide range of algorithms, and advanced visualization options to facilitate
geospatial data analysis.
38
PROCESS MINING
Process mining is a technique designed to discover, monitor, and improve
real business processes by extracting readily available knowledge from the
event logs of information systems.
Process mining helps organizations gain a full understanding of the processes
that support their customers through the examination of the actual processes,
which often differ from the documented processes that they currently use.
39
3. Enhancement
The enhancement technique uses additional information to change each
workflow’s pre-existing models. Also called organizational mining,
performance mining or extension, the enhancement strategy aims to
continually optimize these pre-existing models based on your organizational
logs. Enhancement can help make both your pre-existing workflow models
and your actual workflows more detailed, which may improve accuracy.
40
6. Process standardization:
It supports standardizing processes across an organization by identifying
variations and aligning them with the optimal process model. This helps ensure
consistent performance and quality.
7. Better customer experience:
Streamlining processes and enhancing efficiency leads to improved service
delivery, fostering greater customer satisfaction and loyalty.
41
5. Complexity in large organizations:
In larger organizations, the volume and complexity of processes can amplify
the challenges of process mining, affecting insight extraction. By adopting object-
centric or multi-level process mining techniques, organizations can better manage
and analyze complex processes.
6. Potential resistance to change:
Significant changes in process management due to process mining can meet
resistance from employees accustomed to existing workflows. Effective change
management is critical for successful implementation and adoption.
Implementing effective change management strategies, including staff training
and engagement, can facilitate smoother transition and adoption.
42
Healthcare:
Process mining provides recommendations for reducing patients' treatment
processing time.
E-commerce:
Process mining can provide insight into buyer behaviors and accurate
recommendations to increase sales.
Manufacturing:
Process mining enhances supply chain and manufacturing business operations
by assigning appropriate resources based on product attributes. Insights into
production times and resource allocation, such as storage space, machines or
workers, allow for more efficient management and operational transformation.
IT service management (ITSM):
Process mining can optimize service delivery and incident management
processes. It enables IT teams to analyze service workflows, identify
inefficiencies and improve response times. This helps enhance overall IT support
and customer satisfaction.
1. Process discovery
Process discovery is now a simple and speedy part of rapid workflows, used to
analyze desktop user interaction data and link it with process details mined
from system event data.
43
Using these timestamps, the software can then create visual models to help
identify the variations and bottlenecks impacting efficiency or customer
experience.
2. Process analysis
Process analysis is where you develop an understanding of process
performance based on actual data. Insights gained from process analytics may
include optimal process paths, paths for the greatest return on investment
(ROI) and causes of variations.
This is also the stage where you can better understand your processes by
visualizing them through process mapping. Some platforms include process
simulations, and with these, you can also visualize target business outcomes.
While process mining gives you insight into how your processes actually run,
BPM gives you a map of your business’s ideal processes. You can use them
together to improve processes for better workflow management and business
[Link] how processes currently work is invaluable for BPM.
Process mining helps across almost every step of the BPM lifecycle.
3. Process optimization
Using the data you’ve gathered in the discovery and analysis phases, you can
now make changes to your processes to prepare them for automation.
This is also where you select the most important processes on which to place
your focus.
4. Process automation
At this stage, your process data can be exported for RPA. With insights gained
in previous steps, you can choose the ‘happy path’ with the least resistance
and highest ROI for automation.
5. Process monitoring & prediction
Some process mining solutions will monitor every process instance. This is
usually only available with full process intelligence.
The platform will issue alerts or automatically act when unusual process
behavior occurs. Some platforms can even offer predictive analytics to further
enhance your process planning strategy.
44
How does process mining work?
Here are the four stages of process mining:
1. Read and transform organizational data
Many companies already track and store data related to their workflows.
Process mining technology transforms these data sets into event logs for each
of your organizational processes.
The process mining software then assigns every piece of data a time stamp, a
case ID and an activity. These three elements help the process mining program
analyze and prioritize your workflows more effectively.
45
4. Continually track your workflows
Process mining software continues to monitor and evaluate your workflows,
even after you've made improvements.
This can help you gather and analyze real-time data that shows you if you've
designed effective improvements. Process mining programs may also evaluate
whether your workflows adhere to compliance regulations for your industry.
46
In March 2023 The Analytics Insight Magazine identified top 5 process
mining software companies for 2023
1. Celonis
2. UiPath Process Mining
3. SAP Signavio Process Intelligence
4. Software AG ARIS Process Mining
5. ABBYY Timeline
eline
DATA WAREHOUSE:
A data warehouse is where data can be collected for mining purposes, usually
with large storage capacity.
Data warehousing is a method of organizing and compiling data into one
database, whereas data mining deals with fetching impo
important
rtant data from
databases.
Data Warehousing integrates data and information collected from various
sources into one comprehensive database. For example, a data warehouse
might combine customer information from an organization's point-of-sale
point
systems, its mailing
ailing lists, website, and comment cards.
Key Characteristics of Data Warehouse
The main characteristics of a data warehouse are as follows:
Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise information
rather than the overall processes of a business. Such subjects may be sales,
promotion, inventory, etc. For example, if you want to analyze your company’s
sales data, you need to build a data warehouse that concentrates on sales. Such a
warehouse would provide valuable information like ‘who was your best customer
last year?’ or ‘who is likely to be your best customer in the coming year?’
Integrated
A data warehouse is developed by integrating data from varied sources into a
consistent format. The data must be stored in the warehouse in a consistent and
universally acceptable manner in terms of naming, format, and coding. This
facilitates effective data analysis.
Non-Volatile
Data once entered into a data warehouse must remain unchanged. All data is
read-only. Previous data is not erased when current data is entered. This helps
you to analyze what has happened and when.
Time-Variant
The data stored in a data warehouse is documented with an element of time,
either explicitly or implicitly. An example of time variance in Data Warehouse is
exhibited in the Primary Key, which must have an element of time like the day,
week, or month.
48
1. Enterprise Data Warehouse (EDW)
This type of warehouse serves as a key or central database that facilitates
decision-support services throughout the enterprise. The advantage to this type of
warehouse is that it provides access to cross-organizational information, offers a
unified approach to data representation, and allows running complex queries.
2. Operational Data Store (ODS)
This type of data warehouse refreshes in real-time. It is often preferred for
routine activities like storing employee records. It is required when data warehouse
systems do not support reporting needs of the business.
3. Data Mart
A data mart is a subset of a data warehouse built to maintain a particular
department, region, or business unit. Every department of a business has a central
repository or data mart to store data. The data from the data mart is stored in the
ODS periodically. The ODS then sends the data to the EDW, where it is stored and
used.
Improved data Quality: In the data warehouse the data you gathered from
different sources is being stored and analyzed it does not interfere with or add
data by itself so your quality of data is maintained and if you get any issue
regarding data quality then the data warehouse team will solve this.
Historical Insight: The warehouse stores all your historical data which
contains details about the business so that one can analyze it at any time and
extract insights from it.
49
Features of Data Warehousing
Centralized Data Repository: Data warehousing provides a centralized
repository for all enterprise data from various sources, such as transactional
databases, operational systems, and external sources. This enables
organizations to have a comprehensive view of their data, which can help in
making informed business decisions.
Data Integration: Data warehousing integrates data from different sources
into a single, unified view, which can help in eliminating data silos and
reducing data inconsistencies.
Historical Data Storage: Data warehousing stores historical data, which
enables organizations to analyze data trends over time. This can help in
identifying patterns and anomalies in the data, which can be used to improve
business performance.
Query and Analysis: Data warehousing provides powerful query and
analysis capabilities that enable users to explore and analyze data in different
ways. This can help in identifying patterns and trends, and can also help in
making informed business decisions.
Data Transformation: Data warehousing includes a process of data
transformation, which involves cleaning, filtering, and formatting data from
various sources to make it consistent and usable. This can help in improving
data quality and reducing data inconsistencies.
Data Mining: Data warehousing provides data mining capabilities, which
enable organizations to discover hidden patterns and relationships in their
data. This can help in identifying new opportunities, predicting future trends,
and mitigating risks.
Data Security: Data warehousing provides robust data security features,
such as access controls, data encryption, and data backups, which ensure that
the data is secure and protected from unauthorized access.
50
Advantages of Data Warehousing
Intelligent Decision-Making: With centralized data in warehouses,
decisions may be made more quickly and intelligently.
Business Intelligence: Provides strong operational insights through
business intelligence.
Historical Analysis: Predictions and trend analysis are made easier by
storing past data.
Data Quality: Guarantees data quality and consistency for trustworthy
reporting.
Scalability: Capable of managing massive data volumes and expanding to
meet changing requirements.
Effective Queries: Fast and effective data retrieval is made possible by an
optimized structure.
Cost reductions:
Data warehousing can result in cost savings over time by reducing data
management procedures and increasing overall efficiency, even when there are
setup costs initially.
Data security:
Data warehouses employ security protocols to safeguard confidential
information, guaranteeing that only authorized personnel are granted access to
certain data.
51
Data integration challenges:
Data from different sources can be challenging to integrate, requiring
significant effort to ensure consistency and accuracy.
Data security:
Data warehousing can pose data security risks, and businesses must take
measures to protect sensitive data from unauthorized access or breaches.
52
This component allows us to extract data, fill disarranged data, highlight data
distribution from the central repository to the business intelligence
applications, and much more.
How does ETL Work?
To understand how ETL works, we should go through each step of the ETL process.
Extract
Copy raw data from the source locations to a staging area. This data is
collected by the data management team and can be structured or unstructured.
Transform
In the staging area, this data is transformed by filtering, cleansing, de-
duplicating, validating, and authenticating the data, etc.
Loading
In this step, transformed data is moved from the staging area to the
target data warehouse. It initially loads all the data, then gradually load it as
changes occur in the data.
3. Metadata
Metadata is a component that can be used in a variety of conditions to build,
manage and maintain the system. The simplest definition of metadata is “it is
data about the data.”
It helps us to understand the context, nature, and structure of the data.
It enables the user to have an easy search and retrieval of data.
It is a key to unlocking the hidden content of the data and getting a proper
understanding of it.
53
4. Query Tools
Tools are the components using which we interact with the data warehouse and get
relevant data out of it.
Some of the tools used for interaction purposes are query and reporting tools,
application development tools, data mining tools, and online analytical tools.
Firstly, query and reporting tools are categorized into managed query tools
and reporting tools. Reporting tools are used for developing business reports,
and the end-users can use them at an affordable cost.
Managed query tools to protect the end user from SQL query-related
complexities by adding a security layer between the database and users.
Online analytical processing tools are generally used to extract or retrieve data
selectively so that they can be analyzed from a different standpoint. These
tools believe that data is managed in a multidimensional model.
Data mining tools are the set of tools that are used to analyze large amounts of
data and the relationship in that data.
5. Data Marts
Data marts are components of data warehouse. Let’s discuss it in detail:
It is a data store that is designed for a particular department of an organization,
or a Data mart is a subset of the data warehouse that is usually oriented to a
specific purpose.
It helps stakeholders to make decisions quickly from the summarized data and
make knowledgeable decisions.
In a data mart, companies can retrieve information more efficiently as it
contains the most relevant information.
Used for making streamlined decisions and gives privileges to access minute
data.
As it has very few data tables, data engineers can manage and change
information without causing significant database changes.
54
6. Data Warehouse Management and Administration
Unlike traditional relational databases, a data warehouse collects vast amounts
of past and current data. To manage this, we require an administrator with a
skillset different from the traditional data administrator.
All the control elements manage the system within the data warehouse, and
these components also control the transformation of data into the data
warehouse.
DATA MART
A Data Mart is a smaller version of a data warehouse and it is meant to be
used by a particular department or a group of individuals in the company.
It focuses on a single functional unit of an organization and keeps a subset of
data stored in the data warehouse. It is normally controlled by a unit
department in the organization. Whereas a data warehouse draws data from
many sources, a Data Mart draws data from only a few sources.
A data mart is a subset of a database —usually a data warehouse—
where data is stored for a specific business area. That is, a data mart stores
concise and specific data sets used for analysis for a specific department or
line of business, such as the sales department.
55
What is a Data Mart?
A data mart is an access layer of a data warehouse focused on a specific line
of business, function, or department. It is used for retrieving client-facing data.
A data mart contains a subset of the data that is stored in a data warehouse.
1. Star
The star schema is a blueprint that resembles a star shape and consists of fact
tables that reference dimension tables in a relational database. The fact table is
placed at the center of the star and relates a metric set that relates to a specific
process.
The star schema requires fewer joints when writing queries as there is no
dependency between dimension tables. The ETL request process makes it
vastly efficient for accessing and navigating large data sets. The said benefits
make star schemas widely used in most information technology systems.
2. Snowflake
A snowflake schema extends the star schema blueprint with additional
dimension tables that are normalized to protect data integrity and minimize
data redundancy.
56
The snowflake schema’s main benefit is that it requires less storage space for
dimension tables.
However, a snowflake structure is difficult to maintain due to multiple tables
that need to be populated and synchronized. It also adversely impacts
performance as a result of the need for additional dimension tables.
3. Vault
The vault schema enables users to design agile enterprise data warehouses. It
is a fairly modern database modeling technique. The vault schema is a layered
structure that focuses on agility and scalability.
57
Business intelligence: Data marts enable quicker insights into strategic
information contained in a data warehouse. Business intelligence benefits the
organization through accelerated information access and potentially higher
productivity.
Analytics: It is easier to track key performance indicators through a data
mart.
58
2. Independent Data Marts
The second approach is Independent data marts (IDM) Here, firstly
independent data marts are created, and then a data warehouse is designed
using these independent multiple data marts.
In this approach, as all the data marts are designed independently; therefore,
the integration of data marts is required. It is also termed as a bottom-up
approach as the data marts are integrated to develop a data warehouse.
59
Instead of being organized around the needs of the enterprise as a whole,
the data is usually organized around a particular subject area, such
as sales or inventory.
Business users can now more quickly and easily access the data they
require without having to wade through extraneous information.
The information in a Data Mart is typically obtained from a central
warehouse or other data sources, but it is arranged and displayed
in a manner that is specific to the particular business unit or function.
In addition to lowering the risk of data redundancy and enhancing data
governance, this helps to ensure that the data is accurate, consistent, and
up-to-date.
2. Integrated
A Data Mart’s ability to integrate data from various sources to produce
a single, comprehensive view of information is referred to as its integrated
characteristic. As a result, the information kept it is not only subject-
oriented but also combined from a variety of sources to give users
a thorough understanding of a particular business function or process. Data
from both internal and external sources, such as data lakes, relational
databases, and business intelligence tools, are incorporated into the process
of integration.
Data Marts give business users a centralized location to access data that
is consistent, accurate, and up-to-date by combining data from various
sources. By integrating these systems, data provided is guaranteed
to be accurate and usable for data mining and querying, enabling
businesses to base decisions on specific data trends and insights.
3. Time-variant
The 'Time-variant' characteristic refers to how data changes over time.
In a Data Mart, data is stored in a way that allows for analysis of trends and
patterns over time. This means that the data stored in a marts is time-
stamped, enabling business users to query and analyze historical data
to identify trends and patterns in the data. By capturing changes in data
60
over time, Data Marts provide insights into how business data has evolved,
which can help organizations make data-driven decisions.
Additionally, the time-variant nature of Data Marts allows businesses
to track changes in specific data over time, which is particularly useful for
tracking trends in sales, customer behavior, and other business functions.
4. Non-volatile
Data Marts are non-volatile, meaning that once data is stored, it cannot
be altered or updated. This characteristic ensures data integrity and
consistency, which is crucial for business intelligence and decision-
making. Unlike operational databases that are subject to frequent changes,
Data Marts store historical data that is relevant to specific business
functions or departments. This allows for simplified data access and
querying, as well as faster response times when retrieving information.
Data Marts can be independent or part of an existing enterprise warehouse,
depending on the organization’s data management system and business
requirements. While Data Marts are focused on a specific subset of data,
they can still access and incorporate external data sources to support data
mining and provide a comprehensive view of business trends.
61
1. Bottom-up approach
In the bottom-up approach of Data Mart architecture, Data Marts are created
from the operational data sources that a business unit or department uses. This
approach starts with a department or business unit identifying the data they
need to access to support their specific business function, and then creating
a mart that stores that data. Data is loaded into the mart from internal
operational systems or external data sources and then structured into
dimension tables for simplified data access by business users.
This approach allows for a more agile warehouse, as it is built incrementally
from specific data needs rather than attempting to design and implement
an entire data warehouse upfront. Additionally, the use of independent Data
Marts can provide faster querying of data and support data mining for specific
data trends.
2. Top-down approach
The top-down approach to Data Mart architecture involves the creation
of a centralized data warehouse, which serves as the primary data source for
all Data Marts. In this approach, Data Marts are designed to serve the needs
of specific business units or functions and are created based on the
requirements of business users.
The data warehouses are designed to store all types of data, including
structured and unstructured data, from various data sources, including internal
operational systems and external data sources. Dimension tables and fact
tables are used to store and organize data within the warehouse. The top-down
approach to Data Mart architecture is a more warehouse-focused approach and
is often used in larger organizations with more complex data requirements.
3. Federated approach
A federated Data Mart architecture is a data management approach that allows
for the autonomous integration of multiple Data Marts while maintaining their
independence. This approach enables organizations to access and analyze data
from disparate sources without having to physically move the data
in a centralized repository. Usually, in marts that have federated architecture,
62
data remains in its original location and is accessed on an as-needed basis
through a virtual layer.
This type of architecture provides increased agility and flexibility, as well
as a simplified data access process, making it easier for business users
to query data and gain insights. Additionally, a federated approach can support
a variety of data types, including structured, unstructured, and semi-structured
data, sourced from internal operational systems or external data sources,
making it a versatile solution for businesses looking to streamline their data
management processes.
63
Subject-focused Data Analytics: Data Analytics plays a pivotal role in
any business lifecycle. These repositories allow for more focused data analysis
since they only contain records that are organized around particular subjects
like sales, products, customers, etc. Since there is no extraneous information to
deal with, businesses can filter more accurate and clearer insights.
Selective Data Access: You can leverage these repositories in situations
when an organization needs selective privileges for managing and accessing
data. Generally, this can be the case for big enterprises that can’t reveal the
entire Data Warehouse to all the users. By building multiple dependent
repositories, you can help protect sensitive data from accidental writes and
unauthorized access.
Time-limited Data Projects: As opposed to corporate data warehouses
that need considerable effort and time, these are much easier and faster to set
up. Since, data developers and engineers work with smaller amounts of data,
simpler schemas, and fewer sources, this comes in handy. Apart from this,
these repositories are also easier to implement compared to a Data Warehouse.
So, if you are facing any time crunches in terms of completing a data project,
these repositories may be the way to go.
Step 1: Design
This is the first step when building a Data Mart.
64
It includes tasks such as initiating a request for the Data Mart and collecting
information about the requirements. Other tasks involved in this step include
identifying the data sources and selecting the right data subset.
The output of this step is the logical and physical design of the Data Mart.
Step 2: Build / Construct
This is the step during which both the physical and the logical structures for
the Data Mart are created.
In this step, you create the tables, indexes, fields, and access controls.
Step 3: Populate / Data Transfer
This is the step in which you populate the Data Mart by transferring data into
it. You can also set the frequency with which data transfer will be done,
whether daily or weekly.
To ensure that information stored in the structure is clean, it is always
overwritten during the population of the Data Mart. In this step, the source
information is extracted, cleaned, transformed, and loaded into the Data Mart.
Step 4: Data Access
In this step, the data that has been loaded into the Data Mart is put into active
use. Activities involved here include querying, generating graphs and reports,
and publishing.
To make it easy for non-technical users to use the Data Mart, a meta-layer
should be set up and item names and database structures translated into
corporate expressions.
If possible, interfaces and APIs should be set up to ease the process of data
access.
Step 5: Manage
This is the last step when building a Data Mart and it involves the following tasks:
Controlling user access.
Refining and optimizing the target system to improve its performance.
Adding new data into the Data Mart and managing it.
Configuring recovery settings and ensuring that the system is available even
after the occurrences of disasters.
65
Data Mart Tools
There are several tools available for managing and building Data Marts, including:
IBM InfoSphere Warehouse: This program offers data integration, data
transformation, and data management functions, as well as data warehousing
capabilities.
Oracle Database: The Oracle Data Warehousing option, which offers
advanced data management, integration, and analysis capabilities, is one of the
features for creating and managing Data Marts that are included in the Oracle
Database.
Microsoft SQL Server: SQL Server Analysis Services, which offers
multidimensional data analysis and reporting capabilities, is just one of the
tools available through Microsoft SQL Server for creating and managing Data
Marts.
Teradata: Teradata is a potent analytics and data warehousing platform that
offers assistance with creating and maintaining Data Marts.
SAP BusinessObjects: This program can be used to create and maintain
Data Marts because it has data integration, data quality, and data analysis
features.
Tableau: To support business intelligence and decision-making, Tableau
offers robust data visualization and analysis capabilities that can be used with
Data Marts.
Amazon Redshift: An affordable and highly scalable cloud-based data
warehousing solution, Amazon Redshift supports the creation and
management of cloud Data Marts.
Depending on the unique requirements of the organization, each of these tools offers
a variety of capabilities for creating and managing Data Marts, with varying
strengths and weaknesses. To choose the right tool for a specific project or use case,
it’s critical to carefully assess each one.
66
Benefits of data marts
Data marts can offer benefits to every industry. Here are a few benefits to consider:
Centralized data: Data marts help centralize specific data sets so everyone
is drawing information from a single source. This helps prevent data
discrepancies and reduces errors.
Scalable data management: Data marts allow for more scalability for
data sets, which is the ability to grow as a company's needs change. Teams can
scale the data in a data mart to ensure a business meets its data needs.
Fast implementation: Data marts are more specific than larger data
warehouses, which can make them faster and easier to implement. This can
save time and money for a company.
Quick data access: Data marts make it easier for teams to review data sets
and access specific data more quickly. This can make the data acquisition
process much quicker and save time and money.
Better decision-making: With access to faster and more accurate data
sets, teams may make better decisions based on tangible information. This can
improve overall efficiency and reduce costs.
Low cost: Setting up a data mart can cost significantly less than a full-size
data warehouse for the business. The company can then reinvest those savings
into other parts of the business.
67
Customizable: Data marts are customizable, which means that they can be
designed to meet the specific needs of a business unit or department.
Self-Contained: Data marts are self-contained, which means that they have
their own set of tables, indexes, and data models. This allows for easier
management and maintenance of the data mart.
Security: Data marts can be secured, which means that access to the data in
the data mart can be controlled and restricted to specific users or groups.
Scalability: Data marts can be scaled horizontally or vertically to
accommodate larger volumes of data or to support more users.
Integration with Business Intelligence Tools: Data marts can be
integrated with business intelligence tools, such as Tableau, Power BI, or
QlikView, which allows users to analyze and visualize the data stored in the
data mart.
ETL Process: Data marts are typically populated using an Extract,
Transform, Load (ETL) process, which means that data is extracted from the
larger data warehouse or data lake, transformed to meet the requirements of
the data mart, and loaded into the data mart.
68
Data Mart Vs Data Warehouse:
A data warehouse is used to store data A data mart carries data related to a
from numerous subject areas. department, such as HR, marketing,
finance data mart, etc.
However, a star schema is used most Data marts use a star schema for
widely. designing tables.
Tricky to design and use due to its Comparatively more manageable due to
large size (more than 100GB). its small size (less than 100GB).
Designed to support the decision- Data marts are designed for particular
making process in a company. user groups or corporate departments.
Data warehouses are used to store Data marts hold highly denormalized data
detailed information in denormalized in a summarized form.
or normalized form.
Has large dimensions and integrates Smaller dimensions to integrate data sets
data from many sources. from a smaller number of sources.
Data warehouses are subject-oriented Data marts are used for particular areas
and time-variant with data existing for related to a business, retains data for a
a longer duration. shorter duration.
69
Key benefits of data marts include centralized data, scalable data management, fast implementation, quick data access, enhanced decision-making capabilities, and low setup costs compared to data warehouses. However, they also have limitations such as data redundancy leading to inconsistencies, limited scope which poses challenges for cross-departmental decisions, integration issues due to differing data models, and limited scalability for handling large, complex datasets .
Data warehouses are large centralized systems used to store data from multiple subject areas, serving as a central repository for companies and designed to support decision-making processes. They typically use sophisticated architectures and tend to integrate significant volumes of detailed and historical data. Data marts, in contrast, are specific to departments like HR, marketing, or finance, and function as logical subsections of data warehouses. They feature smaller dimensions and are often designed using a star schema, with a focus on quick implementation and optimized query performance for particular applications or user groups .
Web mining leverages data mining approaches like clustering, classification, and association rule mining to analyze web data and extract valuable insights. By examining patterns like website traffic and user behavior via clickstream analysis, organizations can understand user navigation, content popularity, and other dynamics crucial for strategic web management. The multidisciplinary application of data mining techniques in web contexts facilitates the discovery of valuable data-driven insights .
Text mining applications enhance customer care services by enabling rapid, automated responses to customer queries through chatbots, which simulate human interaction, thus enhancing customization and reducing reliance on human operators. In personalized advertising, text mining analyzes user-generated data to tailor advertisements that are more likely to engage the user and lead to sales, thereby increasing advertising effectiveness and user engagement .
The ETL (Extract, Transform, Load) process is crucial in sustaining data marts by systematically channeling data from larger warehouses or lakes, transforming it to align with specific departmental needs, and loading it into the mart for optimized analyses. This structured flow ensures that the data within data marts is relevant, fresh, and ready for precise querying and business intelligence reporting. Integration with tools like Tableau or Power BI further enhances decision-making and strategic planning capabilities within business units .
The main stages and techniques in the data mining process according to widely-used methodologies include the KDD process, CRISP-DM, and SEMMA. The KDD process involves selection, preprocessing, transformation, data mining, and interpretation/evaluation. CRISP-DM includes understanding, data preparation, modeling, evaluation, and deployment stages. SEMMA consists of sample, explore, modify, model, and assess phases. Techniques used in these processes involve classification, prediction, clustering, and association rule mining .
Text mining is complex due to the advanced skills required in natural language processing and machine learning, compounded by high computational costs. For smaller organizations, these demands can be prohibitive, making it challenging to afford the necessary technology. Limitations include variability in data quality, computational expense, and the inability to work beyond unstructured text data. These factors often restrict smaller organizations' capabilities to fully leverage text mining for competitive advantage .
Within business intelligence frameworks, genetic algorithms are used for optimization problems by simulating natural evolution, providing solutions through mechanisms such as selection, crossover, and mutation. Neural networks, with their ability to learn complex patterns through layer-based structures, are instrumental in tasks requiring classification and predictive analytics. Both technologies contribute to extracting deeper insights and enhancing decision-making processes within machine learning and AI environments .
Dependent data marts benefit from integration with a central data warehouse, ensuring consistency and eliminating the need for data mart integration. They are advantageous for organizations seeking structured, unified data environments. Conversely, independent data marts allow for quicker, department-specific setup but require integration to form a cohesive data warehouse, which can lead to inconsistencies and redundant data if not managed properly. The choice often depends on the organization's size, investment capability, and integration needs .
Feature selection improves the accuracy and efficiency of data mining by identifying the most relevant features that provide significant predictive power while excluding irrelevant ones, which often lead to increased complexity and decreased accuracy. Methods like Pearson correlation, Chi-squared, lasso regression, and tree-based algorithms effectively reduce data dimensionality, leading to simpler and more robust models. This results in more efficient data processing and improved model performance .