0% found this document useful (0 votes)
20 views38 pages

BAunit-2 Notes

Data Warehousing is a process that involves creating a centralized repository for business data, enabling organizations to analyze and make informed decisions based on historical data. It consists of various components, including data extraction, transformation, and loading (ETL), and utilizes tools for efficient data management and analysis. The architecture of a data warehouse typically includes source data components, data staging, storage, and data marts, all aimed at improving business intelligence and decision-making capabilities.

Uploaded by

dimpy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views38 pages

BAunit-2 Notes

Data Warehousing is a process that involves creating a centralized repository for business data, enabling organizations to analyze and make informed decisions based on historical data. It consists of various components, including data extraction, transformation, and loading (ETL), and utilizes tools for efficient data management and analysis. The architecture of a data warehouse typically includes source data components, data staging, storage, and data marts, all aimed at improving business intelligence and decision-making capabilities.

Uploaded by

dimpy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit- II

Data Warehousing and Data Mining

What is Data Warehouse?

A Data Warehouse is a collection of software tools that facilitates analysis of a


large set of business data used to help an organization make decisions. A large
amount of data in data warehouses comes from numerous sources such that
internal applications like marketing, sales, and finance; customer-facing apps;
and external partner systems, among others. It is a centralized data repository
for analysts that can be queried whenever required for business benefits. A data
warehouse is mainly a data management system that’s designed to enable and
support business intelligence (BI) activities, particularly analytics. Data
warehouses are alleged to perform queries, cleaning, manipulating,
transforming and analyzing the data and they also contain large amounts of
historical data.
What is Data Warehousing?

The process of creating data warehouses to store a large amount of data is


named Data Warehousing. Data Warehousing helps to improve the speed and
efficiency of accessing different data sets and makes it easier for company
decision-makers to obtain insights that will help the business and promoting
marketing tactics that set them aside from their competitors. We can say that it
is a blend of technologies and components which aids the strategic use of data
and information. The main goal of data warehousing is to create a hoarded
wealth of historical data that can be retrieved and analyzed to supply helpful
insight into the organization’s operations.

Need of Data Warehousing.

Data Warehousing is a progressively essential tool for business intelligence. It


allows organizations to make quality business decisions. The data warehouse
benefits by improving data analytics, it also helps to gain considerable revenue
and the strength to compete more strategically in the market. By efficiently
providing systematic, contextual data to the business intelligence tool of an
organization, the data warehouses can find out more practical business
strategies.
Business User: Business users or customers need a data warehouse to look
at summarized data from the past. Since these people are coming from a non-
technical background also, the data may be represented to them in an
uncomplicated way.

1. Maintains consistency: Data warehouses are programmed in such a


way that they can be applied in a regular format to all collected data
from different sources, which makes it effortless for company decision-
makers to analyze and share data insights with their colleagues around
the globe. By standardizing the data, the risk of error in interpretation is
also reduced and improves overall accuracy.
2. Store historical data: Data
Warehouses are also used to store historical data that means, the time
variable data from the past and this input can be used for various
purposes.
3. Make strategic decisions: Data warehouses contribute to making
better strategic decisions. Some business strategies may be depending
upon the data stored within the data warehouses.
4. High response time: Data warehouse has got to be prepared for
somewhat sudden masses and type of queries that
demands a major degree of flexibility and fast latency.
Characteristics of Data warehouse:

1. Subject Oriented: A data warehouse is often subject-oriented because


it delivers may be achieved on a particular theme which means the data
warehousing process is proposed to handle a particular theme that is
more defined. These themes are often sales, distribution, selling. etc.
2. Time-Variant: When the data is maintained via totally different intervals
of time like weekly, monthly, or annually, etc. It founds numerous time
limits that are unit structured between the big datasets and are
command within the online transaction method (OLTP). The time limits
for the data warehouse are extended than that of operational systems.
The data resided within the data warehouse is predetermined with a
particular interval of time and delivers information from the historical
perspective. It contains parts of time directly or indirectly.
3. Non-volatile: The data residing in the data warehouse is permanent
and defined by its names. It additionally means that the data in the data
warehouse cannot be erased or deleted or also when new data is
inserted into it. In the data warehouse, data is read-only and can only be
refreshed at a particular interval of time.
Operations such as delete, update and insert that is done in a software
application over data is lost in the data warehouse environment. There
are only two types of data operations that can be done in the data
warehouse:

o Data Loading
o Data Access
4. Integrated: A data warehouse is created by integrating data from
numerous different sources such that from mainframe computers and a
relational database. Additionally, it should also have reliable naming
conventions, formats, and codes. Integration of data warehouse benefits
in the successful analysis of data. Dependability in naming conventions,
column scaling, encoding structure, etc. needs to be confirmed.
Integration of data warehouse handles numerous subject-oriented
warehouses.

Architecture & Components of Data Warehouse:

Data warehouse architecture defines the comprehensive architecture of data


processing and presentation that will be useful for data analysis and decision
making within the enterprise and organization. Each organization has different
data warehouses depending upon their need, but all of them are characterized
by some standard components.

Data Warehouse applications are designed to support the user’s data


requirements, an example of this is online analytical processing (OLAP). These
include functions such as forecasting, profiling, summary reporting, and trend
analysis.

The architecture of the data warehouse mainly consists of the proper


arrangement of its elements, to build an efficient data warehouse with software
and hardware components. The elements and components may vary based on
the requirement of organizations. All of these depend on the organization’s
circumstances.
1. Source Data Component:
In the Data Warehouse, the source data comes from different places. They are
group into four categories:

 External Data: For data gathering, most of the executives and data
analysts rely on information coming from external sources for a numerous
amount of the information they use. They use statistical features
associated with their organization that is brought out by some external
sources and department.
 Internal Data: In every organization, the consumer keeps their “private”
spreadsheets, reports, client profiles, and generally even department
databases. This is often the interior information, a part that might be
helpful in every data warehouse.
 Operational System data: Operational systems are principally meant to
run the business. In each operation system, we periodically take the old
data and store it in achieved files.
 Flat files: A flat file is nothing but a text database that stores data in a
plain text format. Flat files generally are text files that have all data
processing and structure markup removed. A flat file contains a table with
a single record per line.

2. Data Staging:
After the data is extracted from various sources, now it’s time to prepare the
data files for storing in the data warehouse. The extracted data collected from
various sources must be transformed and made ready in a format that is
suitable to be saved in the data warehouse for querying and analysis. The
data staging contains three primary functions that take place in this part:

 Data Extraction: This stage handles various data sources. Data analysts
should employ suitable techniques for every data source.
 Data Transformation: As we all know, information for a knowledge
warehouse comes from many alternative sources. If information
extraction for a data warehouse posture huge challenge, information
transformation gifts even important challenges. We tend to perform many
individual tasks as a part of information transformation.
 Data Loading: When we complete the structure and construction of the
data warehouse and go live for the first time, we do the initial loading of
the data into the data warehouse storage. The initial load moves high
volumes of data consuming a considerable amount of time.

3. Data Storage in Warehouse:


Data storage for data warehousing is split into multiple repositories. These
data repositories contain structured data in a very highly normalized form for
fast and efficient processing.

 Metadata: Metadata means data about data i.e., it summarizes basic


details regarding data, creating findings & operating with explicit
instances of data. Metadata is generated by an additional correction or
automatically and can contain basic information about data.
 Raw Data: Raw data is a set of data and information that has not yet
been processed and was delivered from a particular data entity to the
data supplier and hasn’t been processed nonetheless by machine or
human. This data is gathered out from online sources to deliver deep
insight into users’ online behavior.
 Summary Data or Data summary: Data summary is an easy term for a
brief conclusion of an enormous theory or a paragraph. This is often one
thing where analysts write the code and in the end, they declare the
ultimate end in the form of summarizing data. Data summary is the most
essential thing in data mining and processing.

4. Data Marts:
Data marts are also the part of storage component in a data warehouse. It can
store the information of a specific function of an organization that is handled by
a single authority. There may be any number of data marts in a particular
organization depending upon the functions. In short, data marts contain subsets
of the data stored in data warehouses.

Now, the users and analysts can use data for various applications like reporting,
analyzing, mining, etc. The data is made available to them whenever required.

Data Warehousing life Cycle:

As we know the data warehouse is made by combining data from multiple


diverse sources and the tools that support analytical reporting, structured and
unstructured queries, and decision making for the organization. We need to
follow the step-by-step approach for building and successfully implementing the
Data Warehouse:
How does Data Warehouse work?

A Data Warehouse is like a central depository where data comes from different
data sources. In a data warehouse, the data flows from the transactional system
and relational databases. A data warehouse timely pulls out the data from
various apps and systems, after then, the data goes through various processing
and formatting and makes the data in a format that matches the data already in
the warehouse. This processed data is stored in the data warehouses that
ready for further analysis for decision making. The data formatting and
processing depends upon the need of the organization.

The Data could be in one of the following formats:

1. Structured
2. Semi-structured
3. Unstructured data

The data is processed and transformed so that users and analysts can access
the processed data in the Data Warehouse through Business Intelligence tools,
SQL clients, and spreadsheets. A data warehouse merges all information
coming from various sources into one global and complete database. By
merging all this information in one place, it becomes easier for an organization
to analyze its customers more comprehensively.

Latest Tools and Technologies for Data Warehousing:

Data warehousing had improved the access to information, reduced query-


response time, and allows businesses to get deep insights from huge, big data.
Earlier, companies had to build lots of infrastructure for data warehousing. But
today the cloud technology has remarkably reduced the cost and effort of data
warehousing for businesses.

The field of data warehousing is most emerging and their various cloud data
warehousing tools and technologies are developed for better decision making.
The cloud-based data warehousing tools are fast, highly scalable, and available
on a pay-per-use basis. Following are some data warehousing tools:

1. Amazon Redshift
2. Microsoft Azure
3. Google BigQuery
4. Snowflake
5. Micro Focus Vertica
6. Teradata
7. Amazon DynamoDB
8. PostgreSQL

All these are the top Data Warehousing Tools.


ETL (Extract, Transform, and Load) Process
The mechanism of extracting information from source systems and bringing it into the
data warehouse is commonly called ETL, which stands for Extraction, Transformation
and Loading.

The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.

To maintain its value as a tool for decision-makers, Data warehouse technique needs to
change with business changes. ETL is a recurring method (daily, weekly, monthly) of a
Data warehouse system and needs to be agile, automated, and well documented.

How ETL Works?


ETL consists of three separate phases:
Extraction
o Extraction is the operation of extracting information from a source system for further use
in a data warehouse environment. This is the first stage of the ETL process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus determining
which data needs to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all changed data
to the warehouse and keep it up-to-date.

Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing mistakes
and to recognize synonyms, as well as rule-based cleansing to enforce domain-specific
rules and defines appropriate associations between values.

The following examples show the essential of data cleaning:

If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-
to-date list of contact addresses, email addresses and telephone numbers must be
available.

If a client or supplier calls, the staff responding should be quickly able to find the person
in the enterprise database, but this need that the caller's name or his/her company name
is listed in the database.

If a user appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer's information.

Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.

The following points must be rectified in this phase:


o Loose texts may hide valuable information. For example, XYZ PVT Ltd does not explicitly
show that this is a Limited Partnership company.
o Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.

Following are the main transformation processes aimed at populating the reconciled data
layer:

o Conversion and normalization that operate on both storage formats and units of measure
to make data uniform.
o Matching that associates equivalent fields in different sources.
o Selection that reduces the number of source fields and records.

Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load step,
it is necessary to ensure that the load is performed correctly and with as little resources
as possible.

Loading can be carried in two ways:

1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying preexisting
data. This method is used in combination with incremental extraction to update data
warehouses regularly.

Selecting an ETL Tool


Selection of an appropriate ETL Tools is an important decision that has to be made in
choosing the importance of an ODS or data warehousing application. The ETL tools are
required to provide coordinated access to multiple data sources so that relevant data may
be extracted from them. An ETL tool would generally contains tools for data cleansing, re-
organization, transformations, aggregation, calculation and automatic loading of
information into the object database.
What is Star Schema?
A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale
or log in. A dimension includes reference data about the fact, such as date, item, or
customer.

A star schema is a relational schema where a relational schema whose design represents
a multidimensional data model. The star schema is the explicit data warehouse schema. It
is known as star schema because the entity-relationship diagram of this schemas
simulates a star, with points, diverge from a central table. The center of the schema
consists of a large fact table, and the points of the star are the dimension tables.

Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table
generally contains facts with the same level of aggregation.

Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that
categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables are
usually small in size than fact table.

Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.

Characteristics of Star Schema


The star schema is intensely suitable for data warehouse database design because of the
following features:

o It creates a DE-normalized database that can quickly provide query responses.


o It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema


Star Schemas are easy for end-users and application to understand and navigate. With a
well-designed schema, the customer can instantly analyze large, multidimensional data
sets.

The main advantage of star schemas in a decision-support environment are:


Query Performance
A star schema database has a limited number of table and clear join paths, the query run
faster than they do against OLTP systems. Small single-table queries, frequently of a
dimension table, are almost instantaneous. Large join queries that contain multiple tables
takes only seconds or minutes to run.

In a star schema database design, the dimension is connected only through the central
fact table. When the two-dimension table is used in a query, only one join path,
intersecting the fact tables, exist between those two tables. This design feature enforces
authentic and consistent query results.

Load performance and administration

Structural simplicity also decreases the time required to load large batches of record into
a star schema database. By describing facts and dimensions and separating them into the
various table, the impact of a load structure is reduced. Dimension table can be populated
once and occasionally refreshed. We can add new facts regularly and selectively by
appending records to a fact table.

Built-in referential integrity


A star schema has referential integrity built-in when information is loaded. Referential
integrity is enforced because each data in dimensional tables has a unique primary key,
and all keys in the fact table are legitimate foreign keys drawn from the dimension table.
A record in the fact table which is not related correctly to a dimension cannot be given
the correct key value to be retrieved.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through
the fact table. These joins are more significant to the end-user because they represent the
fundamental relationship between parts of the underlying business. Customer can also
browse dimension table attributes before constructing a query.

Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.

The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has
columns for each branch_key, branch_name, branch_type. The LOCATION table has
columns of geographic data, including street, city, state, and country.

In this scenario, the SALES table contains only four columns with IDs from the dimension
tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four
columns for ITEM data, three columns for BRANCH data, and four columns for LOCATION
data. Thus, the size of the fact table is significantly reduced. When we need to change an
item, we need only make a single change in the dimension table, instead of making many
changes in the fact table.
Data Mining

Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process
includes Data cleaning, Data integration, Data selection, Data transformation, Data
mining, Pattern evaluation, and Knowledge presentation.

Our Data mining tutorial includes all topics of Data mining such as applications, Data
mining vs Machine learning, Data mining tools, Social Media Data mining, Data mining
techniques, Clustering in data mining, Challenges in Data mining, etc.

What is Data Mining?


The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.

In other words, we can say that Data Mining is the process of investigating hidden patterns
of information to various perspectives for categorization into useful data, which is
collected and assembled in particular areas such as data warehouses, efficient analysis,
data mining algorithm, helping decision making and other data requirement to eventually
cost-cutting and generating revenue.

Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such
as text mining, web mining, audio and video mining, pictorial data mining, and social
media mining. It is done through software that is simple or highly specific. By outsourcing
data mining, all the work can be done faster with low operation costs. Specialized firms
can also use new technologies to collect data that is impossible to locate manually. There
are tonnes of information available on various platforms, but very little knowledge is
accessible. The biggest challenge is to analyze the data to extract important information
that can be used to solve a problem or for company development. There are many
powerful instruments and techniques available to mine data and find better insight from
it.
Types of Data Mining
Data mining can be performed on the following types of data:

Relational Database:

A relational database is a collection of multiple data sets formally organized by tables,


records, and columns from which data can be accessed in various ways without having to
recognize the database tables. Tables convey and share information, which facilitates data
searchability, reporting, and organization.

Data warehouses:

A Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights. The huge amount of data
comes from multiple places such as Marketing and Finance. The extracted data is utilized
for analytical purposes and helps in decision- making for a business organization. The
data warehouse is designed for the analysis of data rather than transaction processing.

Data Repositories:

The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various
kinds of information.

Object-Relational Database:

A combination of an object-oriented database model and relational database model is


called an object-relational model. It supports Classes, Objects, Inheritance, etc.

One of the primary objectives of the Object-relational data model is to close the gap
between the Relational database and the object-oriented model practices frequently
utilized in many programming languages, for example, C++, Java, C#, and so on.

Transactional Database:

A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately. Even though
this was a unique capability a very long while back, today, most of the relational database
systems support transactional database activities.
Tasks and Functionalities of Data Mining

Data mining tasks are designed to be semi-automatic or fully automatic and on large data sets to
uncover patterns such as groups or clusters, unusual or over the top data called anomaly
detection and dependencies such as association and sequential pattern. Once patterns are
uncovered, they can be thought of as a summary of the input data, and further analysis may be
carried out using Machine Learning and Predictive analytics. For example, the data mining step
might help identify multiple groups in the data that a decision support system can use. Note that
data collection, preparation, reporting are not part of data mining.

There is a lot of confusion between data mining and data analysis. Data mining functions are
used to define the trends or correlations contained in data mining activities. While data analysis
is used to test statistical models that fit the dataset, for example, analysis of a marketing
campaign, data mining uses Machine Learning and mathematical and statistical models to
discover patterns hidden in the data. In comparison, data mining activities can be divided into
two categories:

o Descriptive Data Mining: It includes certain knowledge to understand what is happening


within the data without a previous idea. The common data features are highlighted in the
data set. For example, count, average etc.
o Predictive Data Mining: It helps developers to provide unlabeled definitions of attributes.
With previously available or historical data, data mining can be used to make predictions
about critical business metrics based on data's linearity. For example, predicting the volume
of business next quarter based on performance in the previous quarters over several years
or judging from the findings of a patient's medical examinations that is he suffering from
any particular disease.

Functionalities of Data Mining


Data mining functionalities are used to represent the type of patterns that have to be discovered
in data mining tasks. Data mining tasks can be classified into two types: descriptive and
predictive. Descriptive mining tasks define the common features of the data in the database, and
the predictive mining tasks act in inference on the current information to develop predictions.

Data mining is extensively used in many areas or sectors. It is used to predict and characterize
data. But the ultimate objective in Data Mining Functionalities is to observe the various
trends in data mining. There are several data mining functionalities that the organized and
scientific methods offer, such as:
1. Class/Concept Descriptions

A class or concept implies there is a data set or set of features that define the class or a concept.
A class can be a category of items on a shop floor, and a concept could be the abstract idea on
which data may be categorized like products to be put on clearance sale and non-sale products.
There are two concepts here, one that helps with grouping and the other that helps in
differentiating.

o Data Characterization: This refers to the summary of general characteristics or features


of the class, resulting in specific rules that define a target class. A data analysis technique
called Attribute-oriented Induction is employed on the data set for achieving
characterization.
o Data Discrimination: Discrimination is used to separate distinct data sets based on the
disparity in attribute values. It compares features of a class with features of one or more
contrasting classes. E.g., bar charts, curves and pie charts.

2. Mining Frequent Patterns

One of the functions of data mining is finding data patterns. Frequent patterns are things that are
discovered to be most common in data. Various types of frequency can be found in the dataset.

o Frequent item set: This term refers to a group of items that are commonly found together,
such as milk and sugar.
o Frequent substructure: It refers to the various types of data structures that can be
combined with an item set or subsequences, such as trees and graphs.
o Frequent Subsequence: A regular pattern series, such as buying a phone followed by a
cover.

3. Association Analysis

It analyses the set of items that generally occur together in a transactional dataset. It is also
known as Market Basket Analysis for its wide use in retail sales. Two parameters are used for
determining the association rules:

o It provides which identifies the common item set in the database.


o Confidence is the conditional probability that an item occurs when another item occurs in
a transaction.

4. Classification

Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to predict a
class or essentially classify a collection of items. A training set containing items whose
properties are known is used to train the system to predict the category of items from an
unknown collection of items.

5. Prediction

It defines predict some unavailable data values or spending trends. An object can be anticipated
based on the attribute values of the object and attribute values of the classes. It can be a
prediction of missing numerical values or increase or decrease trends in time-related information.
There are primarily two types of predictions in data mining: numeric and class predictions.

o Numeric predictions are made by creating a linear regression model that is based
on historical data. Prediction of numeric values helps businesses ramp up for a
future event that might impact the business positively or negatively.
o Class predictions are used to fill in missing class information for products using a
training data set where the class for products is known.

6. Cluster Analysis

In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes
represent the classes. Similar data are grouped together, with the difference being that a class label
is not known. Clustering algorithms group data based on similar features and dissimilarities.
7. Outlier Analysis

Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns. An outlier analysis determines if there is something out of
turn in the data and whether it indicates a situation that a business needs to consider and take
measures to mitigate. An outlier analysis of the data that cannot be grouped into any classes by
the algorithms is pulled up.

8. Evolution and Deviation Analysis

Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify,
cluster or discriminate time-related data.

9. Correlation Analysis

Correlation is a mathematical technique for determining whether and how strongly two attributes
is related to one another. It refers to the various types of data structures, such as trees and graphs
that can be combined with an item set or subsequence. It determines how well two numerically
measured continuous variables are linked. Researchers can use this type of analysis to see if there
are any possible correlations between variables in their study.

Advantages of Data Mining


o The Data Mining technique enables organizations to obtain knowledge-based
data.
o Data mining enables organizations to make lucrative modifications in operation
and production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as the prediction
of trends and behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous amounts
of data in a short time.
Disadvantages of Data Mining
o There is a probability that the organizations may sell useful data of customers to
other organizations for money. As per the report, American Express has sold credit
card purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance
training to work on.
o Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data mining
tools is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.

Data Mining Applications


Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences,
product positioning, and impact on sales, customer satisfaction, and corporate profits.
Data mining enables a retailer to use point-of-sale records of customer purchases to
develop products and promotions that help the organization to attract the customer.
These are the following areas where data mining is widely used:

Data Mining in Healthcare:

Data mining in healthcare has excellent potential to improve the health system. It uses
data and analytics for better insights and to identify best practices that will enhance health
care services and reduce costs. Analysts use data mining approaches such as Machine
learning, Multi-dimensional database, Data visualization, Soft computing, and statistics.
Data Mining can be used to forecast patients in each category. The procedures ensure
that the patients get intensive care at the right place and at the right time. Data mining
also enables healthcare insurers to recognize fraud and abuse.

Data Mining in Market Basket Analysis:

Market basket analysis is a modeling method based on a hypothesis. If you buy a specific
group of products, then you are more likely to buy another group of products. This
technique may enable the retailer to understand the purchase behavior of a buyer. This
data may assist the retailer in understanding the requirements of the buyer and altering
the store's layout accordingly. Using a different analytical comparison of results between
various stores, between customers in different demographic groups can be done.

Data mining in Education:

Education data mining is a newly emerging field, concerned with developing techniques
that explore knowledge from the data generated from educational Environments. EDM
objectives are recognized as affirming student's future learning behavior, studying the
impact of educational support, and promoting learning science. An organization can use
data mining to make precise decisions and also to predict the results of the student. With
the results, the institution can concentrate on what to teach and how to teach.

Data Mining in Manufacturing Engineering:

Knowledge is the best asset possessed by a manufacturing company. Data mining tools
can be beneficial to find patterns in a complex manufacturing process. Data mining can
be used in system-level designing to obtain the relationships between product
architecture, product portfolio, and data needs of the customers. It can also be used to
forecast the product development period, cost, and expectations among the other tasks.

Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and holding Customers,
also enhancing customer loyalty and implementing customer-oriented strategies. To get
a decent relationship with the customer, a business organization needs to collect data
and analyze the data. With data mining technologies, the collected data can be used for
analytics.

Data Mining in Fraud detection:

Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection
are a little bit time consuming and sophisticated. Data mining provides meaningful
patterns and turning data into information. An ideal fraud detection system should
protect the data of all the users. Supervised methods consist of a collection of sample
records, and these records are classified as fraudulent or non-fraudulent. A model is
constructed using this data, and the technique is made to identify whether the document
is fraudulent or not.

Data Mining in Lie Detection:

Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate
offenses, monitor suspected terrorist communications, etc. This technique includes text
mining also, and it seeks meaningful patterns in data, which is usually unstructured text.
The information collected from the previous investigations is compared, and a model for
lie detection is constructed.

Data Mining Financial Banking:

The Digitalization of the banking system is supposed to generate an enormous amount


of data with every new transaction. The data mining technique can help bankers by solving
business-related problems in banking and finance by identifying trends, casualties, and
correlations in business information and market costs that are not instantly evident to
managers or executives because the data volume is too large or are produced too rapidly
on the screen by experts. The manager may find these data for better targeting, acquiring,
retaining, segmenting, and maintain a profitable customer.

Challenges of Implementation in Data mining


Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques, etc.
The process of data mining becomes effective when the challenges or problems are
correctly recognized and adequately resolved.
Incomplete and noisy data:

The process of extracting useful data from large volumes of data is data mining. The data
in the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will
usually be inaccurate or unreliable. These problems may occur due to data measuring
instrument or because of human errors. Suppose a retail chain collects phone numbers of
customers who spend more than $ 500, and the accounting employees put the
information into their system. The person may make a digit mistake when entering the
phone number, which results in incorrect data. Even some customers may not be willing
to disclose their phone numbers, which results in incomplete data. The data could get
changed due to human or system error. All these consequences (noisy and incomplete
data) makes data mining challenging.

Data Distribution:

Real-worlds data is usually stored on various platforms in a distributed computing


environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data repository
mainly due to organizational and technical concerns. For example, various regional offices
may have their servers to store their data. It is not feasible to store, all the data from all
the offices on a central server. Therefore, data mining requires the development of tools
and algorithms that allow the mining of distributed data.

Complex Data:

Real-world data is heterogeneous, and it could be multimedia data, including audio and
video, images, complex data, spatial data, time series, and so on. Managing these various
types of data and extracting useful information is a tough task. Most of the time, new
technologies, new tools, and methodologies would have to be refined to obtain specific
information.

Performance:

The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then
the efficiency of the data mining process will be affected adversely.

Data Privacy and Security:

Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it
reveals data about buying habits and preferences of the customers without their
permission.

Data Visualization:

In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data should
convey the exact meaning of what it intends to express. But many times, representing the
information to the end-user in a precise and easy way is difficult. The input data and the
output information being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.

There are many more challenges in data mining in addition to the problems above-mentioned.
More problems are disclosed as the actual data mining process begins, and the success of data
mining relies on getting rid of all these difficulties.

Backward Skip 10s


Role of data mining in retail industries

In the dynamic and fast-growing retail industry, the consumption of goods


increases day by day which in turn increases the data collected and used. The
retail industry includes the sales of goods to the customer through retailers. It
covers from a local booth in the street to the big malls in cities. For eg: The
grocery shop owner in a defined area would know about their customer details
after-sales for few months. When he notes the need of his customer, it would be
easy to enhance the sales. The same happens in the big retail industries. They
collect customers’ responses to a product, the time zone, their location, shopping
cart history, etc. Preference of brands and products help the company to create
targeted ad to increase the sales and profit.

Knowing the customers:

What is the purpose of sales if the retailer doesn’t know who their customers are?
It’s a definite need to understand about their customers. It starts by analyzing
them with various factors. Finding the source by which the customer gets to know
about that retailing platform would help in enhancing the advertisement of
retailers to attract a completely new set of people. By finding the days they have
frequently purchased can help in discount sales or special boost up on festival
days. The time they spend buying per order can give us useful statistical data to
enhance growth

RFM Value:

RFM stands for Recency, Frequency, Monetary value. Recency is nothing but
the nearest or recent time when the customer made a purchase. Frequency is
how often the purchase had taken place and Monetary value is the amount spent
by the customers on the purchase. RFM can surge monetization by holding on to
the regular and potential customers by keeping them happy with satisfying
results. It can also help in pulling back the trailing customers who tend to reduce
the purchase. The more the RFM score, the more the growth of sales is. RFM
also prevents from sending over requests to engaged customers and it helps to
implement new marketing techniques to low ordering customers. RFM helps in
identifying innovative solutions.
Market-based analysis:

The market-based analysis is a technique used to study and analyze the


shopping sequence of a customer to increase revenue/sales. This is done by
analyzing datasets of a particular customer by learning their shopping history,
frequently bought items, items grouped like a combination to use.
A very good example is the loyalty card issued by the retailer to customers. From
the customer’s point of view, the card is needed to keep track of discounts in the
future, incentive criteria details, and the history of transactions.
This analysis can be achieved with data science techniques or various
algorithms. This can even be achieved without technical skills. Microsoft Excel
platform is used to analyze the customer purchases, frequently bought or
frequently grouped items. The spreadsheets can be organized by using ID as
specified for different transactions. This analysis helps in suggesting products for
the customer which may pair well with their current purchase which leads
to cross-selling and improved profits. It also helps to track the purchase rate per
month or year. It manifests the correct time for the retailer to make the desired
offers to attract the right customers for the targeted products.

Potent sales campaign:

Everything nowadays needs advertising. Because advertising the product helps


people know about its existence, use, and features. It takes the product from
the warehouse to the real world. If it has to attract the right customers, data
must be analyzed. This is the right call to sales or market campaign performed
by the retailers. The marketing campaigns must be initiated with the right plans
else it may lead to loss of company by over-investing in untargeted
Advertisements. The sales campaign depends on the time, location, and
preference of the customer. The platform in which the campaign takes place
also plays a major role in pulling the right customers in. It requires regular
analysis of the sales and its associated data taking place in a particular platform
at a certain time. The traffic in social or network platforms will give us the
favoring of campaigned product or not. The retailer can make changes in the
campaign with the previous statistics which rapidly increases the sales profit
and prevents overspending.
Role of data mining in telecommunication industries

In the highly evolving and competitive surroundings, the telecommunication


industry plays a major in handling huge data sets of customers, network and call
data. To thrive in such an environment, the Telecommunication Industry must
find a way to handle data easily. Data Mining is preferred to enhance the business
and to solve the problem in this industry. The major function includes fraud call
identification and spotting the defects in a network to isolate the faults. Data
mining can also enhance effective marketing techniques. Anyways, this industry
confronts challenges in dealing with the logical and time aspect in data mining
which calls the need to foresee rarity in telecommunication data to detect network
faults or buyer frauds in real-time.

Call detail data:

Whenever a call starts in the telecommunication network, the details of the call
are recorded. The date and instant of time in which it happens, the duration of
call along with the time when it ends. Since all the data of a call is collected in
real-time, it is ready to be processed with data mining techniques. But we should
segregate data from the customer level not from isolated single phone call levels.
Thus, by efficient extraction of data, one can find the customer calling pattern.
Some of the data that help to find the pattern are
 average time duration of calls
 Time in which the call took place (Daytime/Night-time)
 The average number of calls on weekdays
 Calls generated with varied area code
 Calls generated per day, etc.
By sensing the proper customer call details, one can progress the business
growth. If a customer makes more calls during dayshift working hours, that makes
them distinguished as a part of a business firm. If the night-time call rate is high,
it may be used only for residential or domestic purposes.

Data of customers:

When it comes to the telecommunication industry, there would be an enormous


number of customers. This customer database is sustained for any further
queries in the data mining process. For example, when a customer fraud case is
encountered, these customer details would help in the identification of the person
with the details in the customer database like name, address of the person. It
would be easy to trace them and solve the issue. This dataset can also be
extracted from external sources because mostly this information would be
common. It also includes the plan chosen for subscription, proper payment
history. By using this dataset, we can escalate the growth in telecommunication
industries.

Network Data:

Due to the use of well-developed complex appliances used in telecommunication


networks, there is a possibility that every part of the system may generate errors
and messages. This leads to a large amount of network data being processed.
This data must be separated, grouped, and stored in order if the system causes
any network fault isolation. This ensures that the error or status message of
any part of the network system would reach the technical specialist. So, they
could rectify it. Since the database is enormous, when a large number of status
or error messages get generated, it becomes difficult to solve the problems
manually. So, some sets of errors and messages can be automatized to reduce
the strain. A methodical approach of data mining can manage the network system
efficiently which can enhance the functions.

Preparing and clustering data:

Even though raw data are processed in data mining, it must be in a well sensed
and properly arranged format to be processed. And, in the telecommunication
industry dealing with the giant database, it’s an important need. First, clashing
and contrary data must be identified to avoid inconsistency. Making sure of the
removal of undesired data fields heaping space. The data must be organized and
mapped by finding the relationship between datasets to avoid redundancy.
Clustering or grouping similar data can be done by algorithms in the data mining
field. It can help in analyzing the patterns like calling patterns or customer
behavior patterns. Group of frequencies is made by analyzing the similarities
between them. By doing this, data can easily be understood which leads to easy
manipulation and use.

Customer profiling:

The telecommunication industry deals with a large scale of customer details. It


starts observing patterns of the customer from call data to profile the customers
to predict future trends. By knowing the customer pattern, the company can
decide the promotion methods offered to the customer. If the call ranges within
an area code. The promotion made in that aspect would gain a group of
customers. This can efficiently monetize the promotion techniques and stop the
company from investing in a single subscriber but it can attract a group of people
with the right plan. Privacy issues arise when the customer’s call history or details
are monitored.
One of the significant problems that the telecommunication industry faces is
that Customer churn. This can also be stated as customer turnover in which the
company loses its client. In this case, the client leaves and switches to another
telecommunication company. If the customer churn rate is high in a company, the
respective company will experience severe loss of revenue and profit which will
lead to its decline in growth. This issue can be fixed by data mining techniques
to collect patterns of customers and profiling them. Incentive offers provided by
companies attract the regular user of some other company. By profiling the data,
the customer churn can be effectively forecasted by their behaviors like
subscription history, the plan they choose, and so on. While collecting data from
the paid customers, it’s also possible to collect data of the receiver or non-
customer but with a set of restrictions.

Fraud detection:

Fraud is a critical problem for telecommunication industries which causes loss of


revenue and also causes a deterioration in customer relations. Two major fraud
activity involved is subscription theft and super-imposed frauds. The
subscription fraud involves collecting the details of customers mostly from the
KYC(Know Your Customer) documents like name, address, and ID proof details.
These details are needed to sign up for telecom services with authenticating
approval but without any type of intention to pay for using the service using the
account. Some offender not only stops with the illegitimate use of services but
perform bypass fraud by diverting voice traffic from local to international protocols
which causes destructive loss to the telecommunication company. In super-
imposed frauds, it starts with a legitimate account and a legal activity but with
further lead to the overlapped or imposed activity by some other person illegally
using the services rather than the account holder. But by collecting the behavioral
pattern of the account holder, if a suspect is found on super-imposed fraudulent
activities it will lead to immediate actions like blocking or deactivating the account
user. This will prevent further damage to the company.
Role of data mining in Heath industries
Data mining has been used intensively and widely by numerous industries. In healthcare, data
mining is becoming more popular nowadays. Data mining applications can incredibly benefit all
parties who are involved in the healthcare industry. For example, data mining can help the
healthcare industry in fraud detection and abuse, customer relationship management, effective
patient care, and best practices, affordable healthcare services. The large amounts of data
generated by healthcare transactions are too complex and huge to be processed and analyzed by
conventional methods.

Data mining provides the framework and techniques to transform these data into useful
information for data-driven decision purposes.

Treatment effectiveness:

Data mining applications can be used to assess the effectiveness of medical treatments. Data
mining can convey analysis of which course of action demonstrates effective by comparing and
differentiating causes, symptoms, and courses of treatments.

Healthcare management:

Data mining applications can be used to identify and track chronic illness states and incentive
care unit patients, decrease the number of hospital admissions, and supports healthcare
management. Data mining used to analyze massive data sets and statistics to search for patterns
that may demonstrate an assault by bio-terrorists.

Customer relationship management:

Customer and management interactions are very crucial for any organization to achieve business
goals. Customer relationship management is the primary approach to managing interactions
between commercial organizations normally retail sectors and banks, with their customers.
Similarly, it is important in the healthcare context. Customer interactions may happen through call
centers, billing departments, and ambulatory care settings.

Fraud and abuse:

Data mining fraud and abuse applications can focus on inappropriate or wrong prescriptions and
fraud insurance and medical claims.

You might also like