BAunit-2 Notes
BAunit-2 Notes
o Data Loading
o Data Access
4. Integrated: A data warehouse is created by integrating data from
numerous different sources such that from mainframe computers and a
relational database. Additionally, it should also have reliable naming
conventions, formats, and codes. Integration of data warehouse benefits
in the successful analysis of data. Dependability in naming conventions,
column scaling, encoding structure, etc. needs to be confirmed.
Integration of data warehouse handles numerous subject-oriented
warehouses.
External Data: For data gathering, most of the executives and data
analysts rely on information coming from external sources for a numerous
amount of the information they use. They use statistical features
associated with their organization that is brought out by some external
sources and department.
Internal Data: In every organization, the consumer keeps their “private”
spreadsheets, reports, client profiles, and generally even department
databases. This is often the interior information, a part that might be
helpful in every data warehouse.
Operational System data: Operational systems are principally meant to
run the business. In each operation system, we periodically take the old
data and store it in achieved files.
Flat files: A flat file is nothing but a text database that stores data in a
plain text format. Flat files generally are text files that have all data
processing and structure markup removed. A flat file contains a table with
a single record per line.
2. Data Staging:
After the data is extracted from various sources, now it’s time to prepare the
data files for storing in the data warehouse. The extracted data collected from
various sources must be transformed and made ready in a format that is
suitable to be saved in the data warehouse for querying and analysis. The
data staging contains three primary functions that take place in this part:
Data Extraction: This stage handles various data sources. Data analysts
should employ suitable techniques for every data source.
Data Transformation: As we all know, information for a knowledge
warehouse comes from many alternative sources. If information
extraction for a data warehouse posture huge challenge, information
transformation gifts even important challenges. We tend to perform many
individual tasks as a part of information transformation.
Data Loading: When we complete the structure and construction of the
data warehouse and go live for the first time, we do the initial loading of
the data into the data warehouse storage. The initial load moves high
volumes of data consuming a considerable amount of time.
4. Data Marts:
Data marts are also the part of storage component in a data warehouse. It can
store the information of a specific function of an organization that is handled by
a single authority. There may be any number of data marts in a particular
organization depending upon the functions. In short, data marts contain subsets
of the data stored in data warehouses.
Now, the users and analysts can use data for various applications like reporting,
analyzing, mining, etc. The data is made available to them whenever required.
A Data Warehouse is like a central depository where data comes from different
data sources. In a data warehouse, the data flows from the transactional system
and relational databases. A data warehouse timely pulls out the data from
various apps and systems, after then, the data goes through various processing
and formatting and makes the data in a format that matches the data already in
the warehouse. This processed data is stored in the data warehouses that
ready for further analysis for decision making. The data formatting and
processing depends upon the need of the organization.
1. Structured
2. Semi-structured
3. Unstructured data
The data is processed and transformed so that users and analysts can access
the processed data in the Data Warehouse through Business Intelligence tools,
SQL clients, and spreadsheets. A data warehouse merges all information
coming from various sources into one global and complete database. By
merging all this information in one place, it becomes easier for an organization
to analyze its customers more comprehensively.
The field of data warehousing is most emerging and their various cloud data
warehousing tools and technologies are developed for better decision making.
The cloud-based data warehousing tools are fast, highly scalable, and available
on a pay-per-use basis. Following are some data warehousing tools:
1. Amazon Redshift
2. Microsoft Azure
3. Google BigQuery
4. Snowflake
5. Micro Focus Vertica
6. Teradata
7. Amazon DynamoDB
8. PostgreSQL
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to
change with business changes. ETL is a recurring method (daily, weekly, monthly) of a
Data warehouse system and needs to be agile, automated, and well documented.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing mistakes
and to recognize synonyms, as well as rule-based cleansing to enforce domain-specific
rules and defines appropriate associations between values.
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-
to-date list of contact addresses, email addresses and telephone numbers must be
available.
If a client or supplier calls, the staff responding should be quickly able to find the person
in the enterprise database, but this need that the caller's name or his/her company name
is listed in the database.
If a user appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer's information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.
Following are the main transformation processes aimed at populating the reconciled data
layer:
o Conversion and normalization that operate on both storage formats and units of measure
to make data uniform.
o Matching that associates equivalent fields in different sources.
o Selection that reduces the number of source fields and records.
Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load step,
it is necessary to ensure that the load is performed correctly and with as little resources
as possible.
1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying preexisting
data. This method is used in combination with incremental extraction to update data
warehouses regularly.
A star schema is a relational schema where a relational schema whose design represents
a multidimensional data model. The star schema is the explicit data warehouse schema. It
is known as star schema because the entity-relationship diagram of this schemas
simulates a star, with points, diverge from a central table. The center of the schema
consists of a large fact table, and the points of the star are the dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table
generally contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that
categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables are
usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.
In a star schema database design, the dimension is connected only through the central
fact table. When the two-dimension table is used in a query, only one join path,
intersecting the fact tables, exist between those two tables. This design feature enforces
authentic and consistent query results.
Structural simplicity also decreases the time required to load large batches of record into
a star schema database. By describing facts and dimensions and separating them into the
various table, the impact of a load structure is reduced. Dimension table can be populated
once and occasionally refreshed. We can add new facts regularly and selectively by
appending records to a fact table.
Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table has
columns for each branch_key, branch_name, branch_type. The LOCATION table has
columns of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the dimension
tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four
columns for ITEM data, three columns for BRANCH data, and four columns for LOCATION
data. Thus, the size of the fact table is significantly reduced. When we need to change an
item, we need only make a single change in the dimension table, instead of making many
changes in the fact table.
Data Mining
Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process
includes Data cleaning, Data integration, Data selection, Data transformation, Data
mining, Pattern evaluation, and Knowledge presentation.
Our Data mining tutorial includes all topics of Data mining such as applications, Data
mining vs Machine learning, Data mining tools, Social Media Data mining, Data mining
techniques, Clustering in data mining, Challenges in Data mining, etc.
In other words, we can say that Data Mining is the process of investigating hidden patterns
of information to various perspectives for categorization into useful data, which is
collected and assembled in particular areas such as data warehouses, efficient analysis,
data mining algorithm, helping decision making and other data requirement to eventually
cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such
as text mining, web mining, audio and video mining, pictorial data mining, and social
media mining. It is done through software that is simple or highly specific. By outsourcing
data mining, all the work can be done faster with low operation costs. Specialized firms
can also use new technologies to collect data that is impossible to locate manually. There
are tonnes of information available on various platforms, but very little knowledge is
accessible. The biggest challenge is to analyze the data to extract important information
that can be used to solve a problem or for company development. There are many
powerful instruments and techniques available to mine data and find better insight from
it.
Types of Data Mining
Data mining can be performed on the following types of data:
Relational Database:
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights. The huge amount of data
comes from multiple places such as Marketing and Finance. The extracted data is utilized
for analytical purposes and helps in decision- making for a business organization. The
data warehouse is designed for the analysis of data rather than transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various
kinds of information.
Object-Relational Database:
One of the primary objectives of the Object-relational data model is to close the gap
between the Relational database and the object-oriented model practices frequently
utilized in many programming languages, for example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately. Even though
this was a unique capability a very long while back, today, most of the relational database
systems support transactional database activities.
Tasks and Functionalities of Data Mining
Data mining tasks are designed to be semi-automatic or fully automatic and on large data sets to
uncover patterns such as groups or clusters, unusual or over the top data called anomaly
detection and dependencies such as association and sequential pattern. Once patterns are
uncovered, they can be thought of as a summary of the input data, and further analysis may be
carried out using Machine Learning and Predictive analytics. For example, the data mining step
might help identify multiple groups in the data that a decision support system can use. Note that
data collection, preparation, reporting are not part of data mining.
There is a lot of confusion between data mining and data analysis. Data mining functions are
used to define the trends or correlations contained in data mining activities. While data analysis
is used to test statistical models that fit the dataset, for example, analysis of a marketing
campaign, data mining uses Machine Learning and mathematical and statistical models to
discover patterns hidden in the data. In comparison, data mining activities can be divided into
two categories:
Data mining is extensively used in many areas or sectors. It is used to predict and characterize
data. But the ultimate objective in Data Mining Functionalities is to observe the various
trends in data mining. There are several data mining functionalities that the organized and
scientific methods offer, such as:
1. Class/Concept Descriptions
A class or concept implies there is a data set or set of features that define the class or a concept.
A class can be a category of items on a shop floor, and a concept could be the abstract idea on
which data may be categorized like products to be put on clearance sale and non-sale products.
There are two concepts here, one that helps with grouping and the other that helps in
differentiating.
One of the functions of data mining is finding data patterns. Frequent patterns are things that are
discovered to be most common in data. Various types of frequency can be found in the dataset.
o Frequent item set: This term refers to a group of items that are commonly found together,
such as milk and sugar.
o Frequent substructure: It refers to the various types of data structures that can be
combined with an item set or subsequences, such as trees and graphs.
o Frequent Subsequence: A regular pattern series, such as buying a phone followed by a
cover.
3. Association Analysis
It analyses the set of items that generally occur together in a transactional dataset. It is also
known as Market Basket Analysis for its wide use in retail sales. Two parameters are used for
determining the association rules:
4. Classification
Classification is a data mining technique that categorizes items in a collection based on some
predefined properties. It uses methods like if-then, decision trees or neural networks to predict a
class or essentially classify a collection of items. A training set containing items whose
properties are known is used to train the system to predict the category of items from an
unknown collection of items.
5. Prediction
It defines predict some unavailable data values or spending trends. An object can be anticipated
based on the attribute values of the object and attribute values of the classes. It can be a
prediction of missing numerical values or increase or decrease trends in time-related information.
There are primarily two types of predictions in data mining: numeric and class predictions.
o Numeric predictions are made by creating a linear regression model that is based
on historical data. Prediction of numeric values helps businesses ramp up for a
future event that might impact the business positively or negatively.
o Class predictions are used to fill in missing class information for products using a
training data set where the class for products is known.
6. Cluster Analysis
In image processing, pattern recognition and bioinformatics, clustering is a popular data mining
functionality. It is similar to classification, but the classes are not predefined. Data attributes
represent the classes. Similar data are grouped together, with the difference being that a class label
is not known. Clustering algorithms group data based on similar features and dissimilarities.
7. Outlier Analysis
Outlier analysis is important to understand the quality of data. If there are too many outliers, you
cannot trust the data or draw patterns. An outlier analysis determines if there is something out of
turn in the data and whether it indicates a situation that a business needs to consider and take
measures to mitigate. An outlier analysis of the data that cannot be grouped into any classes by
the algorithms is pulled up.
Evolution Analysis pertains to the study of data sets that change over time. Evolution analysis
models are designed to capture evolutionary trends in data helping to characterize, classify,
cluster or discriminate time-related data.
9. Correlation Analysis
Correlation is a mathematical technique for determining whether and how strongly two attributes
is related to one another. It refers to the various types of data structures, such as trees and graphs
that can be combined with an item set or subsequence. It determines how well two numerically
measured continuous variables are linked. Researchers can use this type of analysis to see if there
are any possible correlations between variables in their study.
Data mining in healthcare has excellent potential to improve the health system. It uses
data and analytics for better insights and to identify best practices that will enhance health
care services and reduce costs. Analysts use data mining approaches such as Machine
learning, Multi-dimensional database, Data visualization, Soft computing, and statistics.
Data Mining can be used to forecast patients in each category. The procedures ensure
that the patients get intensive care at the right place and at the right time. Data mining
also enables healthcare insurers to recognize fraud and abuse.
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific
group of products, then you are more likely to buy another group of products. This
technique may enable the retailer to understand the purchase behavior of a buyer. This
data may assist the retailer in understanding the requirements of the buyer and altering
the store's layout accordingly. Using a different analytical comparison of results between
various stores, between customers in different demographic groups can be done.
Education data mining is a newly emerging field, concerned with developing techniques
that explore knowledge from the data generated from educational Environments. EDM
objectives are recognized as affirming student's future learning behavior, studying the
impact of educational support, and promoting learning science. An organization can use
data mining to make precise decisions and also to predict the results of the student. With
the results, the institution can concentrate on what to teach and how to teach.
Knowledge is the best asset possessed by a manufacturing company. Data mining tools
can be beneficial to find patterns in a complex manufacturing process. Data mining can
be used in system-level designing to obtain the relationships between product
architecture, product portfolio, and data needs of the customers. It can also be used to
forecast the product development period, cost, and expectations among the other tasks.
Customer Relationship Management (CRM) is all about obtaining and holding Customers,
also enhancing customer loyalty and implementing customer-oriented strategies. To get
a decent relationship with the customer, a business organization needs to collect data
and analyze the data. With data mining technologies, the collected data can be used for
analytics.
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection
are a little bit time consuming and sophisticated. Data mining provides meaningful
patterns and turning data into information. An ideal fraud detection system should
protect the data of all the users. Supervised methods consist of a collection of sample
records, and these records are classified as fraudulent or non-fraudulent. A model is
constructed using this data, and the technique is made to identify whether the document
is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate
offenses, monitor suspected terrorist communications, etc. This technique includes text
mining also, and it seeks meaningful patterns in data, which is usually unstructured text.
The information collected from the previous investigations is compared, and a model for
lie detection is constructed.
The process of extracting useful data from large volumes of data is data mining. The data
in the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will
usually be inaccurate or unreliable. These problems may occur due to data measuring
instrument or because of human errors. Suppose a retail chain collects phone numbers of
customers who spend more than $ 500, and the accounting employees put the
information into their system. The person may make a digit mistake when entering the
phone number, which results in incorrect data. Even some customers may not be willing
to disclose their phone numbers, which results in incomplete data. The data could get
changed due to human or system error. All these consequences (noisy and incomplete
data) makes data mining challenging.
Data Distribution:
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and
video, images, complex data, spatial data, time series, and so on. Managing these various
types of data and extracting useful information is a tough task. Most of the time, new
technologies, new tools, and methodologies would have to be refined to obtain specific
information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then
the efficiency of the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it
reveals data about buying habits and preferences of the customers without their
permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data should
convey the exact meaning of what it intends to express. But many times, representing the
information to the end-user in a precise and easy way is difficult. The input data and the
output information being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.
There are many more challenges in data mining in addition to the problems above-mentioned.
More problems are disclosed as the actual data mining process begins, and the success of data
mining relies on getting rid of all these difficulties.
What is the purpose of sales if the retailer doesn’t know who their customers are?
It’s a definite need to understand about their customers. It starts by analyzing
them with various factors. Finding the source by which the customer gets to know
about that retailing platform would help in enhancing the advertisement of
retailers to attract a completely new set of people. By finding the days they have
frequently purchased can help in discount sales or special boost up on festival
days. The time they spend buying per order can give us useful statistical data to
enhance growth
RFM Value:
RFM stands for Recency, Frequency, Monetary value. Recency is nothing but
the nearest or recent time when the customer made a purchase. Frequency is
how often the purchase had taken place and Monetary value is the amount spent
by the customers on the purchase. RFM can surge monetization by holding on to
the regular and potential customers by keeping them happy with satisfying
results. It can also help in pulling back the trailing customers who tend to reduce
the purchase. The more the RFM score, the more the growth of sales is. RFM
also prevents from sending over requests to engaged customers and it helps to
implement new marketing techniques to low ordering customers. RFM helps in
identifying innovative solutions.
Market-based analysis:
Whenever a call starts in the telecommunication network, the details of the call
are recorded. The date and instant of time in which it happens, the duration of
call along with the time when it ends. Since all the data of a call is collected in
real-time, it is ready to be processed with data mining techniques. But we should
segregate data from the customer level not from isolated single phone call levels.
Thus, by efficient extraction of data, one can find the customer calling pattern.
Some of the data that help to find the pattern are
average time duration of calls
Time in which the call took place (Daytime/Night-time)
The average number of calls on weekdays
Calls generated with varied area code
Calls generated per day, etc.
By sensing the proper customer call details, one can progress the business
growth. If a customer makes more calls during dayshift working hours, that makes
them distinguished as a part of a business firm. If the night-time call rate is high,
it may be used only for residential or domestic purposes.
Data of customers:
Network Data:
Even though raw data are processed in data mining, it must be in a well sensed
and properly arranged format to be processed. And, in the telecommunication
industry dealing with the giant database, it’s an important need. First, clashing
and contrary data must be identified to avoid inconsistency. Making sure of the
removal of undesired data fields heaping space. The data must be organized and
mapped by finding the relationship between datasets to avoid redundancy.
Clustering or grouping similar data can be done by algorithms in the data mining
field. It can help in analyzing the patterns like calling patterns or customer
behavior patterns. Group of frequencies is made by analyzing the similarities
between them. By doing this, data can easily be understood which leads to easy
manipulation and use.
Customer profiling:
Fraud detection:
Data mining provides the framework and techniques to transform these data into useful
information for data-driven decision purposes.
Treatment effectiveness:
Data mining applications can be used to assess the effectiveness of medical treatments. Data
mining can convey analysis of which course of action demonstrates effective by comparing and
differentiating causes, symptoms, and courses of treatments.
Healthcare management:
Data mining applications can be used to identify and track chronic illness states and incentive
care unit patients, decrease the number of hospital admissions, and supports healthcare
management. Data mining used to analyze massive data sets and statistics to search for patterns
that may demonstrate an assault by bio-terrorists.
Customer and management interactions are very crucial for any organization to achieve business
goals. Customer relationship management is the primary approach to managing interactions
between commercial organizations normally retail sectors and banks, with their customers.
Similarly, it is important in the healthcare context. Customer interactions may happen through call
centers, billing departments, and ambulatory care settings.
Data mining fraud and abuse applications can focus on inappropriate or wrong prescriptions and
fraud insurance and medical claims.