Data
Warehouse
D R . Q O TA D E H A L J AW A Z N E H
Table of Contents
Introduction
What is Data Warehouse?
What is Data Warehousing?
Need of Data Warehousing
Characteristics of Data warehouse
Architecture & Components of Data Warehouse
Data Warehousing life Cycle
How does Data Warehouse work?
Latest Tools and Technologies for Data Warehousing
A Data Warehouse is Built by combining data from multiple
diverse sources that support analytical reporting, structured
and unstructured queries, and decision making for the
organization, and Data Warehousing is a step-by-step
approach for constructing and using a Data Warehouse. Many
data scientists get their data in raw formats from various
Introduction sources of data and information. But, for many data scientists
also as business decision-makers, particularly in big enterprises,
the main sources of data and information are corporate data
warehouses. A data warehouse holds data from multiple
sources, including internal databases and Software (SaaS)
platforms. After the data is loaded, it often cleansed,
transformed, and checked for quality before it is used for
analytics reporting, data science, machine learning, or anything.
A Data Warehouse is a collection of software tools that
facilitates analysis of a large set of business data used to help
an organization make decisions. A large amount of data in data
warehouses comes from numerous sources such that internal
applications like marketing, sales, and finance; customer-facing
What is Data apps; and external partner systems, among others. It is a
centralized data repository for analysts that can be queried
Warehouse? whenever required for business benefits. A data warehouse is
mainly a data management system that’s designed to enable
and support business intelligence (BI) activities, particularly
analytics. Data warehouses are alleged to perform queries,
cleaning, manipulating, transforming and analyzing the data
and they also contain large amounts of historical data.
The process of creating data warehouses to store a large
amount of data is named Data Warehousing. Data
Warehousing helps to improve the speed and efficiency of
accessing different data sets and makes it easier for company
decision-makers to obtain insights that will help the business
What is Data and promoting marketing tactics that set them aside from their
Warehousing? competitors. We can say that it is a blend of technologies and
components which aids the strategic use of data and
information. The main goal of data warehousing is to create a
hoarded wealth of historical data that can be retrieved and
analyzed to supply helpful insight into the organization’s
operations.
Need of Data
Warehousing.
Data Warehousing is a progressively essential
tool for business intelligence. It allows
organizations to make quality business
decisions. The data warehouse benefits by
improving data analytics, it also helps to gain
considerable revenue and the strength to
compete more strategically in the market. By
efficiently providing systematic, contextual
data to the business intelligence tool of an
organization, the data warehouses can find out
more practical business strategies.
[Link] User: Business users or customers need a data warehouse to look at
summarized data from the past.
Since these people are coming from a non-technical background also, the data
may be represented to them in an uncomplicated way.
[Link] consistency: Data warehouses are programmed in such a way that
they can be applied in a regular format
to all collected data from different sources, which makes it effortless
for company decision-makers to analyze and share data insights with their
colleagues around the globe. By standardizing the data, the risk of error
Need of Data
in interpretation is also reduced and improves overall accuracy.
[Link] historical data: Data
Warehousing
Warehouses are also used to store historical data that means, the time variable
data from the past and this input can be used for various purposes.
[Link] strategic decisions: Data warehouses contribute to making better
strategic decisions. Some business
strategies may be depending upon the data stored within the data
warehouses.
[Link] response time: Data warehouse has got to be prepared for somewhat
sudden masses and type of queries that
demands a major degree of flexibility and fast latency.
1. Subject Oriented: A data warehouse is often subject-
oriented because it delivers may be achieved on a
particular theme which means the data warehousing
process is proposed to handle a particular theme that is
more defined. These themes are often sales, distribution,
selling. etc.
Characteristics 2. Time-Variant: When the data is maintained via totally
different intervals of time like weekly, monthly,
of Data or annually, etc. It founds numerous time limits that are
unit structured between the big datasets and are
warehouse: command within the online transaction method (OLTP).
The time limits for the data warehouse are extended than
that of operational systems. The data resided within the
data warehouse is predetermined with a particular
interval of time and delivers information from the
historical perspective. It contains parts of time directly or
indirectly.
3. Non-volatile: The data residing in the data warehouse is permanent
and defined by its names. It additionally means that the data in the
data warehouse is cannot be erased or deleted or also when new data
is inserted into it. In the data warehouse, data is read-only and can
only be refreshed at a particular interval of time. Operations such as
delete, update and insert that is done in a software application over
data is lost in the data warehouse environment. There are only two
Characteristics
types of data operations that can be done in the data
warehouse:
• Data Loading
of Data • Data Access
warehouse:
4. Integrated: A data warehouse is created by integrating data from
numerous different sources such that from mainframe computers and
a relational database. Additionally, it should also have reliable naming
conventions, formats, and codes. Integration of data warehouse
benefits in the successful analysis of data. Dependability in naming
conventions, column scaling, encoding structure, etc. needs to be
confirmed. Integration of data warehouse handles numerous subject-
oriented warehouses.
Data warehouse architecture defines the comprehensive
architecture of data processing and presentation that will be
useful for data analysis and decision making within the
enterprise and organization. Each organization has different
Architecture data warehouses depending upon their need, but all of them
are characterized by some standard components.
& Data Warehouse applications are designed to support the
user’s data requirements, an example of this is online
Components analytical processing (OLAP). These include functions such as
forecasting, profiling, summary reporting, and trend analysis.
of Data The architecture of the data warehouse mainly consists of
the proper arrangement of its elements, to build an efficient
Warehouse: data warehouse with software and hardware components.
The elements and components may vary based on the
requirement of organizations. All of these depend on the
organization’s circumstances.
In the Data Warehouse, the source data comes from different places. They are
group into four categories:
• External Data: For data gathering, most of the executives and data analysts rely
on information coming from external sources for a numerous amount of the
information they use. They use statistical features associated with their
organization that is brought out by some external sources and department.
1. Source • Internal Data: In every organization, the consumer keeps their “private”
spreadsheets, reports, client profiles, and generally even department databases.
Data
This is often the interior information, a part that might be helpful in every data
warehouse.
• Operational System data: Operational systems are principally meant to run the
Component: business. In each operation system, we periodically take the old data and store it
in achieved files.
• Flat files: A flat file is nothing but a text database that stores data in a plain text
format. Flat files generally are text files that have all data processing and
structure markup removed. A flat file contains a table with a single record per
line.
After the data is extracted from various sources, now
it’s time to prepare the data files for storing in the data
warehouse. The extracted data collected from various
2. Data sources must be transformed and made ready in a
format that is suitable to be saved in the data
Staging: warehouse for querying and analysis. The data staging
contains three primary functions
that take place in this part:
• Data Extraction: This stage handles various data sources. Data analysts should
employ suitable techniques for every data source.
• Data Transformation: As we all know, information for a knowledge warehouse
comes from many alternative sources. If information extraction for a data
warehouse posture huge challenges, information transformation gifts even
important challenges. We tend to perform many individual tasks as a part of
information transformation. First, we tend to clean the info extracted from every
2. Data source of data. Standardization of information elements forms an outsized part
of data transformation. Data transformation contains several kinds of combining
items of information from totally different sources. Information transformation
Staging:
additionally contains purging supply information that’s not helpful and separating
outsourced records into new mixtures. Once the data transformation performs
ends, we’ve got a set of integrated information that’s clean, standardized, and
summarized.
• Data Loading: When we complete the structure and construction of the data
warehouse and go live for the first time, we do the initial loading of the data into
the data warehouse storage. The initial load moves high volumes of data
consuming a considerable amount of time.
Data storage for data warehousing is split into multiple repositories.
These data repositories contain structured data in a very highly
normalized form for fast and efficient processing.
• Metadata: Metadata means data about data i.e. it summarizes basic
details regarding data, creating findings & operating with explicit
instances of data. Metadata is generated by an additional correction
3. Data or automatically and can contain basic information about data.
• Raw Data: Raw data is a set of data and information that has not yet
Storage in been processed and was delivered from a particular data entity to the
data supplier and hasn’t been processed nonetheless by machine or
human. This data is gathered out from online sources to deliver deep
Warehouse: insight into users’ online behavior.
• Summary Data or Data summary: Data summary is an easy term for a
brief conclusion of an enormous theory or a paragraph. This is often
one thing where analysts write the code and in the end, they declare
the ultimate end in the form of summarizing data. Data summary is
the most essential thing in data mining and processing.
Data marts are also the part of storage component in a
data warehouse. It can store the information of a
4. Data specific function of an organization that is handled by a
single authority. There may be any number of data
Marts: marts in a particular organization depending upon the
functions. In short, data marts contain subsets of the
data stored in data warehouses.
Data
Warehousing
life Cycle
As we know the data warehouse is
made by combining data from
multiple diverse sources and the
tools that support analytical
reporting, structured and
unstructured queries, and decision
making for the organization. We
need to follow the step by step
approach for building and
successfully implementing the
Data Warehouse
A Data Warehouse is like a central depository where data comes
from different data sources. In a data warehouse, the data flows
from the transactional system and relational databases. A data
warehouse timely pulls out the data from various apps and
systems, after then, the data goes through various processing and
How does formatting and makes the data in a format that matches the data
already in the warehouse. This processed data is stored in the data
Data warehouses that ready for further analysis for decision making.
The data formatting and processing depends upon the need of the
Warehouse organization.
the data is processed and transformed so that users and analysts
work? can access the processed data in the Data Warehouse through
Business Intelligence tools, SQL clients, and spreadsheets. A data
warehouse merges all information coming from various sources
into one global and complete database. By merging all of this
information in one place, it becomes easier for an organization to
analyze its customers more comprehensively.
Data warehousing had improved the access to information,
reduced query-response time, and also allows businesses to
get deep insights from huge big data. Earlier, companies had to
build lots of infrastructure for data warehousing. But today the
cloud technology has remarkably reduced the cost and effort
Latest Tools of data warehousing for businesses.
and The field of data warehousing is most emerging and there
various cloud data warehousing tools and technologies are
Technologies developed for better decision making. The cloud-based data
warehousing tools are fast, highly scalable, and available on a
for Data pay-per-use basis. Following are some data warehousing tools:
Warehousing: Amazon Redshift ,Microsoft Azure , Google BigQuery
,Snowflake, Micro Focus Vertica, Teradata, Amazon DynamoDB
,PostgreSQL, Amazon RD and Amazon S3
All these are the top 10 Data Warehousing Tools
Subject-
Oriented
A data warehouse target on
the modeling and analysis of
data for decision-makers.
Therefore, data warehouses
typically provide a concise and
straightforward view around a
particular subject, such as
customer, product, or sales,
instead of the global
organization's ongoing
operations
Integrated
A data warehouse integrates
various heterogeneous data
sources like RDBMS, flat files,
and online transaction records.
Time-Variant
Historical information is kept in
a data warehouse. For
example, one can retrieve files
from 3 months, 6 months, 12
months, or even previous data
from a data warehouse. These
variations with a transactions
system, where often only the
most current file is kept.
Non-Volatile
The data warehouse is a physically
separate data storage, which is
transformed from the source operational
RDBMS. The operational updates of data
do not occur in the data warehouse, i.e.,
update, insert, and delete operations are
not performed. It usually requires only two
procedures in data accessing: Initial
loading of data and access to data.
Therefore, the DW does not require
transaction processing, recovery, and
concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-
Volatile defines that once entered into the
warehouse, and data should not change.
Difference between OLTP and OLAP
OLTP (On-Line Transaction Processing) is featured by a large number of short on-line transactions
(INSERT, UPDATE, and DELETE). The primary significance of OLTP operations is put on very rapid query
processing, maintaining record integrity in multi-access environments, and effectiveness consistent by
the number of transactions per second.
OLAP (On-line Analytical Processing) is represented by a relatively low volume of transactions.
Queries are very difficult and involve aggregations. For OLAP operations, response time is an
effectiveness measure. OLAP applications are generally used by Data Mining techniques. In OLAP
database there is aggregated, historical information, stored in multi-dimensional schemas (generally
star schema).
OLAP Operations in the Multidimensional
Data Model
In the multidimensional model, the records are organized into various dimensions, and each
dimension includes multiple levels of abstraction described by concept hierarchies. This organization
support users with the flexibility to view data from various perspectives. A number of OLAP data cube
operation exist to demonstrate these different views, allowing interactive queries and search of the
record at hand. Hence, OLAP supports a user-friendly environment for interactive data analysis.
Example
CO NSIDER THE FOL LOWI NG CUBES IL LUST RATI NG T EM PE RAT URE O F
CE RTAI N DAYS RE CORDE D WE EK LY:
Consider that we want to set up
levels (hot (80-85), mild (70-75), cool
(64-69)) in temperature from the
above cubes.
To do this, we have to group column The roll-up operation groups the information by levels of
and add up the value according to
temperature.
the concept hierarchies. This
operation is known as a roll-up.
By doing this, we contain the
following cube:
The following
diagram
illustrates how
roll-up works
What is Data
Cube?
When data is grouped or
combined in multidimensional
matrices called Data Cubes. The
data cube method has a few
alternative names or a few
variants, such as
"Multidimensional databases,"
"materialized views," and
"OLAP (On-Line Analytical
Processing)."
Example
In the 2-D representation, we
will look at the All Electronics
sales data for items sold per
quarter in the city of Vancouver.
The measured display in dollars
sold (in thousands).
3-Dimensional
Cuboids
Let suppose we would like to view
the sales data with a third dimension.
For example, suppose we would like
to view the data according to time,
item as well as the location for the
cities Chicago, New York, Toronto,
and Vancouver. The measured display
in dollars sold (in thousands). These
3-D data are shown in the table. The
3-D data of the table are represented
as a series of 2-D tables.
3-D data cubes
Conceptually, we may
represent the same data in the
form of 3-D data cubes, as
shown in fig:
Let us suppose that we would like to
view our sales data with an additional
fourth dimension, such as a supplier.
In data warehousing, the data cubes are
n-dimensional. The cuboid which holds
the lowest level of summarization is
called a base cuboid.
For example, the 4-D cuboid in the
figure is the base cuboid for the given
time, item, location, and supplier
dimensions.
The lattice of cuboid forms a
data cube. The figure shows
the lattice of cuboids creating
4-D data cubes for the
dimension time, item, location,
and supplier. Each cuboid
represents a different degree of
summarization.