UNIT II DATA WAREHOUSING
Definition of Data ware house:
It may be understood as Repository of all of the organization’s data (or simply data) in a pattern that can
be analysed so that it can be arranged for administration and other information workers.
A data warehouse is a central repository of information that can be analyzed to make more informed
decisions. Data flows into a data warehouse from transactional systems, relational databases, and other
sources.
It can be described as any centralized data repository which can be queried for business benefits. It is a
database that stores information oriented to satisfy decision-making requests. It is a group of decision
support technologies, targets to enabling the knowledge worker (executive, manager, and analyst) to
make superior and higher decisions.
It is designed for query and analysis rather than transaction processing. It includes historical data derived
from transaction data from single and multiple sources. It provides integrated, enterprise-wide, historical
data and focuses on providing support for decision-makers for data modeling and analysis.
A data warehouse is the foremost repository for the data available for developing business intelligence
architectures and decision support systems. The term data warehousing indicates the whole set of
interrelated activities involved in designing, implementing and using a data warehouse.
However, some challenges faced are:
Data should be acquired from a variety of incompatible systems.
The identical piece of data might reside in the databases of distinct systems in distinct types. A
specific data item might not only be represented in distinct formats, but the values of this Data
piece might be distinct in distinct databases. Which value is the correct one?
Data is continually altering. How often should the Data warehouse be revised to contemplate a
sensibly current view?
The amount of Data is massive. How is it analysed and presented easily so that it is useful?
To meet these needs, a broad range of powerful tools were developed over the years and became
productized. They included:
_ Extract, Transform, and Load (ETL) utilities for the moving of data from the diverse data sources to the
common data warehouse.
_ Data-mining pushes for complex predetermined analysis and ad hoc queries of the Data retained in the
Data warehouse.
_ Reporting tools to provide management employees with the outcomes of the analysis in very simple to
absorb formats.
It is possible to identify three main categories of data feeding into a data warehouse: internal data,
external data and personal data.
Internal data: Internal data are stored for the most part in the databases, referred to as transactional
systems or operational systems, that are the backbone of an enterprise information system.
Internal data are gathered through transactional applications that routinely preside over the operations
of a company, such as administration, accounting, production and logistics. This collection of trans-
actional software applications is termed enterprise resource planning (ERP). The data stored in the
operational systems usually deal with the main entities involved in a company processes, namely
customers, products, sales, employees and suppliers.
These data usually come from different components of the information system:
• back-office systems, that collect basic transactional records such as orders, invoices, inventories,
production and logistics data;
• front-office systems, that contain data originating from call-center activities, customer assistance,
execution of marketing campaigns;
• web-based systems, that gather sales transactions on e-commerce websites, visits to websites, data
available on forms filled out by existing and prospective customers.
External data: There are several sources of external data that may be used to extend information. For
example, some agencies gather and make available data relative to sales, market share and future trend
predictions for specific business industries, as well as economic and financial indicators.
Other agencies provide data market surveys and consumer opinions collected through questionnaires.
Geographic Information Systems (GIS), which represent a set of applications for acquiring, organizing,
storing and presenting territorial data. These contain information relative to entities having a specific
geographic position.
Personal data: In most cases, decision makers performing a business intelligence analysis also rely on
information and personal assessments stored inside worksheets or local databases located in their
computers.
Definition of Data Mart:
Data marts are systems that gather all the data required by a specific company department, such as
marketing or logistics, for the purpose of performing business intelligence analyses and executing decision
support applications specific to the function itself. Therefore, a data mart can be considered as a
functional or departmental data warehouse of a smaller size and a more specific type than the overall
company data warehouse.
A data mart is a special type of a data warehouse. It is focused on a single subject (or functional area),
such as Sales, Finance, or Marketing. Whereas data warehouses have an enterprise-wide depth, the
information in data marts pertains to a single department. The primary use for a data mart is Business
Intelligence (BI) applications.
A data mart therefore contains a subset of the data stored in the company data warehouse, which are
usually integrated with other data that the company department responsible for the data mart owns and
deems of interest. For example, a marketing data mart will contain data extracted from the central data
warehouse, such as information on customers and sales transactions, but also additional data pertaining
to the marketing function, such as the results of marketing campaigns run in the past. Data warehouses
and data marts thus share the same technological frame-work.
Data quality
The need to verify, preserve and improve the quality of data is a constant concern of those responsible
for the design and updating of a data warehouse. Data in a warehouse should have the following qualities:
Accuracy. To be useful for subsequent analyses, data must be highly accurate. For instance, it is necessary
to verify that names and encodings are correctly represented and values are within admissible ranges.
Completeness. In order to avoid compromising the accuracy of business intelligence analyses, data should
not include a large number of missing values.
Consistency. The form and content of the data must be consistent across the different data sources after
the integration procedures, with respect to currency and measurement units.
Timeliness. Data must be frequently updated, based on the objectives of the analysis. It is customary to
arrange an update of the data warehouse regularly on a daily or at most weekly basis.
Non-redundancy. Data repetition and redundancy should be avoided in order to prevent waste of
memory and possible inconsistencies.
Relevance. Data must be relevant to the needs of the business intelligence system in order to add real
value to the analyses that will be subsequently performed.
Interpretability. The meaning of the data should be well understood and correctly interpreted by the
analysts.
Accessibility. Data must be easily accessible by analysts and decision support applications
Data Warehouse Architecture:
Basic architecture and function of a data ware house is as follows:
Major components includes:
The data warehouse itself, together with additional data marts, that contains the data and the functions
that allow the data to be accessed, visualized and modified.
Data acquisition applications, also known as extract, transform and load (ETL) or back-end tools, which
allow the data to be extracted, transformed and loaded into the data warehouse.
Business intelligence and decision support applications, which represent the front-end and allow the
knowledge workers to carry out the analyses and visualize the results.
The three-level distinction applies to the architecture shown in above figure even from a technological
perspective.
• The level of the data sources and the related ETL tools that are usually installed on one or more servers.
• The level of the data warehouse and any data mart, possibly available on one or more servers as well,
and separated from those containing the data sources. This second level also includes the metadata
documenting the origin and meaning of the records stored in the data warehouse.
• The level of the analyses that increase the value of the information contained in a data warehouse
through query, reporting and possibly sophisticated decision support tools. The applications for business
intelligence and decision support analysis are usually found on separate servers or directly on the client
PC used by analysts and knowledge workers.
A data warehouse may be implemented according to different design approaches: top-down, bottom-up
and mixed.
Top-down. The top-down methodology is based on the overall design of the data warehouse, and is
therefore more systematic. However, it implies longer development times and higher risks of not being
completed within schedule since the whole data warehouse is actually being developed.
Bottom-up. The bottom-up method is based on the use of prototypes and therefore system extensions
are made according to a step-by-step scheme. This approach is usually quicker, provides more tangible
results but lacks an overall vision of the entire system to be developed.
Mixed.
The mixed methodology is based on the overall design of the data warehouse, but then proceeds with a
prototyping approach, by sequentially implementing different parts of the entire system. This approach
is highly practical and usually preferable, since it allows small and controlled steps to be taken while
bearing in mind the whole picture.
Extract, Transform and Load (ETL)
these utilities extract data from source databases, change/transform them into the widespread data
warehouse format, and load them into the data warehouse
The transform function is the key to the achievement of this approach. Its job is to request a series of rules
to extracted data so that it is properly formatted for loading into the data warehouse.
Examples:
The selection of data to load.
The translation of encoded items (for example, 1 for male, 2 for female to M, F).
Deriving new calculated values (sale price = price - discount).
Merging data from multiple sources.
Summarizing (aggregating) certain rows and columns.
Splitting a column into multiple columns.
Resolving discrepancies between similar data items.
Validating the data.
ETL model :
The ETL process is performed occasionally, such as daily, weekly, or monthly, depending upon the
enterprise needs. This method is called offline ETL because the key database is not relentlessly updated.
It is revised on a periodic batch basis.
The ETL utilities make data collection from numerous diverse systems practical. Then, the data needs to
be converted into useful information. Some key points to remember:
_ Data are easily facts, figures, and text that can be processed by a computer.
Example: A transaction at retail point-of-sale is data.
_ Information is processed data.
For example, analysis of point-of-sale transactions yields information of consumer buying behaviour.
_ Knowledge represents a pattern that connects information and usually presents a high grade of
predictability as to what is recounted or what will happen next.
Example: An example of knowledge is the prediction of promotional efforts on sales of particular items
based on buyers’ buying behaviour.
Data-Mining Engines
Useful data-mining engines were evolved to support complex analysis and ad hoc queries on a data
warehouse’s database. Data mining looks for patterns among hundreds of seemingly unrelated fields in a
large database, patterns that recognize earlier unknown trends. These trends play a key role in strategic
decision making because they disclose localities for process enhancement(ORACLE)
Reporting Tools
The knowledge created by a data-mining engine is not very useful unless it is presented easily and clearly
to those who need it. There are many formats for reporting information and knowledge results. One of
the common techniques for displaying information is the digital dashboard.
Metadata
In order to document the meaning of the data contained in a data warehouse, it is recommended to set
up a specific information structure, known as metadata, i.e. data describing data. The metadata indicate
for each attribute of a data warehouse the original source of the data, their meaning and the
transformations to which they have been subjected. The documentation provided by metadata should be
constantly kept up to date, in order to reflect any modification in the data warehouse structure. The
documentation should be directly accessible to the data warehouse users.
OLAP and OLTP:
Online Analytical Processing consists of a type of software tools that are used for data analysis for business
decisions. OLAP provides an environment to get insights from the database retrieved from multiple
database systems at one time. Examples – Any type of Data warehouse system is an OLAP system.
An OLTP(Online Transactional Processing) system captures and maintains transaction data in a database.
Each transaction involves individual database records made up of multiple fields or columns. Examples
include banking and credit card activity or retail checkout scanning.
The data from one or more OLTP databases is ingested into OLAP systems through extract, transform,
load (ETL)
Cubes and multidimensional analysis
The multidimensional representation is based on a star schema which contains two types of data tables:
dimension tables and fact tables.
Dimension Tables: In general, dimensions are associated with the entities around which the processes of
an organization revolve. Dimension tables then correspond to primary entities contained in the data
warehouse, and in most cases, they directly derive from master tables stored in OLTP systems, such as
customers, products, sales, locations and time.
Each dimension table is often internally structured according to hierarchical relationships.
For example, the temporal dimension is usually based upon two major hierarchies: {day, week,
year} and {day, month, quarter, year}
Similarly, the location dimension may be hierarchically organized as {street, zip code, city, province,
region, country, area}
Products in their turn have hierarchical structures such as {item, family, type} in the manufacturing
industry and {item, category, department} in the retail industry.
Fact Tables: Fact tables usually refer to transactions and contain two types of
data:
• links to dimension tables, that are required to properly reference the information contained in each fact
table;
• numerical values of the attributes that characterize the corresponding transactions and that represent
the actual target of the subsequent OLAP analyses.
For example, a fact table may contain sales transactions and make reference to several dimension tables,
such as customers, points of sale, products, suppliers, time. The corresponding measures of interest are
attributes such as quantity of items sold, unit price and discount.
In this example the fact table allows analysts to evaluate the trends of sales over time, either total, or
referred to a single customer, or referred to a group of customers, that can be identified through
any hierarchy induced by the dimension table associated with the customers.
The analyst may also evaluate the trend over time of sales percentages relative to customers located in a
specific region.
STAR SCHEMA-
The fact table is placed in the middle of the schema and is linked to the dimension tables through
appropriate references. It uses a single large fact table to store transactional or measured data, and one
or more smaller dimensional tables that store attributes about the data.
Example –
SNOWFLAKE SCHEMA-
Sometimes dimension tables are connected in their turn to other dimension tables, as shown
In the given example the dimension table referring to the location is in turn hierarchically connected with
the dimension table containing geographical information. This brings about a snowflake
schema.
GALAXY SCHEMA-
A data warehouse includes several fact tables, interconnected with dimension tables, linked in their turn
with other dimension tables. This type of schema is termed a galaxy schema as shown-
(In the above example tables that have name written above it are fact tables others are dimension tables)
(time in shipment fact table may also point to time dimension table)
A fact table connected with n dimension tables may be represented by a n-dimensional data cube where
each axis corresponds to a dimension. Multidimensional cubes are a natural extension of the popular two-
dimensional spreadsheets, which can be interpreted as two-dimensional cubes.
Cube Example:
For instance, consider a sales fact table developed along the three dimensions of {time, product, region}.
Suppose we select only two dimensions for the analysis, such as{time, product}, having preset the region
attribute along the three values {USA, Asia, Europa}.
In this way we obtain the three two-dimensional tables in which the rows correspond to quarters of a
year and the columns to products as follows:
Table1-
Table 2-
Table 3-
The cube shown in Figure below is a three-dimensional illustration of the same sales fact table
Atomic data are represented by 36 cells that can be obtained by crossing all possible values along the
three dimensions: time{Q1, Q2, Q3, Q4},region{USA, Asia, Europa} and product {TV, PC, DVD}
These atomic cells can be supplemented by 44 cells corresponding to the summary values obtained
through consolidation along one or more dimensions
four-dimensional space cannot be represented graphically. However, we can obtain four logical views
composed of three-dimensional cubes, called cuboids
Starting from a fact table linked to n dimension tables, it is possible to obtain a lattice of cuboids, each of
them corresponding to a different consolidation along one or more dimensions.
Figure below illustrates the lattice composed by the cuboids obtained from the data cube defined along
the four dimensions {time, product, region, supplier}
The cuboid associated with the atomic data, which therefore does not imply any type of consolidation, is
called a base cuboid. At the other extreme, the apex cuboid is defined as the cuboid corresponding to the
consolidation along all dimensions, therefore associated with the grand total of the measure of
interest.