Data Warehouse
Chapter IV
Chapter Objectives:
• Study each component or building block that makes up a data warehouse
Major components
• Source data component
• Data staging component
• Information delivery component
• Metadata component
• Management and control component
Components of Data warehouse
Reporting and Analysis
Master Data
Management
Data Data
Cleansing Models
Data Sources
Data
Warehouse
ETL
Source Data Components
Source data can be grouped into 4 components
• Production data
Comes from operational systems of enterprise some segments are selected from its narrow
scope, e.g. order details
• Internal data
Private datasheet, documents, customer profiles etc. e.g. Customer profiles for specific
offering Special strategies to transform ‘it’ to DW (text document)
• Archived data
Old data is archived DW have snapshots of historical data
• External data
Executives depend upon external sources E.g. market data of competitors, car rental
require new manufacturing.
Data Staging Component
After data is extracted, data is to be prepared, Data extracted from sources needs to be changed, converted and made
ready in suitable format
Three major functions to make data ready
• Extract
• Transform
• Load
Staging area provides a place and area with a set of functions to
• Clean
• Change
• Combine
• Convert
ETL Cycle
The process of reading The process of transforming the
data from different extracted data from its original The process of writing
sources. state into a consistent state so that the data into the target
it can be placed into another source.
database.
www data MIS Systems
Data Warehouse
(Acct, HR) TRANSFORM CLEANSE
Legacy
Systems
Archived data EXTRACT LOAD
Other indigenous applications
(COBOL, VB, C++, Java)
OLAP
Data Extraction
• This function must deal with numerous data sources.
• You must employ the appropriate technique for each data source.
• Source data may be from different source machines in diverse data formats.
• Part of the source data may be in relational database systems; some data may be on other
legacy network and hierarchical data models. Many data sources may still be in flat files.
• Data extraction may become quite complex. Tools are available on the market for data
extraction.
Data Extraction
• You may want to consider using outside tools suitable for certain data sources. For the
other data sources, you may want to develop in-house programs to do the data
extraction. Purchasing outside tools may entail high initial costs. In-house programs,
on the other hand, may mean ongoing costs for development and maintenance.
• Teams extract the source into a separate physical environment from which moving the
data into the data warehouse would be easier.
• In the separate environment, you may extract the source data into a group of flat files,
or a data-staging relational database, or a combination of both.
Data Extraction
A very complex task due to number of reasons:
• Very complex and poorly documented source system.
• Data must be extracted not once, but number of times.
The process design is dependent on:
• Which extraction method to choose?
• How to make available extracted data for further processing?
Types of Data Extraction
Logical Extraction
• Full Extraction
• Incremental Extraction
Physical Extraction
• Online Extraction
• Offline Extraction
Logical Data Extraction
Full Extraction
• The data extracted completely from the source system.
• No need to keep track of changes.
• Source data made available as-is with any additional information.
Incremental Extraction
• Data extracted after a well-defined point/event in time.
• Mechanism used to reflect/record the temporal changes in data (column or table).
• Can have significant performance impacts on the data warehouse server.
Physical Data Extraction
Online Extraction
• Data extracted directly from the source system.
• May access source tables through an intermediate system.
Offline Extraction
• Data NOT extracted directly from the source system, instead staged explicitly outside the original source
system.
• Data is either already structured or was created by an extraction routine.
• Some of the prevalent structures are:
- Flat files - Redo and archive logs
- Dump files - Transportable table-spaces
Data Transformation
• Data for a data warehouse comes from many disparate sources.
• The data feed is not just an initial load. You will have to continue to pick up the ongoing changes from
the source systems.
• Any transformation tasks you set up for the initial load will be adapted for the ongoing revisions as
well.
You perform several individual tasks as part of data transformation.:
• First, you clean the data extracted from each source.
• Cleaning may just be correction of misspellings,
• or may include resolution of conflicts between state codes and zip codes in the source data,
• or may deal with providing default values for missing data elements,
• or elimination of duplicates when you bring in the same data from multiple source systems.
Data Transformation
Semantic standardization is another major task.
• Standardization of data elements forms a large part of data transformation.
• You standardize the data types and field lengths for same data elements retrieved from the various
sources.
• You resolve synonyms and homonyms. When two or more terms from different source systems
mean the same thing, you resolve the synonyms. When a single term means many different things in
different source systems, you resolve the homonym.
Data transformation involves many forms of combining pieces of data
• You combine data from a single source record or related data elements from many source records.
• On the other hand, data transformation also involves purging source data that is not useful and
separating outsource records into new combinations.
Data Transformation
• Sorting and merging of data takes place on a large scale in the data staging area.
• Creating Keys:
• In many cases, the keys chosen for the operational systems are field values with built-in meanings.
For example, the product key value may be a combination of characters indicating the product
category, the code of the warehouse where the product is stored, and some code to show the
production batch. Primary keys in the data warehouse cannot have built-in meanings.
• Data transformation also includes the assignment of surrogate keys derived from the source
system primary keys.
Data Transformation Examples
• Convert common data elements into a consistent form i.e.
name and address.
First-Family-title Muhammad Ibrahim Contractor
Family-title-comma-first Ibrahim Contractor, Muhammad
Family-comma-first-title Ibrahim, Muhammad Contractor
• Translation of dissimilar codes into a standard code.
F/NO-2
F-2
[Link].2
Natl. ID NID FL.2 FLAT No. 2
National ID NID FL/NO.2
FL-2
FLAT-2
FLAT#
FLAT,2
FLAT-NO-2
FL-NO.2
Data Transformation Examples
• Data elements are mapped from source tables and files to
destination fact and dimension tables.
First Name: HAJI MUHAMMAD
HAJI MUHAMMAD IBRAHIM, GOVT. CONT. Family Name: IBRAHIM
K. S. ABDULLAH & BROTHERS, MAMOOJI Title: GOVT. CONT.
ROAD, ABDULLAH MANZIL RAWALPINDI, Ph Firm: K. S. ABDULLAH & BROTHERS
67855 Location: ABDULLAH MANZIL MAMOOJI
Road: ROAD
Phone: 051-67855
City: RAWALPINDI
Code: 46200
• Default values are used in the absence of source data.
• Fields are added for unique keys and time elements.
Data Transformation Examples
• Data representation change
• EBCIDIC to ASCII
• Operating System Change
• Mainframe (MVS) to UNIX
• UNIX to NT or XP
• Data type change
• Program (Excel to Access), database format (FoxPro to Access).
• Character, numeric and date type.
• Fixed and variable length.
Data Loading
• Two distinct groups of tasks form the data loading function.
• The initial loading of the data into the data warehouse storage.
• The initial load moves large volumes of data using up substantial amounts of time.
• As the data warehouse starts functioning:
• Extract the changes to the source data,
• Transform the data revisions
• Feed the incremental data revisions on an ongoing basis.
Data Storage Components
• Separate repository
• Data structured for efficient processing
• Redundancy is increased
• Updated after specific periods
• Only read-only
Information Delivery Components
• Authentication issues
• Active monitoring services
• Performance, DBA note selected aggregates to change storage
• User performance
• Aggregate awareness
• E.g. mining, OLAP etc