CS8107 – BUSINESS INTELLIGENCE 2 - BASICS OF DATA INTEGRATION- Part (a)
Table of Contents
1. NEED FOR DATA WAREHOUSE ................................................................................................................. 2
• When to use a Data Warehouse? ..................................................................................................................... 2
o Lack of Information Sharing. ................................................................................................................... 2
o Lack of Information Credibility. .............................................................................................................. 2
o Reports Take a Longer Time to be Prepared. ........................................................................................ 2
o Little or no Scope for ad hoc querying/queries that require Historical Data. ..................................... 2
2. DEFINITION OF DATA WAREHOUSE ....................................................................................................... 2
• Subject Oriented. ........................................................................................................................................... 2
• Integrated ....................................................................................................................................................... 2
• Time-Variant.................................................................................................................................................. 3
• Non-Volatile ................................................................................................................................................... 3
3. WHAT IS DATA MART? ................................................................................................................................ 3
4. WHAT IS THEN AN ODS? (Operational Data Source) ................................................................................. 4
5. RALPH KIMBAL’S APPROACH v/s WH INMON’S APPROACH .......................................................... 4
• According to Ralph Kimball (Botton-Up Approach) ................................................................................... 4
• According to WH BILL Inmon (Top-Down Approach) .............................................................................. 5
6. GOALS OF A DATA WAREHOUSE ............................................................................................................. 5
• Information Accessibility. ............................................................................................................................. 5
• Information Credibility................................................................................................................................. 6
• Flexible to Changes........................................................................................................................................ 6
• Support for More fact-based Decision Making........................................................................................... 6
• Support for the Data Security ...................................................................................................................... 6
• Information Consistency ............................................................................................................................... 6
7. WHAT CONSTITUES A DATA WAREHOUSE (Four Parts) ..................................................................... 6
• Operational Source System........................................................................................................................... 7
• Data Staging Area.......................................................................................................................................... 7
• Data Presentation Area ................................................................................................................................. 7
• Data Access Tools .......................................................................................................................................... 8
• Data Sources................................................................................................................................................... 8
8. ETL (Extract-Transform-Load) .......................................................................................................................... 8
• Stages of ETL ................................................................................................................................................. 9
• Data Mapping ................................................................................................................................................ 9
• Data Staging ................................................................................................................................................... 9
• Data Extraction............................................................................................................................................ 10
• Data Transformation................................................................................................................................... 10
• Data Loading................................................................................................................................................ 10
1|Page
CS8107 – BUSINESS INTELLIGENCE 2 - BASICS OF DATA INTEGRATION- Part (a)
1. NEED FOR DATA WAREHOUSE
• When to use a Data Warehouse?
If the organization is facing any of the following issues, then we can use Data
Warehouse.
o Lack of Information Sharing.
o Lack of Information Credibility.
o Reports Take a Longer Time to be Prepared.
o Little or no Scope for ad hoc querying/queries that require Historical
Data.
2. DEFINITION OF DATA WAREHOUSE
• According to William H Inmon “A Data Warehouse is a subject-oriented,
integrated, time-variant, non-volatile collection of data in support of
management’s decision-making process.”
• Subject Oriented.
o A data warehouse collects data of subjects such as
▪ "customers",
▪ "suppliers"
▪ "partners",
▪ "sales",
▪ "products", etc.
spread across the enterprise or organization.
o A data mart on the other hand deals with the analysis of a particular
subject such as “sale”.
• Integrated
o A typical enterprise will have a multitude of enterprise applications.
o It is unlikely that these applications are on heterogeneous technology
platforms.
2|Page
CS8107 – BUSINESS INTELLIGENCE 2 - BASICS OF DATA INTEGRATION- Part (a)
o It is also not-likely that these applications use varied databases to house
their data.
o Few of the applications may exist in silos.
Few others may be sharing a little information between them.
o A data warehouse will serve to bring together the data from these multiple
disparate (meaning differing in the format and content of data) sources
after careful cleansing and transformation into a unified format to serve
the information needs of the enterprise.
• Time-Variant
o A data warehouse keeps historical data while an OLTP (On-Line
Transaction Processing) system will usually have the most up-to-date data.
o From a data warehouse, one can retrieve data that is 3 months, 6 months,
12 months, or even older.
o For example, a transactional system may hold the most recent hold address
of a customer, whereas a data warehouse can addresses associated with a
customer recorded, say, over the last five years.
• Non-Volatile
o We have learnt earlier that transaction processing, recovery, and
concurrency control mechanisms are usually associated with OLTP
systems.
o A data warehouse is a separate physical store of data transformed from the
application data found in the operational environment.
3. WHAT IS DATA MART?
• A data mart is meant to provide single domain data aggregation that can then
be used for analysis, reporting, and/or decision support.
• Data marts can be sourced from the enterprise-wide data warehouse or can also
be sourced directly from the operational/transactional systems.
• These data marts can also perform transformations and calculations on the data
housed within.
• When compared to the data warehouse,
o data marts are restricted in their scope and business purpose.
• Is it a good idea to go for a data mart for virtually every business process/event?
o The answer is "No".
o This could result in several disparate and independent data marts.
o Chances are that it will become a challenge to ensure the single version of
truth.
• Furthermore, data marts can be either independent or dependent.
o Independent data marts are sourced directly from one or more operational
systems, or can be sourced from external information providers, or can be
3|Page
CS8107 – BUSINESS INTELLIGENCE 2 - BASICS OF DATA INTEGRATION- Part (a)
sourced from data generated locally from within a department or unit or
function.
o Dependent data marts, on the other hand, are sourced from enterprise data
warehouses.
4. WHAT IS THEN AN ODS? (Operational Data Source)
• It is similar to a data warehouse in that several systems around the enterprise
feed operational information to it.
• The ODS processes this operational data to provide a homogeneous, unified
view which can then be utilized by analysts and report-writers alike for analysis
and reporting.
• An ODS differs from an enterprise data warehouse in that it is not meant to
store and maintain vast amounts of historical information.
• An ODS is meant to hold current or very recent operational data.
• Why is this required? Sometimes it is required to perform an instant analysis on
the more recent data to allow one to respond immediately to a given situation.
• There are cases where some enterprises use the ODS as a staging area for the
data warehouse.
• This would mean that the integration logic and processes are built into the ODS.
• On a regular basis, the data warehouse takes the current processed data from
the ODS and adds it to its own historical data.
5. RALPH KIMBAL’S APPROACH v/s WH INMON’S APPROACH
• There are two schools of thought when it comes to building a data warehouse:
• According to Ralph Kimball (Botton-Up Approach)
o “A data warehouse is made up of all the data marts in an enterprise.”
o It essentially means that an enterprise-wide data warehouse is a confluence
of all the data marts of the organization.
4|Page
CS8107 – BUSINESS INTELLIGENCE 2 - BASICS OF DATA INTEGRATION- Part (a)
o Small organizations will benefit by building the data warehouse following
the Kimball approach.
▪ Kimball's approach is faster, cheaper, and less complex.
o The single version of truth might be compromised in Kimball's approach,
and the reason is obvious.
▪ If you have a large organization, you will have several independent
data marts, with each data mart proclaiming to have the genuine
corporate data.
▪ The confused entity here is the end-user!!!
▪ He has absolutely no idea which data mart to turn to.
• According to WH BILL Inmon (Top-Down Approach)
o A data warehouse is a subject-oriented, integrated, non-volatile, time-
variant collection of data in support of management's decision.
o Large organizations will find Inmon's approach extremely lucrative.
▪ Inmon's approach is more expensive and is a time-consuming slower
process involving several complexities.
▪ However, it can achieve the "single version of truth" for large
organizations.
▪ Therefore, it is worth investment of time and efforts.
6. GOALS OF A DATA WAREHOUSE
• Information Accessibility.
o Data in a data warehouse must be easy to comprehend, both by the business
users and developers alike.
o It should be properly labelled to facilitate easy access.
5|Page
CS8107 – BUSINESS INTELLIGENCE 2 - BASICS OF DATA INTEGRATION- Part (a)
o The business users should be allowed to slice and dice the data in every
possible way (slicing and dicing refers to the separation and combination
of data in infinite combinations).
• Information Credibility.
o The data in the data warehouse should be credible, complete, and of desired
quality.
o Let us go back to the board meeting of "All Financcs" mentioned earlier.
o Suppose in this meeting Alfred presents a business metric X. If Richard
also presents the same business metric X, then the information provided by
both Alfred and Richard should be consistent.
• Flexible to Changes
o Business situations change, users' requirements change, technology
changes, and tools to access data may also change.
o The data warehouse must be adaptable to change.
o Addition of new data from disparate sources or new queries against the
data warehouse should not invalidate the existing information in the data
warehouse.
• Support for More fact-based Decision Making
o "Manage by fact" seems to be the buzzword these days.
o The data warehouse should have enough pertinent data to support more
precise decision making.
o What is also required is that the business users should be able to access the
data easily.
• Support for the Data Security
o The data warehouse maintains the company's confidential information.
o This information falling into wrong hands will do more damage than not
having a data warehouse at all.
o There should be mechanisms in place to enable the provision of
information in the required format to only those who are supposed to
receive it.
• Information Consistency
o Information consistency is about a single/consistent version of truth.
o A data warehouse brings data from disparate data sources into a centralized
repository.
o Users from across the organization make use of the data warehouse to view
s
a single and consistent version of truth.
7. WHAT CONSTITUES A DATA WAREHOUSE (Four Parts)
• Data from operational system flow into the Staging Area
where
• It undergoes Transformation and is then placed in the presentation Area.
6|Page
CS8107 – BUSINESS INTELLIGENCE 2 - BASICS OF DATA INTEGRATION- Part (a)
• From there it can be accessed using data access tools.
Operational Source System Extract
Data Staging Area
Load Data Access Tools
Data Presentation Area
• Operational Source System
o These systems maintain transactional or operational data.
o They are outside the data warehouse.
o There could be any number of such systems (similar or disparate) feeding
data to the data warehouse.
o They may maintain little historical data.
o The queries against such systems generally return an answer set (also called
record set or result set) of one or few records.
• Data Staging Area
o Data staging area comprises storage space for the data that has been
extracted from various disparate operational sources.
o It also consists of a set of processes related to data quality.
o There are three major processes popularly referred to as extraction,
transformation, and loading.
o The data staging area is off-limits from the business users and is not
designed to answer queries however simple they may be, or to offer
presentation services.
• Data Presentation Area
o Data staging area is off-limits to the business users.
o But data presentation area is the interface or the front face of the data
warehouse with which the business community interacts via the data access
tools.
o It is just a collection of integrated data marts. What does the term
"integrated" imply?
7|Page
CS8107 – BUSINESS INTELLIGENCE 2 - BASICS OF DATA INTEGRATION- Part (a)
• Data Access Tools
o Data access tools can be ad hoc query tools used to query the data
presentation Area.
o A data access tool can also be an reporting tool or a data modelling/mining
application (for trend analysis or prediction, etc.).
• Data Sources
o In data warehousing, we extract data from different disparate sources
(heterogeneous sources such as text files, .CSV files, .XLS files, .MDB
files, etc.),
transform this data into a certain format (unified/data warehouse
format), and then load the data in data warehouse.
o The raw material for any data integration is provided by data sources.
o Data sources refer to any of the following types of source:
▪ Data storage Media (flat file, DBMS, etc.).
▪ Data organization (linked data in COBOL mainframes, normalized
forms in RDBMS).
o The data in these data sources can be present in any format.
8. ETL (Extract-Transform-Load)
• It is a three-stage process in database usage, especially in data warehousing.
• It allows integration and analysis of data stored in different sources. After
collecting the data from multiple varied sources (extraction), the data is
reformatted (from host format to warehouse format) and cleansed (to detect and
rectify errors) to meet the information needs (transformation) and then sorted,
summarized, consolidated, and loaded into desired end target (loading).
• Put simply, ETL allows creation of efficient and consistent databases.
• ETL is
o Extracting Data from Different Sources.
8|Page
CS8107 – BUSINESS INTELLIGENCE 2 - BASICS OF DATA INTEGRATION- Part (a)
o Transforming the extracted data into a relevant format to fit information
needs.
o Loading data into the final target database, usually a data warehouse.
• Stages of ETL
• Data Mapping
o It is a process of generating data element mapping between two distinct data
models.
o It is thc first process that is performed for a variety of data integration tasks
which include-
▪ Data transformation between data source and data destination.
▪ Identification of data relationships.
▪ Discovery of hidden sensitive data.
▪ Consolidation of databases into a single database.
• Data Staging
o A data staging area can be defined as an intermediate storage area that falls
between the operational/transactional sources of data and the data
warehouse (DW) or data mart (DM).
o A staging area can be used, among others, for the following purposes:
▪ To gather data from different sources ready to be processed at different
times.
▪ To quickly load information from the operational database.
▪ To find changes against current DW/DM values.
9|Page
CS8107 – BUSINESS INTELLIGENCE 2 - BASICS OF DATA INTEGRATION- Part (a)
▪ To cleanse data.
▪ To pre-calculate aggregates.
• Data Extraction
o It is a process of collecting data from different data sources. In other words,
it is the consolidation of data from different sources having different
formats.
o Flat files and relational databases are most common data sources.
Depending upon the type of source data, the complexity of extraction may
vary.
o The storage of intermediate version of data is very necessary. This data is
required to be backed up and archived.
o The area where the extracted data is stored is called staging area.
• Data Transformation
o A series of rules or functions is applied to the data extracted from the source
data that is loaded into the end to obtain derived target.
o Depending upon the data source, manipulation of data may be required.
o If the data source is good, its data may require very less transformation and
validation.
o But data from some sources might require one or more transformation types
to meet the operational needs and make data fit in the end target.
o Some transformation types are-
▪ Selecting only certain columns to load.
▪ Translating a few coded values.
▪ Encoding some free-form values.
▪ Deriving a new calculated value.
▪ Joining together data derived from multiple sources.
▪ Summarizing multiple rows of data.
▪ Splitting a column into multiple columns.
o Data transformation is the most complex and, in terms of production, the
costliest part of the ETL process.
• Data Loading
o The last stage of the ETL process is loading which loads the extracted and
transformed data into the end target, usually the data warehouse.
o Data can also be loaded by using SQL queries.
10 | P a g e