Data Warehousing Architecture - Designing the Data Staging Area
By Denise Rogers
The staging area tends to be one of the more overlooked components of a data warehouse
architecture, and yet it is an integral part of the ETL component design. Learn why it is best
to design the staging layer right the first time, enabling support of various ETL processes
and related methodology, recoverability and scalability.
In any data warehousing initiative, there are several common components to the
architecture. There are the data sources and targets, ETL framework, infrastructure,
application layer and the data staging area.
The staging area, in my experience has to be one of the more overlooked and
underestimated components of a data warehouse architecture. I think mostly this is due to
a lack of understanding as to what exactly it is.
If a quick search is made through a number of websites, many definitions will include the
fact the data staging area is simply a temporary workspace used to transform and enrich
data before it flows into the operational data store (ODS) and the data warehouse.
This is a good fundamental definition of the data staging area. However, it is so much more.
How much more do you ask? Well in reality, the data staging area is an information hub
that facilitates the enriching stages that data goes through in order to populate an ODS
and/or data warehouse. It is the essential ingredient in the development of an approach
and/or methodology for creating a comprehensive data-centric solution for any data
warehousing project.
If we really think about this, the data staging area is an integral part of the ETL component
design and is the foundation for the ETL architecture.
The Design of the Information Hub
The data staging area has been labeled appropriately and with good reason. With any data
warehousing effort, we all know that data will be transformed and consolidated from any
number of disparate and heterogeneous sources.
However, the design of a robust and scalable information hub is framed and scoped out by
functional and non-functional requirements. Examples of some of these requirements
include items such as the following:
The amount of raw source data to retain after it has been processed through the ETL
data lifecycle
The type of server(s) to house the staging area will be dedicated or shared with
other applications and environments (dedicated servers are a proven way to go)
The acceptable levels of data quality, related baselines and metrics as stated by the
Data Governance Board
Decisions on the data sources that will be federated in and the ones that will be a
copy of the sources
The management of metadata as data sources are brought into the landing zone of
the staging area
The level of security and roles defined for each of the areas with the staging
environment
The masking/scrambling of sensitive data within staging areas
The identification of recoverable artifacts in the event of disasters, etc.
With these types of requirements, rules and decisions, a scalable and secured framework is
firmly in place to facilitate the defined ETL methodology. These data sources go through a
number of evolutionary stages in order to build a robust and comprehensive data
warehouse and/or ODS. Moreover, as great data architects that we are, we know that these
stages must include the following.
Data Acquisition
This process includes landing the data physically or logically in order to initiate the ETL
processing lifecycle. The staging area here could include a series of sequential files,
relational or federated data objects. However, the design of intake area or landing zone
must enable the subsequent ETL processes, as well as provide direct links and/or
integrating points to the metadata repository so that appropriate entries can be made for all
data sources landing in the intake area.
Data Profiling
Data profiling is the surveying of the source data landscape to gain an understanding of the
condition of the data sources. In most profiling efforts, this means generating various
reports with any number of metrics, statistics, and counts that reflect the quality of the
source data coming in.
Data Cleansing
Data cleansing is an iterative set of processes that starts and ends with the business rules
and standards around acceptable data quality levels from the Data Governance Board (e.g.
95% of the data meets the quality standards). ). This includes investigative jobs to
provide additional detail in detecting data patterns and design alternatives for quality
enforcement at the attribute, record and aggregate levels and data correction jobs to fill
in missing or incomplete data and correct data values. There is also the analysis of reports
based on the findings and results of the investigation and data correction jobs to determine
if further refinements and/or modifications are to be made.
Data Standardization and Matching
Data standardization and matching is a set of processes that primarily consists of the design
and execution, standardizing jobs to create uniformity around specific mandatory data
elements. This includes the design and execution of matching and de-duplicating jobs to
eliminate duplicate data and create a single version of the truth. It also includes the analysis
of reports related to errors and/or exceptions and determines if further refinements or
modifications are to be made (if required) and to assess the readiness for data delivery to
the data warehouse and ODS.
Data Transformation
Transforming data essentially means converting data to conform to a standard established
by the Data Governance Board. Examples of data transformations include converting nulls
to specific values, gender codes that are disparate to a common set of values or even
merging multiple source fields to one data element.
Data Loading
Depending on business requirements, the loading phase can include a total data refresh of
the target component or adding new data to the data component in a historical manner.
Loading to a staged copy of the target component enables a series of validation exercises.
This includes verification of referential integrity, data quality and transformation rules prior
to the actual data population of the DW and/or ODS.
Design and Construction
The creation of a staging area will usually start with the typical activities of the design of
any data environment. Tasks such as server configuration, alignment of file systems,
creating the database instances and related database objects are common elements in the
design of any infrastructure dedicate to a data environment.
However, there a number of unique tasks that need to be completed to align the staging
area to the ETL methodology discussed in prior sections of this article.
For starters, the data architect and the DBA will need to create separate environments for
each stage that the data goes through. This means separate database and file systems that
are dedicated to the stage that the ETL lifecycle is in.
For example, a dedicated database instance and related file systems should be created for
the data acquisition and profiling stages. The tasks included in these stages are the reading
of every data element and record in order to generate detailed statistical information on the
source data. This means that processes involved in the profiling effort will be using
tremendous amounts of resources related to memory and CPU and should be segregated so
that other workloads are not adversely impacted. The design of the database instance must
take into consideration the fact that with the use of federated data, there may be
implications at the database level that will cause ripple effects on the other data objects
within the database instance. Also the file systems allocated to the containers that the
database uses should be separate from the file systems used in the data acquisition process
so that there are no I/O bottleneck issues.
Then there is the SECURITY component! This is live production data that has highly
sensitive information. This data cannot be masked and/or scrambled as this defeats the
whole purpose of the ETL process to stage data into the data warehouse or ODS. The raw
data must be exposed in order for the ETL to be as effective in integrating, cleansing and
standardizing all data from all sources. Therefore, having a robust security framework is an
essential ingredient in this configuration. Typically, the data steward and an appointed
business analyst should be among the chosen few that have access to some of the sensitive
data elements. The ETL developer, DBA and system administrator does not need to see any
of it. There is also the prevention of copying data. No one should be allowed to make copies
of anything for any purpose. The information hub should be able to satisfy all requests for
data access for analysis in a robustly secured environment.
The Information Hub Experience - Tales from
the Data Layer
I was assigned to the first data warehouse project at a major healthcare company. It was
our first time working with an ETL solution and all that comes with it. We successfully
installed the toolset, created the protocols to pull in the data sources and target data
warehouse components. However, it was an extremely painfully project. Why? Because
whenever the ETL processes aborted or there were hardware failures, there were no clean
ways to restart anything! The staging layer was the sum total of several file systems
allocated for ETL usage and not much else was in place at the staging area level. In other
words, we built a flimsy foundation for the ETL component and we paid dearly for it!
At another time, having grown from that experience, I worked at another client site as part
of a team to design and construct a data warehouse environment complete with an ETL
solution, etc. This time, I knew I would get it right! I created an information hub that had
file systems and a database, tables and views. This database had federated objects and
every kind of bell and whistle you could think of. Except that during the data profiling
process of the federated objects, the process ran out of temporary space at the source
application and aborted. The error message generated was that the database is corrupt and
all is lost. Talk about the panic! I had that look in my eyes! Everything ground to a
screeching halt while I completed the database recovery.
The lessons here are to design the staging layer to enable support of various ETL processing
and related methodology, recoverability and scalability.
A well-designed staging area should enable the ETL approach, processes and services and
the facilitation of the data management activities with business analysts, data stewards,
(validation of business rules) profiling reports, quality reports and successfully stage the
data required to populate the data warehouse and the operational data store. Failure to do
that will lead to many sleepless nights, days spent in war rooms and putting the data
warehouse project in jeopardy of not meeting milestones and deadlines. I have been on
both sides and not being a big fan of the war rooms, I now know better. You should too!