0% found this document useful (0 votes)

20 views4 pages

Designing a Data Staging Area for ETL

The staging area is an important but often overlooked component of a data warehouse architecture. It acts as an information hub that facilitates the transformation of data from various sources into a data warehouse or operational data store. The staging area should be designed as a scalable and secure foundation that supports the ETL methodology and processes through which data passes. This includes creating separate dedicated environments for each stage of the ETL lifecycle and implementing robust security measures to protect sensitive data in the staging area.

Uploaded by

BalachandraKS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views4 pages

Designing a Data Staging Area for ETL

Uploaded by

BalachandraKS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Data Warehousing Architecture
The Information Hub Experience

Data Warehousing Architecture - Designing the Data Staging Area

By Denise Rogers

The staging area tends to be one of the more overlooked components of a data warehouse
architecture, and yet it is an integral part of the ETL component design. Learn why it is best
to design the staging layer right the first time, enabling support of various ETL processes
and related methodology, recoverability and scalability.

In any data warehousing initiative, there are several common components to the
architecture. There are the data sources and targets, ETL framework, infrastructure,
application layer and the data staging area.

The staging area, in my experience has to be one of the more overlooked and
underestimated components of a data warehouse architecture. I think mostly this is due to
a lack of understanding as to what exactly it is.

If a quick search is made through a number of websites, many definitions will include the
fact the data staging area is simply a temporary workspace used to transform and enrich
data before it flows into the operational data store (ODS) and the data warehouse.

This is a good fundamental definition of the data staging area. However, it is so much more.
How much more do you ask? Well in reality, the data staging area is an information hub
that facilitates the enriching stages that data goes through in order to populate an ODS
and/or data warehouse. It is the essential ingredient in the development of an approach
and/or methodology for creating a comprehensive data-centric solution for any data
warehousing project.

If we really think about this, the data staging area is an integral part of the ETL component
design and is the foundation for the ETL architecture.

The Design of the Information Hub

The data staging area has been labeled appropriately and with good reason. With any data
warehousing effort, we all know that data will be transformed and consolidated from any
number of disparate and heterogeneous sources.

However, the design of a robust and scalable information hub is framed and scoped out by
functional and non-functional requirements. Examples of some of these requirements
include items such as the following:

 The amount of raw source data to retain after it has been processed through the ETL
data lifecycle
 The type of server(s) to house the staging area will be dedicated or shared with
other applications and environments (dedicated servers are a proven way to go)
 The acceptable levels of data quality, related baselines and metrics as stated by the
Data Governance Board
 Decisions on the data sources that will be federated in and the ones that will be a
copy of the sources
 The management of metadata as data sources are brought into the landing zone of
the staging area
 The level of security and roles defined for each of the areas with the staging
environment
 The masking/scrambling of sensitive data within staging areas
 The identification of recoverable artifacts in the event of disasters, etc.

With these types of requirements, rules and decisions, a scalable and secured framework is
firmly in place to facilitate the defined ETL methodology. These data sources go through a
number of evolutionary stages in order to build a robust and comprehensive data
warehouse and/or ODS. Moreover, as great data architects that we are, we know that these
stages must include the following.

Data Acquisition
This process includes landing the data physically or logically in order to initiate the ETL
processing lifecycle. The staging area here could include a series of sequential files,
relational or federated data objects. However, the design of intake area or landing zone
must enable the subsequent ETL processes, as well as provide direct links and/or
integrating points to the metadata repository so that appropriate entries can be made for all
data sources landing in the intake area.

Data Profiling
Data profiling is the surveying of the source data landscape to gain an understanding of the
condition of the data sources. In most profiling efforts, this means generating various
reports with any number of metrics, statistics, and counts that reflect the quality of the
source data coming in.

Data Cleansing
Data cleansing is an iterative set of processes that starts and ends with the business rules
and standards around acceptable data quality levels from the Data Governance Board (e.g.
95% of the data meets the quality standards). ). This includes investigative jobs to
provide additional detail in detecting data patterns and design alternatives for quality
enforcement at the attribute, record and aggregate levels and data correction jobs to fill
in missing or incomplete data and correct data values. There is also the analysis of reports
based on the findings and results of the investigation and data correction jobs to determine
if further refinements and/or modifications are to be made.

Data Standardization and Matching

Data standardization and matching is a set of processes that primarily consists of the design
and execution, standardizing jobs to create uniformity around specific mandatory data
elements. This includes the design and execution of matching and de-duplicating jobs to
eliminate duplicate data and create a single version of the truth. It also includes the analysis
of reports related to errors and/or exceptions and determines if further refinements or
modifications are to be made (if required) and to assess the readiness for data delivery to
the data warehouse and ODS.

Data Transformation
Transforming data essentially means converting data to conform to a standard established
by the Data Governance Board. Examples of data transformations include converting nulls
to specific values, gender codes that are disparate to a common set of values or even
merging multiple source fields to one data element.

Data Loading
Depending on business requirements, the loading phase can include a total data refresh of
the target component or adding new data to the data component in a historical manner.
Loading to a staged copy of the target component enables a series of validation exercises.
This includes verification of referential integrity, data quality and transformation rules prior
to the actual data population of the DW and/or ODS.

Design and Construction

The creation of a staging area will usually start with the typical activities of the design of
any data environment. Tasks such as server configuration, alignment of file systems,
creating the database instances and related database objects are common elements in the
design of any infrastructure dedicate to a data environment.

However, there a number of unique tasks that need to be completed to align the staging
area to the ETL methodology discussed in prior sections of this article.

For starters, the data architect and the DBA will need to create separate environments for
each stage that the data goes through. This means separate database and file systems that
are dedicated to the stage that the ETL lifecycle is in.

For example, a dedicated database instance and related file systems should be created for
the data acquisition and profiling stages. The tasks included in these stages are the reading
of every data element and record in order to generate detailed statistical information on the
source data. This means that processes involved in the profiling effort will be using
tremendous amounts of resources related to memory and CPU and should be segregated so
that other workloads are not adversely impacted. The design of the database instance must
take into consideration the fact that with the use of federated data, there may be
implications at the database level that will cause ripple effects on the other data objects
within the database instance. Also the file systems allocated to the containers that the
database uses should be separate from the file systems used in the data acquisition process
so that there are no I/O bottleneck issues.

Then there is the SECURITY component! This is live production data that has highly
sensitive information. This data cannot be masked and/or scrambled as this defeats the
whole purpose of the ETL process to stage data into the data warehouse or ODS. The raw
data must be exposed in order for the ETL to be as effective in integrating, cleansing and
standardizing all data from all sources. Therefore, having a robust security framework is an
essential ingredient in this configuration. Typically, the data steward and an appointed
business analyst should be among the chosen few that have access to some of the sensitive
data elements. The ETL developer, DBA and system administrator does not need to see any
of it. There is also the prevention of copying data. No one should be allowed to make copies
of anything for any purpose. The information hub should be able to satisfy all requests for
data access for analysis in a robustly secured environment.

The Information Hub Experience - Tales from

the Data Layer
I was assigned to the first data warehouse project at a major healthcare company. It was
our first time working with an ETL solution and all that comes with it. We successfully
installed the toolset, created the protocols to pull in the data sources and target data
warehouse components. However, it was an extremely painfully project. Why? Because
whenever the ETL processes aborted or there were hardware failures, there were no clean
ways to restart anything! The staging layer was the sum total of several file systems
allocated for ETL usage and not much else was in place at the staging area level. In other
words, we built a flimsy foundation for the ETL component and we paid dearly for it!

At another time, having grown from that experience, I worked at another client site as part
of a team to design and construct a data warehouse environment complete with an ETL
solution, etc. This time, I knew I would get it right! I created an information hub that had
file systems and a database, tables and views. This database had federated objects and
every kind of bell and whistle you could think of. Except that during the data profiling
process of the federated objects, the process ran out of temporary space at the source
application and aborted. The error message generated was that the database is corrupt and
all is lost. Talk about the panic! I had that look in my eyes! Everything ground to a
screeching halt while I completed the database recovery.

The lessons here are to design the staging layer to enable support of various ETL processing
and related methodology, recoverability and scalability.

A well-designed staging area should enable the ETL approach, processes and services and
the facilitation of the data management activities with business analysts, data stewards,
(validation of business rules) profiling reports, quality reports and successfully stage the
data required to populate the data warehouse and the operational data store. Failure to do
that will lead to many sleepless nights, days spent in war rooms and putting the data
warehouse project in jeopardy of not meeting milestones and deadlines. I have been on
both sides and not being a big fan of the war rooms, I now know better. You should too!

Data Warehousing Architecture - Designing the Data Staging Area
By Denise Rogers (http://www.databasejournal.com/feedback.php


The management of metadata as data sources are brought into the landing zone of
the staging area

The level of security

modifications are to be made (if required) and to assess the readiness for data delivery to
the data warehouse and ODS.
Data

essential ingredient in this configuration. Typically, the data steward and an appointed
business analyst should be among th

Overview of Data Warehouse Concepts
No ratings yet
Overview of Data Warehouse Concepts
78 pages
Understanding ETL in Data Warehousing
No ratings yet
Understanding ETL in Data Warehousing
16 pages
Introduction to Data Warehousing
No ratings yet
Introduction to Data Warehousing
38 pages
Data Warehouse and OLAP Technology Overview
No ratings yet
Data Warehouse and OLAP Technology Overview
74 pages
Understanding Data Warehousing Basics
No ratings yet
Understanding Data Warehousing Basics
65 pages
Volume52 Number1
No ratings yet
Volume52 Number1
56 pages
The Operational Data Store - Tactical Analysis at Your Fingertips
86% (7)
The Operational Data Store - Tactical Analysis at Your Fingertips
64 pages
Value Chain Relationship Strategy Matrix
No ratings yet
Value Chain Relationship Strategy Matrix
18 pages
Real-Time Data Warehousing Solutions
No ratings yet
Real-Time Data Warehousing Solutions
10 pages
Data Mining Concepts and Techniques Guide
100% (1)
Data Mining Concepts and Techniques Guide
63 pages
Demand-Driven Virtual Data Warehousing
No ratings yet
Demand-Driven Virtual Data Warehousing
11 pages
Inmon vs Kimball: Data Warehouse Models
No ratings yet
Inmon vs Kimball: Data Warehouse Models
15 pages
Evolution of Metadata Management
No ratings yet
Evolution of Metadata Management
23 pages
Balancing Your Value Chain Metrics
0% (1)
Balancing Your Value Chain Metrics
4 pages
Data Warehouse Design with Dimensional Modeling
No ratings yet
Data Warehouse Design with Dimensional Modeling
87 pages
Data Warehouse Design Methodology
No ratings yet
Data Warehouse Design Methodology
10 pages
Alternatives to Star Schema in Data Warehousing
No ratings yet
Alternatives to Star Schema in Data Warehousing
15 pages
Data Warehousing for Decision Support
No ratings yet
Data Warehousing for Decision Support
26 pages
Data Quality Issues in Warehousing
No ratings yet
Data Quality Issues in Warehousing
10 pages
Innovations in MDM Implementation: Success Via A Boxed Approach
No ratings yet
Innovations in MDM Implementation: Success Via A Boxed Approach
4 pages
Understanding Data Vault PIT Tables
No ratings yet
Understanding Data Vault PIT Tables
9 pages
Universal Data Models For Financial Services
100% (1)
Universal Data Models For Financial Services
21 pages
Data Mart vs. Data Warehouse Explained
100% (1)
Data Mart vs. Data Warehouse Explained
6 pages
ER vs Dimensional Modeling Explained
No ratings yet
ER vs Dimensional Modeling Explained
18 pages
Overview of Apache Hive Essentials
No ratings yet
Overview of Apache Hive Essentials
9 pages
The Modernization of The Data Warehouse
100% (1)
The Modernization of The Data Warehouse
17 pages
DataStage Architecture Overview
No ratings yet
DataStage Architecture Overview
4 pages
Informatica Data Quality Data Sheet
No ratings yet
Informatica Data Quality Data Sheet
4 pages
Understanding Operational Data Stores
No ratings yet
Understanding Operational Data Stores
3 pages
Data Mesh Basics for Beginners
No ratings yet
Data Mesh Basics for Beginners
34 pages
BCA Semester 6: Data Warehousing Insights
No ratings yet
BCA Semester 6: Data Warehousing Insights
6 pages
Overview of SQL Server ETL Process
No ratings yet
Overview of SQL Server ETL Process
27 pages
Data Warehouse Development Approaches
No ratings yet
Data Warehouse Development Approaches
25 pages
Data Mesh: Federated Governance Insights
No ratings yet
Data Mesh: Federated Governance Insights
3 pages
Airline Data Warehouse Overview
No ratings yet
Airline Data Warehouse Overview
7 pages
Data Warehouse Concepts Overview
No ratings yet
Data Warehouse Concepts Overview
56 pages
Four Architectural Patterns in Distributed Systems
No ratings yet
Four Architectural Patterns in Distributed Systems
10 pages
Data Federation vs. Data Warehouse Explained
No ratings yet
Data Federation vs. Data Warehouse Explained
7 pages
Data Lake Architecture Overview
No ratings yet
Data Lake Architecture Overview
6 pages
Data Warehousing Overview and Concepts
No ratings yet
Data Warehousing Overview and Concepts
5 pages
Informatica Data Quality Overview
No ratings yet
Informatica Data Quality Overview
13 pages
Alation Data Catalog Overview
No ratings yet
Alation Data Catalog Overview
2 pages
Data Warehousing and Mining Essentials
No ratings yet
Data Warehousing and Mining Essentials
203 pages
A Brief History in Time For Data Vault
100% (1)
A Brief History in Time For Data Vault
6 pages
Snapshot Types in Dimensional Modeling
No ratings yet
Snapshot Types in Dimensional Modeling
27 pages
ETL Architecture Best Practices Guide
No ratings yet
ETL Architecture Best Practices Guide
2 pages
Data Pipeline Essentials: See Ya Later
No ratings yet
Data Pipeline Essentials: See Ya Later
6 pages
Enterprise Data Management Solution For Oracle Siebel CRM Applications
No ratings yet
Enterprise Data Management Solution For Oracle Siebel CRM Applications
3 pages
Data Warehouse Architecture Overview
No ratings yet
Data Warehouse Architecture Overview
14 pages
Data Warehouse and Data Mining Overview
No ratings yet
Data Warehouse and Data Mining Overview
14 pages
Data Warehouse Design: Best Practices Guide
No ratings yet
Data Warehouse Design: Best Practices Guide
18 pages
Data Warehouse Architecture Overview
No ratings yet
Data Warehouse Architecture Overview
60 pages
Understanding Data Staging in Warehousing
No ratings yet
Understanding Data Staging in Warehousing
8 pages
Data Warehouse Architecture Layers Explained
No ratings yet
Data Warehouse Architecture Layers Explained
21 pages
DWH Architecture Overview and Layers
No ratings yet
DWH Architecture Overview and Layers
3 pages
Data Warehouse Architecture Overview
No ratings yet
Data Warehouse Architecture Overview
7 pages
Data Warehouse Architecture Overview
No ratings yet
Data Warehouse Architecture Overview
8 pages
Data Warehouse Architecture Overview
No ratings yet
Data Warehouse Architecture Overview
69 pages
Chapter 2 Data Warehousing
No ratings yet
Chapter 2 Data Warehousing
47 pages
OPC-UA-Server User Manual en v1.0
No ratings yet
OPC-UA-Server User Manual en v1.0
67 pages
Symantec Backup Exec™ 12.5 For Windows Servers Word Descriptions
No ratings yet
Symantec Backup Exec™ 12.5 For Windows Servers Word Descriptions
15 pages
MIS Implementation Challenges at Parle
No ratings yet
MIS Implementation Challenges at Parle
16 pages
Full Stack Development
No ratings yet
Full Stack Development
10 pages
2MP Outdoor Speed Dome Camera B916
No ratings yet
2MP Outdoor Speed Dome Camera B916
2 pages
Information Security Course Overview
No ratings yet
Information Security Course Overview
2 pages
Graphic Designer: Client Collaboration Expertise
No ratings yet
Graphic Designer: Client Collaboration Expertise
1 page
Formatting Guidelines for Academic Papers
No ratings yet
Formatting Guidelines for Academic Papers
4 pages
Volume 1, June 2002 Volume 1, June 2002
No ratings yet
Volume 1, June 2002 Volume 1, June 2002
39 pages
Back Office Connection Setup Guide
No ratings yet
Back Office Connection Setup Guide
4 pages
Jewish Culture Website Design Plan
No ratings yet
Jewish Culture Website Design Plan
10 pages
Data Analyst Resume - Dhara Patel
No ratings yet
Data Analyst Resume - Dhara Patel
2 pages
NexentaConnect VMware VSAN QuickStart InstallGuide 1.0.2 FP2 GA
No ratings yet
NexentaConnect VMware VSAN QuickStart InstallGuide 1.0.2 FP2 GA
15 pages
Azure Networking Best Practices Overview
No ratings yet
Azure Networking Best Practices Overview
71 pages
PV Conn Sparxsea Manual
No ratings yet
PV Conn Sparxsea Manual
30 pages
ControlTech Bluetooth Torque Screwdrivers
No ratings yet
ControlTech Bluetooth Torque Screwdrivers
1 page
Northwind Data Warehouse Creation Script
No ratings yet
Northwind Data Warehouse Creation Script
8 pages
Social Technographics PDF
No ratings yet
Social Technographics PDF
13 pages
Web Development Internship Report
No ratings yet
Web Development Internship Report
14 pages
Windows 11 Scan Results Overview
No ratings yet
Windows 11 Scan Results Overview
18 pages
Build C++ Programs with Native Compilers
No ratings yet
Build C++ Programs with Native Compilers
622 pages
System Dynamics and Applied Agent Based Modeling
No ratings yet
System Dynamics and Applied Agent Based Modeling
37 pages
YOLOv8 for Autonomous Drone Navigation
No ratings yet
YOLOv8 for Autonomous Drone Navigation
11 pages
Shagun Mittal: Skills Summary
No ratings yet
Shagun Mittal: Skills Summary
2 pages
Files Reference AIX PDF
No ratings yet
Files Reference AIX PDF
1,078 pages
Laporan Praktikum Jaringan Komputer "Eigrp": Oleh: Alifia Claudia Zahra Kelompok 2/kelas 3A
No ratings yet
Laporan Praktikum Jaringan Komputer "Eigrp": Oleh: Alifia Claudia Zahra Kelompok 2/kelas 3A
40 pages
Calculus 7: Single Variable by Leithold
No ratings yet
Calculus 7: Single Variable by Leithold
2 pages
Coding and Marking at A Very Low Cost: by Macsa
No ratings yet
Coding and Marking at A Very Low Cost: by Macsa
8 pages
Supplier Payment Method Overview
No ratings yet
Supplier Payment Method Overview
28 pages
Comprehensive Project Management Guide
No ratings yet
Comprehensive Project Management Guide
55 pages

Designing a Data Staging Area for ETL

Uploaded by

Designing a Data Staging Area for ETL

Uploaded by

Data Warehousing Architecture - Designing the Data Staging Area

The Design of the Information Hub

Data Standardization and Matching

Design and Construction

The Information Hub Experience - Tales from

You might also like