0% found this document useful (0 votes)
7 views10 pages

Data Warehousing Fundamentals Explained

The document provides an overview of data warehousing, including its definition, components, and the ETL process. It highlights the differences between operational databases and data warehouses, as well as various data warehouse models such as Enterprise Data Warehouse, Data Mart, and Virtual Data Warehouse. Additionally, it discusses the architecture of data lakes and the concept of a Data Ocean as a comprehensive storage solution for organizational data.

Uploaded by

kapsetanmay112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views10 pages

Data Warehousing Fundamentals Explained

The document provides an overview of data warehousing, including its definition, components, and the ETL process. It highlights the differences between operational databases and data warehouses, as well as various data warehouse models such as Enterprise Data Warehouse, Data Mart, and Virtual Data Warehouse. Additionally, it discusses the architecture of data lakes and the concept of a Data Ocean as a comprehensive storage solution for organizational data.

Uploaded by

kapsetanmay112
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Unit 1 : Basics of Data Warehousing

Data mining
Data mining is the process of discovering hidden patterns, trends, and valuable insights from
massive datasets using advanced techniques like statistics, machine learning, and AI, to support
better decision-making, predict outcomes, and solve complex business problems

Introduction to Data Warehouse


A Data Warehouse is a centralized repository used to store large volumes of historical data
collected from multiple sources. It is designed to support decision-making, analysis, and reporting,
not daily transactions.

In simple words:
Database = A database is an organized collection of current data used to store,delete,
update, and retrieve information for daily operations.

Data Warehouse = A data warehouse is a centralized repository that stores large amounts
of historical data collected from multiple sources for analysis and
decision-making.

Need for Data Warehousing


Handling Large Data Volumes: Traditional databases store limited data (MBs to GBs), while data
warehouses are built to handle huge datasets (up to TBs), making it easier to store and analyze
long-term historical data.

Enhanced Analytics: Databases handle transactions; data warehouses are optimized for complex
analysis and historical insights.

Centralized Data Storage: A data warehouse combines data from multiple sources, giving a
single, unified view for better decision-making.

Trend Analysis: By storing historical data, a data warehouse allows businesses to analyze trends
over time, enabling them to make strategic decisions based on past performance and predict
future outcomes.

Business Intelligence Support: Data warehouses work with BI tools to give quick access to
insights, helping in data-driven decisions and improving efficiency.

Components of Data Warehouse


a) Data Sources: These are the various operational systems, databases and external data feeds that
provide raw data to be stored in the warehouse.

b) ETL (Extract, Transform, Load) Process: The ETL process is responsible for extracting data from
different sources, transforming it into a suitable format and loading it into the data warehouse.
c) Data Warehouse Database: This is the central repository where cleaned and transformed data is
stored. It is typically organized in a multidimensional format for efficient querying and reporting.

d) Metadata: Metadata describes the structure, source and usage of data within the warehouse, making
it easier for users and systems to understand and work with the data.

e) Data Marts: These are smaller, more focused data repositories derived from the data warehouse,
designed to meet the needs of specific business departments or functions.

f) OLAP (Online Analytical Processing) Tools: OLAP tools allow users to analyze data in multiple
dimensions, providing deeper insights and supporting complex analytical queries.

g) End-User Access Tools: These are reporting and analysis tools, such as dashboards or Business
Intelligence (BI) tools, that enable business users to query the data warehouse and generate reports.

Differences between Operational Database Systems and Data Warehouses

Sr.
no.
Operational Database Data Warehouse
1) Operational database data is volatile (changes data warehouse data is non-volatile (stable)
often)
2) Relational databases are made for on-line Data Warehouse planned for on-line Analytical
value-based Preparing (OLTP) Processing (OLAP)
3) Operational database systems are generally While data warehouses are generally subject-
application-oriented. oriented.
4) Stores current data Stores historical data
5) Data comes from one application Data comes from multiple sources
6) Focus on speed & accuracy Focus on analysis & trends
7) Smaller data size Very large data size
8) Operational databases support insert, update, data warehouses mainly support read and
delete analysis.
Multi-Tiered Architecture of Data Warehouse(Three-Tier Architecture)

A multi-tiered architecture of a data warehouse means the system is divided into different layers
(tiers). Each tier has a specific role, which makes data storage, processing, and analysis efficient and
easy to manage.

 Bottom Tier (Data Sources and Data Storage)


 Middle Tier (OLAP Engine)
 Top Tier (Front-End Tools)

Bottom Tier Middle Tier Top Tier

Bottom Tier
Bottom Tier is the foundation of the data warehouse.
It is responsible for collecting, processing and storing data from multiple sources.
It plays a critical role in preparing data for analysis.

Key Components:

 Data Sources: These are the places where data is created.


Examples include operational databases, Excel files, spreadsheets, websites, apps, CRM/ERP systems,
and web logs.
This data is raw and used as input for the data warehouse.

 Data Storage: After cleaning and processing, the data is stored safely in special databases like RDBMS
or multidimensional databases.
This stored data is used for easy searching, analysis, and reporting.
ETL Process(Extract,Transform,Load)
 Extract: Gathers raw data from different, often incompatible sources.
 Transform: Converts data into a consistent format, applying business rules, cleansing
errors, handling missing values and resolving duplicates.
 Load: Loads the transformed data into the warehouse, organizing it for fast access and
analysis.
Metadata = “Data about Data”
Monitor and Integrator : It is like a security guard + manager for data flow.

Middle Tier(OLAP Engine)


The Middle Tier hosts the OLAP server, which processes complex analytical queries.
It acts as a bridge between the data storage layer (bottom tier) and the user interface (top tier),
ensuring data is quickly retrieved, aggregated and ready for reporting and analysis.
It is designed for high-speed analytical processing.

OLAP server models come in three different categories, including:


ROLAP (Relational OLAP):This model uses a relational database to store and manage warehouse
data. It is ideal for handling large data volumes as it operates directly on relational databases.

MOLAP (Multidimensional OLAP): This model stores data in a multidimensional cube. The storage
and retrieval processes are highly efficient, making MOLAP suitable for complex analytical queries
that require aggregation.

HOLAP (Hybrid OLAP): It is combination of relational and multidimensional online analytical


processing paradigms. HOLAP is the ideal option for a seamless functional flow across the database
systems when the repository houses both the relational database management system and the
multidimensional database management system.

Top Tier(Front-end tools)


The Top Tier in the Three-Tier Data Warehouse Architecture comprises the front-end client layer,
which is essential for interacting with the data stored and processed in the lower tiers.
This layer includes a variety of business intelligence (BI) tools and techniques designed to facilitate
easy access and manipulation of data for reporting, analysis, and decision-making.
Data Warehouse Models: Enterprise Warehouse, Data Mart, And virtual
Warehouse ---> (A Data Warehouse model defines how data is collected, stored, organized, and
accessed for decision-making.)

Enterprise Data Warehouse (EDW)

An Enterprise Data Warehouse is a central, large warehouse that stores all data of the entire
organization.
An Enterprise Data Warehouse (EDW) refers to a comprehensive data repository that integrates
data drawn from different areas of an organisation.
It holds all information for all business units required giving a consolidated and unified view of the
organization.

Advantages

✅ Single source of truth


✅ Better decision-making
✅ Consistent data across departments

Disadvantages

❌ High cost
❌ Long time to build
❌ Needs skilled professionals

Data Mart

A Data Mart is a smaller, subject-oriented data warehouse designed to serve the needs of a specific
department or business function.

It contains only the data that department needs, not everything from the whole organization.

Data Warehouse = big store


Data Mart = one section of that store
Advantages

✔ Faster response time


✔ Lower cost than full warehouse
✔ Easy for users to understand
✔ Quick implementation

Disadvantages

✖ Data duplication possible


✖ Data inconsistency (independent type)
✖ Limited analysis scope

Virtual Data Warehouse

A Virtual Data Warehouse is a logical (virtual) system that allows users to see and analyze data from
different databases together, without actually storing the data in one place.

Advantages (Explainable)

✅ Very low cost (no storage)

✅ Fast to implement

✅ Real-time or near real-time data

✅ No data duplication

Disadvantages (Exam Focus)

❌ Slow for complex queries

❌ Heavy load on source databases

❌ Not suitable for historical analysis

❌ Limited performance optimization


Extraction, Transformation and Loading (ETL)

The ETL process, which stands for Extract, Transform and Load.
It is a critical methodology used to prepare data for storage, analysis and reporting in a data
warehouse.
It involves three distinct stages that help to streamline raw data from multiple sources into a
clean, structured and usable form.

Extract

The Extract phase is the first step in the ETL process, where raw data is collected from various data
sources.
Data is collected from tables, files, and text, not just one type of source, Data can come from tables
(databases), formatted files (JSON/XML), or unorganized data like emails and text files.
The main goal of extraction is to gather data without altering its format, enabling it to be further
processed in the next stage.

Types of data sources can include:


 Structured: SQL databases, ERPs, CRMs
 Semi-structured: JSON, XML
 Unstructured: Emails, web pages, flat files

Transformation

Data extracted in the previous phase is often raw and inconsistent.


During transformation, the data is cleaned, aggregated and formatted according to business rules.
This is a crucial step because it ensures that the data meets the quality standards required for
accurate analysis.

Common transformations include:


 Data Filtering: Removing irrelevant or incorrect data.
 Data Sorting: Organizing data into a required order for easier analysis.
 Data Aggregating: Summarizing data to provide meaningful insights (e.g., averaging sales
data).
Loading

Once data has been cleaned and transformed, it is ready for the final step: Loading.
This phase involves transferring the transformed data into a data warehouse, data lake or
another target system for storage.

Depending on the use case, there are two types of loading methods:
 Full Load: loads all data at once (usually first time)
 Incremental Load: Only new or updated data is loaded, making this method more
efficient for ongoing data updates.

Metadata Repository
A Metadata Repository is a central storage system that keeps information about data, not the
actual business data itself.
It tells us what the data means, where it comes from, how it is stored, and how it is used.
If data is a book, metadata is the title, author, index, and instructions about that book.
It helps users understand the data, supports ETL processes, ensures data consistency, and improves
data warehouse management.

Data pond

A Data Pond is a temporary storage area used to collect and store raw, unprocessed data from
multiple sources before it is cleaned, transformed, and loaded into a Data Warehouse or Data Mart.
Data Pond also known as Data Puddles

Data lake

A Data Lake is a centralized repository that stores large volumes of raw data in its original format
(structured, semi-structured, and unstructured) for long-term storage and future analysis.
Unlike a data warehouse-which stores processed and modeled data-data lakes allow organizations
to store everything first and analyze later.
This flexibility makes data lakes ideal for big data, advanced analytics, machine learning, and real-
time processing scenarios.

Architecture of data lake


1. Data Sources

Data is generated from different sources such as structured data (relational databases, ERP, CRM), semi-
structured data (JSON, XML, CSV), and unstructured data (images, videos, audio files, social media, and
sensor data).

2. Data Ingestion Layer


This layer is responsible for extracting and loading data into the data lake. Data can be ingested using
batch/scheduled ingestion or real-time streaming, and it is loaded without any transformation.

3. Raw / Landing Zone

In this layer, data is stored in its original raw form. No schema or cleaning is applied at this stage. It
supports the schema-on-read approach and allows data to be reprocessed if needed.

4. Data Processing and Transformation Layer

This layer performs batch processing and real-time processing to clean, filter, and transform raw data into
a usable format.

5. Processed Data Layer

After transformation, clean and structured data is stored here. This data can be directly used for analytics
or moved to a data warehouse for reporting.

6. Analytical Sandboxes

This layer is used by analysts and data scientists for data discovery, exploratory analysis, and predictive
modeling, supporting advanced analytics and machine learning.

7. Data Consumption Layer

Processed data is used for BI analytics, reporting, querying, real-time alerting, and search operations by
business users and applications.

8. Security, Governance, and Monitoring

This layer ensures data security, access control, data governance policies, monitoring, and compliance
across all stages of the data lake.

Data ocean
Data Ocean is the largest form of data storage that combines many data lakes and warehouses
across an organization.

You might also like