0% found this document useful (0 votes)
7 views30 pages

Data Minning Work

The document outlines the course details for a Bachelor of Information Technology assignment on Data Mining and Data Warehousing, including definitions, features, types, benefits, and applications of data warehouses. It explains the processes involved in data warehousing such as data extraction, transformation, and loading, as well as the client/server computing model's role in facilitating data access and analysis. Additionally, it discusses the advantages and disadvantages of data warehouses, highlighting their importance in decision-making across various industries.

Uploaded by

kaingunashon6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views30 pages

Data Minning Work

The document outlines the course details for a Bachelor of Information Technology assignment on Data Mining and Data Warehousing, including definitions, features, types, benefits, and applications of data warehouses. It explains the processes involved in data warehousing such as data extraction, transformation, and loading, as well as the client/server computing model's role in facilitating data access and analysis. Additionally, it discusses the advantages and disadvantages of data warehouses, highlighting their importance in decision-making across various industries.

Uploaded by

kaingunashon6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

FACULTY: COMPUTER MANAGEMENT AND SOCIAL SCIENCES

DEPARTMENT: COMPUTING AND INFORMATION TECHNOLOGY

COURS PROGRAM: BACHELOR OF INFORMATION TECHNOLOGY.

COURSE TITLE: DATA MINING AND DATAWAREHOUSE.

COURSE CODE: IT 3101

LECTURE NAME: MR. ARAKA

ASSIGNMENT : GROUP NUMBER 13.

YEAR: 2025/26.

SEMESTER: 3.1

DATE OF SUBMISSION: December 15, 2025

S/N STUDENT NAME REG NUMBER SIGNATURE

1 NASHON BITA KAINGU BIT/30416/2301/DT

2 MARTINO JOSEPH MISANGA BIT/28657/2301/DT

3 PASCHALIUS GIDION MSHOBOZI BIT/28679/2301/DT

4 VALENTINO MABULA DUBA BIT/30720/2301/DT

5 ANDREW HENRY CHATANDA BIT/28428/2301/DT


INTRODUCTIO TO DATA WAREHOUSE.
What are Data Warehouses?
A data warehouse is a centralized repository designed to store large volumes
of structured and unstructured data from multiple sources. It supports data
analysis, business intelligence, and reporting by consolidating data into a single,
comprehensive system. The process typically involves extracting, transforming,
and loading (ETL) data into the warehouse, where it can be organized and queried
efficiently.
Data warehousing is the process of collecting, integrating, storing and managing data
from multiple sources in a central repository. It enables organizations to organize large
volumes of current and historical data for efficient querying, analysis and reporting.
The main goal of data warehousing is to support decision making by providing clean,
consistent and timely access to data. It ensures fast data retrieval even when working
with massive datasets.

Why we need for Data Warehousing?


[Link] Large Data Volumes: Traditional databases store limited data (MBs to
GBs), while data warehouses are built to handle huge datasets (up to TBs), making it
easier to store and analyze long-term historical data.
[Link] Analytics: Databases handle transactions; data warehouses are optimized
for complex analysis and historical insights.
[Link] Data Storage: A data warehouse combines data from multiple sources,
giving a single, unified view for better decision-making.
[Link] Analysis: By storing historical data, a data warehouse allows businesses to
analyze trends over time, enabling them to make strategic decisions based on past
performance and predict future outcomes.
[Link] Intelligence Support: Data warehouses work with BI tools to give quick
access to insights, helping in data-driven decisions and improving efficiency.
Features of Data Warehouses
a)Subject Oriented: Information being stored and processed is related to important user
topics including customers, sales, and products among others giving a clear and overall
view of certain business areas of interest.
b)Integrated: Access to the collected data takes place and it is compiled to have
standard names, measurements and encoding structures.
c)Non Volatile: The data being stored, is not changed or erased in the warehouse so that
history and trends can be checked reliably over time.
d)Time Variant: Data warehouses contain past information, and, as a rule, information
for several years is contained in the data warehouse that allows realizing analyses of
tendencies, and chronological changes, etc.
e)Optimized for Query Performance: Integrated in such a way that it allows easy search
and query method through indexing and partitioning for large Data sets.
How do Data Warehouses work?
[Link] Extraction: It pulls data from several transactional sources including but not
limited to transactional databases, customer relationship management systems,
enterprise resource planning systems, and from external sources. In this step, focus and
collect information that is pertinent to the warehouse.
[Link] Transformation: The extracted data goes through integration processes to
make them uniform and accurate. This concerns data pre-processing that involves
deleting duplicate records or observations, actual error checking and rectification, data
sources consolidation, adding value to the data, and format and unit harmonization.
[Link] Loading (ETL): Implemented data is transferred to data warehouses. This
process can be executed and completed at a large scale in a predetermined period
(Batch) or incremental from the source to update the data warehouse constantly (Real-
time).
[Link] Storage: Encapsulated data is integrated into an architecture known as a schema
such as a star schema and snowflake schema, because of query and designing
optimization for analysis. Information is usually maintained in dimensional models
containing facts which are numerical and dimensions which are attributes.
[Link] Indexing and Partitioning: To improve the efficiency of queries that are
submitted to it, data in the warehouse is segmented and indexed. Indexing optimizes the
speed at which data is accessed because pointers to the location of data are created;
partitioning, on the other hand, breaks a large table into smaller tables thus making
them easier to manage.
[Link] Management: About data, there is metadata to offer details on its source,
organization, manipulation and utilization. This is useful for a user in figuring out the
history and origin of the data.
[Link] Access and Querying: BI( Business intelligence) tools, query engines and
reporting tools are integrated into the data warehouse where; data analysis is done,
report generation and even dashboard creation is conducted. Employees and customers
can apply versatile queries, execute some calculations for solving unpredictable tasks,
and obtain some valuable conclusions based on the united data from the sources.
[Link] Security and Governance: Data warehouses therefore apply security functions
for regulating access to the data as well as other related data. Policies focus on data
quality, compliance as well as utilization to eliminate problems related to data quality,
data compliance and data utilization during its lifetime.

TYPES OF DATA WARE HOUSE.

1. Enterprise Data Warehouse (EDW)

An Enterprise Data Warehouse (EDW) refers to a comprehensive data repository that


integrates data drawn from different areas of an organisation. It holds all information
for all business units required giving a consolidated and unified view of the
organization.
Features of Enterprise Data Warehouse (EDW)
[Link] and Scale: The EDWs are deployed for managing data generated from diverse
business areas including finance, marketing, selling, and operations.
[Link]: It gathers information from transactional databases, ERP systems, CRM
systems and other data feeds from outside the business environment.
[Link]: Usually, the structure of an EDW complies with a star/snowflake
scheme containing facts (numeric, measurable information) and dimensions (qualitative
and contextual data).
[Link]: They facilitate complex queries, high-level analysis, and business insight
information within the business.
[Link]: Being central repositories of an organization’s data, EDWs demand
investments in infrastructure, upkeep, as well as stewardship for the quality of data kept
in them.
2. Operational Data Store (ODS)
An ODS is another form of data warehouse data layer that holds data from operational
systems in a consolidated and integrated format for near real-time reporting and
operational analysis.
Features of Operational Data Store (ODS)
[Link]-Time Data: ODSs are intended for the incorporation of updated information as
close as possible to the operational time, so they offer the current look at the
operational data.
[Link]: While EDWs are used for strategic decision-making, ODSs deal with the
organization’s operational activities and transactional requirements.
[Link]: They collect data from different operational sources like transactional,
ERP, CRM and so on for which the data is denormalized mostly.
[Link]: Other Decision Support System is utilized for short-term decisions and
comprises inventory control, customers’ orders and control of employee scheduling.
[Link]: Generally, ODSs do not retain data history; they provide real-time data
for fast read and write operations and are data marts for loading data into the EDW.

3. Data Mart
A Data Mart can be defined as an element of a Data Warehouse system designed to
hold data from a particular business division, department or user type. It is created to
serve the specific interests of a specific class of people.
Features of Data Mart
[Link]: Data marts are less comprehensive than EDWs and relevant to the
organization’s specific departments, such as sales, finance, or marketing.
[Link]: They routinely pull or gather data from the EDW and other operation
sources to develop a specific data set.
[Link]: Data marts are used to supply departmental reporting and analysis as the users
of this type of data get real-time access to the data that can be useful for their
organizational tasks.
[Link]: Data marts can be deployed based on the star schema or any other
simplification of a star-war structure. They can be operations-based or derived
(extracted from the operational system directly) or they can be dependent, that is,
retrieved from the EDW.
[Link]: They also introduce faster implementation and lower cost than building an
entire EDW to satisfy all the needs of an organization because they can address
departmental requirements instead.

4. Cloud Data Warehouses

Cloud Data Warehouses are Data Warehousing solutions that are located on the cloud
platform that offer a scalable platform for effective usage of data storage and analysis.
Features of Cloud Data Warehouses
[Link]: Cloud data warehouses mean that you can adjust the storage or
computational power as required to fit various loads and data amounts.
[Link]-Effectiveness: It is a logical model because it entails organizations making
payments only to the extent of the services they want, therefore departing from huge
initial capital investments.
[Link]: Cloud service providers take care of the infrastructure, and manage
their upgrades and security thus freeing managers to give their time to analysis on data
patterns.
[Link]: They enable the easy integration of different forms of data, the cloud, and
other applications to help with data ingestion.
[Link]: Cloud data warehouses facilitate several analytics, including operational
processing, near real-time, machine learning, and complex business analytics. Some
examples include Amazon Redshift, Google BigQuery, and most recently, Snowflake.

5. Big Data Warehouses


Big Data Warehouses are advanced preparation instruments for data warehousing to
address gigantic volumes of structured and unstructured data that are created with
velocity.
Features of Big Data Warehouses
[Link], Variety, and Velocity: They can analyze and retain wide arrays of data
collected from social media platforms, IoT devices, logs, and many others.
[Link]: Large buckets usually take advantage of distributed computing software
like Hadoop and Apache Spark in technologies like HDFS, No SQL database, and
columnar storage.
[Link]: They enable data mining, machine learning and real-time analysis
techniques that are used in decision-making from the big data.
[Link]: Most big data platforms and tools interconnect with various big data
warehouses that facilitate data ingestion, transformation, as well as analysis.
[Link] Cases: They are applied in areas like finance, healthcare, retail/wholesale,
telecom, and oil & gas, to name a few: consumer behaviour analysis, fraud detection,
predictive maintenance, and the like.
6. Virtual Data Warehouse

A Virtual Data Warehouse provides a logical view of data from multiple sources
without physically storing the data in a central location.
Features of Virtual Data Warehouses:
[Link] Integration: Combines data from various sources without physical movement.
[Link]-time Access: Provides up-to-date information directly from source systems.
[Link] Storage Costs: Minimizes the need for additional storage infrastructure.
4Flexibility: Easily adapts to changes in source systems.

7. Hybrid Data Warehouse

A Hybrid Data Warehouse combines on-premises and cloud-based data storage and
processing capabilities.
Features of Hybrid Data Warehouses:
[Link]: Allows organizations to keep sensitive data on-premises while leveraging
cloud benefits.
[Link]: Can scale resources up or down based on needs.
[Link]-Effective: Balances the benefits of cloud and on-premises solutions.
[Link] Sovereignty: Helps comply with data residency requirements.

8. Real-time Data Warehouse

A Real-time Data Warehouse is designed to process and analyze data as it's generated,
providing immediate insights.
Features of Real-time Data Warehouses:
[Link] Data Processing: Analyzes data in real-time or near-real-time.
[Link] Latency: Minimizes delay between data generation and availability for analysis.
[Link] Data Support: Handles continuous data streams effectively.
[Link]-Sensitive Decision Support: Enables quick responses to changing conditions.

Benefits of Data Warehouses

[Link] Decision Making


Provides accurate and consolidated data for better strategic decisions.

[Link] Data Storage


Integrates data from different sources into one central repository.

[Link] Data Analysis


Stores large amounts of historical data for trend and pattern analysis.
[Link] Query Performance
Optimized for reporting and analytical queries.

[Link] Data Quality


Data is cleaned, transformed, and standardized before storage.

[Link] Business Intelligence (BI)


Enables reporting, dashboards, and data mining.

[Link] of Information
Ensures the same data definitions are used across the organization.

[Link] Saving for Users


Reduces time spent searching and preparing data from multiple systems.

Disadvantages of data warehouse

[Link] Initial Cost


Requires expensive hardware, software, and skilled personnel to design and implement.

[Link] Design and Maintenance


Building and maintaining a data warehouse is technically complex and time-consuming.

[Link] Implementation Time


It may take months or years before the data warehouse becomes fully operational.

[Link] Updating Delays


Data is not always real-time; updates are often done periodically, causing delays.

[Link] Skilled Staff


Needs experts like data engineers, DBAs, and analysts, which increases operational cost.

[Link] Integration Challenges


Combining data from different sources can cause data inconsistency and quality issues.

[Link] Overhead
Large volumes of historical data require significant storage space.

[Link] Risks
Centralized data storage can be a target for cyberattacks if not well secured.

APPLICATION OF DATA WAREHOUSE

[Link] Decision Support


Helps managers analyze sales, costs, and profits to make strategic decisions.
[Link] and Marketing Analysis
Used to study customer behavior, sales trends, and campaign performance.

[Link] Reporting and Analysis


Supports budgeting, forecasting, auditing, and financial performance analysis.

[Link] Management
Used to analyze patient records, treatment outcomes, and hospital performance.

[Link] and Finance


Helps in risk analysis, fraud detection, and credit evaluation.

[Link] Chain and Inventory Management


Analyzes stock levels, supplier performance, and logistics efficiency.

[Link] and Research


Used to analyze student performance, enrollment trends, and academic research data.

[Link] Industry
Analyzes call records, network usage, and customer churn.

[Link] and Public Sector


Used for policy planning, census analysis, and public service improvement.

Client/Server Computing Model and Data Warehousing


In the context of data warehousing, the client/server model facilitates the interaction
between the end-user and the warehouse. Here's how it fits together:
 Client: The client is typically a front-end application or tool that users interact
with. This could be a business intelligence (BI) tool, a web interface, or any application
used to query and analyze data from the warehouse.
 Server: The server is the data warehouse itself, or more precisely, the database
management system (DBMS) that hosts the data warehouse. The server processes the
requests made by the client, performs the necessary computations or data retrievals, and
sends back the results.
How it Works:
1. User Query: A user (client) wants to analyze some data, so they use their BI tool or
application to submit a query.
2. Request Transmission: The query is sent from the client to the server over a network.
This request might include SQL queries or other data retrieval commands.
3. Processing: The server receives the request and processes it. This could involve
searching through large volumes of data, aggregating results, or performing complex
calculations.
4. Result Transmission: Once the server has processed the request, it sends the results
back to the client over the network.
5. Display: The client receives the data and presents it to the user, often in the form of
reports, charts, or dashboards.
Advantages:
 Efficiency: The server handles the heavy lifting of data processing, allowing
clients to be lightweight and fast.
 Scalability: Additional servers can be added to handle more requests or larger
datasets.
 Security: Centralized control over data access and management enhances
security.
Example in Practice:
Imagine a retail company with a data warehouse that stores sales data. Analysts at the
company use BI tools on their laptops (clients) to query the data warehouse (server).
They might ask questions like, "What were the total sales last month?" The query is sent
to the server, processed, and the result is sent back to the client's BI tool, whi
ch then displays a report showing the sales figures.
Relatioship between client/server computing model and data warehousing

Data warehousing heavily relies on the client/server computing model to facilitate


efficient storage and analysis of data. In this model, the server stores all data collected
from various sources, including both historical and current information, ensuring that the
data is secure, well-maintained, and easily accessible. Meanwhile, clients, such as users
accessing business intelligence (BI) tools, dashboards, or analytical applications, send
requests to the server to retrieve information, generate reports, or perform data analysis.

Each request from a client is processed by the server, which handles the heavy
computation tasks, such as aggregating data, analyzing trends, or generating detailed
reports. This setup allows multiple users to access the data warehouse simultaneously
without overloading individual client machines, while the server ensures data integrity,
security, and proper management.
Therefore, the client/server model supports data warehousing by providing centralized
control, secure storage, efficient processing, and scalable access. This architecture makes
it easier for organizations to make data-driven decisions accurately and in a timely
manner, as users can access and analyze large volumes of data without compromising
performance or security.

Benefits of Client/Server in Data Warehousing

Scalability: Server can handle thousands of simultaneous analytical queries from


multiple clients.

Performance: Servers optimized for query processing (OLAP engines) reduce


client-side load.

Centralized Maintenance: Updates, backups, and ETL processes are done at the
server level.

Accessibility: Users (clients) can access warehouse data from desktops, web
apps, or mobile devices.

Security: Sensitive warehouse data is protected on the server, and clients only see
authorized views.

PARALLEL PROCESSORS
parallel processor system is a computing setup where multiple processors work
simultaneously to perform computations on different parts of a task. Instead of executing
one instruction at a time (like in serial processing), tasks are divided and executed
concurrently.

How it works:

 The dataset is split into smaller chunks.


 Each processor handles a portion of the data simultaneously.
 After computation, results from all processors are combined to produce the final
output.

Architecture Types:

a) Shared Memory Systems:

 All processors share the same memory space.


 Easy to program but can have memory bottlenecks if many processors try to access
memory simultaneously.
 Example: Multi-core CPUs used for running parallel algorithms like parallel decision
tree construction.

b) Distributed Memory Systems (Message Passing):

 Each processor has its own memory.


 Processors communicate by sending messages.
 Suitable for large-scale datasets that cannot fit in a single memory.
 Example: MPI (Message Passing Interface) used in scientific data mining
applications.

Advantages in Data Mining:

 Faster processing of large datasets.


 Efficient for computationally heavy algorithms, e.g., clustering (K-means), frequent
itemset mining, classification (random forests).
 Can handle high-dimensional data efficiently.

Example in Data Mining:

Suppose we want to run K-means clustering on a dataset of 1 million records:

I. Divide dataset into 4 parts for 4 processors.


II. Each processor computes distances to cluster centroids for its portion.
III. Results are combined to update centroids globally.
IV. Repeat until clusters stabilize.

CLUSTER SYSTEMS

A cluster system (or computing cluster) is a set of independent computers (nodes)


connected via a network, working together as a single powerful system for processing
tasks.

How it works:

 Data is distributed across nodes.


 Each node performs local computation on its part of the data.
 Results from nodes are aggregated to produce the final data mining output.

Types of Clusters:

a) High-Performance Computing (HPC) Cluster:

 Optimized for speed and computation-heavy tasks.


 Used in scientific data mining and simulation.
b) High-Availability Cluster:

 Focuses on reliability and fault tolerance rather than speed.


 Useful when continuous data mining tasks must run without interruption.

Advantages in Data Mining:

 Can process massive datasets that exceed a single machine’s capacity.


 Offers fault tolerance: if one node fails, others continue processing.
 Scalable: nodes can be added to handle more data.

Example in Data Mining:

Imagine mining social media data with billions of records:

 Data is stored on multiple nodes in a cluster system.


 Each node runs parallel mining tasks (e.g., sentiment analysis).
 Results are combined to generate overall insights.

Popular frameworks like Apache Hadoop and Apache Spark implement cluster-
based data mining with distributed computation.

Key Differences Between Parallel Processors and Cluster Systems

Feature Parallel Processor Cluster System

Definition Multiple processors in one Multiple independent


system sharing memory or computers working
network together

Memory Shared or distributed Distributed per node

Communication Fast, via shared memory or Slower, via network


network messages

Scalability Limited by physical


Highly scalable
system

Cost Usually expensive for Can use commodity


large systems hardware

Use Case in Data Mining Speeding up computation- Handling massive datasets


heavy tasks beyond single system
DISTRIBUTED DBMS (DDBMS) IMPLEMENTATION

A Distributed Database Management System (DDBMS) is a database system in which


the data is distributed across multiple physical locations (computers, servers, or sites)
but appears to the user as a single logical database.

 Users can query the database without worrying where the data is physically stored.
 The system handles data distribution, replication, and communication between
sites.

Key Features of DDBMS:

1. Data Distribution:

 Data can be fragmented (horizontal or vertical) across multiple sites.


 Example: Customer data for a retail chain could be horizontally partitioned by
region (East, West, North, South).

2. Transparency:

 Location transparency: Users don’t need to know where the data resides.
 Replication transparency: Users don’t need to know if data is duplicated across
sites.
 Concurrency transparency: Multiple users can access the same data without
conflict.

3. Autonomy:

 Each site can operate independently, even if other sites fail.

4. Replication:

 Frequently accessed data can be replicated across sites to reduce query time.

Types of DDBMS Implementations:

1. Homogeneous DDBMS:

 All sites use same DBMS software.


 Example: All nodes run Oracle Database with distributed tables.

2. Heterogeneous DDBMS:

 Sites may use different DBMS software (Oracle, MySQL, SQL Server).
 Requires middleware to translate queries between systems.
 Example in Data Mining/Data Warehousing:

 A company wants to analyze sales data across multiple regions.


 Each regional office stores data locally.
 A distributed DBMS allows the central office to query all regions as if it’s a single
database.
 Mining algorithms (like association rule mining) can run on distributed datasets
simultaneously.

CLIENT/SERVER RDBMS SOLUTION

A Client/Server RDBMS is a database system architecture in which:

 Client: The application or user interface that requests data.


 Server: The database engine that processes queries, manages data, and returns
results.

This architecture separates processing between clients and server, improving efficiency
and scalability.

Components:

 Client:

 Sends queries to the server.


 Can perform presentation and some processing (like filtering results).
 Example: A web application interface displaying customer orders.

 Server:

 Handles data storage, query processing, and transaction management.


 Runs the DBMS software (Oracle, MySQL, SQL Server).

How Client/Server RDBMS Works:

 Client sends SQL query to server.


 Server executes query on the database.
 Server returns results to client.
 Client displays results or performs additional computations.

Advantages in Data Mining/Data Warehousing:

I. Centralized control: Data is managed centrally, improving security.


II. Efficiency: Heavy computation (like aggregation, sorting, and joins) is done on the
server.
III. Scalability: Multiple clients can connect to the server simultaneously.
IV. Integration with distributed systems: Can be part of a distributed RDBMS.

Example in Data Mining:

Suppose we have a sales data warehouse:

 Clients: Sales analysts using BI tools or data mining software.


 Server: Central warehouse database storing all transactional data.
 Analysts run clustering or predictive analytics queries from client tools.
 Server processes queries efficiently and returns results.

Combining DDBMS with Client/Server Architecture

Modern distributed databases often use client/server architecture:

 Clients access the distributed database from anywhere.


 Servers at different locations manage local data and coordinate with other servers.

Frameworks like Oracle RAC (Real Application Clusters), MySQL Cluster,


and PostgreSQL with replication are real examples.

Summary of Key Points

Concept Key idea

DDBMS implementation Database is distributed across multple


sites; provide transparency, replication,
nad autonomy.

Homogeneous vs heterogeneous Same DBMS at all sites vs different


DBMS at site.

Client/Server RDBMS solution Clients request data; server processes


queries and return results.

Advantages Scalability, efficiency, centralized contol,


fault tolerance (DDBMS), easy query
access.

Data mining application Parallel mining of distributed datasets,


centralized analysis using client tools
COMPONENTS OF DATA WAREHOUSE
a) Data Sources: These are the various operational systems, databases and external data
feeds that provide raw data to be stored in the warehouse.
b) ETL (Extract, Transform, Load) Process: The ETL process is responsible for
extracting data from different sources, transforming it into a suitable format and
loading it into the data warehouse.
c) Data Warehouse Database: This is the central repository where cleaned and
transformed data is stored. It is typically organized in a multidimensional format for
efficient querying and reporting.
d) Metadata: Metadata describes the structure, source and usage of data within the
warehouse, making it easier for users and systems to understand and work with the
data.
e) Data Marts: These are smaller, more focused data repositories derived from the data
warehouse, designed to meet the needs of specific business departments or functions.
f) OLAP (Online Analytical Processing) Tools: OLAP tools allow users to analyze data
in multiple dimensions, providing deeper insights and supporting complex analytical
queries.
g) End-User Access Tools: These are reporting and analysis tools, such as dashboards
or Business Intelligence (BI) tools, that enable business users to query the data
warehouse and generate reports.
HOW TO BUILD A DATA WAREHOUSE
Building a data warehouse is a multi-step project that transforms raw data from various
source systems into a structured, unified repository optimized for business intelligence
(BI) and reporting.
Here is a step-by-step guide to the process:
1. Planning and Requirements Gathering (The "Why")
Define Business Objectives
The Goal: Clearly define what the data warehouse needs to achieve. Consult with
business users, managers, and stakeholders to identify key business questions, required
reports, and pain points.
Key Questions: What metrics (KPIs) are crucial? What decisions will be driven by this
data? (e.g., “We need to analyze sales trends by region and product category.”)
Deliverables: A formal document outlining the business requirements and a clear scope
for the initial phase.
Evaluate Data Sources
The Goal: Identify all relevant source systems (e.g., CRM, ERP, transactional databases,
external files, APIs).
Assessment: Evaluate the quality, volume, velocity, and structure of the data in each
source. This helps determine the complexity of the data integration process.
2. Architectural Design (The "Blueprint")
Choose Data Warehouse Architecture
Decide on the overall structure, typically a variation of the three-tier architecture:
Bottom Tier: Data Sources and the physical data warehouse storage (often a relational or
cloud-native database like Snowflake, Big Query, or Redshift).
Middle Tier: The OLAP (Online Analytical Processing) server, which presents a
dimensional view of the data.
Top Tier: Client/Front-end layer (BI tools like Tableau, Power BI, Looker).
Design the Data Model
This is one of the most critical steps, as it structures the data for analytical performance.
Star Schema: The most popular model, featuring a central Fact Table (containing
metrics/measures) connected to multiple Dimension Tables (containing context like
Customer, Date, Product). This design prioritizes simple, fast querying.
Snowflake Schema: A more complex version where dimension tables are further
normalized.
Granularity: Define the lowest level of detail you will store (e.g., storing data at the
individual transaction level vs. the daily summary level).
3. Technology and Infrastructure
Select the Technology Stack
Choose the core components based on your budget, scalability needs, and existing
infrastructure:
Data Warehouse Platform: Cloud (Snowflake, BigQuery, Redshift) vs. On-Premise
(Teradata, Oracle). Cloud solutions offer elasticity and managed services.
Data Integration Tool: ETL (Extract, Transform, Load) or ELT tools (e.g., Fivetran,
Informatica, custom Python/Spark).
BI/Analytics Tools: Front-end tools for end-users.
4. Data Integration (The "Pipeline")
Implement ETL/ELT Processes
This is where you build the pipelines that move data from sources into the warehouse.
Extract: Pull raw data from the source systems.
Transform: Clean, standardize, aggregate, and apply business logic to the data. This is
crucial for ensuring data consistency across all sources.
Load: Insert the transformed data into the data warehouse tables, often using incremental
loading (only loading new/changed data) for efficiency.
Maintain Data Quality and Governance
Quality: Implement checks and balances during the transformation phase to validate
data, handle missing values, and resolve inconsistencies.
Governance: Define security policies, access controls (Role-Based Access Control), data
retention policies, and compliance measures (like GDPR or HIPAA).
5. Deployment, Testing, and Monitoring
Deploy and Test
Testing: Conduct unit, integration, and User Acceptance Testing (UAT). UAT is essential
for ensuring the data warehouse answers the business questions defined in Step 1 and that
end-users trust the data.
Performance Tuning: Optimize tables, indexing, and queries to ensure fast reporting.
Partitioning large tables by date is a common technique to improve query speed.
Launch and Monitor
Launch: Deploy the data warehouse to production. Often this is done in an iterative
(agile) approach, starting with a single Data Mart (a subset of the warehouse focused on a
specific department, like Sales).
Monitor: Implement automated monitoring and alerting for the data pipelines, warehouse
performance, and data quality to ensure high data freshness and system availability.
MAPPING THE DATA WAREHOUSING TO A MULTIPROCESSOR
ARCHITECTURE.
Mapping data warehousing to a multiprocessor architecture involves designing the data
warehouse system in such a way that it utilizes multiple processors to perform parallel
processing. This significantly improves performance, especially when handling large
volumes of data, complex queries, and intensive analysis typical in data mining and
decision support systems.
Data warehouses store and process huge datasets. Running queries or mining such large
volumes on a single processor would be slow and inefficient. Multiprocessor systems
enable parallelism, where tasks are divided across several CPUs, speeding up operations
like ETL (Extract, Transform, Load), indexing, query processing, and analytics.
There are three main types of multiprocessor architectures commonly used:
a) Shared Memory Architecture:
b) Shared Disk Architecture:
c) Shared Nothing Architecture:

a. Shared Memory Architecture

Shared Memory Architecture- In Shared Memory Architecture, there are multiple CPUs
that are attached to an interconnection network. They are able to share a single or global
main memory and common disk arrays. It is to be noted that, In this architecture, a single
copy of a multi-threaded operating system and multithreaded DBMS can support these
multiple CPUs. Also, the shared memory is a solid coupled architecture in which multiple
CPUs share their memory. It is also known as Symmetric multiprocessing (SMP). This
architecture has a very wide range which starts from personal workstations that support a
few microprocessors in parallel via RISC.
Shared Memory Architecture.

Advantages :
 It has high-speed data access for a limited number of processors.
 The communication is efficient.
Disadvantages :
 It cannot use beyond 80 or 100 CPUs in parallel.
 The bus or the interconnection network gets block due to the increment of the large
number of CPUs.
b. Shared Disk Architectures :
In Shared Disk Architecture, various CPUs are attached to an interconnection network. In
this, each CPU has its own memory and all of them have access to the same disk. Also,
note that here the memory is not shared among CPUs therefore each node has its own
copy of the operating system and DBMS. Shared disk architecture is a loosely coupled
architecture optimized for applications that are inherently centralized. They are also
known as clusters.

Shared Disk Architecture


Advantages :
 The interconnection network is no longer a bottleneck each CPU has its own
memory.
 Load-balancing is easier in shared disk architecture.
 There is better fault tolerance.
Disadvantages :
 If the number of CPUs increases, the problems of interference and memory
contentions also increase.
 There's also exists a scalability problem.
c. Shared Nothing Architecture :
Shared Nothing Architecture is multiple processor architecture in which each processor
has its own memory and disk storage. In this, multiple CPUs are attached to an
interconnection network through a node. Also, note that no two CPUs can access the
same disk area. In this architecture, no sharing of memory or disk resources is done. It is
also known as Massively parallel processing (MPP).

Shared Nothing Architecture


Advantages :
 It has better scalability as no sharing of resources is done
 Multiple CPUs can be added

Disadvantages:
 The cost of communications is higher as it involves sending of data and software
interaction at both ends
 The cost of non-local disk access is higher than the cost of shared disk architectures.
Note that this technology is typically used for very large databases that have the size of
10powe12 bytes or TB or for the system that has the process of thousands of transactions
per second.
d. Hierarchical Architecture :
This architecture is a combination of shared disk, shared memory and shared nothing
architectures. This architecture is scalable due to availability of more memory and many
processor. But is costly to other architecture.

Example Used in Data Mining:


Imagine a retail company with millions of transactions. When mining data to find
patterns (e.g., association rules like "people who buy X also buy Y"), the dataset is split
across multiple nodes in a shared-nothing architecture. Each node processes its portion
independently and results are merged.
This makes the mining process faster and scalable — enabling the discovery of customer
behavior patterns in near real-time.
Multiprocessor architecture emphasizes the hardware and system-level design that allows
a data warehouse to perform parallel processing, improving speed and scalability when
handling large data volumes. It deals with how the system executes queries and data
operations efficiently across multiple CPUs, using architectures like shared memory,
shared disk, or shared nothing
DBMS SCHEMAS FOR DECISION SUPPORT.
A DBMS (Database Management System) schema for decision support is a structured
organization of data optimized for complex queries and analytical processing, not for
transactional operations. These schemas form the backbone of data warehouses,
supporting decision-making through OLAP (Online Analytical Processing) and data
mining.
In decision support systems, the goal is not just to store data but to analyze and retrieve
insights. Therefore, the schema design must support efficient querying, aggregation, and
reporting.
There are three common schema types:
a) Star Schema:
 Central fact table connected to multiple dimension tables.
 Fact table contains measurable data (e.g., sales).
 Dimension tables contain descriptive attributes (e.g., product, time, region).
 Simplifies queries and is widely used in OLAP tools.
Example: In a sales data warehouse:
 Fact table: Sales (ProductID, StoreID, DateID, SalesAmount)
 Dimension tables: Product, Store, Time

b) Snowflake Schema:
 An extension of the star schema where dimension tables are normalized.
 Reduces redundancy but increases complexity.
 Used when data integrity and storage efficiency are priorities.
 Example:
The Product dimension in the star schema may be split into Product, Category, and
Supplier tables in a snowflake schema.
c) Fact Constellation (Galaxy) Schema:
 Multiple fact tables share dimension tables.
 Suitable for complex business models.
Example: A company tracks both Sales and Shipments as separate fact tables, but both
share Product and Time dimensions.
Use in Data Mining and Decision Support:
 A business analyst runs a query to compare quarterly sales performance by region.
 Thanks to the star schema, the system performs fast aggregations across the Time
and Region dimensions.
 Data mining algorithms then find trends or anomalies like a drop in sales in a specific
region, helping guide business strategy.
DBMS schemas for decision support are focused on the logical design of the database
specifically how data is structured and organized to support complex queries, reporting,
and analysis. These schemas (such as star, snowflake, or fact constellation) ensure that
data is arranged in a way that makes it easy for users and analytical tools to extract
insights. While multiprocessor architecture addresses performance and execution, DBMS
schemas deal with data modeling and accessibility for informed decision-making.
For example, a star schema for a hospital might include a fact table for treatments and
dimensions for patients, doctors, diseases, and time.
DATA EXTRACTION, CLEAN-UP, & TRANSFORMATION
These three steps are the fundamental stages for moving data from a source system into a
destination system (like a data warehouse or data lake) in a usable format.
1. Data Extraction
This is the process of retrieving or 'sourcing' data from its original location (the source
system).
 Goal: Read the required data and prepare it for the next stage.
 Sources: Can be diverse, including relational databases (e.g., Oracle, SQL
Server), non-relational databases (NoSQL), flat files (CSV, JSON, XML), SaaS
applications (Salesforce, SAP), and streaming data (e.g., IoT sensors).
 Techniques:
 Full Extraction: Extracting all data from the source, typically used for initial
setup.
 Incremental Extraction: Extracting only the data that has changed since the
last extraction (e.g., using timestamps or change data capture/CDC).
2. Data Clean-up (or Cleansing)
This is a key part of the Transformation step, focusing on ensuring the data is accurate,
consistent, and reliable.
 Goal: Identify and fix errors, inconsistencies, and missing values to improve data
quality.
 Tasks include:
 Handling Missing Values: Filling them in with defaults, calculating averages,
or removing the entire record.
 Standardization: Ensuring data is in a common format (e.g., phone number
formats, date formats).
 De-duplication: Identifying and merging identical records.
 Identifying Outliers/Errors: Correcting or flagging data that falls outside
acceptable business rules (e.g., a person's age is 200).
3. Data Transformation
This is the heart of the process, where extracted and cleaned data is converted into the
format and structure required for the target system.
 Goal: Apply business rules and prepare data for analysis.
 Tasks include:
 Data Aggregation: Summarizing data (e.g., calculating total daily sales from
individual transactions).
 Data Derivation: Creating new values from existing data (e.g., calculating Age
from Date of Birth).
 Key Construction: Creating consistent primary/foreign keys for the target
system (especially in data warehousing).
 Data Integration: Merging data from multiple sources into a single, cohesive
dataset.
Tools (Etl/Elt Platforms)
Tools designed to manage this process are called ETL (Extract, Transform, Load) or ELT
(Extract, Load, Transform) tools.

Tool Category Example Tools Description

Enterprise/Commercial Informatica Power Center, Comprehensive, high-cost,


Talend, IBM Info Sphere high-governance platforms
Data Stage, Microsoft often used by large
SSIS, Oracle Data enterprises.
Integrator (ODI)

Cloud-Native / Modern Azure Data Factory Tools built for cloud


(ADF), AWS Glue, environments, often
Google Cloud Dataflow, supporting ELT (loading
Five tran, Matillion first, then transforming
within the data warehouse).

Open Source / Code-Centric Apache Airflow, Apache Offer flexibility and


NiFi, Airbyte, dbt (Data customizability. dbt is
Build Tool) popular for SQL-based
transformation in the ELT
paradigm.

METADATA
What is Metadata?
Metadata refers to data that provides information about other data. It describes the
structure, content and characteristics of a file, document, image or dataset, helping users
and systems understand and manage data more effectively.
 It gives context and meaning to raw data.
 It can include details like author name, creation date, file size and format.
 It improves data organization, searchability and accessibility.
 It’s widely used in databases, websites, digital media and data analytics.
Importance
Metadata is crucial because it adds context to raw data. Without metadata, understanding
the meaning, source, or reliability of data would be difficult. It is especially important in:
 Data analysis for ensuring accurate interpretation.
 Web development for improving search engine optimization (SEO).
 Database systems for efficient querying and data management.
 Digital libraries for cataloging and indexing information resources.
Example
Suppose you have an image file named [Link]. The metadata for this image might
include:
 File size: 3 MB
 Resolution: 1920 × 1080
 Date created: 25-Oct-2025
 Camera model: Canon EOS 80D
Here, metadata tells you details about the file, not the content of the image itself.
Types of Metadata
Metadata is generally divided into five main categories:
Types of Metadata
a) Preservation Metadata: Ensures long-term access to digital assets by storing details
about file formats, preservation methods and migration strategies.
b) Descriptive Metadata: Describes content details like title, author, keywords and date
to help users identify and locate resources.
c) Technical Metadata: Covers technical details such as file type, size, resolution and
software used for creation or editing.
d) Structural Metadata: Shows how data is organized and related helping users navigate
chapters, sections or multimedia parts.
e) Administrative Metadata: Includes management details like ownership, creation date,
access rights and preservation policies
Functions_Of_Metadata
 Identification: Assigns unique identifiers or tags to each resource for easy
recognition and retrieval.
 Preservation: Ensures the long-term integrity and usability of digital content through
preservation records.
 Discovery: Uses descriptive attributes like keywords or subjects to make content
easier to find.
 Navigation: Helps organize complex datasets and allows users to move easily
through structured content.
 Interoperability: Enables seamless data exchange between systems using standard
metadata formats and vocabularies.
Challenges
Some of the challenges with metadata are:
 Inconsistent Labeling: Different systems or users may describe the same data in
varying ways making it hard to standardize.
 Missing Metadata: If metadata is incomplete or absent, finding, managing and
interpreting data becomes difficult.
 Privacy Concerns: Metadata can reveal sensitive information such as location,
creation details or ownership, raising privacy and security risks.
 Maintenance Complexity: Keeping metadata accurate and latest across large datasets
requires ongoing effort and resources.
Meta Data in Data Warehousing.
Metadata in data warehousing: ls data that describes other data. It provides information
about the structure, meaning, origin, and usage of data stored in a data warehouse. Instead
of containing actual business data, metadata acts as a guidebook that helps users and
systems understand how data is organized and how it should be used.
For example: if a customer's purchase amount is the data, then the information about
when the purchase was made, the data type, and the source system is its metadata.
Examples of Metadata
a. File metadata: This includes information about a file, such as its name, size, type,
and creation date.
b. Image metadata: This includes information about an image, such as its resolution,
color depth, and camera settings.
c. Music metadata: This includes information about a piece of music, such as its title,
artist, album, and genre.
d. Video metadata: This includes information about a video, such as its length,
resolution, and frame rate.
e. Document metadata: This includes information about a document, such as its author,
title, and creation date.
f. Database metadata: This includes information about a database, such as its structure,
tables, and fields.
g. Web metadata: This includes information about a web page, such as its title,
keywords, and description.
Consider a table named Sales_Fact:

Metadata Element Description

Table Name Sales_Fact

Column Name Total_Amount

Data Type DECIMAL(10,2)

Source System CRM System

Load Frequency Daily

This metadata explains how the actual data should be interpreted and maintained.

Types of Metadata in Data Warehousing


Metadata in a data warehouse is commonly divided into three main categories:
1. Business Metadata
Business metadata describes data in business-friendly terms so that non-technical users
can understand it easily.
Examples:
 Meaning of a column like Customer_ID
 Business rules for calculating "Total Sales"
 Report definitions and KPIs
2. Technical Metadata
Technical metadata focuses on the physical and structural aspects of data.
Examples:
 Table and column names
 Data types (VARCHAR, INT, DATE, etc.)
 Indexes, partitions, and storage locations
3. Operational Metadata
Operational metadata tracks process-related information about how data moves and
changes.
Examples:
 Data load timestamps
 ETL job status (success/failure)
 Error logs and refresh cycles
How Metadata is Used in Data Warehousing
Metadata supports various warehouse operations, such as:
 ETL processes: mapping source fields to target warehouse fields
 Query optimization: helping database engines choose efficient execution plans
 Data lineage tracking: identifying source systems and transformations
 Impact analysis: understanding how changes affect downstream reports
This makes system maintenance and troubleshooting much easier.
Metadata Repository
A metadata repository is a centralized storage where all metadata is stored and managed.
It acts as a reference point for:
 ETL tools
 BI tools
 Data analysts and engineers
 Data governance teams
This repository ensures consistency and standardization across the data warehouse.
REFERENCES

1. Inmon, W. H. (2005). Building the Data Warehouse (4th ed.). John Wiley
& Sons.
2. Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit: The
Definitive Guide to Dimensional Modeling (3rd ed.). Wiley.
3. Golfarelli, M., & Rizzi, S. (2009). Data Warehouse Design: Modern
Principles and Methodologies. McGraw-Hill.
4. Han, J., Kamber, M., & Pei, J. (2012). Data Mining: Concepts and
Techniques (3rd ed.). Morgan Kaufmann.
5. Elmasri, R., & Navathe, S. B. (2016). Fundamentals of Database Systems
(7th ed.). Pearson.
6. Chaudhuri, S., & Dayal, U. (1997). An Overview of Data Warehousing
and OLAP Technology. ACM SIGMOD Record.
7. Ponniah, P. (2010). Data Warehousing Fundamentals for IT Professionals
(2nd ed.). Wiley.

You might also like