0% found this document useful (0 votes)
70 views79 pages

CS614 Data Warehousing Lecture Notes

The document provides comprehensive notes on data warehousing, detailing its evolution from the industrial age to the information age, emphasizing the importance of data warehousing in managing and analyzing vast amounts of data for informed decision-making. It outlines the key features of data warehouses, such as their ability to integrate historical data from heterogeneous sources, support complex queries, and provide a complete organizational view. Additionally, the document includes multiple-choice questions and short answers to reinforce understanding of the concepts presented.

Uploaded by

Noble Ch
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views79 pages

CS614 Data Warehousing Lecture Notes

The document provides comprehensive notes on data warehousing, detailing its evolution from the industrial age to the information age, emphasizing the importance of data warehousing in managing and analyzing vast amounts of data for informed decision-making. It outlines the key features of data warehouses, such as their ability to integrate historical data from heterogeneous sources, support complex queries, and provide a complete organizational view. Additionally, the document includes multiple-choice questions and short answers to reinforce understanding of the concepts presented.

Uploaded by

Noble Ch
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

CS614 Midterm Simplified Lectures Notes

Lecture 1

Simplified Explanation (With All Important Concepts)


🌍 The Changing World and the Right Direction

 The world has shifted from industrial age to the information age.
 Now, countries progress using information and knowledge, not just physical resources.
 Many countries (like India) succeeded by using modern technologies. If we follow
outdated tools, we’ll fall behind.
 Right direction = Using the data we have collected to generate insights and decisions.

❓ Why Do We Need Data Warehousing?

 We are surrounded by too much data, but we often lack useful information.
 There's a saying: "Drowning in data, starving for information."
 To solve this, we need systems that can store, manage, and analyze data efficiently —
that’s why Data Warehousing comes in.

🔄 From Data to Intelligence (Figure 1.1)

 Data: Raw facts (e.g., transaction records).


 Information: Processed data (e.g., total sales).
 Knowledge: Application of information (e.g., which products are best sellers).
 Intelligence: Understanding the reasons behind knowledge (e.g., why those products are
top sellers).
 Wisdom: Using intelligence for goals (e.g., plan better marketing).

Historical Background

1960s – COBOL & Master Files:

 Data stored in magnetic tapes – cheap but slow and unreliable.


 Problems: Redundancy, slow access, and complexity.

1970s – DASD & DBMS:

 DASD (Direct Access Storage Devices) introduced.


 Data could be accessed directly and quickly.
 DBMS (Database Management Systems) helped manage data efficiently.
 Data was now treated as a "single source of truth" – essential for building a data
warehouse.

1980s – PCs & 4GL:

 PCs and 4th Generation Languages allowed managers to control data themselves.
 Introduced Decision Support Systems (DSS).
 Operational + Analytical Processing became possible.

1990s – Extract Programs & Spider Web:

 Simple Extract programs moved selected data from one place to another.
 Over time, too many extracts were made — extracts of extracts — causing data chaos.
 This uncontrolled setup became the "spider web" or legacy systems.

Crisis of Credibility (Figure 1.3)

 Due to different systems and data sources, departments gave conflicting reports (e.g.,
one shows +10% profit, another -10%).
 The CEO is confused. Which report to trust?
 This shows the danger of unreliable and scattered data sources.
 Solution: A centralized data warehouse that offers consistent, reliable, and
integrated information.

✅ 10 MCQs with Answers


1. What has the world economy shifted to?
A. Agricultural economy
B. Industrial economy
C. Knowledge economy
D. Mechanical economy
Answer: C. Knowledge economy
2. Why is data warehousing important?
A. To store paper records
B. To organize books
C. To process and convert data into useful information
D. To print reports
Answer: C. To process and convert data into useful information
3. What is data?
A. Final results
B. Raw facts and figures
C. Processed knowledge
D. Meaningful insights
Answer: B. Raw facts and figures
4. Which one comes after Information in the data hierarchy?
A. Intelligence
B. Wisdom
C. Data
D. Knowledge
Answer: D. Knowledge
5. What does DASD stand for?
A. Direct Automated Storage Device
B. Direct Access Storage Device
C. Digital Archive Storage Device
D. Data And System Disk
Answer: B. Direct Access Storage Device
6. Which of the following was a drawback of magnetic tapes?
A. High cost
B. No storage capacity
C. Sequential access only
D. Not available in 1960s
Answer: C. Sequential access only
7. What caused data redundancy in the 1960s?
A. Use of DASD
B. Use of a single database
C. Multiple master files
D. Decision Support Systems
Answer: C. Multiple master files
8. What does DBMS help with?
A. Printing reports
B. Controlling operating systems
C. Storing and managing data on DASD
D. Formatting Excel sheets
Answer: C. Storing and managing data on DASD
9. What was the extract program used for?
A. Deleting files
B. Transferring and filtering data
C. Sending emails
D. Compressing files
Answer: B. Transferring and filtering data
10. What is the main problem with legacy systems?
A. Easy to manage
B. Fast data access
C. Inconsistent and conflicting data
D. Based on cloud storage
Answer: C. Inconsistent and conflicting data
✍️5 Short Questions with Answers
1. What is the main reason for using a data warehouse?
➤ To convert large amounts of raw data into meaningful, consistent information for
decision-making.
2. What is meant by “Drowning in data, starving for information”?
➤ It means organizations collect too much raw data but fail to process it into useful
information.
3. What replaced magnetic tapes in the 1970s?
➤ DASD (Direct Access Storage Device) replaced magnetic tapes due to faster, direct
access.
4. What is the crisis of credibility in legacy systems?
➤ It occurs when different departments present conflicting reports due to using different
data sources, causing confusion and poor decision-making.
5. How does a DBMS help in data storage?
➤ It manages, indexes, and stores data efficiently on DASD, allowing better control and
less redundancy.

Lecture 2

Here is a simplified explanation of Lecture 2: Introduction to Data Warehousing – Part II,


along with 10 MCQs, 5 short questions, and their answers.

🎯 Simplified Explanation (Without Skipping Concepts)


🌐 Why Data Warehouse (DWH)?

1. Data is Growing Fast


Due to modern tech (Moore’s Law), CPU performance is increasing and cost is
decreasing—but storage is growing faster and getting cheaper even more rapidly.
Example: A company’s data may double each year.
2. Data Size Examples
o 1MB: Small novel
o 1GB: A pickup van filled with printed paper
o 1TB: Paper from 50,000 trees
o 2PB: All academic libraries in the U.S.
o 5EB: All words ever spoken
3. Data Warehouse ≠ Large Data Storage
Just storing huge data doesn’t mean you have a DWH. A true DWH is a system that
helps in decision-making, not just data storage.
📈 Importance of Historical Data

 “History is the best predictor of the future.”


By analyzing past data, organizations can predict trends and make smarter decisions.
But insight and experience are needed alongside data.

🏢 Total View of the Organization

 Operational systems (like savings account, ATM, credit card systems) are usually
separate.
 A DWH combines all this data to give a complete picture of the customer or business.

🧠 Intelligent Decision Support

 DWH helps in finding why something happened, what is happening, and what may
happen.
 Using data mining techniques (like clustering or classification), we can even predict
which customers may leave the business before they actually do.

🤖 Intelligent Enterprise vs Traditional DBMS

DBMS Questions Intelligent DWH Questions


List all items sold last month Which items sell together?
How many sales in January? What offers/discounts work best?
Items purchased by Ali? Which customers will respond to a campaign?

📊 Stages of Data Warehousing

1. What happened? (Basic reporting)


2. Why it happened?
3. What is happening now?
4. What will happen?
5. What do we want to happen? (Active DWH, real-time decisions)
🧾 Definition of a Data Warehouse
A complete repository of historical corporate data extracted from transaction systems that
is available for ad-hoc access by knowledge workers.

Key Components:

 Complete Repository: Covers all branches and possibly even archived or old system
data.
 Transaction System: Like MIS or OLTP systems.
 Ad-Hoc Access: Flexible, not fixed. Users generate queries as needed.
 Knowledge Workers: Executives, decision-makers (not clerks or programmers).

🧬 Another Definition View:

A DWH is:

 Subject-Oriented: Focus on areas like customer, product, not workflow.


 Integrated: Combines data from different sources/formats into one format.
 Time-Variant: Keeps history—every record has a time reference.
 Non-Volatile: Once data is entered, it is not deleted or changed.

✅ 10 Multiple Choice Questions (MCQs)


1. Which of the following is growing faster than CPU performance according to
Moore’s Law?
a) Network speed
b) Storage size and availability ✅
c) Internet bandwidth
d) Processor heat
2. Which of the following statements is true?
a) Larger data storage means it is a data warehouse
b) Historical data is not needed in DWH
c) A 500GB system can still be a DWH ✅
d) Only real-time data is stored in DWH
3. Why is historical data important in DWH?
a) To delete old records
b) To predict future events ✅
c) To increase performance
d) To reduce system size
4. Which user is likely to use ad-hoc access in a DWH?
a) IT support staff
b) Data entry clerks
c) CEOs and Marketing Managers ✅
d) Software testers
5. Data in DWH is NOT typically:
a) Time-variant
b) Non-volatile
c) Integrated
d) Frequently updated ✅
6. Which data mining technique helps predict customer behavior?
a) Formatting
b) Clustering ✅
c) Indexing
d) Scripting
7. The ability to answer ‘What will happen?’ is a sign of:
a) OLTP System
b) Active Data Warehouse ✅
c) Manual Report
d) Network analysis
8. Which of these is a core benefit of a DWH?
a) Data fragmentation
b) Partial business view
c) Total organizational view ✅
d) Static reports only
9. What is the role of knowledge workers in a DWH system?
a) Data entry
b) Programming
c) Decision-making ✅
d) Hardware repair
10. Which one is NOT a characteristic of a data warehouse?
a) Subject-oriented
b) Integrated
c) Volatile ✅
d) Time-variant

✍️5 Short Questions with Answers


1. Q: What makes a system a Data Warehouse, size or structure?
A: Structure and purpose. It must support decision-making using integrated historical
data.
2. Q: Why is history considered important in data warehousing?
A: Because it helps to analyze trends and predict future behavior.
3. Q: Who are the typical users of a Data Warehouse?
A: Knowledge workers like executives, managers, and analysts—not IT staff.
4. Q: What does ad-hoc access mean in a data warehouse?
A: Queries are not fixed and are generated as needed, without predefined paths.
5. Q: List any two characteristics of a data warehouse.
A: Time-variant and Non-volatile.

Lecture 3

Here’s a simplified explanation of Lecture 3: Introduction to Data Warehousing – Part III,


covering all key points and concepts, followed by 10 MCQs and 10 short questions with
answers.

✅ Simplified Explanation of Lecture 3


🔸 What is a Data Warehouse?

A Data Warehouse (DWH) is not a software you can install and start using. It is a system that
evolves over time by collecting, cleaning, and organizing data from different sources (called
heterogeneous sources), like:

 Different software
 Different databases
 Even external data (like industry price lists, customer contacts)

Since this data comes in different formats, it is cleaned, standardized, and integrated into one
uniform format before storing in the data warehouse.

🔸 Main Goal of a Data Warehouse:

To help decision-makers by providing big-picture analysis using huge amounts of historical and
current data. Queries here are complex and span across many tables – not just simple, single-
record lookups.

🔸 Key Features of Data Warehousing:

1. Collect all operational and external data.


2. Transform it into a uniform format.
3. Integrate it into one structure.
4. Store it for easy access and analysis.
5. Use special indexing (like bitmap indexes) for fast access.
6. Use efficient joins like hash joins or sort-merge joins (not nested-loop joins).
7. Run ad-hoc queries (random questions) with low selectivity (they fetch many records,
not just one or two).

🔸 Indexing in DWH:

Traditional indexes (like B-Trees) are not efficient for very large data because they cause too
many memory page faults. DWH uses smarter indexes that can sometimes answer queries
without accessing actual data tables.

🔸 Joins in DWH:

Unlike small OLTP systems, where few rows are accessed and nested loops work fine, DWH
handles millions of rows, so it needs faster join techniques like:

 Hash Join
 Sort-Merge Join

🔸 Selectivity Explained:

 Selectivity = (Unique Values) / (Total Values)


 For example: In a "Gender" column with 2 values out of 1000 rows → Selectivity =
2/1000 = 0.2
 In DWH, selectivity is low (many rows retrieved).
 In OLTP, selectivity is high (few rows retrieved).

🔸 Ad-hoc Queries:

In DWH, decision-makers ask random, unplanned questions, not fixed ones. These questions
change with time and can’t be answered using fixed reports.

🔸 Different Hardware Usage Patterns:

 Operational systems (OLTP): Constant and steady usage.


 Data Warehouses (DWH): Binary usage (either fully active or completely idle).

🔸 Bus vs Train Analogy:

 Bus = OLTP → Runs frequently, handles few users.


 Train = DWH → Runs rarely, carries many users/data.
 You can’t use the same system for both purposes. They serve different needs.

🔸 Historical Data in DWH:

DWH keeps historical data, which OLTP systems don’t keep for long. This data helps:

 Understand customer behavior


 Predict future trends
 Analyze seasonal changes

🔸 How much history to keep?

Depends on:

 Industry type
 Cost of storage
 Value of old data

Examples:

 Telecom: ~18 months (calls are frequent)


 Retail: 65 weeks (year + season for comparison)
 Insurance: 7+ years (for risk analysis)
 Banks: 3+ years

Older data is less valuable but still useful for trends.

📘 10 Multiple Choice Questions (MCQs)


1. What is the first step in building a data warehouse?
A. Data analysis
B. Collecting data from operational systems ✅
C. Creating reports
D. Data mining
2. What does “heterogeneous sources” mean in DWH?
A. Same format data
B. Different formats and systems ✅
C. Only external data
D. Standardized data
3. Which of the following is NOT a feature of DWH?
A. OLTP-style data entry ✅
B. Storing historical data
C. Uniform data format
D. Supporting decision making
4. Which join is NOT suitable for very large databases?
A. Hash Join
B. Sort-Merge Join
C. Nested-loop Join ✅
D. Index Join
5. What is the purpose of indexing in DWH?
A. To slow down access
B. To avoid joins
C. To quickly access data ✅
D. To delete duplicates
6. Which indexing method is more common in DWH?
A. B-Tree
B. Hash Index
C. Bitmap Index ✅
D. Linked Index
7. Selectivity in DWH is usually:
A. High
B. Low ✅
C. Fixed
D. Not applicable
8. In DWH, decision queries are mostly:
A. Specific and repeated
B. Based on primary keys
C. Ad-hoc and wide-ranging ✅
D. Rare
9. The 'Train vs Bus' analogy explains:
A. The difference in query type
B. Difference in storage size
C. Difference in hardware usage ✅
D. Difference in users
10. Why does DWH store old customer data?
A. For billing
B. To fill space
C. For trend analysis and decision support ✅
D. To avoid deletion

🧠 10 Short Questions with Answers


1. Q: What is a data warehouse?
A: A system that collects, cleans, integrates, and stores data from various sources to
support decision-making.
2. Q: Why do we need to transform data before storing it in DWH?
A: Because it comes from different formats and systems and needs to be standardized.
3. Q: What kind of queries are common in DWH?
A: Ad-hoc queries with low selectivity covering large data sets.
4. Q: What is selectivity in data warehousing?
A: It is the ratio of unique values in a column to total values, indicating how specific a
query is.
5. Q: Why is B-Tree indexing not preferred in DWH?
A: Because it causes high page faults due to the large size of the tree.
6. Q: What type of indexing is efficient in DWH?
A: Bitmap indexing and other advanced techniques.
7. Q: How is DWH hardware usage different from OLTP?
A: It is binary – either fully used or idle, unlike steady OLTP usage.
8. Q: Why do retailers store 65 weeks of data?
A: To compare current season sales with the same season last year.
9. Q: What is meant by “decision making is ad-hoc”?
A: Questions and requirements keep changing and are not fixed.
10. Q: Why isn’t DWH used for data entry?
A: Because it only stores and analyzes data collected from OLTP systems; it’s read-only.

Lecture 4

Here's a simplified explanation of Lecture 4: Introduction to Data Warehousing – Part IV,


along with 10 MCQs and short questions with answers, while preserving all key concepts:

🔍 Simplified Explanation
📌 1. Availability of Data Warehouse (DWH)

 In the beginning, DWH may be required only 6 days a week, 12 hours a day (6x12)
because decision-makers don’t work 24/7.
 Over time, they start using DWH more often and expect it to be available 24/7 (7x24).
 This demands 100% availability, which is not easy. We must carefully manage data
loading, refresh rates, and query response times.

📌 2. DWH vs Traditional Systems (SDLC vs CLDS)

Traditional SDLC (used in software


CLDS (used in Data Warehousing)
development)
Starts with requirements Starts with data
Then integrate, analyze, program, understand
Then design, develop, test
requirements
Requirement-driven Data-driven

 Using traditional development tools on DWH can waste time and cause confusion.

📌 3. OLTP vs DWH Queries

Feature OLTP DWH


Data Operational Historical
Structure Fully normalized Lightly de-normalized
Use of Primary Key Yes Rarely
Query result size Small Large
Tables used One Many
Time taken Seconds Minutes to Hours
Availability 24x7 Starts as 6x12, evolves to 24x7

📌 4. OLAP and Query Optimization

 OLAP (Online Analytical Processing) allows fast, interactive queries in seconds, even
on billions of rows.
 Techniques used:
o Sampling
o Denormalization
o Special indexing
 Some tasks like data mining may take hours.

📌 5. Data Warehouse Components


1. Extract data from multiple sources.
2. Transform & clean the data.
3. Load into data warehouse.
4. Create schemas for fast querying.
5. Build data marts for departments (e.g., Sales).
6. Generate data cubes for OLAP.
7. Use data mining tools for insights.

📌 6. Challenges in Data Warehousing

 Data comes from unstructured & different sources.


 Business needs change often.
 Traditional OLTP methods don't work well for very large databases (VLDBs).
 Performance problems arise at scale (millions/billions of records).
 Complex computer architecture and products change quickly.

📌 7. High-Level Implementation Steps

Phase I – Planning & Design

1. Know users' needs.


2. Choose DBMS and hardware.
3. Model data and metadata.

🔄 Phase II – Building

4. Extract, clean, transform data.


5. Set up connections (middleware).
6. Test with prototypes, reports, OLAP, and data mining.

🚀 Phase III – Deployment

7. Final deployment and system management.

✅ 10 MCQs with Answers


1. Initially, how many hours a day is a data warehouse expected to be available?
A) 24 hours
B) 12 hours ✅
C) 18 hours
D) 6 hours
2. Which development model does DWH follow?
A) SDLC
B) CLDS ✅
C) Agile
D) Waterfall
3. In DWH, development starts with:
A) Requirements
B) Testing
C) Data ✅
D) Design
4. Which type of query system is used in OLTP?
A) Complex
B) Transactional ✅
C) Analytical
D) None
5. Data in a DWH is usually:
A) Fully normalized
B) Unstructured
C) Lightly de-normalized ✅
D) Repetitive
6. OLAP queries should execute in:
A) Hours
B) Days
C) Seconds ✅
D) Weeks
7. Which is NOT a step in Phase I of DWH implementation?
A) Metadata repository
B) Data mining ✅
C) Hardware selection
D) Information modeling
8. Which of the following is true for OLTP?
A) Based on historical data
B) Designed for analytics
C) Returns thousands of rows
D) Has high selectivity and uses primary keys ✅
9. Why is scaling difficult in DWH?
A) Lack of data
B) Business rules change
C) Algorithms perform poorly at large scale ✅
D) Easy to manage
10. DWH is often designed for:
A) Real-time transaction
B) Data entry
C) Long-term decision support ✅
D) Gaming

✍️Short Questions with Answers


1. Q: What does 6x12 and 7x24 availability mean?
A: 6x12 means available 6 days/week for 12 hours/day; 7x24 means 24 hours/day, 7
days/week availability.
2. Q: What is the CLDS model in DWH?
A: CLDS (Classical Life Data Cycle) starts from data, integrates and analyzes it, and
understands user needs at the end.
3. Q: What is OLAP used for?
A: OLAP allows interactive decision-making by running fast queries on large data sets.
4. Q: Why can't traditional SDLC work for DWH?
A: Because business requirements change rapidly, making pre-collected requirements
outdated.
5. Q: Name two performance optimization techniques in DWH.
A: Sampling and denormalization.
6. Q: What are data marts?
A: Smaller sections of the data warehouse focused on specific departments like sales or
marketing.
7. Q: What is the main difference between OLTP and DWH queries?
A: OLTP queries are fast, use primary keys, and return fewer rows; DWH queries are
complex, span many tables, and return thousands of rows.
8. Q: Why is scaling a problem in DWH?
A: Because algorithms that work well on small data fail or slow down significantly with
billions of records.
9. Q: What happens in Phase II of DWH implementation?
A: Data is extracted, cleaned, transformed, and OLAP/data mining tools are
implemented.
10. Q: What is metadata in DWH?
A: Data that describes other data, helping in organizing and retrieving data efficiently.

Lecture 5

Here's a simple explanation of Lecture 5: Types of Data Warehouses and Their


Applications, with all key points covered and then 10 MCQs and short questions with
answers.

🧠 Simple Explanation of Lecture 5


📌 What is a Data Warehouse (DWH)?

A Data Warehouse (DWH) is a system that collects and stores large amounts of data from
different sources in one place, so it can be analyzed for decision-making.

🏢 Types of Data Warehouses (DWHs)


1. Financial DWH

 Usually the first type a company builds.


 Deals with money, so it touches all parts of the organization.
 Smaller in data size compared to others.
 Highly structured because finance follows strict rules.
 But it doesn’t match exactly with real-time operational systems because:
o Different accounting periods.
o Different geographical definitions.
o Different unit measures (e.g., dozen vs kg).
 It’s normal if the DWH doesn’t balance to the last rupee.

2. Telecommunication DWH

 Huge amount of data, especially from call-level details.


 Storing full detail is hard, so options include:
o Keeping only recent data.
o Summarizing or selecting key data.
 Some tasks require full call-level data, so aggregation is not possible.

3. Insurance DWH

 Very old data is stored, used for actuarial processing.


 Business model has not changed much in 40-50 years.
 Many types of dates (e.g., policy start date, claim date, etc.).
 Slow business cycle (claims take years).
 Data is often “frozen”, not changing like in banks or retail.
 Requires a unique DWH design.

4. Human Resource (HR) DWH


 Very specific: focuses only on one subject — people (employees).
 Smaller but important in analyzing employee data like salary, department, promotions,
etc.

🔄 Comparison Summary
Type Key Feature
Financial Easy to start, touches all departments, small size, can't match operational data exactly
Telecom Data-heavy (calls), hard to summarize, needs detailed storage
Insurance Very old data, slow processes, many date types
HR Single focus area (employees)

🎯 Typical Applications of Data Warehousing


1. Fraud Detection

 Analyze customer behavior patterns.


 If behavior suddenly changes, possible fraud.
 Example: A credit card used for unusual purchases in another city.

2. Profitability Analysis

 Know which customers are profitable or not.


 Banks may lose money on 50% of customers.
 Helps in restructuring products and pricing.
 Predict future profits using lifetime value models.

3. Direct Mail/Database Marketing

 Targeted marketing to save costs.


 Example: Offer internet services only to users who browse the web.
 Uses call detail records to know customer preferences.

✅ MCQs (Multiple Choice Questions)


1. Which data warehouse is usually the first built in an organization?
A) Telecom
B) Insurance
C) Financial
D) HR
✅ Answer: C) Financial
2. Why is financial data warehouse appealing to start with?
A) It has unstructured data
B) It's large and complex
C) It touches all departments and is small
D) It doesn't deal with money
✅ Answer: C) It touches all departments and is small
3. What is the biggest challenge in Telecom DWH?
A) Too little data
B) Data duplication
C) Sheer volume of call-level data
D) Lack of customers
✅ Answer: C) Sheer volume of call-level data
4. Why does Financial DWH often not reconcile with operational systems?
A) Bugs in software
B) Use of fake data
C) Different calendars and region classifications
D) Data is encrypted
✅ Answer: C) Different calendars and region classifications
5. Insurance DWH stores very old data mainly for:
A) Fraud detection
B) Marketing
C) Actuarial processing
D) Internet surfing
✅ Answer: C) Actuarial processing
6. Which type of DWH is likely to have “frozen” transactions?
A) Financial
B) Telecom
C) Insurance
D) HR
✅ Answer: C) Insurance
7. Which DWH is unique due to only one major subject area?
A) Financial
B) HR
C) Telecom
D) Insurance
✅ Answer: B) HR
8. Which is a typical application of data warehousing?
A) SMS forwarding
B) Email spamming
C) Credit risk prediction
D) Making video calls
✅ Answer: C) Credit risk prediction
9. Why do banks need profitability analysis?
A) To open more branches
B) To shut down ATMs
C) To find out which customers are profitable
D) To catch fraudsters
✅ Answer: C) To find out which customers are profitable
10. Direct mail marketing using DWH helps by:
A) Increasing printing cost
B) Sending messages to all
C) Targeting interested customers
D) Avoiding internet
✅ Answer: C) Targeting interested customers

✍️Short Questions with Answers


1. Why is the Financial DWH usually the first to be built?

Answer: Because it deals with money (a central focus), is smaller in size, and touches all
departments.

2. What makes Telecom DWH challenging?

Answer: It is dominated by massive volumes of call-level data which cannot be summarized


easily.

3. What is unique about Insurance DWH?

Answer: It stores very old data, has many types of dates, slow operational cycles, and frozen
transactions.

4. Why is it hard to reconcile a Financial DWH to the last rupee?

Answer: Due to differences in calendars, region definitions, and data measurement units.

5. How can DWH help in detecting fraud?

Answer: By analyzing changes in customer behavior patterns and identifying unusual activity.

Lecture 6

Here is a simple explanation of Lecture 6 (Normalization) from CS614 – Data Warehousing,


along with 10 MCQs and short questions with answers at the end. The explanation includes all
important concepts without skipping any points.
🔍 What is Normalization?
Normalization is the process of organizing data in a database to remove redundancy and
ensure logical data storage. It involves splitting larger tables into smaller ones to avoid
repeating data and ensure that every piece of data is stored in only one place.

🎯 Goals of Normalization:

1. Eliminate redundant data (e.g., don't store the same data in multiple places).
2. Ensure that data dependencies make sense (related data should be stored together).

⚠️What are Anomalies?


Anomalies are data problems that occur if a database is not properly normalized:

 INSERT Anomaly: Can’t add data because other related data isn’t available.
 DELETE Anomaly: Deleting one data item causes loss of other useful data.
 UPDATE Anomaly: One change requires updating many rows.

📘 Normalization Forms
🥇 First Normal Form (1NF)

 A table is in 1NF if:


o All fields contain atomic (indivisible) values.
o There are no repeating groups or arrays.

🧾 Example Problem in 1NF:

 In a student table, the degree and campus are repeated for each course.
 Leads to redundancy and update anomalies.

🥈 Second Normal Form (2NF)

 A table is in 2NF if:


o It is already in 1NF, and
o Every non-key attribute is fully dependent on the entire primary key (not just
part of it).
🧾 Example:

 The original table used a composite key (SID, Course).


 But degree and campus only depended on SID, not on Course.
 So, a new table REGISTRATION is created with SID as primary key to store degree
and campus.

🥉 Third Normal Form (3NF)

 A table is in 3NF if:


o It is already in 2NF, and
o No non-key column is transitively dependent on the primary key.

🧾 What is Transitive Dependency?

 If A → B and B → C, then A → C is transitive dependency.


 Example: SID → Campus, Campus → Degree, so SID → Degree is transitive.

To fix this:

 Create a new table CAMPUS_DEGREE for campus and degree.


 Leave campus in REGISTRATION as a foreign key.
 Rename REGISTRATION to STUDENT_CAMPUS.

❗ Do we always normalize to 3NF or beyond?


No!

 In real business systems, sometimes we intentionally de-normalize to improve


performance.
 This is acceptable if we understand the consequences, like redundancy or anomalies.

✅ Benefits of 3NF:
 Less redundant data
 Fewer anomalies
 Easier data updates
 More flexible for inserts and deletes
📌 Conclusion:
Normalization helps to make databases clean, efficient, and logical by:

 Eliminating repeated data


 Avoiding update problems
 Making data easier to maintain and query

📚 10 Multiple Choice Questions (MCQs)


1. What is the main goal of normalization?
A. Create more tables
B. Improve speed
C. Eliminate redundant data
D. Store less data
✅ Answer: C
2. What causes an update anomaly?
A. Missing tables
B. Incorrect query
C. Repeating groups
D. Redundant data
✅ Answer: D
3. Which normal form removes partial dependency?
A. 1NF
B. 2NF
C. 3NF
D. 5NF
✅ Answer: B
4. Which normal form removes transitive dependency?
A. 1NF
B. 2NF
C. 3NF
D. BCNF
✅ Answer: C
5. A table with atomic values and no repeating groups is in:
A. 1NF
B. 2NF
C. 3NF
D. 4NF
✅ Answer: A
6. What is a transitive dependency?
A. A column depends on multiple columns
B. A column depends on a column which depends on the key
C. A key depends on a column
D. Columns have repeating values
✅ Answer: B
7. Which of these is NOT a reason for normalization?
A. Avoid anomalies
B. Improve storage
C. Increase redundancy
D. Ensure logical dependencies
✅ Answer: C
8. In which case would de-normalization be useful?
A. When redundancy is needed
B. When performance is more important
C. When too many users
D. When using NoSQL
✅ Answer: B
9. Which table in the example is already in 3NF?
A. FIRST
B. REGISTRATION
C. PERFORMANCE
D. CAMPUS_DEGREE
✅ Answer: C
10. After converting to 3NF, campus becomes a:
A. Primary Key
B. Foreign Key
C. Composite Key
D. Secondary Index
✅ Answer: B

✍️Short Questions with Answers


1. Q: What is normalization?
A: It is the process of organizing database tables to remove redundancy and ensure data
dependencies make sense.
2. Q: What are the main goals of normalization?
A: To eliminate redundant data and ensure logical data dependencies.
3. Q: What is a composite key?
A: A primary key that consists of more than one column (e.g., SID + Course).
4. Q: What is an update anomaly?
A: A problem that occurs when the same data needs to be changed in multiple places.
5. Q: Why is 1NF not enough?
A: Because it only ensures atomicity but still allows partial dependencies and
redundancy.
6. Q: What is partial dependency?
A: When a non-key column depends on part of a composite key instead of the full key.
7. Q: What is transitive dependency?
A: When a non-key column is indirectly dependent on the primary key through another
column.
8. Q: Why do we move to 3NF?
A: To remove transitive dependencies and avoid anomalies.
9. Q: What is de-normalization?
A: The process of combining tables to improve performance, sometimes at the cost of
redundancy.
10. Q: How does normalization help queries?
A: It makes queries faster and more accurate by avoiding repeated and inconsistent data.

Lecture 7

Here’s a simple explanation of Lecture 07: De-Normalization from CS614 (Data


Warehousing), covering all key points, followed by 10 MCQs and 5 short questions with
answers:

✅ Simple Explanation of De-Normalization (Lecture 07)


🔍 What is De-Normalization?

 De-normalization means reversing some parts of normalization to improve


performance in data queries.
 It’s not chaos—it's done carefully and purposefully.
 The goal is to make data easier and faster to access, not to break rules.
 Examples include combining tables, adding summary data, or redundant columns.
 It's like making shortcuts to speed up reading data but without losing any information.

💡 Why is De-Normalization Used in DSS (Decision Support Systems)?

 DSS needs to analyze and summarize large amounts of data quickly.


 In a fully normalized database, data is split into many small tables, which need joins to
answer queries.
 Joins are slow, especially when data size is large.
 De-normalization brings related data closer together in fewer tables, reducing joins and
making queries faster.
 Studies (e.g., by Inmon) showed performance can improve significantly when de-
normalized carefully.

⚡ How Does De-Normalization Improve Performance?

 Reduces number of tables → fewer joins needed.


 Reduces number of joins → faster queries.
 Reduces number of rows accessed → less processing time.
 Helpful for large datasets where joins are expensive in terms of speed.

📏 Guidelines for De-Normalization

1. Cost-benefit analysis: Is the performance gain worth the extra storage?


2. Analyze data requirements: Will this structure support your typical queries?
3. Watch for maintenance issues: More redundancy = harder updates.
4. Golden Rule: When in doubt, don’t de-normalize.

🌍 Where to Apply De-Normalization?

 Data warehouses with star schemas.


 When performing time-series analysis.
 When needing aggregated (summary) data.
 In multi-dimensional analysis (like location: Province → District → City).
 When there are few updates but many read queries.

🧰 Common De-Normalization Techniques

1. Collapsing Tables: Combine tables in a One-to-One or Many-to-Many relationship.


2. Pre-Joining: Join tables in advance and store the result.
3. Splitting Tables: Break large tables into smaller ones by rows (horizontal) or columns
(vertical).
4. Adding Redundant Columns: Store same data in more than one place (for quick
access).
5. Derived Attributes: Add columns like totals, balances, etc., that are calculated from
other data.
📝 10 MCQs (with Correct Answers)
1. What is the main goal of de-normalization in DSS?
A. Save disk space
B. Remove redundant data
C. Improve query performance ✅
D. Apply normalization rules
2. Which technique involves combining two related tables into one?
A. Derived attributes
B. Splitting tables
C. Collapsing tables ✅
D. Adding triggers
3. Which of the following is NOT a benefit of de-normalization?
A. Fewer joins
B. Faster queries
C. Easier updates ✅
D. Less overhead
4. In DSS, de-normalization helps bring together ___________.
A. unrelated data
B. duplicated records
C. dispersed but related data ✅
D. complex schema design
5. Which is a golden rule of de-normalization?
A. Always denormalize
B. Normalize after every update
C. When in doubt, don’t de-normalize ✅
D. Add indexes on all columns
6. Which of the following is a technique of de-normalization?
A. Creating more joins
B. Normalization
C. Adding redundant columns ✅
D. Removing all indexes
7. Which type of system typically benefits most from de-normalization?
A. OLTP
B. DSS ✅
C. Compiler
D. ERP
8. What is the effect of too much de-normalization?
A. More normalization
B. One large table and loss of performance ✅
C. No change
D. Data loss
9. Which one is NOT a guideline for de-normalization?
A. Perform cost-benefit analysis
B. Add triggers for consistency
C. Always de-normalize first ✅
D. Analyze data requirements
10. Derived attributes in de-normalization are used to store:
A. Only IDs
B. Business rules
C. Calculated values ✅
D. NULLs only

✍️5 Short Questions with Answers


1. Q: What is de-normalization?
A: It is the process of carefully converting a normalized database into a less normalized
form to improve query performance without losing data.
2. Q: Why is de-normalization important in DSS?
A: Because it reduces joins and access times, which helps retrieve large data quickly, a
common need in Decision Support Systems.
3. Q: How does de-normalization improve performance?
A: By reducing tables, joins, and rows accessed during queries, which speeds up data
retrieval.
4. Q: Name any two de-normalization techniques.
A: Collapsing Tables and Adding Redundant Columns.
5. Q: What is a drawback of too much de-normalization?
A: It can cause update problems and lead to a large flat file, reducing performance and
increasing maintenance.

Lecture 8

Certainly! Here's a simple but complete explanation of the lecture on De-Normalization


Techniques, covering all important concepts, followed by 10 MCQs and short questions with
answers.

Lecture 08: De-Normalization Techniques — Simple Explanation

De-normalization is a database design technique where you intentionally add redundancy or


change table structures to improve query performance. Unlike normalization, which reduces
redundancy to avoid inconsistencies, de-normalization accepts some redundancy for faster data
retrieval, especially useful in decision support systems (DSS) and data warehouses.
1. Splitting Tables

 Splitting tables means breaking one big table into multiple smaller tables.
 Two main types of splitting:

a) Horizontal Splitting (Horizontal Partitioning):

o Divide a table’s rows based on a condition, e.g., students of different campuses in


separate tables.
o All tables have the same columns but contain different rows.
o Useful for:
 Parallel processing (distributing data across hardware).
 Reducing query load by limiting WHERE clause to a smaller table.
 Improving security by restricting access to certain rows.
 Faster data access due to smaller table size and flatter index structures.
 Graceful degradation if a partition/table gets damaged (only part of data
lost).

b) Vertical Splitting (Vertical Partitioning):

o Divide a table’s columns into separate tables.


o Each table repeats the primary key so they can be joined.
o Used when some columns are rarely accessed or are very large (e.g., big text
fields).
o Reduces the size of table headers, allowing more rows per data block and better
I/O performance.
o To users, the split looks like a single table via database views.

2. Pre-Joining

 In normalized databases, related tables are joined at query time.


 Pre-joining means physically combining (joining) frequently joined tables into one large
table before queries.
 Typical in one-to-many relationships like master-detail tables (e.g., sales header and
sales details).
 Avoids costly joins at run-time by duplicating master data into detail records.
 Trade-off:
o Increased storage due to data duplication.
o Much faster query performance as no join needed.

3. Adding Redundant Columns


 When a column from one table is often needed alongside another table, add that column
redundantly to avoid join.
 Example: Adding a product description directly to the sales table instead of always
joining with the product table.
 Similar to pre-joining but only selected columns are duplicated or moved.
 Pros and cons:
o Faster query performance by avoiding joins.
o Increased storage and update complexity.
o May affect referential integrity constraints.

4. Derived Attributes

 Derived attributes are columns calculated from other data and stored for quick access.
 Examples:
o Grade Point (Grade * Credits)
o Age (Current Date - Date of Birth)
 Useful when:
o The derived value is used frequently.
o Calculation is done once and is stable.
 Helps in faster query response and reduces runtime calculation.
 Storage overhead is justified if ratio of detail data to derived data is high (e.g., 10:1).

Summary

 De-normalization techniques improve query speed by reducing joins or data size per
query.
 Splitting tables horizontally or vertically helps manage big datasets better.
 Pre-joining and adding redundant columns trade storage space for speed.
 Derived attributes store calculated data to speed up queries.
 Each technique involves trade-offs between storage, performance, and complexity.

MCQs (Multiple Choice Questions)

1. What is horizontal splitting in de-normalization?


a) Splitting table columns into multiple tables
b) Splitting table rows into multiple tables based on some criteria
c) Joining two tables into one
d) Adding redundant columns to a table
Answer: b
2. Which of the following is NOT an advantage of horizontal splitting?
a) Improved query performance by reducing data scanned
b) Enhanced data security by restricting row access
c) Eliminates the need for any join operations
d) Enables parallel processing by distributing data
Answer: c
3. Vertical splitting is most useful when:
a) Some columns are rarely accessed or are large in size
b) Rows need to be split by campus
c) Data needs to be duplicated for faster queries
d) Tables need to be combined for queries
Answer: a
4. Pre-joining tables is mainly used to avoid:
a) Data duplication
b) Frequent run-time join operations
c) Data security problems
d) Horizontal splitting
Answer: b
5. Adding redundant columns in a table:
a) Always decreases storage space
b) Helps avoid joins but increases storage and update cost
c) Removes the need for primary keys
d) Is never used in data warehousing
Answer: b
6. Derived attributes are:
a) Columns that are copied from another table
b) Columns calculated from other columns and stored for quick access
c) Columns rarely accessed
d) The same as redundant columns
Answer: b
7. Which technique repeats the primary key in split tables?
a) Horizontal splitting
b) Vertical splitting
c) Pre-joining
d) Adding redundant columns
Answer: b
8. One disadvantage of pre-joining is:
a) Degraded query performance
b) Increased storage due to data repetition
c) Loss of data integrity
d) Reduced security
Answer: b
9. Splitting tables based on campus or year is an example of:
a) Vertical splitting
b) Horizontal splitting
c) Pre-joining
d) Adding redundant columns
Answer: b
10. In data warehousing, derived attributes help because:
a) They reduce storage requirements
b) They eliminate the need for primary keys
c) They provide fast query responses by avoiding runtime calculations
d) They split tables horizontally
Answer: c

Short Questions with Answers

Q1. What is the goal of horizontal splitting?


A: To spread rows across different tables for parallelism and to limit queries to relevant
partitions, improving performance and security.

Q2. How does vertical splitting improve database performance?


A: By moving rarely accessed or large columns into separate tables, it reduces the row size,
allowing more rows per block and less I/O.

Q3. What problem does pre-joining solve in data warehouses?


A: It avoids costly runtime joins by combining master and detail tables physically, trading
storage space for faster queries.

Q4. Why might redundant columns be added to a table?


A: To avoid frequent joins by duplicating necessary columns, improving query speed at the cost
of additional storage.

Q5. Define derived attributes and give an example.


A: Derived attributes are columns calculated from other data and stored for fast access. Example:
Age calculated from Date of Birth.

Q6. What is a disadvantage of adding redundant columns?


A: It increases storage and update overhead and may complicate maintaining data integrity.

Q7. How does horizontal splitting enhance security?


A: By restricting access to specific partitions or tables, users see only their permitted data.

Q8. Why is pre-joining acceptable despite violating normalization rules?


A: Because the performance gain from eliminating joins outweighs the cost of data redundancy
in data warehouses.
Q9. What is meant by ‘graceful degradation’ in the context of splitting tables?
A: If one partition is damaged, only part of the data is lost, and the system continues to operate
rather than failing completely.

Q10. When should derived attributes be added to a data warehouse?


A: When the derived value is frequently accessed, calculated once, and remains stable over time,
ensuring fast query responses.

Lecture 9

Certainly! Here's a simple but complete explanation of the lecture on Issues of De-
Normalization and then 10 MCQs with answers plus some short questions to help you grasp
the key concepts.

Lecture 09: Issues of De-Normalization — Simplified Explanation

What is Denormalization?

Denormalization means intentionally adding redundancy (duplicate data) to a database to


improve performance for certain queries, even though this goes against the rules of normalization
(which reduce redundancy to avoid problems).

Why Denormalize?

 To improve performance (make queries faster).


 Sometimes to improve ease of use (simpler queries).

Learning Goals

When thinking about denormalization, keep these in mind:

 Storage: How much extra space will denormalization take?


 Performance: Will queries run faster or slower?
 Maintenance: How hard is it to keep data correct?
 Ease of use: How easy is it to use and manage the database?
Important Notes:

 Denormalization can improve performance for some queries but hurt performance for
others.
 Always fully normalize your database to 3NF before denormalizing.
 Document both your logical model (normalized) and physical model (denormalized)
carefully.

1. Storage Issues (Pre-joining Example)

 Pre-joining means combining two tables (like a master and detail) into one by copying
some master data into the detail table.
 Example:
o Master table: 10 million records, 40 bytes each.
o Detail table: 20 million records, 60 bytes each.
 After pre-joining, storage increases by about 12.5% because master data is duplicated in
detail.
 This uses more storage but can speed up some joins.

2. Performance Issues (Pre-joining)

 Some queries run faster with denormalization.


 But for queries like “How many members claimed last year?” denormalization can slow
things down because:
o You have to do a count distinct (expensive sort operation) on repeated member
IDs.
o The table size is larger and scanning more data is slower.
 Sometimes, performance can degrade 5 times worse after denormalization for certain
queries.
 A smart system keeps both normalized and denormalized tables and chooses which one
to query depending on the need.

3. Adding Redundant Columns

 Copying columns (like salesperson ID) into detail tables adds size and slows scans (by
16% here).
 Useful only if many queries benefit.
 If uncontrolled, this can lead to a huge table with many duplicated columns causing
performance to worsen.
 More columns → fewer rows per block → more I/O → slower access.

4. Maintenance Issues

 Updating redundant columns is difficult. If a key changes, it must be updated in all


copies, which is expensive and risky.
 This is especially hard in large detail tables or historical archives.

5. Ease-of-use Issues (Horizontal Splitting/Partitioning)

 Horizontal splitting divides data rows into partitions based on hash, range, or other
methods.
 Hash partitioning spreads data evenly but makes reversing partitions or merging data
difficult.
 Range partitioning groups data by ranges (like dates) but can cause “hot spots” where
some partitions are heavily accessed.
 Round robin splitting evenly distributes data but is not easy to reverse and does not help
query optimization.

6. Performance Issues (Vertical Splitting)

 Vertical splitting splits a table into frequently used columns and infrequently used
columns.
 Queries needing frequent columns get faster (e.g., 5 times faster).
 But queries needing infrequent columns slow down because a join is needed between
split parts.

Summary

Denormalization is a trade-off. You gain performance for some queries but might lose
performance, increase storage use, and raise maintenance cost in others. Careful analysis and
design are required before denormalizing.
10 MCQs with Answers

1. What is the main purpose of denormalization?


a) To reduce storage usage
b) To improve query performance
c) To enforce data integrity
d) To eliminate redundancy
Answer: b
2. Before denormalizing, the database model should be normalized to which form?
a) 1NF
b) 2NF
c) 3NF
d) BCNF
Answer: c
3. Denormalization typically results in:
a) Less storage use
b) Increased storage use
c) No change in storage use
d) Reduced query complexity
Answer: b
4. Which operation tends to degrade performance after denormalization when
counting distinct values?
a) Join
b) Sort
c) Insert
d) Delete
Answer: b
5. What is a major maintenance challenge caused by denormalization?
a) Query optimization
b) Updating redundant data consistently
c) Index creation
d) Backup scheduling
Answer: b
6. Horizontal splitting based on hashing:
a) Guarantees easy reversal of partitions
b) Creates uneven data distribution
c) Distributes data evenly but is hard to reverse
d) Groups data by ranges
Answer: c
7. Round robin splitting is mostly used for:
a) Permanent large tables
b) Temporary tables where even load is important
c) Partition elimination optimization
d) Historical data grouping
Answer: b
8. Vertical splitting improves performance for:
a) Queries accessing infrequently used columns
b) Queries accessing frequently used columns
c) All queries equally
d) None of the above
Answer: b
9. What happens if redundant columns keep increasing in a fact table?
a) Query performance improves
b) Number of rows per block decreases, causing performance degradation
c) Storage use decreases
d) Table becomes easier to maintain
Answer: b
10. Which of the following is NOT a trade-off of denormalization?
a) Storage
b) Ease-of-use
c) Data integrity (strictly enforced)
d) Maintenance
Answer: c

Short Questions with Answers

1. Q: What is pre-joining in denormalization?


A: Pre-joining is copying data from a master table into a detail table, combining them to
avoid joins and improve query speed.
2. Q: Why can denormalization degrade performance for some queries?
A: Because queries might need to do expensive operations like sorting for distinct counts
on larger tables with duplicated data.
3. Q: How does denormalization affect storage?
A: It increases storage requirements because data is duplicated.
4. Q: What is a "hot spot" in horizontal splitting?
A: A partition that receives more data access than others, causing uneven workload and
bottlenecks.
5. Q: Why is maintenance harder after denormalization?
A: Because updating a piece of data requires changing multiple copies, increasing
complexity and risk of inconsistency.
6. Q: What is the benefit of vertical splitting?
A: It speeds up queries that use frequently accessed columns by splitting them from
rarely used columns.
7. Q: Why is round robin splitting not good for partition elimination?
A: Because it distributes data randomly and evenly, making it hard to exclude partitions
during queries.
8. Q: What does a "count distinct" operation do and why is it expensive after
denormalization?
A: It counts unique values by sorting data to remove duplicates, which is slow on large,
denormalized tables.
9. Q: What should be documented when denormalizing a database?
A: Both the original normalized logical model and the physical denormalized model.
10. Q: When is adding redundant columns justified?
A: When a significant number of queries benefit enough to outweigh the increased
storage and maintenance costs.

Lecture 10

Certainly! Here's a simplified explanation of Lecture 10: Online Analytical Processing


(OLAP) covering all important points, followed by 10 MCQs and short questions with answers.

Simple Explanation of Lecture 10: OLAP

What is OLAP?

 OLAP stands for On-Line Analytical Processing.


 It is focused on analyzing data rather than processing transactions (which is done by
OLTP).
 OLAP is a framework or a set of tools for analysis, not just a physical database design or
implementation.
 OLAP databases are often highly denormalized or fully denormalized, while OLTP
databases are normalized.

Relationship between Data Warehouse (DWH) and OLAP:

 A Data Warehouse is a special type of database that collects data from many sources, is
subject-oriented, integrated, time-variant, and non-volatile (data doesn’t change once
entered).
 Data Warehouses store large amounts of historical data on powerful servers separate from
day-to-day operational databases.
 OLAP is tightly connected to Data Warehouses because OLAP tools analyze the data
stored in the warehouse to support decision-making.
 Without OLAP, Data Warehousing would be incomplete.

Difference between OLAP and OLTP:

 OLTP (Online Transaction Processing) is used for daily operations and routine tasks (like
banking transactions).
 OLAP helps in decision-making, dealing with unexpected and complex queries that
cannot be predefined easily.

How OLAP Supports Analysis:


 OLAP supports ad-hoc, interactive, and iterative analysis — meaning users ask
questions on the fly, and each answer can lead to new questions.
 Analysis can start from different perspectives: time, geography, products, etc.
 OLAP allows drilling down (going into more detailed data), rolling up (summarizing
data), and drilling across (looking at data from different dimensions).

Example Scenario:

 Imagine a CEO sees a profit drop and asks “What happened last year’s quarterly sales?”
 The CEO may find one quarter with a drop and then explore sales by region or product to
find the root cause, such as high costs in a particular region during that quarter.
 This iterative exploration is fast and easy with OLAP.

Challenges of OLAP and Decision Support:

 It's impossible to write all queries in advance because decision makers don’t know all
questions beforehand.
 Business users are usually not programmers and don’t know SQL, so OLAP tools must
be intuitive and fast.
 Generating SQL queries on the fly can be too slow due to the large data sizes.

Solution to Challenges:

 Instead of writing all queries, precompute aggregates at multiple levels (e.g., sales by
year, quarter, region, store) so answers to many questions are ready quickly.
 This is done by organizing data into facts and dimensions:
o Facts: Numeric measurements (sales amount, units sold).
o Dimensions: Categories to analyze facts by (time, geography, product), often
organized in hierarchies (e.g., city → district → division → province → country).

10 MCQs with Answers

1. What does OLAP stand for?


A) Online Logical Analysis Processing
B) On-Line Analytical Processing
C) Offline Analytical Processing
D) On-Line Logical Processing
Answer: B
2. Which of the following best describes the purpose of OLAP?
A) Processing daily transactions
B) Analyzing large volumes of data for decision making
C) Managing user accounts
D) Handling inventory management
Answer: B
3. How are OLAP databases typically designed compared to OLTP databases?
A) Highly normalized
B) Fully normalized
C) Highly denormalized or completely denormalized
D) Same design as OLTP
Answer: C
4. What is the relationship between Data Warehouse and OLAP?
A) Data Warehouse is used only for OLTP
B) OLAP is used to analyze data stored in the Data Warehouse
C) They are unrelated
D) OLAP stores data for the Data Warehouse
Answer: B
5. Which of these is NOT true about OLAP analysis?
A) It is ad-hoc
B) It is iterative
C) It only supports fixed queries
D) It is user-driven
Answer: C
6. Which process in OLAP means viewing data at a more detailed level?
A) Roll-up
B) Drill-down
C) Drill-across
D) Roll-down
Answer: B
7. Why can't all queries be predefined in OLAP systems?
A) Queries are always the same
B) Users don’t know all questions in advance
C) OLAP doesn’t support queries
D) Predefined queries are faster
Answer: B
8. In OLAP, what are 'facts'?
A) Qualitative descriptions
B) Numeric measurements to analyze (e.g., sales)
C) Types of databases
D) User queries
Answer: B
9. What are 'dimensions' in OLAP?
A) Numeric measures
B) Data warehouses
C) Categories for filtering facts (e.g., time, geography)
D) Reports
Answer: C
10. Why is OLAP important for decision support systems?
A) It processes transactions faster
B) It allows complex, flexible, and interactive analysis of data
C) It stores data permanently
D) It replaces databases
Answer: B

Short Questions with Answers

1. Q: What is the key difference between OLAP and OLTP?


A: OLAP is for analyzing data and decision making, while OLTP is for processing
routine transactions.
2. Q: Why is OLAP not considered a physical database design?
A: Because OLAP is a framework or approach for analysis, not a method of physically
designing a database.
3. Q: What are the four key characteristics of a Data Warehouse?
A: Subject-oriented, integrated, time-variant, and non-volatile.
4. Q: What does "drill down" mean in OLAP?
A: It means exploring data at a more detailed level.
5. Q: What challenge does OLAP face related to query generation?
A: It's impossible to know and write all queries in advance because user questions are
unpredictable.
6. Q: How do OLAP systems support non-programmer business users?
A: By providing intuitive tools that don't require knowledge of SQL and precomputed
aggregates for fast responses.
7. Q: What is meant by the 'iterative' nature of OLAP analysis?
A: Users repeatedly ask questions, analyze answers, and refine their queries.
8. Q: What is an example of a 'fact' in OLAP?
A: Sales amount or number of units sold.
9. Q: How are dimensions organized in OLAP?
A: They are often organized in hierarchies (e.g., day → month → quarter → year).
10. Q: What is the 'field of dreams' scenario in data warehousing?
A: The mistaken belief that if you build a data warehouse, users will automatically start
using it without considering usability.

Lecture 11

Sure! Here's a simple, clear explanation of the key concepts from Lecture 11 on
Multidimensional OLAP (MOLAP), followed by 10 MCQs and short questions with
answers for review.

Lecture 11: Multidimensional OLAP (MOLAP) — Simplified Explanation

What is OLAP?
OLAP stands for Online Analytical Processing. It helps decision-makers analyze large volumes
of data quickly from different perspectives.

Types of OLAP Implementations:

1. MOLAP (Multidimensional OLAP):


o Uses a special multidimensional data structure called a cube.
o Data is pre-aggregated and stored in this cube for fast access.
o Does not use standard SQL queries because the data is not in tables.
o Supports fast data retrieval with direct access (O(1) time complexity).
o Example: Essbase by Hyperion, Microsoft OLAP Server.
2. ROLAP (Relational OLAP):
o Uses relational databases and standard SQL queries to access data.
o Better scalability but slower than MOLAP.
3. HOLAP (Hybrid OLAP):
o Combines MOLAP and ROLAP to get the best of both: scalability and
performance.
4. DOLAP (Desktop OLAP):
o Allows OLAP cube data to be downloaded and used on a desktop without servers.
o Useful for mobile or offline decision support.

MOLAP Cubes:

 Data is organized into dimensions (like Geography, Product, Time).


 Each dimension can have hierarchies (e.g., Country > Province > City).
 The cube stores pre-calculated aggregates like total sales by product and time.
 Fast access is possible because data is stored in a multidimensional array (like a 3D or 4D
matrix).
 No standard SQL is used; vendors provide proprietary languages or graphical tools (e.g.,
MDX by Microsoft).

Cube Operations (How we analyze data in cubes):

1. Roll-up: Summarize data to a higher level (e.g., from monthly sales to yearly sales).
2. Drill-down: Go into more detail (e.g., from yearly sales to monthly or daily sales).
3. Slice and dice: Look at specific slices of data by selecting subsets across dimensions
(e.g., sales of soft drinks in Karachi last quarter).
4. Pivot: Change the way data is displayed by swapping dimensions (e.g., swap rows and
columns in a report).
Advantages of MOLAP:

 Instant response because aggregates are pre-calculated.


 Useful for complex queries without missing answers.
 Supports advanced functions like ranking and percentage changes.

Drawbacks of MOLAP:

 Long time to pre-calculate the cubes (may take days).


 Cube size grows exponentially with dimensions — called the curse of dimensionality.
 Cubes are often very sparse (most data entries are empty), wasting storage.
 Limited scalability when dimension cardinality is very large (usually max ~64,000).
 Cube creation is a batch process, so data can be stale if not updated regularly.

MOLAP Maintenance and Storage Issues:

 Every new data item must update all relevant aggregates in the cube — this is costly.
 Building many cubes for various dimension combinations requires a lot of storage space.
 Cube updates must be synchronized with data changes to avoid inconsistent results.

10 MCQs and Short Questions with Answers

1. What does MOLAP stand for?


a) Multidimensional OLAP
b) Multiple OLAP
c) Modular OLAP
d) Managed OLAP
Answer: a) Multidimensional OLAP

2. Which OLAP implementation uses relational databases and SQL queries?


Answer: ROLAP
3. True or False: MOLAP supports standard ANSI SQL queries.
Answer: False — MOLAP uses proprietary query languages, not ANSI SQL.

4. What are the four fundamental cube operations in MOLAP?


Answer: Roll-up, Drill-down, Slice and Dice, Pivot

5. Explain the difference between roll-up and drill-down operations.


Answer:

 Roll-up summarizes data to a higher level (less detail).


 Drill-down breaks data into more detail (more granularity).

6. Why is MOLAP faster in retrieving data than ROLAP?


Answer: Because MOLAP stores pre-aggregated data in multidimensional cubes with direct
O(1) lookup, avoiding slow table scans.

7. What is a major disadvantage of MOLAP regarding cube creation?


Answer: Cube pre-calculation takes a long time (may take days) and is resource-intensive.

8. What problem does the "curse of dimensionality" refer to in MOLAP?


Answer: As the number of dimensions increases, the number of possible aggregates grows
exponentially, making cubes very large and hard to manage.

9. What does DOLAP enable?


Answer: It allows OLAP cubes to be downloaded and used on desktop systems without needing
a server, useful for mobile or offline use.

10. Why are MOLAP cubes often very sparse?


Answer: Because many combinations of dimensions have no data (e.g., seasonal products not
sold year-round), resulting in many empty cells in the cube.
Lecture 12

Sure! Here's a simple and clear explanation of the key concepts from Lecture 12: Relational
OLAP (ROLAP), followed by 10 MCQs and short questions with answers.

Lecture 12: Relational OLAP (ROLAP) - Simple Explanation

What is ROLAP and Why Use It?

 ROLAP stands for Relational Online Analytical Processing.


 It uses relational databases (like SQL databases) to store and analyze multidimensional
data.
 MOLAP (Multidimensional OLAP) stores data in cubes but has a problem with
scalability when data or dimensions grow large—it needs a lot of memory.
 ROLAP solves this by storing data in relational tables, which can handle larger data
using disk storage.
 ROLAP queries use SQL on fact and dimension tables organized usually in a star
schema (a central fact table connected to dimension tables).

Creating a Cube in ROLAP

 In ROLAP, the "cube" is a logical concept, not a physical cube like in MOLAP.
 The fact table (which stores detailed data) can be visualized as a flattened or "unrolled"
cube.
 By running SQL queries, you can aggregate data at different dimension levels to get
cube-like summaries.
 Example: Summing sales by month and product to get a 2D table from which you can see
aggregates like row totals and column totals.

Problems with Simple ROLAP Queries

 As the number of dimensions increases, the number of queries needed to compute all
aggregates grows exponentially.
 Computing all aggregates separately is wasteful.
 A better approach is to reuse previous results to compute new aggregates ("work
smart, not hard").

SQL Cube Clause

 SQL:1999 introduced the CUBE clause.


 GROUP BY CUBE(v1, v2, ..., vn) generates all combinations of groupings
automatically.
 This reduces the need to write many SQL queries manually.

Space Requirements in ROLAP

 Pre-aggregating data for all combinations of dimensions and hierarchy levels creates
many summary tables.
 For example, just two dimensions (time and product) can create 24 summary tables,
adding geography can raise it to 120 tables.
 Storing all these tables takes a lot of space.
 Aggregate awareness: Smart tools can build less detailed aggregates on the fly from
more detailed aggregates to reduce storage.

Issues in ROLAP

1. Maintenance: Updating all summary tables when new data arrives is hard and costly.
2. Non-standard hierarchies: Dimensions can have complex hierarchies and overlap in
unexpected ways.
3. Non-standard conventions: Different departments or organizations may define time
periods (week, year) differently, causing confusion and requiring multiple summary
tables.
4. Storage explosion: Combining many hierarchies and conventions leads to a huge number
of summary tables.
5. Aggregation pitfalls: Some metrics are not additive (like averages or ratios), making
aggregation harder.

10 MCQs on ROLAP

1. Why is ROLAP preferred over MOLAP for large datasets?


a) ROLAP requires more memory
b) MOLAP is faster but can't handle large dimensions well
c) ROLAP stores data in relational databases, which handle large data better
d) MOLAP supports SQL better
Answer: c
2. What is the main storage structure used in ROLAP?
a) Multidimensional cubes
b) Relational tables (fact and dimension tables)
c) Flat files
d) In-memory arrays
Answer: b
3. Which SQL clause helps in generating all combinations of aggregates
automatically?
a) GROUP BY ALL
b) GROUP BY CUBE
c) GROUP BY ROLLUP
d) GROUP BY MULTI
Answer: b
4. What happens to the number of queries required to compute aggregates as the
number of dimensions increases?
a) It decreases
b) It stays constant
c) It increases exponentially
d) It increases linearly
Answer: c
5. In ROLAP, the "cube" is:
a) A physical cube stored in memory
b) A logical structure derived from relational tables
c) Stored only in MOLAP databases
d) A flat file
Answer: b
6. What is the main disadvantage of creating summary tables for all possible
combinations of dimensions?
a) Too slow to query
b) Too much storage space and maintenance cost
c) Too hard to create SQL queries
d) Results become inaccurate
Answer: b
7. Which of the following is NOT a common ROLAP issue?
a) Maintenance overhead
b) Explosion of storage space
c) Non-standard dimension hierarchies
d) Lack of SQL support
Answer: d
8. Why is aggregate awareness important in ROLAP?
a) To prevent query execution
b) To reduce the number of queries and save computation time
c) To store more detailed data
d) To eliminate the need for dimensions
Answer: b
9. Which of the following metrics is usually hard to aggregate in ROLAP?
a) Sum of sales
b) Count of transactions
c) Average price
d) Total quantity
Answer: c
10. What does a star schema consist of?
a) Only fact tables
b) A fact table connected to multiple dimension tables
c) Many unrelated tables
d) Only dimension tables
Answer: b

Short Questions with Answers

Q1: What is the main scalability problem with MOLAP?


A: MOLAP requires large main memory as the cube size grows due to more dimensions, more
detail data, or higher cardinality, limiting its use for large datasets.

Q2: How does ROLAP overcome MOLAP's scalability problem?


A: ROLAP stores data in relational databases on disk, allowing it to handle larger data sizes and
dimensions without the heavy memory requirement.

Q3: Explain how a cube is created logically in ROLAP.


A: In ROLAP, the fact table is visualized as a flattened cube, and by running SQL queries
aggregating data over dimensions, logical cubes or summary tables are created without
physically storing a cube.

Q4: What is the problem with using simple SQL queries for all aggregates?
A: The number of queries increases exponentially with more dimensions, leading to wasted
computation and inefficiency.

Q5: How does the SQL CUBE clause help?


A: It automatically generates all possible grouping combinations for the specified dimensions,
reducing the need to write multiple queries.

Q6: Why is storage space a concern in ROLAP?


A: Pre-aggregating and storing all combinations of dimension levels results in many summary
tables, which can use huge amounts of disk space.

Q7: What is meant by “aggregate awareness” in ROLAP?


A: It means building less detailed aggregates on-the-fly from more detailed aggregates to reduce
storage and computation.

Q8: What are the maintenance challenges in ROLAP?


A: Keeping all summary tables updated with new data is costly and complex, especially when
data arrives late or dimensions change.

Q9: How do non-standard hierarchies and conventions affect ROLAP?


A: Different ways of defining hierarchies or calendar periods across departments increase the
number of summary tables and complicate maintenance.
Q10: Why can't all aggregates be precomputed in ROLAP?
A: Because the storage and maintenance cost grow too large, and some metrics are not additive
or straightforward to aggregate.

If you want, I can also help you make a study summary or flashcards for these concepts! Would
you like that?

Lecture 13

Here is a simplified explanation of Lecture 13: Dimensional Modeling (DM) with all key
concepts covered followed by 10 MCQs and 10 short questions with answers.

💡 Simplified Explanation: Dimensional Modeling (DM)


📌 Why ER Modeling Was Needed

In the early days (COBOL era), data was stored in flat files with a lot of repetition
(redundancy). For example, WAPDA meter readings repeated customer names and addresses
every time. This caused inconsistency and wasted storage.

ER (Entity-Relationship) Modeling solved this by:

 Breaking data into related tables.


 Reducing redundancy using normalization.
 Using joins to combine data.
 Optimizing for OLTP (Online Transaction Processing) systems.

📌 Success of ER Modeling

 Removed data duplication.


 Made updates easy and fast.
 Supported indexing for fast searching.
 Helped OLTP systems become reliable and consistent.

📌 Limitations of ER Modeling for Decision Support Systems (DSS)

 Complex table structure (lots of tables).


 Difficult to understand and use for analysis.
 Hard to identify which tables store important business data.
 Querying becomes slow with multiple joins.
 Database performance drops when queries involve many tables.

📌 Why Dimensional Modeling (DM) is Needed

ER models are good for transactional tasks but not for answering business analysis questions
like:

 "What are monthly sales trends?"


 "Which product is most profitable?"

Due to this, Dimensional Modeling was introduced for Decision Support Systems (DSS).

📌 The Paradox

The goal was to make data accessible through ER models, but they ended up being too complex
to query efficiently. So, a system meant to help with decision-making actually made it harder.

📌 ER vs. DM

ER (Entity-Relationship) DM (Dimensional Modeling)


Best for OLTP Best for DSS
Focuses on micro details Focuses on business summary
Complex joins Simplified structure
High redundancy removal Easy to query and analyze

📌 What is Dimensional Modeling?

Dimensional Modeling is a simplified database design made for analysis. It focuses on making
data easy to understand and fast to query.

Key Features:

 Central Fact Table (contains measurable data like sales, profits).


 Dimension Tables (descriptive data like date, product, customer).
 Fact table uses multi-part keys, dimensions use single-part keys.
 Forms a Star Schema (central fact connected to surrounding dimensions).
 Helps perform ROLAP (Relational OLAP) operations.
📝 10 Multiple Choice Questions (MCQs) with Answers
1. Which system is optimized using ER modeling?
a) DSS
b) OLTP ✅
c) OLAP
d) Big Data
2. Main goal of ER modeling is to:
a) Make data redundant
b) Simplify queries
c) Remove redundancy ✅
d) Increase storage
3. Which model is ideal for decision support systems (DSS)?
a) ER Model
b) OLTP Model
c) Dimensional Model ✅
d) Hierarchical Model
4. What is a fact table?
a) Stores descriptions
b) Stores calculated data
c) Stores transactional data
d) Stores measurable data ✅
5. What kind of key does a dimension table usually have?
a) Multi-part key
b) Composite key
c) Single-part key ✅
d) Foreign key
6. What kind of join structure does dimensional modeling use?
a) Circular schema
b) Star schema ✅
c) Snowflake schema
d) ER schema
7. Which of the following is NOT true for ER modeling?
a) Supports normalization
b) Suitable for OLTP
c) Simplifies DSS queries ✅
d) Removes redundancy
8. Main problem with ER model in DSS is:
a) Slow writing speed
b) Too few tables
c) Difficult joins and queries ✅
d) No indexing
9. What does ROLAP stand for?
a) Real-time OLAP
b) Relational OLAP ✅
c) Read-only OLAP
d) Recursive OLAP
10. What is a major benefit of dimensional modeling?
a) Complex schema
b) Easy to design
c) Fast and intuitive queries ✅
d) Saves disk space

✍️10 Short Questions with Answers


1. Q: What does ER stand for in data modeling?
A: Entity-Relationship.
2. Q: What is the main purpose of ER modeling?
A: To remove redundancy and structure data for OLTP.
3. Q: What is the weakness of ER modeling in DSS?
A: Complex joins and slow query performance.
4. Q: Why was dimensional modeling introduced?
A: To simplify database design and improve performance in decision support systems.
5. Q: What is a fact table?
A: A table that stores business measurements or facts, like sales.
6. Q: What are dimension tables?
A: Tables that store descriptive information related to facts.
7. Q: What is the shape of a dimensional model?
A: Star-like, called a star schema.
8. Q: What is the key difference between ER and DM?
A: ER focuses on transaction processing; DM focuses on analytical querying.
9. Q: What is ROLAP?
A: Relational OLAP – performing analytical operations using relational databases.
10. Q: Why can’t users easily query ER models in DSS?
A: Because ER models are complex, with many interconnected tables and relationships.

Let me know if you want a diagram for star schema or a PDF file of this content.

Lecture 14

Here’s a simple and complete explanation of Lecture 14: Process of Dimensional Modeling,
along with 10 MCQs and short questions with answers at the end.
🌟 Simple Explanation of Dimensional Modeling Process
Dimensional Modeling (DM) is a method used in data warehousing to design user-friendly
and fast-performing databases for reporting and analysis. It is different from the traditional ER
(Entity-Relationship) model used for operational databases.

✅ The 4-Step Process of Dimensional Modeling:

🧩 1. Choose the Business Process

A business process is any important activity in an organization that you want to analyze.
Examples include:

 Taking orders
 Making shipments
 Managing inventory
 Processing invoices

We don’t model the whole company in one go. Instead, we choose one business process at a
time, which makes the system more manageable and accurate.

❗Note: Business processes are NOT departments. Don’t create separate models for each
department (like Sales or Marketing), instead, create one model for the process (like Orders),
which can be used by many departments.

🧱 2. Choose the Grain

Grain means the level of detail you want to store in your database. It answers:
What does one row in the fact table represent?

Examples of grain:

 One sale per transaction


 Daily stock levels
 Monthly sales summary

🎯 Grain = Unit of Analysis (like kg for weight)

Fine-grain = More details, but more data


Coarse-grain = Summary, less data, faster queries
There is a trade-off:

 Fine-grain = More flexibility


 Aggregated data = Better performance, less storage

📊 3. Choose the Facts

Facts are numeric values that the business wants to analyze.


They are stored in fact tables, which come from many-to-many relationships in ER diagrams.

Good facts are:

 Numeric
 Additive (can be summed)
 Meaningful

Examples:

 Quantity sold
 Sales amount
 Profit

🧭 4. Choose the Dimensions

Dimensions are the descriptive information that explain the facts.

Examples:

 Time (Date, Week, Month)


 Product (Name, Category)
 Geography (City, Country)
 Customer (Name, Gender)

Each fact is connected to dimensions so we can answer questions like:

 What are sales by region?


 Which products sold most last month?

Dimensions are stored in their own tables, and are often denormalized to speed up queries.
📉 Aggregation: Pros and Cons

Aggregation means summarizing data (e.g., monthly instead of daily).

✅ Pros:

 Faster queries
 Less storage
 Suitable for dashboards and repetitive reports

❌ Cons:

 Irreversible: Can’t go back to original data


 Less flexible: Can’t answer all questions (like “Why sales dropped in Week 2?”)
 Can hide facts: Averages may mask real problems

Example:
If Zone A sold 50, 100, 150, and 200 units in 4 weeks → average = 125
If Zone B sold 125, 125, 125, 125 → same average, but completely different sales behavior.

📝 10 MCQs with Answers


1. What is the first step in the dimensional modeling process?
a) Choose the Facts
b) Choose the Grain
c) Choose the Dimensions
d) Choose the Business Process
✅ Answer: d
2. What does 'grain' refer to in dimensional modeling?
a) Size of the database
b) Amount of storage
c) Level of detail in the fact table
d) Number of dimensions
✅ Answer: c
3. Which of the following is a good example of a fact?
a) Customer Name
b) Product Type
c) Quantity Sold
d) Transaction Type
✅ Answer: c
4. Which is NOT a characteristic of good facts?
a) Numeric
b) Additive
c) Descriptive
d) Continuously valued
✅ Answer: c
5. Which of the following is a typical dimension?
a) Price
b) Quantity
c) Sales
d) Time
✅ Answer: d
6. What is a disadvantage of data aggregation?
a) It speeds up queries
b) It saves storage
c) It hides detailed trends
d) It improves reporting
✅ Answer: c
7. Why is it better to model processes instead of departments?
a) Saves time
b) Avoids data duplication
c) Makes diagrams look simple
d) Works faster
✅ Answer: b
8. Which level of grain is more expressive but slower?
a) Weekly
b) Monthly
c) Daily
d) Fine-grain
✅ Answer: d
9. Facts are selected from which type of relationships in ER diagrams?
a) One-to-one
b) Many-to-many
c) One-to-many
d) None of these
✅ Answer: b
10. What does the star schema result from?
a) Combining processes
b) Collapsing hierarchies
c) Splitting ER diagram into business processes
d) Choosing multiple grains
✅ Answer: c

🧠 Short Questions with Answers


1. Q: What is meant by 'grain' in dimensional modeling?
A: Grain is the level of detail that each record in the fact table represents, such as one
transaction or one day’s summary.
2. Q: Why is aggregation not always the best option in data warehousing?
A: Because it is irreversible and may hide important trends or changes in the data,
leading to incorrect conclusions.
3. Q: What are dimension tables?
A: Dimension tables store descriptive information (like time, product, or location) that
help analyze facts.
4. Q: Give one example of a business process that can be modeled.
A: “Order Processing” is a common business process used for modeling in data
warehousing.
5. Q: Why are facts called additive?
A: Because they can be summed up across dimensions like time or geography to get
meaningful results.

Lecture 15

Here is a simple explanation of Lecture 15: Issues of Dimensional Modeling along with 10
MCQs and short questions with answers.

💡 Simple Explanation: Issues of Dimensional Modeling

1. Additive vs. Non-Additive Facts

 Additive Facts are facts that can be added correctly across any dimension like time,
region, product, etc.
✅ Example: Quantity sold, total revenue.
 Non-Additive Facts cannot be added across all dimensions because it gives incorrect
results.
❌ Example: Averages, percentages, ratios.

💬 Why not add averages or percentages?


Adding them directly gives the wrong result. You must calculate from the original numbers.

2. Classification of Aggregation Functions

Aggregation functions summarize data. They are divided into three types:

1. Distributive – Can be calculated from sub-totals.


✅ Examples: SUM, COUNT, MIN, MAX
Simple and fast to compute.
2. Algebraic – Require a combination of values (e.g., SUM + COUNT for AVERAGE).
🧠 Examples: AVERAGE, STDDEV
Need summary data to compute correctly.
3. Holistic – Need full data set; can't be computed from summaries.
❗ Examples: MEDIAN, COUNT DISTINCT
Not suitable for data warehouses (too slow and heavy).

3. Not Recording Facts

 Fact tables only store events that actually happen.


❗ So if a product doesn’t sell, no record is added.

✅ Advantage: Less storage space (efficient).


❌ Disadvantage: Can’t know what didn’t happen (e.g., unsold promoted products).

4. Fact-less Fact Tables

 These tables don’t have numeric values like sales or price.


 They are used to capture relationships between dimensions.
✅ Example: Student-Course enrollment.

📌 They solve the problem of recording non-events (like students not enrolling or unsold items).

They use a dummy value like “1” to show presence.

5. Multi-Valued Dimensions

Sometimes one fact is related to multiple dimension values (e.g., a car having several
maintenance tasks).

Ways to handle:

1. Drop the extra values (not recommended).


2. Pick one primary value (common method).
3. Add fixed extra columns (not flexible, avoid).
4. Use helper tables (best approach).

Helper tables connect facts to dimensions without breaking the schema.


6. OLTP vs. Slowly Changing Dimensions

 OLTP systems are live and changing, data is often overwritten.


 They are not good at keeping history (old data gets deleted).
 Data warehouses preserve history, and keep old values for analysis.

📌 Example: You want to know what a product’s description was last year — OLTP can’t help,
DW can.

📝 10 MCQs with Answers

1. Which of the following is an additive fact?


A. Discount %
B. Average sales
C. Quantity sold ✅
D. Gross margin
2. Which aggregate function is distributive?
A. COUNT ✅
B. AVERAGE
C. MEDIAN
D. COUNT DISTINCT
3. What is a fact-less fact table used for?
A. Store numbers
B. Show dimension relationships ✅
C. Store transactions
D. Keep aggregate data
4. What is the main disadvantage of not recording facts?
A. More storage is used
B. Facts become holistic
C. Loss of non-event data ✅
D. Additive facts become non-additive
5. Which aggregate type needs full data to compute?
A. Algebraic
B. Distributive
C. Holistic ✅
D. None
6. Helper tables are used to handle:
A. Additive facts
B. Fact-less facts
C. Multi-valued dimensions ✅
D. Slowly changing facts
7. Which of the following is a non-additive fact?
A. Total Sales
B. Number of Units
C. Average Unit Price ✅
D. Total Profit
8. What value is commonly used in fact-less fact tables?
A. NULL
B. 0
C. 1 ✅
D. -1
9. Why are holistic functions not suitable for data warehouses?
A. Too simple
B. Require complete scan ✅
C. Add only non-fact data
D. Always inaccurate
10. Which system does NOT keep past data for long periods?
A. Data warehouse
B. OLAP
C. OLTP ✅
D. Star schema

❓ 5 Short Questions with Answers

1. What is the difference between additive and non-additive facts?


Ans: Additive facts can be added across dimensions (like sales), non-additive facts like
averages or percentages cannot be directly summed.
2. Give an example of a fact-less fact table.
Ans: A table showing which students enrolled in which courses without storing numeric
facts.
3. What is a holistic aggregate function?
Ans: A function like median or count distinct that requires full data and cannot be
computed from subgroups.
4. What’s the purpose of a helper table in dimensional modeling?
Ans: To handle multi-valued dimensions by connecting facts to multiple dimension
entries without redundancy.
5. Why OLTP systems are not suitable for historical tracking?
Ans: They overwrite old data and purge records frequently, making it hard to retrieve
past information.

Here's a simple and complete explanation of the topic “Step-4: DWH Dilemma: Slowly
Changing Dimensions” followed by 10 MCQs and short questions with answers:

✅ Simple Explanation of Slowly Changing Dimensions


(SCDs)
In a Data Warehouse (DWH), data comes from many sources and changes over time. These
changes, especially in descriptive data (like product descriptions or customer addresses), are
known as Slowly Changing Dimensions (SCDs).

Unlike fact tables where new records (like sales) are constantly added, dimension tables (like
customer or product) change slowly. But when they change, we face a challenge — how to
handle and store these changes?

For example, a product may have a slight packaging change, but its product ID (SKU) remains
the same. In such a case, the DWH must track both the old and new descriptions, but it raises
two questions:

 What should be the key if SKU doesn’t change?


 Where do we store the changed values?

To manage this, there are three options:

🔹 Option 1: Overwrite History (Type 1 SCD)

 Replace the old data with new data in the same row.
 ✅ Easy and simple.
 ❌ History is lost.
 ✅ Useful when the old data is not important (e.g., correcting typos).

🔹 Option 2: Add New Row (Type 2 SCD)

 Create a new row in the dimension table with updated info.


 Use version numbers or new surrogate keys (e.g., SKU#01, SKU#02).
 ✅ Best for keeping full history.
 ❌ Increases table size over time.

🔹 Option 3: Add New Field (Type 3 SCD)

 Keep both old and current values in separate columns like previous_region,
current_region.
 ✅ Tracks only the last two changes.
 ✅ Keys don’t change.
 ❌ Can't track more than two changes.
📌 Real-World Examples:

 Customer: Single → Married → Divorced.


 Product: New packaging.
 Land Ownership: Changes by inheritance.
 Sales Region: Reassigned over time.

🔹 Pros and Cons Summary:

Option Pros Cons


Overwrite Simple, small table No history
Add Row Full history Table grows
Add Field Track last 2 changes Limited history

📌 Junk Dimension
Sometimes leftover attributes (e.g., Payment Method, Bag Type) don’t belong to any specific
dimension. Instead of making separate tables, we combine them into a "Junk Dimension",
reducing table width and complexity.

✅ 10 MCQs with Answers


1. Which of the following best defines a Slowly Changing Dimension?
A. A dimension that never changes
B. A dimension that changes rapidly
C. A dimension that changes slowly over time
D. A fact table
✔️Answer: C
2. Which option for handling SCDs preserves full historical data?
A. Overwrite History
B. Add New Row
C. Add New Field
D. Ignore the change
✔️Answer: B
3. In Type 1 SCD, what happens to the old value?
A. Archived
B. Kept in another table
C. Deleted
D. Overwritten
✔️Answer: D
4. What is the main drawback of using Type 2 SCD (Add New Row)?
A. Loses history
B. Difficult to implement
C. Table grows in size
D. Keys are affected
✔️Answer: C
5. In which scenario is Type 3 SCD most suitable?
A. When only the most recent change is important
B. When tracking all changes
C. When you need to change keys
D. When no changes are expected
✔️Answer: A
6. What is added in Type 3 SCD to track changes?
A. A new row
B. A junk table
C. A new column
D. A new key
✔️Answer: C
7. What is a “Junk Dimension”?
A. A dimension with corrupted data
B. A mix of unrelated attributes grouped into one dimension
C. A backup of old dimensions
D. A dimension not used in reports
✔️Answer: B
8. What is a common example of slowly changing dimension in sales organizations?
A. Product barcode
B. Customer phone number
C. Sales region names
D. Transaction amount
✔️Answer: C
9. Which SCD type allows unlimited tracking of changes?
A. Type 1
B. Type 2
C. Type 3
D. Type 0
✔️Answer: B
10. What key technique is used in Type 2 SCD to differentiate versions?
A. Time stamps only
B. A combination of key + version number
C. Random key generation
D. Removing keys
✔️Answer: B
✅ 5 Short Questions with Answers
1. What is the main purpose of handling Slowly Changing Dimensions in a data
warehouse?
✔️To accurately track and store changes in dimension data over time.
2. Why is Type 1 SCD not suitable for data warehousing?
✔️Because it overwrites the old data, losing historical information.
3. How does Type 2 SCD maintain historical accuracy?
✔️By creating a new row for every change, each with a unique key or version.
4. What is the key advantage of Type 3 SCD?
✔️It allows tracking of both current and previous values without changing keys.
5. What is the main disadvantage of Type 2 SCD?
✔️It increases the size of the dimension table as more versions are added.

Lecture 16

Here is a simple explanation, 10 MCQs, and short questions with answers based on Lecture
16: Extract Transform Load (ETL) from Data Warehousing (CS614).

📘 Simple Explanation (All Key Concepts Included)


🌐 What is ETL?

ETL stands for:

 Extract: Get data from different sources (like OLTP, web, or archived systems).
 Transform: Clean and change the data so it fits well together.
 Load: Store the cleaned data in the Data Warehouse (DWH).

👉 About 50% of a DWH project involves ETL processes.

🧱 ETL Architecture (Figure 16.1)

 Tier 0 (Source Systems): Operational databases, archived data, web data.


 Tier 1 (Data Warehouse Server): Where data is stored after processing.
 ETL box lies between Tier 0 and Tier 1 and acts as a bridge.
🔁 ETL Cycle (Figure 16.2)

1. Extract:
o Collect data from various sources.
o Sources may be complex and undocumented.
o Must be repeated periodically to keep DWH updated.
2. Transform:
o Change data into a consistent format.
o Data Cleansing: Remove errors, duplicates, and noise.
3. Load:
o Put the final, clean data into DWH.
o Includes data summary, indexing, backup, etc.

🔧 ETL Processing (Figure 16.3)

 ETL is more than just transform and load.


 It includes:
o Extraction planning.
o Network considerations.
o Backup strategies (robotic/manual tape libraries).
o Data marts (small), Data warehouse (large, needs full backup).

🔍 Data Extraction in ETL


1. Overview

 Extracting data is complex and time-consuming.


 Source systems are mostly OLTP (Online Transaction Processing).
 You can't modify these source systems easily.

2. Types of Extraction

a. Logical Extraction:

 Full Extraction: All data is taken each time. No need to track changes.
 Incremental Extraction: Only changed data is taken (e.g., using timestamps or change
tables).

b. Physical Extraction:
 Online Extraction: Directly from the live system or intermediate system.
 Offline Extraction: Data first saved in files/logs, then extracted (e.g., dump files, redo
logs).

✅ 10 MCQs with Answers


1. What does ETL stand for?
A. Extract Transfer Load
B. Extract Transform Load
C. Extract Translate Load
D. Execute Transform Load
✅ B. Extract Transform Load
2. Which activity includes data cleansing?
A. Extract
B. Load
C. Transform
D. None
✅ C. Transform
3. Which is NOT part of ETL?
A. Extract
B. Transport
C. Load
D. Transform
✅ B. Transport
4. In which tier does the ETL process act as a bridge?
A. Between Tier 1 and Tier 2
B. Between Tier 0 and Tier 1
C. Inside Tier 2
D. Inside Tier 0
✅ B. Between Tier 0 and Tier 1
5. Which is an example of a physical offline extraction method?
A. Change table
B. Redo logs
C. SQL join
D. Snapshot
✅ B. Redo logs
6. Which extraction method doesn’t require tracking changes?
A. Incremental
B. Full
C. Online
D. Offline
✅ B. Full
7. ETL is responsible for what percentage of the DWH project?
A. 25%
B. 50%
C. 75%
D. 100%
✅ B. 50%
8. Which of the following is true about data marts?
A. Larger than data warehouses
B. Harder to back up
C. Smaller and easier to back up
D. Use robotic tape libraries
✅ C. Smaller and easier to back up
9. Which of these is a logical extraction technique?
A. Full
B. Offline
C. Redo log
D. Dump file
✅ A. Full
10. What is the role of change tables in ETL?
A. Load data faster
B. Track data cleansing
C. Identify changed data
D. Create backups
✅ C. Identify changed data

✏️5 Short Questions with Answers


1. What is the purpose of ETL in data warehousing?
➤ ETL extracts data from various sources, transforms it into a consistent format
(including cleansing), and loads it into the data warehouse.
2. What is full data extraction?
➤ Full extraction pulls all the data from the source system each time, with no need to
track changes.
3. What is incremental data extraction?
➤ Incremental extraction pulls only the data that has changed since a defined point in
time using timestamps or change tables.
4. What is the difference between online and offline extraction?
➤ Online extraction connects directly to source systems; offline extraction uses staged
data from files/logs.
5. Why is data cleansing important in ETL?
➤ It removes incorrect, incomplete, or duplicate data to ensure quality and consistency in
the data warehouse.

Here is a simple explanation of all important concepts in your text, followed by 10 MCQs, 5
short questions with answers, and 1 long answer summary of the topic.
🌐 Simple Explanation of Key Concepts
1. Physical Data Extraction

 This is the first step in ETL (Extract, Transform, Load).


 Data is copied or moved from source systems (like OLTP or legacy systems) to a
staging area for processing.
 OLTP systems (Online Transaction Processing) are real-time systems like banking or
inventory apps.
 Legacy systems are old systems that are hard to change.
 Staging area: Temporary storage to clean, convert, and prepare data before final storage
in a data warehouse.
 Extraction can be done using SQL, vendor tools, or custom scripts.

2. Data Transformation

This is the T (Transform) in ETL. It prepares the extracted data to be used in reporting and
analytics.

Main Tasks:

1. Selection – Choose the needed part of the data.


2. Splitting/Joining – Split one field into many or join fields from different sources.
3. Conversion – Standardize data (e.g., different formats of name, date, ID).
o Examples: Changing Ali, Manager to Name: Ali, Title: Manager.
o Convert systems: EBCDIC to ASCII, Excel to Access.
4. Summarization – Combine data to a higher level (e.g., total daily sales instead of each
receipt).
5. Enrichment – Add more helpful data (e.g., adding postal code or unique ID if missing).
o Default values may be added.
o Mapping fields to the final structure of data warehouse (fact/dimension tables).

3. Data Loading Strategies

Once transformed, data is loaded into the data warehouse.

Key Factors to Consider:

 Data Freshness – How new the data needs to be.


 System Performance – Avoid slowdowns during loading.
 Data Volatility – How frequently data changes.

Loading Strategies:

1. Full Data Refresh – All data is replaced. Used during first load.
2. Incremental Refresh – Only changed data is added or updated.
3. Trickle Feed (Continuous Feed) – Real-time or near real-time updates; more complex
and costly.

✅ 10 MCQs with Answers


1. Which area is used to temporarily store data during the ETL process?
a) OLTP
b) Data Mart
c) Staging Area ✅
d) Dashboard
2. What is the main purpose of data transformation?
a) To delete old data
b) To clean and prepare data for analysis ✅
c) To store data permanently
d) To build dashboards
3. Which of the following is a basic task of data transformation?
a) Archiving
b) Encryption
c) Summarization ✅
d) Replication
4. What does data conversion help to achieve?
a) Compress data
b) Encrypt data
c) Standardize data ✅
d) Split tables
5. Which loading strategy updates the data warehouse continuously?
a) Full Refresh
b) Incremental
c) Trickle Feed ✅
d) Data Dump
6. Legacy systems are often:
a) Real-time and fast
b) Old and hard to change ✅
c) Web-based
d) Mobile apps
7. Summarization during data transformation helps in:
a) Encrypting data
b) Reducing data size ✅
c) Splitting files
d) Increasing details
8. Which transformation task combines fields from multiple sources?
a) Splitting
b) Joining ✅
c) Conversion
d) Enrichment
9. Which data type conversion might occur during transformation?
a) SQL to NoSQL
b) Text to Video
c) EBCDIC to ASCII ✅
d) JPEG to PNG
10. What can be used in the absence of source data during enrichment?
a) Null values
b) Random values
c) Default values ✅
d) Manual entry

✍️Short Questions with Answers


1. What is the purpose of the staging area in ETL?
To temporarily hold extracted data for cleaning and transformation before loading into
the data warehouse.
2. Name two basic tasks of data transformation.
Selection and Conversion.
3. What is trickle feed in data loading?
A strategy for continuous or real-time data loading using row-level inserts or updates.
4. Why is conversion important during transformation?
It standardizes data from different systems into a consistent and understandable format.
5. What are default values used for in data enrichment?
To fill in missing information such as postal codes or phone codes based on existing data.

📘 Long Answer Summary


Explain Data Extraction, Transformation, and Loading Strategies

The ETL process consists of three phases: Extraction, Transformation, and Loading. During
extraction, data is taken from source systems like OLTP (real-time systems) or legacy systems
(old systems). This data is moved or copied to a staging area for processing. Extraction can be
done through SQL, vendor tools, or custom scripts.
Transformation is the next phase, where the raw data is cleaned, converted, and organized. Key
transformation tasks include:

 Selection (choosing relevant parts),


 Splitting/Joining (dividing or combining fields),
 Conversion (standardizing format),
 Summarization (reducing detail), and
 Enrichment (adding useful info like default values or keys).

After transformation, the data is ready for the loading phase. There are three main loading
strategies:

1. Full Data Refresh (replace everything),


2. Incremental Refresh (only update changed data), and
3. Trickle Feed (real-time updates).

The choice of strategy depends on data freshness, system performance, and data volatility.
Balancing these factors helps ensure an efficient and responsive data warehouse.

Lecture 17
Here is a simple explanation of Lecture 17: Issues of ETL (Extract, Transform, Load), along
with 10 MCQs and short questions with answers. All concepts are covered clearly.

✅ Simple Explanation of Lecture 17: Issues of ETL

🔍 What is ETL?

ETL stands for Extract, Transform, Load. It is a process used to:

 Extract data from different sources,


 Transform it into a proper format, and
 Load it into a Data Warehouse (DWH) for analysis.

🧩 Why Are There ETL Issues?

Because data comes from many different sources, and:

 Each system stores data differently.


 Data is often dirty (wrong, incomplete, inconsistent).
 Documentation is missing or poor.
 Standardization tools are expensive and may not work across countries.
Example:

Different countries use different formats for names and addresses, so there's no single standard.
Automation becomes hard and costly.

Problems in ETL

1. Underestimated Complexity

People think ETL is simple: just move data. But it’s actually complex and:

 Needs careful planning.


 Requires handling huge volumes of inconsistent data.
 Breaks traditional DBMS rules in large systems.

2. Diversity of Platforms

Data can come from:

 Mainframes, Unix systems, Windows, Excel files, Text files, SAP, Oracle, etc.
Each system needs a different specialist or tool. Very complex!

3. Inconsistent Data Representations

Same data is stored in different ways:

 Date: 970314, 1997-03-14, March 14, 1997, 2450521.5 (Julian).


 Gender: M/F, Male/Female, 0/1, PM/PF (Probable Male/Female).
All must be converted to a single standard in the DWH.

4. Multiple Sources for Same Data

A single data point (e.g., gender or name) may come from more than one source.

 You must rank sources by data quality.


 If the best source is missing data, you can use the next best.

5. Complex Transformations

Three types:

1. Simple One-to-One (e.g., 0 → Male, 1 → Female)


2. One-to-Many (e.g., breaking one address field into city, street, etc.)
3. Many-to-Many (e.g., linking people to households using fuzzy matching)
6. Rigidity of Legacy Systems

Old systems (legacy) are:

 Difficult to modify.
 Slow and costly.
 Cannot be upgraded easily.

📝 Multiple Choice Questions (MCQs)


1. Why is ETL considered a difficult task?
A. It involves colorful graphics
B. It always has perfect data
C. It deals with multiple data sources and inconsistencies
D. It is only used for small databases
✅ Answer: C

2. What is a common problem with data in ETL?


A. Data is always clean
B. Data is well-documented
C. Data has inconsistent formats
D. Data comes from one source only
✅ Answer: C

3. What makes address data hard to process in ETL?


A. Everyone writes the same way
B. No standard format across countries
C. Only one way to write addresses
D. Addresses are not part of ETL
✅ Answer: B

4. What is the cost range of first-generation ETL tools?


A. $10–$50
B. $200K–$400K
C. Free
D. $10M
✅ Answer: B
5. What is a scalar transformation in ETL?
A. One-to-one value change
B. Converting many fields into one
C. Ignoring all data
D. Copying files manually
✅ Answer: A

6. Why must data from different sources be ranked?


A. For presentation
B. To ignore low-ranking sources
C. To choose the most reliable source per data element
D. To remove all data
✅ Answer: C

7. What is an example of inconsistent data representation?


A. All names in capital letters
B. Date formats like 1997-03-14 and March 14, 1997
C. Data from only one system
D. Same field size everywhere
✅ Answer: B

8. What makes ETL from legacy systems difficult?


A. Easy to change
B. Very fast
C. Cannot be modified or upgraded easily
D. Built for DWH
✅ Answer: C

9. How many ways can AT&T be spelled in data?


A. One
B. Two
C. 127
D. 10
✅ Answer: C
10. What is a "many-to-many" transformation?
A. Simple mapping
B. One address split into many parts
C. Grouping individuals into households using fuzzy logic
D. Copying one field
✅ Answer: C

✍️Short Questions with Answers


Q1: Why is ETL a challenging process?
Ans: ETL is challenging due to different data formats, multiple source systems, missing or dirty
data, and high complexity of transformations.

Q2: What does “inconsistent data representation” mean?


Ans: It means the same data is represented in different ways, like dates written as 1997-03-14,
14-Mar-1997, or Julian format.

Q3: What is data transformation in ETL?


Ans: It is the process of converting data into a consistent format. It includes simple (1-to-1), one-
to-many, and many-to-many transformations.

Q4: Why is ranking source systems important?


Ans: Ranking helps select the most reliable and clean data from multiple sources for the same
data element.

Q5: What makes legacy systems a problem in ETL?


Ans: Legacy systems are rigid, slow, and hard to modify, making ETL integration and
performance improvement difficult.

Would you like this in PDF format or Canva presentation too?

Lecture 18
Here’s a simple explanation of Lecture 18: ETL Detail – Data Extraction & Transformation
with all important points, followed by 10 MCQs, 5 short questions with answers:

💡 Simple Explanation (Without Leaving Any Concept)


🔷 What is ETL?

ETL stands for:

 Extract: Pulling data from different sources.


 Transform: Changing data into a proper format.
 Load: Putting it into a data warehouse for analysis.

🔷 What is Data Extraction?

This is the first step in ETL. It means pulling data from the source systems into a staging area.

🔹 Types:

 Full Extraction: Get the entire data every time (very slow and heavy).
 Incremental Extraction: Only get changed data (fast and efficient).

🔷 What is Change Data Capture (CDC)?

CDC is a technique that captures only the data that has changed (insert, update, delete), not
the full table.

Without CDC:

 We copy the whole table into files (slow and wasteful).


With CDC:
 We only extract new or changed data (efficient).

🔷 Two Types of Systems for CDC:

✅ 1. Modern Systems
Techniques:

 Timestamps: If a column shows the last update time, we use it to find changes.
 Triggers: Programs that run automatically on INSERT/UPDATE/DELETE and save
changes in logs.
 Partitioning: Dividing a table by date (e.g., by week) to easily access recent data.

✅ 2. Legacy Systems

 These are old systems using tapes for logs (like old journals).
 Problems:
o Tapes have a lot of useless data.
o Format is old and complex.
o Contains addresses, not real data.
o Hard to read and understand.

🔷 Advantages of CDC:

1. Immediate data capture – no delay.


2. Keeps full change history.
3. No flat files needed.
4. No extra online I/O needed.
5. Captures all changes from log tape.
6. Processing can be done offline.
7. Allows transformation before loading (better control).

🔷 Major Transformation Types (in ETL):

These are ways to clean or reshape data before loading into the warehouse.

1. Format Revision: Change format (e.g., numeric to text).


2. Decoding Fields: Convert codes into real words (e.g., “M” to “Male”).
3. Calculated/Derived Values: Create new values like profit, age, GPA.
4. Splitting Fields: Separate full name into first, middle, last.
5. Merging Information: Combine multiple fields into one.
6. Character Set Conversion: Change text encoding.
7. Unit Conversion: e.g., inches to cm.
8. Date/Time Conversion: Standardize dates.
9. Summarization: e.g., total sales by month.
10. Key Restructuring: Modify primary keys.
11. Duplication: Create extra copies if needed.
✅ MCQs (with Correct Answers)
1. Which ETL step focuses on pulling data from sources?
a) Transform
b) Load
✅ c) Extract
d) Filter
2. Change Data Capture (CDC) is used to:
a) Delete duplicate records
✅ b) Capture only modified data
c) Convert units
d) Split fields
3. Which technique is NOT used in modern CDC systems?
a) Timestamps
b) Triggers
✅ c) Journal tapes
d) Partitioning
4. Why is incremental extraction better than full extraction?
a) It copies all records
✅ b) It copies only changed data
c) It ignores delete operations
d) It uses more storage
5. Legacy systems store changes in:
✅ a) Journal tapes
b) Cloud logs
c) Modern triggers
d) Timestamps
6. One advantage of CDC is:
✅ a) No need for flat files
b) It avoids insert operations
c) All data is duplicated
d) Deletes are ignored
7. Which transformation type turns “1” and “2” into “Male” and “Female”?
a) Format Revision
✅ b) Decoding Fields
c) Merging
d) Key restructuring
8. Which transformation calculates profit from sales and cost?
a) Merging
b) Duplication
✅ c) Calculated/Derived Values
d) Summarization
9. Which CDC method uses triggers for every DML operation?
a) Partitioning
✅ b) Triggers
c) Flat files
d) Format revision
10. Which transformation changes date format from MM-DD-YYYY to DD-MM-
YYYY?
a) Unit Conversion
✅ b) Date/Time Conversion
c) Splitting
d) Duplication

📝 Short Questions with Answers


1. What is CDC and why is it used?
Answer: Change Data Capture captures only the changed data (insert, update, delete) to
make ETL efficient and reduce load on systems.
2. What are the main techniques of CDC in modern systems?
Answer: Timestamps, triggers, and partitioning.
3. What are the problems of CDC in legacy systems?
Answer: Log tapes have extra data, complex formats, contain addresses not values, and
vary by DBMS.
4. Name three common transformation types in ETL.
Answer: Format revision, decoding fields, calculated/derived values.
5. Why is CDC preferred over full extraction?
Answer: It’s faster, avoids processing full tables, and allows better transformation before
loading.

Let me know if you want this in PDF, PowerPoint, or Canva format, or want questions in
Urdu as well.

Common questions

Powered by AI

Historical data is important in a Data Warehouse because it helps in analyzing trends and predicting future behavior, thereby aiding in decision-making processes. This data is used to provide a big-picture analysis which is crucial for predicting future events and making informed decisions .

Vertical and horizontal splitting differ significantly in their application: vertical splitting involves dividing a table's columns into separate tables to manage large or rarely accessed columns better, thereby reducing row size and improving I/O performance. On the other hand, horizontal splitting divides a table's rows into separate tables based on specific criteria, facilitating parallel processing and reducing the data load on each query to smaller, more manageable partitions .

Scaling is particularly challenging in Data Warehouses because algorithms that perform adequately on smaller datasets can become inefficient at larger scales. Managing billions of records requires optimization techniques that can handle significant amounts of data efficiently, which often makes traditional approaches unsuitable .

De-normalization improves performance in Decision Support Systems by reducing the need for joins during query execution. This reduction decreases access times and allows for quicker retrieval of large datasets, which is common in DSS environments that require rapid data access .

CEOs and marketing managers use ad-hoc access in a Data Warehouse because it allows them to generate queries as needed without predefined paths. This capability is crucial for exploring different data sets to make strategic decisions based on current and actionable insights, which might not be anticipated beforehand .

A Data Warehouse system might add redundant columns to avoid frequent joins between tables, thus improving query performance by accessing data directly rather than through joins. However, potential drawbacks include increased storage requirements and additional complexity in maintaining data integrity and consistency, as any updated data must be manually propagated to ensure all redundant fields are synchronized .

Pre-computing aggregates at multiple levels benefits OLAP systems by enabling quick responses to queries about different data dimensions. By having pre-aggregated data, OLAP systems can readily provide insights across various categories, such as time, geography, or product lines, without recalculating data each time a query is run. This improves the system’s efficiency and responsiveness to user inquiries .

A Data Warehouse is not characterized by being 'volatile' or 'frequently updated'. Unlike transactional databases, data in a Data Warehouse is non-volatile, meaning once data is entered it doesn't change. Data Warehouses are structured to handle historical data and support decision-making rather than real-time updates .

Derived attributes play a critical role in a Data Warehouse by storing pre-computed values, which are used to quickly answer queries that would otherwise require real-time calculations. This approach enhances performance by providing fast access to frequently needed computed data, thereby reducing query runtime .

Decision support systems using OLAP face several challenges, including the inability to predetermine all relevant queries because decision-makers may not know all necessary questions in advance. Additionally, users often lack programming skills for complex SQL queries, necessitating intuitive OLAP tools. Furthermore, generating SQL queries on-the-fly can be slow when working with large data volumes; thus, precomputed aggregates are used to swiftly answer common queries .

You might also like