Lecture No 04
Data Warehousing
By: Dr. Syed Aun Irtaza
De-normalization
Data warehousing
Normalization
Normalization
Normalization is a process that “improves” a
database design for OLTP systems by generating
relations that are simple and stable in structure.
The objective of normalization:
“to create relations where every dependency is
on the key, the whole key, and nothing but the
key”.
3
Normalization
What are the goals of normalization?
Eliminate redundant data.
Ensure data dependencies make sense.
What is the result of normalization?
What are the levels of normalization?
Always follow purists approach of
normalization?
NO
4
Normalization
Consider a student database system to be developed for a multi-campus
university, such that it specializes in one degree program at a campus i.e. BS,
MS or PhD.
SID Degree Campus Course Marks SID: Student ID
1 BS Islamabad CS-101 30
1 BS Islamabad CS-102 20 Degree: Registered as BS or MS student
1 BS Islamabad CS-103 40
Campus: City where campus is located
1 BS Islamabad CS-104 20
1 BS Islamabad CS-105 10 Course: Course taken
1 BS Islamabad CS-106 10
Marks: Score out of max of 50
2 MS Lahore CS-101 30
2 MS Lahore CS-102 40
3 MS Lahore CS-102 20
4 BS Islamabad CS-102 20
4 BS Islamabad CS-104 30
4 BS Islamabad CS-105 40
5
Normalization: 1NF
Only contains atomic values, BUT also contains redundant
FIRST data.
SID Degree Campus Course Marks
1 BS Islamabad CS-101 30
1 BS Islamabad CS-102 20
1 BS Islamabad CS-103 40
1 BS Islamabad CS-104 20
1 BS Islamabad CS-105 10
1 BS Islamabad CS-106 10
2 MS Lahore CS-101 30
2 MS Lahore CS-102 40
3 MS Lahore CS-102 20
4 BS Islamabad CS-102 20
4 BS Islamabad CS-104 30
4 BS Islamabad CS-105 40
6
Normalization: 1NF
Anomalies
INSERT. Certain student with SID 5 got admission in a
different campus (say) Karachi cannot be added until
the student registers for a course.
DELETE. If student graduates and his/her corresponding
record is deleted, then all information about that
student is lost.
UPDATE. If student migrates from Islamabad campus to
Lahore campus (say) SID = 1, then six rows would have
to be updated with this new information.
7
Normalization: 2NF
Every non-key column is fully dependent on the PK
FIRST is in 1NF but not in 2NF because degree and campus are
functionally dependent upon only on the column SID of the
composite key (SID, course). This can be illustrated by listing the
functional dependencies in the table:
SID —> campus, degree
SID & Campus are NOT unique
campus —> degree
(SID, Course) —> Marks
To transform the table FIRST into 2NF we move the columns SID, Degree
and Campus to a new table called REGISTRATION. The column SID
becomes the primary key of this new table.
8
Normalization: 2NF
SID Degree Campus
REGISTRATION
SID Course Marks
PERFORMANCE
1 BS Islamabad 1 CS-101 30
2 MS Lahore 1 CS-102 20
3 MS Lahore 1 CS-103 40
4 BS Islamabad 1 CS-104 20
5 PhD Peshawar 1 CS-105 10
1 CS-106 10
SID is now a PK 2 CS-101 30
2 CS-102 40
3 CS-102 20
4 CS-102 20
4 CS-104 30
4 CS-105 40
PERFORMANCE in 2NF as (SID, Course) uniquely identify
Marks
9
Normalization: 2NF
Presence of modification anomalies for tables in
2NF. For the table REGISTRATION, they are:
INSERT: Until a student gets registered in a degree
program, that program cannot be offered!
DELETE: Deleting any row from REGISTRATION destroys
all other facts in the table.
Why there are anomalies?
The table is in 2NF but NOT in 3NF
10
Normalization: 3NF
All columns must be dependent only on the primary key.
Table PERFORMANCE is already in 3NF. The non-key column, marks, is
fully dependent upon the primary key (SID, degree).
REGISTRATION is in 2NF but not in 3NF because it contains a transitive
dependency.
A transitive dependency occurs when a non-key column that is a
determinant of the primary key is the determinate of other columns.
The concept of a transitive dependency can be illustrated by showing the
functional dependencies in REGISTRATION:
[Link] —> [Link]
[Link] —> [Link]
[Link] —> [Link]
Note that [Link] is determined both by the primary key SID
and the non-key column campus.
11
Normalization: 3NF
To transform REGISTRATION into 3NF, we create
a new table called CAMPUS_DEGREE and move
the columns campus and degree into it.
Degree is deleted from the original table,
campus is left behind to serve as a foreign key
to CAMPUS_DEGREE, and the original table is
renamed to STUDENT_CAMPUS to reflect its
semantic meaning.
12
Normalization: 3NF
STUDENT_CAMPUS
SID Campus
1 Islamabad
REGISTRATION 2 Lahore
SID Degree Campus 3 Lahore
1 BS Islamabad 4 Islamabad
2 MS Lahore 5 Peshawar
3 MS Lahore
4 BS Islamabad
CAMPUS_DEGREE
5 PhD Peshawar
Campus Degree
Islamabad BS
Lahore MS
Peshawar PhD
13
Normalization: 3NF
Removal of anomalies and improvement
in queries as follows:
INSERT: Able to first offer a degree program,
and then students registering in it.
UPDATE: Migrating students between
campuses by changing a single row.
DELETE: Deleting information about a
course, without deleting facts about all
columns in the record.
14
Normalization
Conclusions:
Normalization guidelines are cumulative.
Generally a good idea to only ensure 2NF.
3NF is at the cost of simplicity and
performance.
There is a 4NF with no multi-valued
dependencies.
There is also a 5NF.
15
Striking a balance between “good” & “evil”
De-normalization Normalization
Too many tables
4+ Normal Forms
3rd Normal Form
2nd Normal Form
Data Cubes 1 st
Normal Form
Data Lists
Flat Table One big flat file
16
What is De-normalization?
It is not chaos, more like a “controlled crash”
with the aim of performance enhancement
without loss of information.
Normalization is a rule of thumb in DBMS, but
in DSS ease of use is achieved by way of de-
normalization.
De-normalization comes in many flavors, such
as combining tables, splitting tables, adding
data etc., but all done very carefully.
17
Why De-normalization In DSS?
Bringing “close” dispersed but related data items.
Query performance in DSS significantly dependent
on physical data model.
Very early studies showed performance difference
in orders of magnitude for different number de-
normalized tables and rows per table.
The level of de-normalization should be carefully
considered.
18
How De-normalization improves performance?
De-normalization specifically improves
performance by either:
Reducing the number of tables and hence
the reliance on joins, which consequently
speeds up performance.
Reducing the number of joins required
during query execution, or
Reduces the amount of I/O operations.
19
4 Guidelines for De-normalization
1. Carefully do a cost-benefit analysis
(frequency of use, additional storage,
join time).
2. Do a data requirement and storage
analysis.
3. Weigh against the maintenance issue of
the redundant data (triggers used).
4. When in doubt, don’t denormalize.
20
Areas for Applying De-Normalization Techniques
Dealing with the abundance of star schemas.
Fast access of time series data for analysis.
Fast aggregate (sum, average etc.) results and
complicated calculations.
Multidimensional analysis (e.g. geography) in a
complex hierarchy.
Dealing with few updates but many join queries.
De-normalization will ultimately affect the database size
and query performance.
21
Five principal De-normalization techniques
1. Collapsing Tables.
- Two entities with a One-to-One relationship.
- Two entities with a Many-to-Many relationship.
2. Splitting Tables (Horizontal/Vertical Splitting).
3. Pre-Joining.
4. Adding Redundant Columns (Reference Data).
5. Derived Attributes (Summary, Total, Balance etc).
22
Collapsing Tables
ColA ColB
denormalized
ColA ColB ColC
normalized
ColA ColC
Reduced storage space.
Reduced update time.
Does not changes business view.
Reduced foreign keys.
Reduced indexing.
23
Splitting Tables
Table Table_v1 Table_v2
ColA ColB ColC ColA ColB ColA ColC
Vertical Split
Table_h1 Table_h2
ColA ColB ColC ColA ColB ColC
Horizontal split
24
Splitting Tables: Horizontal splitting…
Breaks a table into multiple tables based upon
common column values. Example: Campus
specific queries.
GOAL
Spreading rows for exploiting parallelism.
Grouping data to avoid unnecessary query load
in WHERE clause.
25
Splitting Tables: Horizontal splitting
ADVANTAGE
Enhance security of data.
Organizing tables differently for different
queries.
Graceful degradation of database in case of
table damage.
Fewer rows result in fast indexing.
26
Splitting Tables: Vertical Splitting
Infrequently accessed columns become extra
“baggage” thus degrading performance.
Very useful for rarely accessed large text
columns with large headers.
Header size is reduced, allowing more rows per
block, thus reducing I/O.
Splitting and distributing into separate files
with repeating primary key.
27
Pre-joining …
Identify frequent joins and append the tables
together in the physical data model.
Generally used for 1:M such as master-detail.
RI is assumed to exist.
Additional space is required as the master
information is repeated in the new header
table.
28
Pre-Joining…
Master
Sale_ID Sale_date Sale_person
normalized
1 M
Tx_ID Sale_ID Item_ID Item_QtySale_Rs Detail
denormalized
Tx_ID Sale_ID Sale_date Sale_personItem_ID Item_QtySale_Rs
29
Pre-Joining: Typical Scenario
Typical of Market basket query
Join ALWAYS required
Tables could be millions of rows
Squeeze Master into Detail
Repetition of facts. How much?
Detail 3-4 times of master
30
Adding Redundant Columns…
Table_1 Table_1’
ColA ColB ColA ColB ColC
Table_2 Table_2
ColA ColC ColD … ColZ ColA ColC ColD … ColZ
31
Adding Redundant Columns…
Columns can also be moved, instead of making
them redundant. Very similar to pre-joining as
discussed earlier.
EXAMPLE
Frequent referencing of code in one table and
corresponding description in another table.
A join is required.
To eliminate the join, a redundant attribute
added in the target entity which is functionally
independent of the primary key.
32
Derived Attributes: Example
DWH Data Model
Business Data Model
#SID #SID
DoB DoB
Degree Degree
Course Course
Grade Grade
Credits Credits Derived attribute
GP Calculated once
DoB: Date of Birth Age Used Frequently
Age is also a derived attribute, calculated as
Current_Date – DoB (calculated periodically).
GP (Grade Point) column in the data warehouse data
model is included as a derived value. The formula for
calculating this field is Grade*Credits.
33
Issues of Denormalization
Storage
Performance
Ease-of-use
Maintenance
34
Industry Characteristics
Master:Detail Ratios
Health care 1:2 ratio
Video Rental 1:3 ratio
Retail 1:30 ratio
35
Storage Issues: Pre-joining Facts
Assume 1:2 record count ratio between claim
master and detail for health-care application.
Assume 10 million members (20 million records
in claim detail).
Assume 10 byte member_ID.
Assume 40 byte header for master and 60 byte
header for detail tables.
36
Storage Issues: Pre-joining (Calculations)
With normalization:
Total space used = 10 x 40 + 20 x 60 = 1.6 GB
After denormalization:
Total space used = (60 + 40 – 10) x 20 = 1.8 GB
Net result is 12.5% additional space required in
raw data table size for the database.
37
Performance Issues: Pre-joining
Consider the query “How many members were
paid claims during last year?”
With normalization:
Simply count the number of records in the
master table.
After denormalization:
The member_ID would be repeated, hence
need a count distinct. This will cause sorting
on a larger table and degraded performance.
38
Performance Issues: Adding redundant columns
Continuing with the previous Health-Care
example, assuming a 60 byte detail table and 10
byte Sale_Person.
◦ Copying the Sale_Person to the detail table results in
all scans taking 16% longer than previously.
◦ Justifiable only if significant portion of queries get
benefit by accessing the denormalized detail table.
◦ Need to look at the cost-benefit trade-off for each
denormalization decision.
39
Other Issues: Adding redundant columns
Other issues include, increase in table size,
maintenance and loss of information:
The size of the (largest table i.e.) transaction table
increases by the size of the Sale_Person key.
For the example being considered, the detail table size
increases from 1.2 GB to 1.32 GB.
If the Sale_Person key changes (e.g. new 12 digit
NID), then updates to be reflected all the way to
transaction table.
In the absence of 1:M relationship, column movement
will actually result in loss of data.
40
Ease of use Issues: Horizontal Splitting
Horizontal splitting is a Divide&Conquer technique that exploits
parallelism. The conquer part of the technique is about combining the
results.
Lets see how it works for hash based splitting/partitioning.
Assuming uniform hashing, hash splitting supports even data
distribution across all partitions in a pre-defined manner.
However, hash based splitting is not easily reversible to eliminate the
split.
41
Ease of use Issues: Horizontal Splitting
42
Ease of use Issues: Horizontal Splitting
Round robin and random splitting:
◦ Guarantee good data distribution.
◦ Almost impossible to reverse (or
undo).
◦ Not pre-defined.
43
Ease of use Issues: Horizontal Splitting
Range and expression splitting:
◦ Can facilitate partition elimination
with a smart optimizer.
◦ Generally lead to "hot spots” (uneven
distribution of data).
44
Performance Issues: Horizontal Splitting
Dramatic
cancellation of airline
reservations after
Processors 9/11, resulting in
“hot spot”
P1 P2 P3 P4
1998 1999 2000 2001
Splitting based on year
45
Performance issues: Vertical Splitting Facts
Example: Consider a 100 byte header for the
member table such that 20 bytes provide
complete coverage for 90% of the queries.
Split the member table into two parts as
follows:
1. Frequently accessed portion of table (20 bytes), and
2. Infrequently accessed portion of table (80+ bytes).
Why 80+?
Note that primary key (member_id) must be
present in both tables for eliminating the
split.
46
Performance issues: Vertical Splitting Good vs. Bad
Scanning the claim table for most frequently used
queries will be 500% faster with vertical splitting
Ironically, for the “infrequently” accessed queries
the performance will be inferior as compared to
the un-split table because of the join overhead.
47