0% found this document useful (0 votes)

14 views47 pages

Understanding Data Normalization Techniques

The document discusses normalization and denormalization in data warehousing. It explains the goals and levels of normalization from 1NF to 3NF and provides examples to illustrate how normalization removes anomalies and improves database design. While normalization is important, the document notes that a balance needs to be struck between normalization and denormalization. Denormalization is discussed as a controlled process done to enhance query performance in data warehousing without losing information.

Uploaded by

karl evans

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views47 pages

Understanding Data Normalization Techniques

Uploaded by

karl evans

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Lecture No 04

Data Warehousing
By: Dr. Syed Aun Irtaza
De-normalization
Data warehousing
Normalization
Normalization
Normalization is a process that “improves” a
database design for OLTP systems by generating
relations that are simple and stable in structure.

The objective of normalization:

“to create relations where every dependency is
on the key, the whole key, and nothing but the
key”.

3
Normalization
What are the goals of normalization?
 Eliminate redundant data.
 Ensure data dependencies make sense.

What is the result of normalization?

What are the levels of normalization?

Always follow purists approach of

normalization?
NO
4
Normalization
Consider a student database system to be developed for a multi-campus
university, such that it specializes in one degree program at a campus i.e. BS,
MS or PhD.
SID Degree Campus Course Marks SID: Student ID
1 BS Islamabad CS-101 30
1 BS Islamabad CS-102 20 Degree: Registered as BS or MS student
1 BS Islamabad CS-103 40
Campus: City where campus is located
1 BS Islamabad CS-104 20

1 BS Islamabad CS-105 10 Course: Course taken

1 BS Islamabad CS-106 10
Marks: Score out of max of 50
2 MS Lahore CS-101 30
2 MS Lahore CS-102 40
3 MS Lahore CS-102 20
4 BS Islamabad CS-102 20
4 BS Islamabad CS-104 30

4 BS Islamabad CS-105 40

5
Normalization: 1NF
Only contains atomic values, BUT also contains redundant
FIRST data.
SID Degree Campus Course Marks
1 BS Islamabad CS-101 30

1 BS Islamabad CS-102 20

1 BS Islamabad CS-103 40

1 BS Islamabad CS-104 20

1 BS Islamabad CS-105 10

1 BS Islamabad CS-106 10

2 MS Lahore CS-101 30
2 MS Lahore CS-102 40
3 MS Lahore CS-102 20
4 BS Islamabad CS-102 20
4 BS Islamabad CS-104 30

4 BS Islamabad CS-105 40

6
Normalization: 1NF
Anomalies
INSERT. Certain student with SID 5 got admission in a
different campus (say) Karachi cannot be added until
the student registers for a course.

DELETE. If student graduates and his/her corresponding

record is deleted, then all information about that
student is lost.

UPDATE. If student migrates from Islamabad campus to

Lahore campus (say) SID = 1, then six rows would have
to be updated with this new information.
7
Normalization: 2NF
Every non-key column is fully dependent on the PK

FIRST is in 1NF but not in 2NF because degree and campus are
functionally dependent upon only on the column SID of the
composite key (SID, course). This can be illustrated by listing the
functional dependencies in the table:

SID —> campus, degree

SID & Campus are NOT unique
campus —> degree

(SID, Course) —> Marks

To transform the table FIRST into 2NF we move the columns SID, Degree
and Campus to a new table called REGISTRATION. The column SID
becomes the primary key of this new table.

8
Normalization: 2NF
SID Degree Campus
REGISTRATION

SID Course Marks

PERFORMANCE
1 BS Islamabad 1 CS-101 30
2 MS Lahore 1 CS-102 20
3 MS Lahore 1 CS-103 40
4 BS Islamabad 1 CS-104 20
5 PhD Peshawar 1 CS-105 10
1 CS-106 10
SID is now a PK 2 CS-101 30
2 CS-102 40
3 CS-102 20
4 CS-102 20
4 CS-104 30
4 CS-105 40

PERFORMANCE in 2NF as (SID, Course) uniquely identify

Marks
9
Normalization: 2NF
Presence of modification anomalies for tables in
2NF. For the table REGISTRATION, they are:

 INSERT: Until a student gets registered in a degree

program, that program cannot be offered!

 DELETE: Deleting any row from REGISTRATION destroys

all other facts in the table.

Why there are anomalies?

The table is in 2NF but NOT in 3NF

10
Normalization: 3NF
All columns must be dependent only on the primary key.
Table PERFORMANCE is already in 3NF. The non-key column, marks, is
fully dependent upon the primary key (SID, degree).

REGISTRATION is in 2NF but not in 3NF because it contains a transitive

dependency.

A transitive dependency occurs when a non-key column that is a

determinant of the primary key is the determinate of other columns.

The concept of a transitive dependency can be illustrated by showing the

functional dependencies in REGISTRATION:

[Link] —> [Link]

[Link] —> [Link]
[Link] —> [Link]

Note that [Link] is determined both by the primary key SID

and the non-key column campus.
11
Normalization: 3NF

To transform REGISTRATION into 3NF, we create

a new table called CAMPUS_DEGREE and move
the columns campus and degree into it.

Degree is deleted from the original table,

campus is left behind to serve as a foreign key
to CAMPUS_DEGREE, and the original table is
renamed to STUDENT_CAMPUS to reflect its
semantic meaning.

12
Normalization: 3NF

STUDENT_CAMPUS
SID Campus
1 Islamabad
REGISTRATION 2 Lahore
SID Degree Campus 3 Lahore

1 BS Islamabad 4 Islamabad

2 MS Lahore 5 Peshawar

3 MS Lahore
4 BS Islamabad
CAMPUS_DEGREE
5 PhD Peshawar
Campus Degree
Islamabad BS

Lahore MS
Peshawar PhD

13
Normalization: 3NF

Removal of anomalies and improvement

in queries as follows:

 INSERT: Able to first offer a degree program,

and then students registering in it.

 UPDATE: Migrating students between

campuses by changing a single row.

 DELETE: Deleting information about a

course, without deleting facts about all
columns in the record.
14
Normalization

Conclusions:

 Normalization guidelines are cumulative.

 Generally a good idea to only ensure 2NF.

 3NF is at the cost of simplicity and

performance.

 There is a 4NF with no multi-valued

dependencies.
 There is also a 5NF.
15
Striking a balance between “good” & “evil”
De-normalization Normalization
Too many tables
4+ Normal Forms

3rd Normal Form

2nd Normal Form

Data Cubes 1 st
Normal Form

Data Lists

Flat Table One big flat file

16
What is De-normalization?
 It is not chaos, more like a “controlled crash”
with the aim of performance enhancement
without loss of information.

 Normalization is a rule of thumb in DBMS, but

in DSS ease of use is achieved by way of de-
normalization.

 De-normalization comes in many flavors, such

as combining tables, splitting tables, adding
data etc., but all done very carefully.

17
Why De-normalization In DSS?
 Bringing “close” dispersed but related data items.

Query performance in DSS significantly dependent

on physical data model.

 Very early studies showed performance difference

in orders of magnitude for different number de-
normalized tables and rows per table.

The level of de-normalization should be carefully

considered.

18
How De-normalization improves performance?

De-normalization specifically improves

performance by either:

 Reducing the number of tables and hence

the reliance on joins, which consequently
speeds up performance.

 Reducing the number of joins required

during query execution, or

 Reduces the amount of I/O operations.

19
4 Guidelines for De-normalization
1. Carefully do a cost-benefit analysis
(frequency of use, additional storage,
join time).
2. Do a data requirement and storage
analysis.
3. Weigh against the maintenance issue of
the redundant data (triggers used).
4. When in doubt, don’t denormalize.

20
Areas for Applying De-Normalization Techniques

 Dealing with the abundance of star schemas.

 Fast access of time series data for analysis.

 Fast aggregate (sum, average etc.) results and

complicated calculations.

 Multidimensional analysis (e.g. geography) in a

complex hierarchy.

 Dealing with few updates but many join queries.

De-normalization will ultimately affect the database size

and query performance.

21
Five principal De-normalization techniques

1. Collapsing Tables.
- Two entities with a One-to-One relationship.
- Two entities with a Many-to-Many relationship.

2. Splitting Tables (Horizontal/Vertical Splitting).

3. Pre-Joining.

4. Adding Redundant Columns (Reference Data).

5. Derived Attributes (Summary, Total, Balance etc).

22
Collapsing Tables
ColA ColB
denormalized

ColA ColB ColC

normalized

ColA ColC

 Reduced storage space.

 Reduced update time.

 Does not changes business view.

 Reduced foreign keys.

 Reduced indexing.

23
Splitting Tables
Table Table_v1 Table_v2
ColA ColB ColC ColA ColB ColA ColC

Vertical Split
Table_h1 Table_h2

ColA ColB ColC ColA ColB ColC

Horizontal split

24
Splitting Tables: Horizontal splitting…
Breaks a table into multiple tables based upon
common column values. Example: Campus
specific queries.

GOAL

 Spreading rows for exploiting parallelism.

 Grouping data to avoid unnecessary query load

in WHERE clause.

25
Splitting Tables: Horizontal splitting
ADVANTAGE
 Enhance security of data.
 Organizing tables differently for different
queries.

 Graceful degradation of database in case of

table damage.

 Fewer rows result in fast indexing.

26
Splitting Tables: Vertical Splitting
 Infrequently accessed columns become extra
“baggage” thus degrading performance.

Very useful for rarely accessed large text

columns with large headers.

 Header size is reduced, allowing more rows per

block, thus reducing I/O.

Splitting and distributing into separate files

with repeating primary key.

27
Pre-joining …
 Identify frequent joins and append the tables
together in the physical data model.

 Generally used for 1:M such as master-detail.

RI is assumed to exist.

 Additional space is required as the master

information is repeated in the new header
table.

28
Pre-Joining…
Master
Sale_ID Sale_date Sale_person
normalized

1 M
Tx_ID Sale_ID Item_ID Item_QtySale_Rs Detail
denormalized

Tx_ID Sale_ID Sale_date Sale_personItem_ID Item_QtySale_Rs

29
Pre-Joining: Typical Scenario
Typical of Market basket query
Join ALWAYS required
Tables could be millions of rows

Squeeze Master into Detail

Repetition of facts. How much?

Detail 3-4 times of master

30
Adding Redundant Columns…
Table_1 Table_1’
ColA ColB ColA ColB ColC

Table_2 Table_2

ColA ColC ColD … ColZ ColA ColC ColD … ColZ

31
Adding Redundant Columns…
Columns can also be moved, instead of making
them redundant. Very similar to pre-joining as
discussed earlier.

EXAMPLE
Frequent referencing of code in one table and
corresponding description in another table.
 A join is required.

 To eliminate the join, a redundant attribute

added in the target entity which is functionally
independent of the primary key.
32
Derived Attributes: Example
DWH Data Model
Business Data Model
#SID #SID
DoB DoB
Degree Degree
Course Course
Grade Grade
Credits Credits Derived attribute
GP  Calculated once
DoB: Date of Birth Age  Used Frequently

Age is also a derived attribute, calculated as

Current_Date – DoB (calculated periodically).

GP (Grade Point) column in the data warehouse data

model is included as a derived value. The formula for
calculating this field is Grade*Credits.
33
Issues of Denormalization

 Storage
 Performance
 Ease-of-use

 Maintenance

34
Industry Characteristics
Master:Detail Ratios
 Health care 1:2 ratio

 Video Rental 1:3 ratio

 Retail 1:30 ratio

35
Storage Issues: Pre-joining Facts
 Assume 1:2 record count ratio between claim
master and detail for health-care application.
 Assume 10 million members (20 million records
in claim detail).
 Assume 10 byte member_ID.
 Assume 40 byte header for master and 60 byte
header for detail tables.

36
Storage Issues: Pre-joining (Calculations)
With normalization:
Total space used = 10 x 40 + 20 x 60 = 1.6 GB

After denormalization:
Total space used = (60 + 40 – 10) x 20 = 1.8 GB

Net result is 12.5% additional space required in

raw data table size for the database.

37
Performance Issues: Pre-joining
Consider the query “How many members were
paid claims during last year?”

With normalization:
Simply count the number of records in the
master table.

After denormalization:
The member_ID would be repeated, hence
need a count distinct. This will cause sorting
on a larger table and degraded performance.

38
Performance Issues: Adding redundant columns
Continuing with the previous Health-Care
example, assuming a 60 byte detail table and 10
byte Sale_Person.

◦ Copying the Sale_Person to the detail table results in

all scans taking 16% longer than previously.

◦ Justifiable only if significant portion of queries get

benefit by accessing the denormalized detail table.

◦ Need to look at the cost-benefit trade-off for each

denormalization decision.

39
Other Issues: Adding redundant columns
Other issues include, increase in table size,
maintenance and loss of information:

 The size of the (largest table i.e.) transaction table

increases by the size of the Sale_Person key.
 For the example being considered, the detail table size
increases from 1.2 GB to 1.32 GB.

 If the Sale_Person key changes (e.g. new 12 digit

NID), then updates to be reflected all the way to
transaction table.

 In the absence of 1:M relationship, column movement

will actually result in loss of data.

40
Ease of use Issues: Horizontal Splitting
Horizontal splitting is a Divide&Conquer technique that exploits
parallelism. The conquer part of the technique is about combining the
results.

Lets see how it works for hash based splitting/partitioning.

 Assuming uniform hashing, hash splitting supports even data

distribution across all partitions in a pre-defined manner.

 However, hash based splitting is not easily reversible to eliminate the

split.

41
Ease of use Issues: Horizontal Splitting

42
Ease of use Issues: Horizontal Splitting
 Round robin and random splitting:
◦ Guarantee good data distribution.
◦ Almost impossible to reverse (or
undo).
◦ Not pre-defined.

43
Ease of use Issues: Horizontal Splitting
 Range and expression splitting:
◦ Can facilitate partition elimination
with a smart optimizer.
◦ Generally lead to "hot spots” (uneven
distribution of data).

44
Performance Issues: Horizontal Splitting

Dramatic
cancellation of airline
reservations after
Processors 9/11, resulting in
“hot spot”
P1 P2 P3 P4

1998 1999 2000 2001

Splitting based on year

45
Performance issues: Vertical Splitting Facts
Example: Consider a 100 byte header for the
member table such that 20 bytes provide
complete coverage for 90% of the queries.

Split the member table into two parts as

follows:

1. Frequently accessed portion of table (20 bytes), and

2. Infrequently accessed portion of table (80+ bytes).

Why 80+?

Note that primary key (member_id) must be

present in both tables for eliminating the
split.
46
Performance issues: Vertical Splitting Good vs. Bad

Scanning the claim table for most frequently used

queries will be 500% faster with vertical splitting

Ironically, for the “infrequently” accessed queries

the performance will be inferior as compared to
the un-split table because of the join overhead.

Data Normalization and De-normalization Guide
No ratings yet
Data Normalization and De-normalization Guide
38 pages
Beza
No ratings yet
Beza
7 pages
Data Warehousing: Understanding Normalization
No ratings yet
Data Warehousing: Understanding Normalization
14 pages
De Normalization
No ratings yet
De Normalization
21 pages
Database Normalization vs Denormalization
No ratings yet
Database Normalization vs Denormalization
21 pages
Understanding Second Normal Form (2NF)
No ratings yet
Understanding Second Normal Form (2NF)
50 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
29 pages
Database Normalization and SQL Basics
No ratings yet
Database Normalization and SQL Basics
66 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
16 pages
De-normalization in Data Warehousing
No ratings yet
De-normalization in Data Warehousing
9 pages
Understanding Relational Databases and Normalization
No ratings yet
Understanding Relational Databases and Normalization
14 pages
Understanding Database Normalization Concepts
No ratings yet
Understanding Database Normalization Concepts
10 pages
Database Normalization Overview
No ratings yet
Database Normalization Overview
37 pages
Database Table Normalization Guide
No ratings yet
Database Table Normalization Guide
34 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
31 pages
Database Normalization Study Guide
No ratings yet
Database Normalization Study Guide
7 pages
Understanding Functional Dependency in DBMS
No ratings yet
Understanding Functional Dependency in DBMS
8 pages
Database Normalization and Anomalies
No ratings yet
Database Normalization and Anomalies
11 pages
Database Normalization Techniques Explained
No ratings yet
Database Normalization Techniques Explained
41 pages
Relational Database Design Essentials
No ratings yet
Relational Database Design Essentials
28 pages
Database Normalization Explained
No ratings yet
Database Normalization Explained
23 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
6 pages
Database Normalization Explained
No ratings yet
Database Normalization Explained
6 pages
Understanding Normalization in DBMS
No ratings yet
Understanding Normalization in DBMS
16 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
9 pages
De-normalization in Data Warehousing
No ratings yet
De-normalization in Data Warehousing
11 pages
Module 4
No ratings yet
Module 4
17 pages
Lec 5 Normalization
No ratings yet
Lec 5 Normalization
25 pages
Data Warehouse Design Techniques Explained
No ratings yet
Data Warehouse Design Techniques Explained
38 pages
Understanding Database Normalization
100% (1)
Understanding Database Normalization
21 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
79 pages
Database Normalization Techniques Guide
No ratings yet
Database Normalization Techniques Guide
15 pages
4th Unit Normalization and File Organization
No ratings yet
4th Unit Normalization and File Organization
52 pages
Database Design and Normalization Guide
No ratings yet
Database Design and Normalization Guide
23 pages
Normalization of Data in Data Analysis
No ratings yet
Normalization of Data in Data Analysis
18 pages
SQL Normalization: A Beginner's Guide
No ratings yet
SQL Normalization: A Beginner's Guide
10 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
159 pages
Functional Dependencies and Normalization Guide
No ratings yet
Functional Dependencies and Normalization Guide
36 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
26 pages
Data Redundancy in Database Design
No ratings yet
Data Redundancy in Database Design
36 pages
Normalization and Denormalization Explained
No ratings yet
Normalization and Denormalization Explained
4 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
33 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
51 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
6 pages
Topic 4 - Normalization
No ratings yet
Topic 4 - Normalization
5 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
192 pages
Database Design and Normalization Guide
No ratings yet
Database Design and Normalization Guide
10 pages
Understanding Database Normalization
No ratings yet
Understanding Database Normalization
22 pages
Understanding First Normal Form (1NF)
No ratings yet
Understanding First Normal Form (1NF)
15 pages
Database Normalization Techniques Explained
No ratings yet
Database Normalization Techniques Explained
20 pages
Denormalization Techniques in DBMS
No ratings yet
Denormalization Techniques in DBMS
10 pages
Database Normalization Explained
No ratings yet
Database Normalization Explained
14 pages
Fyds Dbms Module 2
No ratings yet
Fyds Dbms Module 2
22 pages
Denormalisation in DBMS: Guidelines & Methods
No ratings yet
Denormalisation in DBMS: Guidelines & Methods
20 pages
Database Table Normalization Guide
No ratings yet
Database Table Normalization Guide
3 pages
Database Normalization and Schema Refinement
No ratings yet
Database Normalization and Schema Refinement
31 pages
Understanding Data Normalization Types
No ratings yet
Understanding Data Normalization Types
25 pages
Project Report: Graphics in Engineering
No ratings yet
Project Report: Graphics in Engineering
30 pages
Testout It Fundamentals Pro Chapter 1 - 4 Fact Sheets
No ratings yet
Testout It Fundamentals Pro Chapter 1 - 4 Fact Sheets
172 pages
Lattice An ADC DAC-less ReRAM-based Processing-In-Memory Architecture For Accelerating Deep Convolution Neural Networks
No ratings yet
Lattice An ADC DAC-less ReRAM-based Processing-In-Memory Architecture For Accelerating Deep Convolution Neural Networks
6 pages
Bimal Kumar: Full-Stack Developer Resume
No ratings yet
Bimal Kumar: Full-Stack Developer Resume
1 page
Lexical Analyser Program Guide
No ratings yet
Lexical Analyser Program Guide
8 pages
Deadlock Detection and Prevention Methods
No ratings yet
Deadlock Detection and Prevention Methods
5 pages
Understanding Bytes in Data Representation
No ratings yet
Understanding Bytes in Data Representation
6 pages
Tech Riddles and Answers List
No ratings yet
Tech Riddles and Answers List
2 pages
Micro Programming
No ratings yet
Micro Programming
197 pages
Client-Server Architecture in Python
No ratings yet
Client-Server Architecture in Python
3 pages
Roll No.: Section: Lab # Date Lab Description Signature Student Name: Zainish Fatima
No ratings yet
Roll No.: Section: Lab # Date Lab Description Signature Student Name: Zainish Fatima
79 pages
Type Casting, Strings & Arrays in Java
No ratings yet
Type Casting, Strings & Arrays in Java
73 pages
ML-Based DDoS Detection System
No ratings yet
ML-Based DDoS Detection System
18 pages
IT Infrastructure Notes PDF
No ratings yet
IT Infrastructure Notes PDF
7 pages
Logical Operators in C Programming
No ratings yet
Logical Operators in C Programming
5 pages
Java String Manipulation Overview
No ratings yet
Java String Manipulation Overview
13 pages
CSE0611302 Structured Programming Lab Assignment
No ratings yet
CSE0611302 Structured Programming Lab Assignment
14 pages
Win-Pak Pro Configuration Guide
No ratings yet
Win-Pak Pro Configuration Guide
26 pages
FineTek PT-83 Series Controller Manual
No ratings yet
FineTek PT-83 Series Controller Manual
23 pages
Servo Position Control Training Guide
No ratings yet
Servo Position Control Training Guide
19 pages
Edge Analytics in Cloud IoT ML
No ratings yet
Edge Analytics in Cloud IoT ML
21 pages
FOG IMU Serial Communication Guide
No ratings yet
FOG IMU Serial Communication Guide
1 page
Aircat 500 Radar Display Project Report
No ratings yet
Aircat 500 Radar Display Project Report
20 pages
ABI Electronics: PCB Testing Solutions
No ratings yet
ABI Electronics: PCB Testing Solutions
20 pages
Software Requirements Analysis Guide
No ratings yet
Software Requirements Analysis Guide
47 pages
Ebook & Testbank Digital Systems Principles and Applications 12th Edition Ronald J Tocci
No ratings yet
Ebook & Testbank Digital Systems Principles and Applications 12th Edition Ronald J Tocci
218 pages
Understanding Wireless LAN (WLAN) Basics
No ratings yet
Understanding Wireless LAN (WLAN) Basics
83 pages
Solution Manual For Operating Systems Internals and Design Principles 9th Edition by William Stallings
No ratings yet
Solution Manual For Operating Systems Internals and Design Principles 9th Edition by William Stallings
61 pages
Q1: Instruction Fetch Cycle Quiz
No ratings yet
Q1: Instruction Fetch Cycle Quiz
3 pages
Hotel Reservation System Project Overview
No ratings yet
Hotel Reservation System Project Overview
7 pages