Normalization
Intro
Good database design must be matched with good table structures E-R diagrams help use with overall design, but a good conceptual design does not necessarily lead to a good table structures Tables are the basic building blocks of the database We wish to avoid data anomalies/redundancies by controlling the table structure logically The process of identifying and eliminating data anomalies and redundancies is called normalization
Redundancy & Anomalies
Data redundancy = data stored in several places
Too much data redundancy causes problems--which value is correct? Data integrity and consistency suffer
Data anomaly = abnormal data relationships
Insertion anomaly - Cant add data because dont know entire primary key value, e.g., primary key based on first, middle, and last name Deletion anomaly - Deletions result in too many fields being removed unintentionally, e.g., delete an employee but lose transaction data Update anomaly - Change requires many updates, e.g., if you store customer names in transaction tables
Normalization
Step-by-step process for eliminating data redundancies and anomalies Enables us to recognize bad table structures Enables us to create good table structures Stages are called normal forms each better than previous (less anomalies/redundancy)
First normal form (1NF) Second normal form (2NF) Third normal form (3NF)
Normalization
There are fourth and fifth normal forms that are seldom used Highest level not always the most desirable In general, the higher the level the slower the database response because of underlying pointer movement Most professionally designed databases reach third normal form
Need for Normalization
To recognize good design, first look at bad one Example, construction company manages several projects and whose charges are dependent on employees position
Desired Report
Proj No Proj Name 1 Hurricane Emp No 101 102 104 Emp Name John News David Senior Anne Ramoras Job Class Elec Eng Comm Tech Comm Tech Chg/Hr Hrs Billed 65 13 60 16 60 19 Sub Tot 65 55 15 17 Sub Tot 18 14 Sub Tot Total Tot Chg 845 960 1,140 2,945 975 935 1,910 1,080 840 1,920 6,775
2 Coast
101 John News 103 June Arbough
Elec Eng Biol Eng
3 Satellite
104 Anne Ramoras 102 David Senior
Comm Tech Comm Tech
60 60
Table
P_No P_Name 1 Hurricane E_No 101 102 104 101 103 104 102 E_Name John News David Senior Anne Ramoras John News June Arbough Anne Ramoras David Senior Job_Class Chg_Hr Elec Eng 65 Comm Tech 60 Comm Tech 60 Elec Eng 65 Biol Eng 55 Comm Tech 60 Comm Tech 60 Hrs 13 16 19 15 17 18 14
2 Coast 3 Satellite
Another View of Table
Group 1
P_No 1 2 3 P_Name Hurricane Coast Satellite E_No1 101 101 104 E_Name1 John News John News Anne Ramoras Job_Class1 Elec Eng Elec Eng Comm Tech
Group 2
Chg_Hr1 Hrs1 E_No2 E_Name2 65 13 102 David Senior 65 15 103 June Arbough 60 18 102 David Senior
Etc.
Job_Class2 Chg_Hr2 Hrs2 Comm Tech 60 16 Biol Eng 55 17 Comm Tech 60 14
Problems
P_No intended to be primary key but contains null values Data redundancies
Invites data inconsistencies (Elect Eng & EE)
Anomalies
Update anomaly modify Job_Class for E_No 101 requires many alterations Insert anomaly to add a project row we need an employee Deletion anomaly delete E_No 101, we delete other vital data too
Problems
Date redundancy
If add new employee to project 2 must type:
2 C oas t 104 A nne Ram oras Com m Tec h 60 19
Wastes data entry time Wastes storage space Leads to data inconsistency
Huricane or Hurricane or Hurracaine
Conversion to 1NF
Table above has repeating groups Each P_No has a group of entries
P_No P_Name 1 Hurricane E_No 101 102 104 E_Name John News David Senior Anne Ramoras Job_Class Chg_Hr Elec Eng 65 Comm Tech 60 Comm Tech 60 Hrs 13 16 19
1NF
Eliminate repeating groups By adding entries in primary key column (at least)
P_No 1 1 1 2 2 3 3 P_Name Hurricane Hurricane Hurricane Coast Coast Satellite Satellite E_No 101 102 104 101 103 104 102 E_Name John News David Senior Anne Ramoras John News June Arbough Anne Ramoras David Senior Job_Class Chg_Hr Elec Eng 65 Comm Tech 60 Comm Tech 60 Elec Eng 65 Biol Eng 55 Comm Tech 60 Comm Tech 60 Hrs 13 16 19 15 17 18 14
Problems
Primary key P_No does not uniquely identify all attributes in row Must create composite key made up of P_No & E_No
Dependency Diagram
Helps us to discover relationships between entity attributes Upper arrows implies dependency on P_No & E_No Lower arrows implies dependency on only one attribute
P_No
P_Name
E_No E_Name
Job_Class
Chg_Hr
Hrs
Dependencies
Upper arrows If you know P_No & E_No you can determine the other row values Lower arrows Partial dependencies based on only part of key P_Name only dependent on P_No E_Name, Job_Class, Chg_Hr only dependent on E_No Dependency diagram may be written: P_No, E_No P_Name, E_Name, Job_Class, Chg_Hr, Hrs P_No P_Name E_No E_Name, Job_Class, Chg_Hr
New Table
Composite primary key P_No & E_No
Charges Table
P_No 1 1 1 2 2 3 3 E_No 101 102 104 101 103 104 102 P_Name Hurricane Hurricane Hurricane Coast Coast Satellite Satellite E_Name John News David Senior Anne Ramoras John News June Arbough Anne Ramoras David Senior Job_Class Chg_Hr Elec Eng 65 Comm Tech 60 Comm Tech 60 Elec Eng 65 Biol Eng 55 Comm Tech 60 Comm Tech 60 Hrs 13 16 19 15 17 18 14
1NF Definition
1. All the key attributes are defined Any attribute that is part of the primary key 1. There are no repeating groups in the table Each cell can contain one and only one value, rather than set 1. All attributes are dependent on the primary key
Problems
Contains partial dependencies
Dependencies base on only part of the primary key
This makes table subject to data redundancies and hence to data anomalies Redundancy caused by fact that every row entry requires duplicate data
E.g., suppose E_No 105 is entered 20 times, must also enter E_Name, Job_Class, Chg_Hr
Anomalies caused by redundancy
E.g., employee name may be spelled Dave Senior or D. Senior or David Senior
Conversion to 2NF
1. Starting with 1NF write each of the key components on separate lines, then write the original key on the last line P_No E_No P_No E_No Each will become key in a new table Original table split into three tables
Conversion to 2NF
2. Write the dependent attributes after each of the new keys using the dependency diagram P_No P_Name E_No E_Name, Job_Class, Chg_Hr P_No E_No Hrs
Three New Tables
Project Table P_No P_Name 1 Hurricane 2 Coast 3 Satellite Employee Table
E_No 101 102 103 104 E_Name John News David Senior June Arbough Anne Ramoras Job_Class Elec Eng Comm Tech Biol Eng Comm Tech Chg_Hr 65 60 55 60
Assign Table
P_No 1 1 1 2 2 3 3 E_No 101 102 104 101 103 104 102 Hrs 13 16 19 15 17 18 14
2NF Definition
1. Table is in 1NF and 2. It includes no partial dependencies (no attribute is dependent on only a portion of the primary key) Note: Since partial dependencies can exist only if there is a composite key, a table with a single attribute as primary key is automatically in 2NF if it is in 1NF
Problem - Transitive Dependency
Note that Chg_Hr is dependent on Job_Class, but neither Chg_Hr nor Job_Class is part of the primary key This is called transitive dependency
A condition in which an attribute is functionally dependent on non-key attributes (another attribute that is not part of the primary key)
Transitive dependency yields data anomalies
Conversion to 3NF
Break off the pieces that are identified by the transitive dependency arrows (lower arrows) in the dependency diagram Store them in a separate table
P_No P_Name E_No E_Name, Job_Class P_No E_No Hrs Job_Class Chg_Hr
Note: Job_Class must be retained in Employee table to establish a link to the newly created Job table
New Tables
Project Table P_No P_Name 1 Hurricane 2 Coast 3 Satellite Assign Table
P_No 1 1 1 2 2 3 3 E_No 101 102 104 101 103 104 102 Hrs 13 16 19 15 17 18 14
Employee Table
E_No 101 102 103 104 E_Name John News David Senior June Arbough Anne Ramoras Job_Class Elec Eng Comm Tech Biol Eng Comm Tech
Job Table
Job_Class Biol Eng Comm Tech Elec Eng
Chg_Hr 55 60 65
3NF Definition
1. Table is in 2NF and 2. It contains no transitive dependencies
Problem
Although the four tables are in 3NF, we have a potential problem The Job_Class is entered for each new employee in the Employee table For example, too easy to enter Electrical Engr, or EE, or El Eng
Problem
Employee Table
E_No 101 102 103 104 104 105 106
E_Name John News David Senior June Arbough Anne Ramoras John Smith Alice White Bob Jones
Job_Class Elec Eng Comm Tech Biol Eng Comm Tech Comm Tech Biol Eng Elec Eng
New Attribute
Create a Job_Code attribute to serve as primary key in the Job table and as a foreign key in the Employee table
Changed Tables
Project Table P_No P_Name 1 Hurricane 2 Coast 3 Satellite Assign Table
P_No 1 1 1 2 2 3 3 E_No 101 102 104 101 103 104 102 Hrs 13 16 19 15 17 18 14
Employee Table
E_No 101 102 103 104 E_Name John News David Senior June Arbough Anne Ramoras Job_Code 502 501 500 501
Job Table
Job_Code 500 501 502
Job_Class Biol Eng Comm Tech Elec Eng
Chg_Hr 55 60 65
3NF Version
Vast improvement over original design No data anomalies
In the Job table each job code has single job class and charge per hour entry
No opportunities to use different values describing same object
Similarly for Employee & Project tables only one entry for each attribute Also, Assign table has only what is needed
Data redundancy has been minimized
Keys are redundant but these are small Assign table is very active but requires only the P_No, E_No, and hours
Other Normal Forms
3NF is the most appropriate form for most applications There are other normal forms 4NF Isolate independent multiple relationships 5NF Isolate semantically related multiple relationships These are advanced and go beyond the scope of the course
Summary
1NF Eliminate repeating groups 2NF Eliminate partial dependencies 3NF Eliminate transitive dependencies Tables are the critical building blocks for the database It is important to design their structure well Normalization is a formal process for doing this that reduces data redundancy and anomalies
End
References: New Perspectives on Microsoft Access 2000, Introductory, by Adamski, Hommel, and Finnegan, Course Technology, 1999. Access Database Design and Programming, Third Edition, by Roman, OReilly, 2002. Database Systems: Design, Implementation, and Management, by Rob & Coronel, Boyd & Fraser, 1995.