Relational Database Design and Normalization
Relational Database Design and Normalization
Systems
Lecture by:
Prathima M. G ,B.E., M.E ,
Assisstant professor
Dept. of Computer Science & Engineering
Bangalore Institute of Technology
Bangalore – 560 004
prathimamg@[Link]
Relational Database Design
Schema Refinement
RELATIONAL DATABASE
DESIGN
NORMALIZATION
BOTTOM - UP
APPROACH
(THROUGH
SYNTHESIS)
Phases of database
design:
Step-01: Requirement Gathering
Step-02: Conceptual database design
( ER_modeling)
Step-03: Logical database design
(ER- Relational Mapping)
Step-04: Schema refinement
(Normalization)
Step-05: Implementation
(using SQL)
Overview
Informal guidelines for database design
The concept of Functional Dependencies (FDs)
Trivial and Nontrivial Dependencies
Closure of set of Functional Dependencies
Minimal Set of FD
Finding the Candidate key
Normal Forms (1NF, 2NF,3 NF, and BCNF)
Examples on Normalization
Relational Database Design
Schema refinement :
The process of evaluating relational schemas
for design quality or
Measuring the appropriateness/ Goodness of
relational schema other than the intuition of
designer
Approaches to database
design
Analysis:
Top- Down approach
Identify ENTITIES and associate ATTRIBUTES.
Synthesis:
Bottom-Up approach
Consider Individual attributes of a TABLE and
associate appropriately with a TABLE.
Normalization
The formal process that can be followed
to achieve a good database design
Also used to check that an existing
design is of good quality
The different stages of normalization are
known as “normal forms”
To understand this, we need to
understand the concept of functional
dependency
Informal Guidelines for Good
Database Design
Four Informal Measures
SEMANTICS of the relation attributes must be
maintained.
Disallowing
the possibility of GENERATING
SPURIOUS TUPLES.
Drawbacks of an unnormalized
relation
Consider a WASE DB.
WASE needs to keep track of details regarding
its STUDENT (like USN, Name, DOB, Gen, Addr,..)
the COURSES/SUBJECTS offered like (Cno, Cname,
Sem)
and also keep track of details regarding Student
being enrolled for many Courses and a Course
having many students enrolled for it along with the
Marks_Range obtained by each student in each
course he/she is enrolled
and then award Grade based on Marks_range.
ER Diagram for WASE DB
USN
M_Range
WASE Student_Course Details
Grade
Cno Cname Sem
ER_Relational Mapping
WASE Student_Course Details Primary Key: USN,CNO
???? CS14 CN 3
Insertion Anomalies
Identifying Attribute: USN,CNO
If we fail to update in some tuples, then the same USN will show
different values for Sname, making the database inconsistent.
Modification anamoly
WASE Student_Course Details Primary Key: USN,CNO
USN Sname DOB Addr CNO Cname Sem M-range Grade
Another problem with null value column is HOW to account for them
when AGGREGATE operations like COUNT, SUM,AVERAGE are
applied.
Null Values
Interpretation of Null values
The attribute does not apply to this tuple
The attribute value for his tuple is unknown
The value is known but absent or not recorded.
Guideline-03:
As far as possible , avoid placing attributes in a base relation
whose values may frequently be null. If nulls are unavoidable ,
make sure that they apply in exceptional cases only and do not
apply to majority of tuples in the relation.
Generation of Spurious
Tuples
Consider 2 relations
Project Department
PNO Pname loc DNo DeptNO Dname loc
P1 PMS Blore D1 D1 Sales Blore
P2 BMS Blore D2 D2 R&D Mysore
P3 UMS Mglore D3 D3 Mktng Blore
P4 LMS Mysore D5 D4 Dev Mglore
P5 HMS Mglore D4 D5 Testing Mglore
P6 ERP Hubli D1
Report(USN,CNO,Sname,DOB,Addr,Cname,
Sem,Marks_Range, Grade)
CNOCname, Sem
USN Sname,DOB,Addr
Report(USN,CNO,Sname,DOB,Addr,Cname,
Sem,Marks_Range, Grade)
USN,CNO Marks_Range
Marks_Range Grade
AB and BC => AC
The attribute Room# is said to be transitively
dependent on the key C# since it is dependent on
LName which in turn is dependent on C#.
Inference Rules(IR) or Armstrong's Axioms
IR-1: Reflexivity: If X ﬤY AND XX, XY
IR-2: Augmentation: If X Y, then XZ YZ.
IR-3: Transitivity: If X Y and Y Z,
then X Z.
IR-4: Decomposition: If X YZ, then X Y
and X Z.
IR-5: Union: If X Y and X Z,
then X YZ.
IR-6: Pseudo transitivity: XY, WYZ then
WXZ
Note: IR1 through IR3 are called as Armstrong
Inference rules. They are Sound and Complete.
IR-1: Reflexivity: If X ﬤY AND XX,
XY
1.x->yz(Given)
[Link]->y{IR1 & knowing that yz⊇y}
3.x->y{IR3 ON 1 &2}
Eg: ssn->ename,add
ssn->ename,ssn->add
IR-5: Union: If X Y and X Z,
then X YZ.
1.x->y(given)
[Link]->z(given)
[Link]->wy(IR2 on 1 with w)
[Link]->z(IR3 on 3 &2)
Ex:ssn->dno ename,dno->dname
then ename,ssn->dname
Normal Forms Based on
Primary Keys
The normalization process, as first proposed by Codd
(1972a), takes a relation schema through a series of
tests to certify whether it satisfies a certain normal
form.
Normalization of data can be considered a process
of analyzing the given relation schemas based on
their FDs and primary keys to achieve the desirable
properties of
1) minimizing redundancy and
2) minimizing the insertion, deletion, and update
anomalies.
Definition. The normal form of a relation refers to
the highest normal form condition that it meets, and
hence indicates the degree to which it has been
normalized.
Normal forms, when considered in isolation from
other factors, do not guarantee a good database
design.
It is generally not sufficient to check separately that
each relation schema in the database is, say, in BCNF
or 3NF.
Rather, the process of normalization
(iv) UnitPriceClass
Second Normal Form :
(Cont..)
Key and Non Key Attributes of Retail Application Table
CustomerId CustomerName,AccountNo
ItemId ItemName,UnitPrice,Class
Second Normal Form :
(Cont..)
After removing the Partial dependencies on Key Attributes we get
the below tables which aree in 2NF:
Customer
CustomerId CustomerName Accountno
1001 John 1500012351
1002 Tom 1200354611
1003 Maria 2134724532
Item
ItemId ItemName UnitPrice Class
STN001 Pen 10 A
BAK003 Bread 10 A
GRO001 Potato 20 B
Second Normal Form :
(Cont..)
ItemPurchase
CustomerId ItemId QtyPurchased NetAmt
1001 STN001 5 50
1002 BAK003 1 10
1003 GRO001 1 20
Third Normal Form: 3 NF
A relation R is said to be in the Third Normal Form (3NF) if and only if
It is in 2NF and
No transitive dependency exists between non-key attributes and key
attributes through another non key attribute.
A B C
It should
be key It should be
Attribute It should be non key
non key attribute
attribute
ItemI UnitPrice
d
UnitPrice Class
Item
ItemId ItemName UnitPrice
STN001 Pen 10
BAK003 Bread 10
GRO001 Potato 20
ItemClass
UnitPrice Class
10 A
20 B
ename ssn bdate add dno dname dmgrss
n
Trivial Functional Dependency
A+ ={A}
AB+ ={A,B,D,C} B C
AC+ ={A,C,B,D}
Logically R1{A,C,D} is a better choice over R1{A,B,D}
as the join operation is will not generate SPURIOUS
tuples.
Example - 2
Consider R (City, Street, Zipcode) or R (C, S, Z) and F = {CS Z,
Z C}.
The candidate keys for R are CS and ZS (using dependency
graph).
The relation R is in 3NF (since each attribute is prime) but not
in BCNF, because in Z C, Z is not a superkey and also it is
not a trivial FD. In R, we cannot store the city to which a
zipcode belongs unless we know a street address with the
zipcode. This introduces insertion anomaly.
To convert this into BCNF, decompose R into:
R1 = {Z, C} and R2 = {S, Z}
If we have R2={C,S} as the other table, then
We can’t have a Foreign key reference to link both the tables.
Determinants of R2 (i.e.,C,S)does not determine any other attribute
BCNF
Also, from the FD: F = {FD1:CS Z, FD2:Z C}
we know that C is Dependent from FD2.
Vni={S}
Voi={ }
Candidate Key is S
Z
S ={S}
+
SZ+ ={S,Z,C}
C
SC+ ={C,S,Z}
Logically R1{S,Z} is a better choice over R1{C,S} as
the join operation is will not generate SPURIOUS
tuples.
Example - 3
Consider the relation GradeList (S, N, C, G}
FD-1: {Name, Course} GPA NC G
FD-2: {StudentNo, Course} GPA SC G
FD-3: Name StudentNo NS
FD-4: StudentNo Name SN
Candidate keys are:
{Name, Course} N G
{StudentNo, Course}
C S
The relation is in 3NF.
But redundancy of data.
The association between Name and the corresponding
StudentNo is repeated.
- insertion anomaly.
There exists deletion anomaly too.
(if a student fails in all subjects, looses the student
information!).
The relation Gradelist is not in BCNF, because
of FD-3 and FD-4 which are nontrivial and their
determinants (left-hand side) are not super
keys of GradeList.
Closure of a set of FD (F
+
)
Given some FDs, new FDs can often be
inferred.
The set of all FDs that are implied by a given
set F of FDs is called the closure of F and is
denoted by F +.
Example:
(1) A BC {Given}
(2) A C {Decomposition of (1)}
(3) AD CD {Augmentation of (2) by adding D}
(4) D EF {Given}
(5) CD CEF {Augmentation}
(6) CD->C,CD->E,CD->F {DECOMPOSITION}
FROM 3 & 5 APLLY TRANSITIVITY
AD->CD,CD->F
(7) AD F
Attribute closure, X+
To compute F + , start with FDs in F; repeatedly apply IR-1 to
IR-3 until no new FD can be derived
Armstrong's Axioms do not produce any incorrect FDs that are
added to F +. However, finding F + is too expensive; the
complexity grows exponentially
The solution is to find the attribute closure of X, denoted as X
+
.
Definition: For each such set of attributes X,
we determine the set X+ of attributes
that are functionally determined by X based on F;
X+ is called the closure of X under F.
Algorithm to find X+
Algorithm Attribute_Closure()
{
X + = X;
Repeat {
for each FD XY in F do X + = X + Y
for each FD Y Z in F do
if Y X + then X+ =X + Z
// i. e. if Y is in X +, the add Z to X +
until no change;
// until no more attributes are added to X +
}
}
Example: Let us consider
SSN ename pnumber pname plocation hours
F={ssn->ename,
pnumber->{pname,plocation},
{ssn,pnumber}->hours}
Closure sets w.r.t F
{ssn}+={ssn,ename}
{pnumber}+={pnumber,pname,plocation}
{ssn,pnumber}
+={ssn,ename,pnumber,pname,plocation,hours}
Example
Consider R (A, B, C) and a set of FDs
F = {AB C, C B}
Using the Algorithm, we calculate the following
closure sets with respect to F:
A+ = {A},
B+ = {B},
C+ = {C, B} because of FD-2
{AB}+ = {ABC} because of FD-1 add attribute C
{AC}+ = {ACB} because of AC AB (IR-2) add attribute
B
{BC}+ = {BC} nothing can be added
{ABC}+ = {ABC} nothing can be added
2.3 Equivalence of Sets
of FDs
Two sets of FDs F and G are equivalent if:
Every FD in F can be inferred from G, and
Every FD in G can be inferred from F
Hence, F and G are equivalent if F+ =G+
Definition (Covers):
F covers G if every FD in G can be inferred from F
(i.e., if G+ subset-of F+)
F and G are equivalent if F covers G and G covers
F
There is an algorithm for checking equivalence of
sets of FDs
Method:
F COVERS G G COVERS F
A->B C->B
A+=ACB C+=ABC
B->C B->A
B+=ABC B+=ABC
C->A A->C
C+=ABC A+=ABC
F=G,THEN THEY ARE EQUIVALENT
Example2:R(A,B,C,D,E)
F={A->B,AB->C,D->AC,D->E}
G={A->BC,D->AE}
G COVERS F F COVERS G
A->BC A->B
A+=ABC A+=ABC
D->AE AB+=ABC
D+=DACEB D+=DAEBC
F=G
THEN THEY ARE EQUIVALENT
Example3:
X={A->B,B->C}
Y={A->B,B->C,A->C}
Check if X covers y y x
Y covers x x y
then X=Y
First take Y
Y=A->B Y=A->C A+=ABC
A+=ABC
Y=B->C
B+=BC
Example4:
X={AB->CD,B->C,C->D}
Y={AB->C,AB->D,C->D}
X COVERS Y Y COVERS X
AB->C AB+=ABCD
AB+=ABCD C+=CD
B+=BCD
C+=CD
Example5:
F={A->B,A->C}
G={A->B,B->C}
F COVERS G G COVERS F
A->B A->B
A+=ABC A+=ABC
B->C
B+=B
Not equivalent
Minimal Cover (F I )
A set of FDs F is minimal and can be
represented as a set of FDs G if it satisfies the
following conditions:
a) Every FD in G has a single attribute on its right-
hand side, i.e. X A, where A is a single attribute.
b) No FD can be removed from G and still have a set of
FDs that is equivalent to F.
c) We can not replace any FD: X A in F with a
dependency Y A, where Y A and still have a
set of dependencies that is equivalent to F.
Algorithm to find the Minimal cover
Algorithm MinimalCover(F)
{
Step-1: G = F.
Step-2: Transform G into a set of FD's with right hand side
containing only one attribute (Canonical cover).
Step-3: Eliminate a redundant attribute from left-side.
For each dependency A1, A2, ..., Ak B in current set of G,
and each attribute Ai in its left-side,
if G - {A1, A2, ..., Ak B } {A1, A2, ..., Ai -1, Ai+1,..., Ak B}
is equivalent to G.
then delete Ai from the left side of A1, A2, ..., Ak B.
Step-4: Eliminate a redundant dependency
For each dependency X Y in the current set of dependencies G
if G - {X Y} is equivalent to G then delete X Y from G.
}
Example 1: Let the given set of FDs be
E: {B → A, D → A, AB → D}. We have to
find the minimal cover of E.
All above dependencies are in canonical form (that is, they
have only one attribute on the right-hand side), so we have
completed step 1 of Algorithm
and can proceed to step 2.
In step 2 we need to determine if AB → D has any redundant
(extraneous) attribute on the left-hand side; that is, can it be
replaced by
B → D or A → D?
Since B → A, by augmenting with B on both sides (IR2), we
have BB → AB, or B → AB (i).
However, AB → D as given (ii).
Hence by the transitive rule (IR3), we get from (i) and (ii), B
→ D. Thus
AB → D may be replaced by B → D.
A D B
E C G
H
Vni = {B,C,G,H}
Voi = {A, E}
Candidate keys = AE
Example - 2
Consider R (A, B, C, D, E, H), and
F = {A B, AB E, BH C, C D, D A}
A B C D
E H
Vni = {H}
Voi = {E}
Candidate keys = AH, BH, CH, and DH
BS={ABCD} So to find ck, w.r.t BS
AH+={AH} BH+={BH}
={AHB} ={BHC}
={AHBE} ={BHCD}
={AHBEC} ={BHCDA}
={AHBECD} ={BHCDAE}
CK1={AH} CK2={BH}
CH+={CH} DH+={DH}
={CHD}{CHDA}{CHDAB} ={DHA}
={CHADBE} ={DHAB}{DHABE}{DHABEC}
CK3={CH} CK4={DH}
Consider R (A, B, C, D, E, H),
and
1)F={A->C,C->D,D->B,E->F}
2)F={CH->G,A->BC,B->CFH,E->A,F->EG}
Find the candidate Keys
R={booktitle,authorname,booktype,list
price,affilitation,publication}
F={book-title->booktype,publication,
author name->affiliation
book type->list price}
Find the key and normalize.
1)Find vni={Btitle,Authorname}
2)Voi={publication,listprice,affilitation}
3){Btitle}+={Btitle,Btype,pub,listprice}#R
4){Authorname}+={Aname,Affilitation}#R
5){Btitle,Authorname}
+={btitle,btype,pub,listprice,authname,Aff}=R
{Btitle,Aname} is the key
Problems:
Must find a minimal cover G for F
No efficient algorithm for finding a minimal cover
Several minimal covers can exist for F; the result of the
algorithm can be different depending on which is chosen
Example1: F+=(F1UF2)+
{A->B,A->C,C->D} ={A->B,A->C,C->D}
Therefore it is dependency preserved.
1.3 The Lossless(Non-Additive)
Join Property
2. Set S(i, j): = bij for all matrix entries. (*Each bij is a
distinct symbol associated with indices (i, j)*)
Begin
Choose one Q in D that is not in BCNF;
Find a FD X Y in Q that violates BCNF;
Replace Q in D by two relation schemas(Q-Y) and (XY) end;
(cont.)
This is based on two properties of lossless join
decomposition:
The decomposition D = {R1, R2} of R has the lossless join
property w.r.t F if and only if either:
The FD (R1R2) (R1 R2) is in F+, or
The FD (R1R2) (R2 R1) is in F+
If D = {R1, R2, …, Rm} of R has the lossless join property w.r.t
a set F, and D1 = {Q1,Q2, …, Qk} of Ri has the lossless join
property w.r.t. F(Ri) then D = {R1, R2, …, Ri-1, Q1, Q2, …, Qk.
Ri+1,… Rm} has the lossless join property w.r.t F
(cont.)
x y/z
(cont.)
An MVD X Y is called a trivial MVD if either:
(a) Y X or,
(b) (X Y) = R
A trivial MVD always holds according to the formal
MVD definition
Figure 15.4
(cont.)
Given a set F of FDs and MVDs, we can infer
additional FDs and MVDs that hold whenever the
dependencies in F hold
Sound and complete set of inference rules for
FDs and MVDs:
(Reflexive for FDs) If Y X, then X Y
(Augmentation for FDs) If X Y, then XZ YZ
(Notation: XZ stands for X Z)
(Transitive for FDs) If X Y and Y Z, then X Z
(cont.)
(Complementation for MVDs) If X Y, then X
Z (where Z = R (X Y))
(Augmentation for MVDs) If X Y and Z W,
then WX YZ
(Transitive rule for MVDs) If X Y and Y Z,
then X (Z Y)
(Relication rule FD to MVD) If X Y, then X Y
(Coalescence rule for MVDs) If X Y and there
exists W such that (a) Z Y, (b) W Z, and (c)
W Y is empty, then X Z
(cont.)
Notes:
By rule I7, every FD is also an MVD
I1 to I8 can derive the closure F+ of a set of
dependencies F
Fourth Normal Form(4NF):
3NF and BCNF do not deal with multivalued
dependencies
A relation schema with some non-trivial MVDs
may not be a good design (see Figures 15.4 and
15.5)
4NF takes care of these problems, and implies
BCNF (every relation in 4NF is also in BCNF)
(cont.)
Formal definition of 4NF:
A relation schema R is in 4NF with respect to a set
of (functional and multivalued) dependencies F if
for every nontrivial multivalued dependency X
Y in F+, X is a superkey of R
Check Fig15.4(a) is not in 4NF; ENAME
PNAME/DNAME But ENAME is not the superkey
Notes:
Since every FD is an MVD, 4NF implies BCNF
If all dependencies in F are FDs, the definition for
4NF becomes the definition for BCNF
(cont.)
There is an algorithm for decomposing R into 4NF
relations such that the decomposition has the
lossless join property with respect to a set F of
FDs and MVDs on R
Algorithm 4.5:
Set D {R}
While there is a relation schema Q in D that is not in
4NF do
begin
Choose one Q in D that is not in 4NF;
Find a nontrivial MVD X Y in Q that violates 4NF;
Replace Q in D by two relation schemas (Q Y) and (X
Y)
end;
(cont.)
Algorithm 15.5 is based on property LJ1:
A decomposition D = {R1, R2} of R has the
lossless join property with respect to F if and
only if either:
(a) (R1 R2) (R1 R2) holds in F+, or
(b) (R1 R2) (R2 R1) holds in F+
Figure 15.5
2.2 Join Dependencies and
5NF
A join dependency JD(R1, R2, …, Rn) is a
constraint on R
Specifies that every legal instance r(R)
should have a lossless join decomposition
into R1, R2, …, Rn
An MVD is a special case of a JD where n= 2
A JD(R1, R2, …, Rn) is a trivial JD if some Ri =
R
Fifth normal form (5NF):
A relation schema R is in 5NF with respect to a set
F of FDs, MVDs, and JDs if for every nontrivial
JD(R1, R2, …, Rn), each Ri is a superkey of R
(cont.)
5NF is also called PJNF (project-join normal form)
Note: Unfortunately, the theory of MVD and
JD lossless join decomposition does not take
NULL values into account
2.3 Inclusion Dependencies
Used to specify referential integrity and
class/subclass constraints between two
relations R and S
An inclusion dependency R.X < S.Y
specifies that at any point in time, if r(R) and
s(S) are relation instances of R and S, then:
(r(R)) (s(S))
X Y
The attributes X of R and Y of S must be
compatible
(cont.)
Sound and complete inference rules for
inclusion dependencies:
(ID1) R.X < R.X
(ID2) If R.X < S.Y, where X = {A1, A2, …, An}, Y =
{B1, B2, …, Bn}, and Ai corresponds to Bi, then Ai
< Bi for 1 i n
(ID3) If R.X < S.Y and S.Y < T.Z, then R.X < T.Z
So far, there are no normal forms based on
inclusion dependencies
Example: [Link] < [Link]
WORKS_ON.SSN < [Link]
WORKS_ON.PNUMBER < [Link]
Merits of Normalization
Normalization is based on a mathematical
foundation.
extent.
Tables in 2 NF
Eliminate partial dependency Tables in 3 NF
Summary
While converting ERD into relational schema, each strong entity becomes a table.
There are three normal forms that were defined being commonly used.
1NF makes sure that all the attributes are atomic in nature.