NORMALIZATION
Informal Design Guidelines for Relation Schemas
Informal guidelines may be used as measures to determine the quality of relation schema design.
GUIDELINE 1: Informally, each tuple in a relation should represent one entity or relationship instance.
(Applies to individual relations and their attributes).
Attributes of different entities (EMPLOYEEs, DEPARTMENTs, PROJECTs) should not be
mixed in the same relation
Only foreign keys should be used to refer to other entities
Entity and relationship attributes should be kept apart as much as possible.
Bottom Line: Design a schema that can be explained easily relation by relation. The semantics of
attributes should be easy to interpret.
GUIDELINE 2: Design a schema that does not suffer from the insertion, deletion and update anomalies.
If there are any present, then note them so that applications can be made to take them into account
Mixing attributes of multiple entities may cause problems
Information is stored redundantly wasting storage
Problems with update anomalies
– Insertion anomalies
– Deletion anomalies
– Modification anomalies
GUIDELINE 3: Relations should be designed such that their tuples will have as few NULL values as
possible
Attributes that are NULL frequently could be placed in separate relations (with the primary key)
Reasons for nulls:
– attribute not applicable or invalid
– attribute value unknown (may exist)
– value known to exist, but unavailable
GUIDELINE 4: The relations should be designed to satisfy the lossless join condition. No spurious
tuples should be generated by doing a natural-join of any relations.
Bad designs for a relational database may result in erroneous results for certain JOIN operations. The
"lossless join" property is used to guarantee meaningful results for join operations
Update Anomalies with examples
Mixing attributes of multiple entities may cause several problems. As Information is stored redundantly
wasting storage, it can also lead to several update anomalies.
Consider the relation:
EMP_PROJ ( Emp#, Proj#, Ename, Pname, No_hours)
Update Anomaly: Changing the name of project number P1 from “Billing” to “Customer-
Accounting” may cause this update to be made for all 100 employees working on project P1.
Insert Anomaly: Cannot insert a project unless an employee is assigned to. Inversely - Cannot
insert an employee unless an he/she is assigned to a project.
Delete Anomaly: When a project is deleted, it will result in deleting all the employees who work
on that project. Alternately, if an employee is the sole employee on a project, deleting that
employee would result in deleting the corresponding project.
Functional Dependency (FDs)
FDs are constraints that are derived from the meaning and interrelationships of the data attributes.
A Functional Dependency indicated by XY where X is a set of attributes and Y is a set of Attributes
holds if whenever two tuples have the same value for X, they must have the same value for Y. i.e. For
any two tuples t1 and t2 in any relation instance r(R): If t1[X]=t2[X], then t1[Y]=t2[Y]
Examples: social security number determines employee name
SSN -> ENAME
project number determines project name and location
PNUMBER -> {PNAME, PLOCATION}
employee ssn and project number determines the hours per week that the employee works on the
project
{SSN, PNUMBER} -> HOURS
Inference Rules for FDs
Given a set of FDs F, we can infer additional FDs that hold whenever the FDs in F hold.
Armstrongs Inference Rules:
Armstrong's inference rules:
IR1. (Reflexive) If Y subset-of X, then X -> Y
IR2. (Augmentation) If X -> Y, then XZ -> YZ
(Notation: XZ stands for X U Z)
IR3. (Transitive) If X -> Y and Y -> Z, then X -> Z
additional inference rules:
(Decomposition) If X -> YZ, then X -> Y and X -> Z
(Union) If X -> Y and X -> Z, then X -> YZ
(Psuedotransitivity) If X -> Y and WY -> Z, then WX -> Z
Closure (2 Definitions)
Closure of a set F of FDs is the set F+ of all FDs that can be inferred from F
Closure of a set of attributes X with respect to F is the set X + of all attributes that are
functionally determined by X
Equivalence of Sets of FDs
Two sets of FDs F and G are equivalent if:
- every FD in F can be inferred from G, and
- every FD in G can be inferred from F
Hence, F and G are equivalent if F + =G +
Definition: F covers G if every FD in G can be inferred from F (i.e., if G + subset-of F +)
F and G are equivalent if F covers G and G covers F
Minimal Sets of FDs
A set of FDs is minimal if it satisfies the following conditions:
(1) Every dependency in F has a single attribute for its RHS.
(2) We cannot remove any dependency from F and have a set of dependencies that is equivalent to F.
(3) We cannot replace any dependency X -> A in F with a dependency Y -> A, where Y proper-
subset-of X ( Y subset-of X) and still have a set of dependencies that is equivalent to F.
NORMAL FORMS AND NORMALIZATION
Normalization: The process of decomposing unsatisfactory "bad" relations by breaking up their attributes
into smaller relations
Normal form: Condition using keys and FDs of a relation to certify whether a relation schema is
in a particular normal form
Denormalization: the process of storing the join of higher normal form relations as a base relation—
which is in a lower normal form
Properties to be satisfied during normalization process (Goals of Normalization)
These include two properties:
■ The nonadditive join or lossless join property, which guarantees that the spurious tuple generation
problem does not occur with respect to the relation schemas created after decomposition
■ The dependency preservation property, which ensures that each functional dependency is represented
in some individual relation resulting after decomposition.
The nonadditive join property is extremely critical and must be achieved at any cost, whereas the
dependency preservation property, although desirable, is sometimes sacrificed
Concepts of KEYS
A superkey of a relation schema R = {A1, A2, ...., An} is a set of attributes S subset-of R with the
property that no two tuples t1 and t2 in any legal relation state r of R will have t1[S] = t2[S]
A key K is a superkey with the additional property that removal of any attribute from K will
cause K not to be a superkey any more. In other words a key is a minimal superkey.
Candidate keys and Primary Key: If a relation schema has more than one key, each is called a
candidate key One of the candidate keys is arbitrarily designated to be the primary key, and the
others are called secondary keys.
Prime and Non-Prime Attribute
A Prime attribute must be a member of some candidate key
A Non-prime attribute is not a prime attribute—that is, it is not a member of any candidate key
First Normal Form (1NF)
Definition: A Relation is in 1NF if every attribute of the relation is simple and atomic.
1NF Disallows composite attributes, multivalued attributes, and nested relations
Example:
In the above relation Dlocations is a multivalued attribute which can hold multiple values and
hence not in 1NF. To convert the relation into 1NF, the attribute values has to be made simple and
atomic. This can be done in 3 ways:
1st: Represent each values of the DLocations individually as shown:
The problem with this approach is however there is redundant data and it can lead to anomalies.
2nd: For each location of the department, create new attributes to represent the same.
DNAME DNUMBER DMGRSSN DLOCATION1 DLOCATION2 DLOCATION3
Research 5 333445555 Bellaire Sugarland Houston
Administration 4 987654321 Stafford - -
Headquarters 1 888665555 Houston - -
The problem with this approach is other tuples will have null values for Dlocation2, Dlocation3.
3rd: Remove the attribute Dlocations that violates 1NF and place it in a separate relation
DEPT_LOCATIONS along with the primary key Dnumber of DEPARTMENT.
For nested relations:
First normal form also disallows multivalued attributes that are themselves composite. These are called
nested relations because each tuple can have a relation within it. For example:
Has multiple-level nesting which must be unnest to make the relation into a set of 1NF relations as
follows:
Second Normal Form (2NF)
Uses the concepts of FDs, primary key
Prime attribute - attribute that is member of the primary key K
Full functional dependency - a FD Y -> Z where removal of any attribute from Y means the FD
does not hold any more
Examples: - {SSN, PNUMBER} -> HOURS is a full FD since neither SSN -> HOURS nor
PNUMBER -> HOURS hold
- {SSN, PNUMBER} -> ENAME is not a full FD (it is called a partial dependency ) since SSN -
> ENAME also holds
Definition: A relation schema R is in second normal form (2NF) if every non-prime attribute A in R is
fully functionally dependent on the primary key.
Example: consider the following relation shown below. Here the non-prime attribute Hours is Fully
functionally dependent on the PK of the relation where as other non-prime attributes are partially
dependent. Hence the relation is not in 2NF. The relation can be broken down into EP1, EP2 and EP3 to
ensure they satisfy 2NF.
Third Normal Form (3NF)
3NF is based on transitive dependency.
Definition:
Transitive functional dependency - a FD X -> Z that can be derived from two FDs X -> Y
and Y -> Z
Examples:
- SSN -> DMGRSSN is a transitive FD since
SSN -> DNUMBER and DNUMBER -> DMGRSSN hold
- SSN -> ENAME is non-transitive since there is no set of attributes X where SSN -> X and X ->
ENAME
Definition: A relation schema R is in third normal form (3NF) if it is in 2NF and no non-prime
attribute A in R is transitively dependent on the primary key
Alternate Definition: A relation schema R is in third normal form (3NF) if whenever a FD X -> A holds
in R, then either:
(a) X is a superkey of R, or
(b) A is a prime attribute of R
For example consider the following relation EMP_DEPT. The relation schema EMP_DEPT in Figure is
in 2NF, since no partial dependencies on a key exist. However, EMP_DEPT is not in 3NF because of the
transitive dependency of Dmgr_ssn (and also Dname) on Ssn via Dnumber. We can normalize
EMP_DEPT by decomposing it into the two 3NF relation schemas ED1 and ED2 shown
Boyce Codd Normal Form (BCNF)
BCNF is a Strict version of 3NF
Definition: A relation schema R is in Boyce-Codd Normal Form (BCNF) if whenever an FD X -> A
holds in R, then X is a superkey of R.
Consider the following example: R is in 3NF but not in BCNF as C is not a superkey. To convert the
relation into BCNF, decompose the relations such that X is always a superkey. This can be done as shown
below.
R1(A, C) and R2(A, B)
R1 (B, C) and R2(B, A)
R1 (C, B) and R2(C, A)
Fourth Normal Form (4NF)
4NF is based on Multi-Valued Dependency (MVDs)
Definition:
A multivalued dependency (MVD) X —>> Y specified on relation schema R, where X and Y are both
subsets of R, specifies the following constraint on any relation state r of R: If two tuples t1 and t2 exist in r
such that t1[X] = t2[X], then two tuples t3 and t4 should also exist in r with the following properties, where
we use Z to denote (R 2 (X υ Y)):
· t3[X] = t4[X] = t1[X] = t2[X].
· t3[Y] = t1[Y] and t4[Y] = t2[Y].
· t3[Z] = t2[Z] and t4[Z] = t1[Z].
Definition:
A relation schema R is in 4NF with respect to a set of dependencies F (that includes functional
dependencies and multivalued dependencies) if, for every nontrivial multivalued dependency X —>> Y in
F+, X is a superkey for R.
An all key relation such as EMP does not have any FD’s but has MVD’s. A relation not in 4NF can be
decomposed to remove the redundancy caused by the MVD as shown below.
Fifth Normal Form (5NF)
5NF is based on Join Dependancy
Definition:
A join dependency (JD), denoted by JD(R1, R2, ..., Rn), specified on relation schema R, specifies a
constraint on the states r of R. The constraint states that every legal state r of R should have a non-
additive join decomposition into R1, R2, ..., Rn; that is, for every such r we have
* (pR1(r), pR2(r), ..., pRn(r)) = r
Definition:
A relation schema R is in fifth normal form (5NF) (or Project-Join Normal Form (PJNF))
with respect to a set F of functional, multivalued, and join dependencies if, for every nontrivial
join dependency JD(R1, R2, ..., Rn) in F+ (that is, implied by F), every Ri is a superkey of R.
The relation SUPPLY is in 4NF because of no
MVD’s however it is no in 5NF, this can be
broken down into R1, R2, R3 to achieve 5NF as
follows
Algorithms:
For finding closure:
For finding key: