0% found this document useful (0 votes)
20 views144 pages

Understanding Database Anomalies

Chapter 4 discusses functional dependency and normalization in database design, emphasizing the importance of eliminating data redundancy and update anomalies. It outlines the process of normalization, which involves decomposing relations to produce well-structured tables that maintain data integrity and optimize performance. The chapter also defines functional dependencies and their types, illustrating how they relate to database design and the prevention of anomalies.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views144 pages

Understanding Database Anomalies

Chapter 4 discusses functional dependency and normalization in database design, emphasizing the importance of eliminating data redundancy and update anomalies. It outlines the process of normalization, which involves decomposing relations to produce well-structured tables that maintain data integrity and optimize performance. The chapter also defines functional dependencies and their types, illustrating how they relate to database design and the prevention of anomalies.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 4: Functional Dependency and Normalization

1
Outline:

4.1. Data Redundancy and Update Anomalies


4.2 Functional Dependency
4.3. Normal Forms
4.4. Process of Normalization

2
Recall: Database design lifecycle
• Requirements analysis
• User needs; what must database do?
• Conceptual design
• High-level description; often using E/R model
• Logical design
• Translate E/R model into relational schema
• Schema refinement
• Check schema for redundancies and anomalies
• Physical design/tuning
• Consider typical workloads, and further optimise

3
• Why are some designs bad?
• What’s a functional dependency?
• What’s the theory of functional dependencies?
• How can we use this theory to classify redundancy in relation design

4
Not all designs are equally good
• Why is this design bad?
Data(sid,sname,address,cid,cname,grade)

• Why is this one preferable?

Student(sid,sname,address)
Course(cid,cname)
Enrolled(sid,cid,grade)

5
An instance of our bad design

sid sname address cid cname grade


124 Abebe Addis Ababa 206 Database A++
204 Ayele Bahirdar 202 OOP C
124 Hagos Jima 201 S/Eng A+
206 Aster Hawas 206 Database B-
124 Abebe Addis Ababa 202 OOP B+

6
Redundancy
• Redundancy is the root of many problems associated with relational schemas

• Redundant storage

• Update anomalies

• Insertion anomalies

• Deletion anomalies

• LOW TRANSACTION THROUGHPUT

• In general, with higher redundancy, if transactions are correct (no anomalies), then they
have to lock more objects thus causing greater contention and lower throughput

7
Anomalies

• An anomaly refers to an undesirable condition or side effect that occurs in a


database table, usually due to poor design and data redundancy. These
anomalies can lead to problems with data integrity, making it difficult to insert,
update, or delete data consistently and accurately.
• The root cause of most anomalies is often placing information about multiple
different "things" (entities) or relationships into a single table when they
should be separated. This leads to repeating data unnecessarily (redundancy).

8
Why are anomalies Bad?

• Data Inconsistency: The same piece of information might exist in multiple


places, and updates might only be applied to some, leading to conflicting data.

• Data Integrity Issues: It becomes hard to enforce rules that ensure the data is
accurate and reliable.

• Wasted Storage: Storing the same data multiple times consumes unnecessary
space.

• Maintenance Problems: Inserting, updating, and deleting data becomes


complex and error-prone

9
Types of Anomalies
The three main types of database anomalies are:
1. Insertion anomaly
2. Deletion anomaly
3. Update anomaly

10
Example: Employee_Department_Info(A poorly designed table)

EmpID EmpName DeptID DeptName DeptHOD


101 Alice D01 HR Mr. Smith
102 Bob D01 HR Mr. Smith
103 Charlie D02 IT Ms. Jones
104 David D02 IT Ms. Jones
105 Eve D01 HR Mr. Smith
106 Mohammed D03 Marketing [Link]

Employee_Department_Info a single table designed to store information about employees, their departments,
and the department's manager (Head of Department - HOD).

• Notice the redundancy: 'HR' and 'Mr. Smith' are repeated for every HR employee; 'IT' and
'Ms. Jones' are repeated for every IT employee. This redundancy is the source of the
anomalies.
11
Insertion Anomaly
• An insertion anomaly occurs when you cannot add certain information to the database without
adding some other, unrelated information first. It often prevents you from recording details
about a new entity until it's related to another existing entity in the table.
Example using Employee_Department_Info:
• Suppose the company wants to create a new department, 'Finance' (DeptID D03), with 'Mr. Brown' as
the HOD. However, no employees have been assigned to this department yet.
• Problem: Using the table structure above, how do we add the 'Finance' department? We cannot add a
row for DeptID 'D03', DeptName 'Finance', and DeptHOD 'Mr. Brown' without also providing an
EmpID and EmpName. The table structure forces us to associate a department with an employee. We
either have to leave EmpID/EmpName blank (which might violate database constraints if EmpID is a
primary key or required field) or create a dummy employee just to record the department's existence.
• Consequence: Difficulty in adding new entities (like a department) if they don't yet have associated
12
entities (like employees) required by the table structure.
Deletion Anomaly
• A deletion anomaly occurs when deleting a record (row) unintentionally removes other
essential information that was stored only in that record. You lose facts about one entity when
you delete information about another entity.
Example using Employee_Department_Info:
• Suppose employee ‘Mohammed' (EmpID 106) is the last remaining employee in the ‘Marketing'
department If Mohammed leaves the company and we delete his record (the row with EmpID 106) from
the table:
• Problem: We not only lose the information that Mohammed worked for the company, but we also lose
the fact that there is an ‘Marketing' department (D03) and that its HOD is 'Ms. Kedir'. This information
existed only in the rows associated with Marketing employees.
• Consequence: Accidental loss of valuable data about one entity (the department) when removing data
about another entity (the last employee in that department).
13
Update(or Modification) Anomaly
• A update anomaly occurs when updating a single piece of information requires updating
multiple rows. If the update is not performed consistently across all relevant rows, it leads to
data inconsistency.
Example using Employee_Department_Info:
• Suppose the HOD of the 'HR' department changes from 'Mr. Smith' to 'Ms. Davis'.
• Problem: To reflect this change accurately, we must find every single row where DeptID is 'D01' (the
HR department) and change 'Mr. Smith' to 'Ms. Davis'. In our example table, this means updating the
rows for Alice (101), Bob (102), and Eve (105). If we miss even one row, or if there's a typo in one
update, the database will become inconsistent, showing conflicting HODs for the same department.
• Consequence: Data inconsistency due to partial updates, increased effort and potential for errors during
data modification.

14
How to Avoid Anomalies: Normalization (to be seen next)

15
Functional dependencies(FDs)
• Basic tool for analyzing relational schemas
• Are used to specify formal measures of the "goodness" ofrelational designs
• Functional dependencies are relationships between attributes in a table.
• A functional dependency exists when the value of one attribute uniquely determines the value of another attribute.
• In DB, we often have the case where one field defines the other. The value of one attribute in a table is determined
entirely by the value of another.
Eg1: Social Security Number (SSN) defines a name.
• If I know someone's SSN, then I can find their name.W will say that we have defined name as being functionally
dependent on SSN.

16
How to Denote a Functional Dependency?
• A functional dependency is denoted by an arrow “→”. The functional dependency of A on B is
represented by A → B. It means each value of A is associated with exactly one value of B
• Consider a relation with four attributes A, B, C and D,
R (ABCD)
• A → BCD
• B → CD
• For the first functional dependency A → BCD, attributes B, C and D are functionally
dependent on attribute A.
• Function dependency B → CD has two attributes C and D functionally depending upon
attribute B.
• In the functional dependency A→B, A is the determinant, and B is functionally dependent
on A
17
How to Denote a Functional Dependency?
• Sometimes everything on the left side of functional dependency is also referred to as
determinant set, while everything on the right side is referred to as depending attributes

• Functional dependency can also be represented diagrammatically like this,

R(ABCDE)

1) A → BCD 2) B → CDE

18
Consider the table below:
sid sname address cid cname grade
124 Abebe Addis Ababa 206 Database A++
204 Ayele Bahirdar 202 OOP C
125 Hagos Jima 201 S/Eng A+
206 Aster Hawas 206 Database B-
124 Abebe Addis Ababa 202 OOP B+

19
Functional dependencies cont.

• We can say that sid determines address


• We’ll write this
sid → address

• This is called a functional dependency (FD)

• (Note: An FD is just another integrity constraint)

20
Functional dependencies cont.

• We would expect the following functional dependencies to hold in our Student


database
• sid → sname,address
• cid → cname
• sid,cid → grade
• A functional dependency X → Y is simply a pair of sets (of field names)

21
Example:
EmpNo → Name. This means that every
EmpNo Job Name time we find 104, we find the name, Fred.
101 President Herbert • Just because something is on the left-hand
104 Programmer Fred
103 Designer Beryl side of a FD, it does not imply that you
103 Programmer Beryl have a key or that it will be unique in the
database.
• i.e the FD X → Y only means that for
every occurrence of X you will get the
same value of Y.

22
Figure 14.1 Simplified version of the COMPANY relational database schema.

23
Examples of Functional Dependencies

24
Types of Functional Dependencies in DBMS

1. Trivial functional dependency


2. Non-Trivial functional dependency
3. Multivalued functional dependency
4. Transitive functional dependency

25
1. Trivial functional dependency

• In Trivial functional dependency, a dependent is always a subset of the


determinant.
• In other words, a functional dependency is called trivial if the attributes on the right
side are the subset of the attributes on the left side of the functional dependency.

X → Y is called a trivial functional dependency if Y is the subset of X.

Example:
{ Employee_Id, Name } → { Name } is a Trivial functional dependency, since the
dependent Name is the subset of determinant { Employee_Id, Name }.

26
2. Non-Trivial functional dependency

• It is the opposite of Trivial functional dependency.


• In Non-Trivial functional dependency, a dependent is not a subset of the
determinant.
• X → Y is called a Non-trivial functional dependency if Y is not a subset of X.
So, a functional dependency X → Y where X is a set of attributes and Y is also a
set of the attribute but not a subset of X, then it is called Non-trivial functional
dependency.

Example:
{ Employee_Id, } → { Name } is a Non-Trivial functional dependency.
27
3. Multivalued functional dependency

• In Multivalued functional dependency, attributes in the dependent set are not dependent
on each other.
• For example, X → { Y, Z }, if there exists is no functional dependency between Y and Z,
then it is called as Multivalued functional dependency.

Example:
{ Employee_Id } → { Name, Age } is a Multivalued functional dependency, since the dependent attributes
Name, Age are not functionally dependent(i.e. Name → Age or Age → Name doesn’t exist !).

28
3. Transitive functional dependency

• Consider two functional dependencies A → B and B → C then according to the


transitivity axiom A → C must also exist. This is called a transitive functional
dependency.
Employee_Id Name Department Street Number
1 Zayn CD 11
2 Phobe AB 24
3 Hikki CD 11
4 David PQ 71
5 Phobe LM 21

Here, { Employee_Id → Department } and { Department → Street Number } holds true. Hence, according to
the axiom of transitivity, { Employee_Id → Street Number } is a valid functional dependency.
29
Armstrong’s Axioms/Properties of Functional Dependency

• William W. Armstrong established a set of rules which can be used to Inference the
functional dependencies in a relational database:

[Link] Reflexive Rule If X is a composite, composed of A and B, then X→ A and X→ B.


(If A is a subset of X, then X→ A ) . This is called trivial FD.
Example:
In Emp_Proj
Let X=SSN,PNUMBER
X →SSN and X →PNUMBER

30
Functional dependencies(FDs (Con’t)

2. Augmentation: If a functional dependency A → B holds true, then appending any


number of the attribute to both sides of dependency doesn't affect the dependency. It
remains true.
For example, X → Y holds true then, ZX → ZY also holds true

For example, if { Employee_Id, Name } → { Name } holds true then, { Employee_Id,


Name, Age } → { Name, Age }

3. Transitivity: If two functional dependencies X → Y and Y → Z hold true, then X → Z


also holds true by the rule of Transitivity.

31
Normalization
• Is a formal process for deciding which attributes should be grouped together in a relation.

• It is a technique for analyzing tables based on primary keys or candidate keys.

• It is the process of decomposing relations with anomalies to produce smaller, well-


structured relations.

• Normalization is a technique for producing a set of relations with desirable properties, given
the data requirements of an enterprise.

• The process of decomposing unsatisfactory "bad" relations by breaking up their attributes


into smaller relations

32
Normalization (Cont’d)
• The process of normalization is a formal method that identifies relations based on their
primary key.

• Primarily a tool to validate and improve a logical design so that it satisfies certain constraints
that avoid unnecessary duplication of data

• A process used to organize data in a relational database to eliminate redundancy, ensure data
integrity, and optimize query performance.

• It involves decomposing a larger, complex table into smaller, well-structured tables while
maintaining relationships between them.

• The process of decomposing relations with anomalies to produce smaller, well- structured
relations 33
Well-Structured Relations

• A relation that contains minimal data redundancy and allows users to insert, delete, and update rows
without causing data inconsistencies

• The Goal of normalization is to avoid anomalies


• Insertion Anomaly – adding new rows forces user to create duplicate data
• Deletion Anomaly – deleting rows may cause a loss of data that would be needed for other future rows
• Modification Anomaly – changing data in a row forces changes to other rows because of duplication
• Minimize NULL values in tuples, as they can indicate: Inapplicable or invalid attributes.
• Unknown or unavailable values.
• Place frequently NULL attributes in separate relations with the primary key.

• General rule of thumb: a table should not pertain to more than one entity type
Problems caused by Unnormalized forms
• Redundant storage
• Potential inconsistency (eg. Spelling errors)
• Anomalies
o Insertion anomaly
o Deletion anomaly
o Update anomaly
5
• Unnormalized Form (UNF)
• A table that contains one or more repeating groups.
• A repeating group is a field or group of fields that hold multiple values for a single
occurrence of a field.

36
• Unnormalized Form (UNF)
• A table that contains one or more repeating groups.
• A repeating group is a field or group of fields that hold multiple values for a single
occurrence of a field.
studid name courseid coursename dept dhead
Repeating group =
(courseid,coursename,dept
123 Abebe M Inst 223 Organization Inf. sc. Bekele ,dhead)
Inst 224 Systems Inf. Sc. Bekele
Geog 211 Geography Geography Woldu
Mtpa 123 organization management Tsega
234 Worku K. Inst 224 Systems Inf. Sc. Bekele
Inst 334 database Inf. Sc. Bekele
345 Abebe W. Inst 223 Organization Inf. Sc. Bekele
Geog 211 Geography Geography Woldu
Mtpa 123 Organization management Tsega

37
Unnormalized Form (UNF) (Cont’d)

Question – Is this a relation? Answer –Yes: unique rows and no multivalued Attributes

Question – What’s the primary key? Answer – Composite: Emp_ID, Course_Title


Anomalies in the table on the previous slide:
• Insertion – can’t enter a new employee without having the employee take a class

• Deletion – if we remove employee 140, we lose information about the existence of a Tax Acc
class

• Modification – giving a salary increase to employee 100 forces us to update multiple records

• Why do these anomalies exist?

• Because we’ve combined two themes (entity types) into one relation. This results in
duplication, and an unnecessary dependency between the entities
• Normalization is guided by a series of rules called normal forms, each addressing specific

types of redundancy and dependency issues.


• Reduces data redundancies
• Helps eliminate data anomalies
• Produces controlled redundancies to link tables
The normal forms (NF) are :

1. First Normal Form (1NF)


2. Second Normal Form (2NF)
3. Third Normal Form (3NF)
4. Boyce-Codd Normal Form (BCNF)
5. Fourth Normal Form (4NF)
40
6. Fifth Normal Form (5NF)
Normal Forms (cont’d)
• Each normal form involves a set of rules that can be tested against each table in the
system.

• If a table violates the rules of some normal form, then we decompose the table in to
tables that individually meet the requirements of normalization.

• Each higher form of normalization is based on the form prior to it.

41
First Normal Form (1NF)
• A relation is in 1NF if it contains no multi-valued attributes and the intersections of each
row and column contains one and only one value.

• It was defined to disallow multivalued attributes, composite attributes, and their combinations.

• It states that the domain of an attribute must include only atomic (simple, indivisible) values
and that the value of any attribute in a tuple must be a single value from the domain of that
attribute.

• Hence, 1NF disallows having a set of values, a tuple of values, or a combination of both as an
attribute value for a single tuple. In other words, 1NF disallows "relations within relations" or
"relations as attribute values within tuples."
42
First Normal Form (1NF) (Cont’d)
• The only attribute values permitted by lNF are single atomic (or indivisible) values.

• Each cell to be single valued, No multi valued attributes

• Rows uniquely identified, add unique id, or add more columns to make unique

• Entries in a column are the same type

In Summary:
• A table is in 1NF if:
• All attributes (columns) contain only atomic (indivisible) values.

• Each attribute contains values of a single type.

• There are no repeating groups or arrays in any column.

43
Example: Unnormalized Table (Not in 1NF)

StudentID StudentName CoursesTaken


1 Alice Math, Physics
2 Bob Chemistry, Biology

Problem: The CoursesTaken column contains multiple values (a repeating group),


violating 1NF.
Normalized to 1NF:

44
Example: Unnormalized Table (Not in 1NF)

studid name courseid coursename dept dhead


123 Abebe M Inst 223 Organization Inf. sc. Bekele
Inst 224 Systems Inf. Sc. Bekele
Geog 211 Geography Geography Woldu

234 Worku K. Inst 224 Systems Inf. Sc. Bekele


Inst 334 database Inf. Sc. Bekele
345 Abebe W. Inst 223 Organization Inf. Sc. Bekele
Geog 211 Geography Geography Woldu

• This is not in 1NF because coursed,coursename, dept, dhead are not an atomic attributes.

45
Conversion to 1NF

• Eliminate repeating groups (ensure each column contains atomic values).

• Proper primary key must be developed (Uniquely identifies attribute values or rows)

• Expand the key so that there will be a separate tuple in the original relation.

46
Examples:
1. The student table shown below is not in 1NF.

StudentID StudentName CoursesTaken


1 Alice Math, Physics
2 Bob Chemistry, Biology

Table in 1NF is :
• Expand the table
• Set new Primary key

StudentID StudentName CourseTaken


1 Alice Math
1 Alice Physics
2 Bob Chemistry
2 Bob Biology
Each row now contains a single course, and the CourseTaken column holds
47
Choose PK={StudentID,CourseTake} atomic values. This eliminates the repeating group.
Example 2: The Table below is Not in 1NF)

studid name courseid coursename dept dhead


123 Abebe M Inst 223 Organization Inf. sc. Bekele
Inst 224 Systems Inf. Sc. Bekele
Geog 211 Geography Geography Woldu

234 Worku K. Inst 224 Systems Inf. Sc. Bekele


Inst 334 database Inf. Sc. Bekele
345 Abebe W. Inst 223 Organization Inf. Sc. Bekele
Geog 211 Geography Geography Woldu

• This is not in 1NF because coursed,coursename, dept, dhead are not an atomic attributes.

48
1NF of the table in the previous slide:
• Expand the table
• Set new Primary key

Pk = {studid,Courseno}

49
Example 3: The car-owners Table show below is Not in 1NF)

Below is the car_owners table in 1NF

Pk={plateNo}

50
Example 4: In the company database, the department table shown below is Not
in 1NF

51
Second Normal Form
Definitions:
• A prime attribute is an attribute that is part of any candidate key of a relation

• A non-prime attribute is an attribute that is not part of any candidate key in the relation.

• Partial Dependency-When a non-prime attribute is dependent on only part of a composite


candidate key.

Example:
{SSN, PNUMBER} -> ENAME is a partial dependency) since
SSN -> ENAME also holds.
Full functional dependency
A FD Y -> Z where removal of any attribute from Y means the FD does not hold any more

Examples:
• {SSN, PNUMBER} -> HOURS is a full FD since neither SSN -> HOURS nor PNUMBER ->
HOURS hold

• {SSN, PNUMBER} -> ENAME is not a full FD (it is called a partial dependency)
since SSN -> ENAME also holds
• Emp_course (empid, course_title, Date_completed)
Empid, Course_title → Date_Completed is a full FD

53
Second Normal Form

2NF Uses the concepts of FDs, primary key


• Must be in 1NF.
• Every non-prime attribute must be fully functionally dependent on every
candidate key (no partial dependencies).

• 2NF removes partial dependencies, ensuring that non-prime attributes depend


on the whole primary key, not just part of it. This reduces redundancy in tables
with composite keys.
• A relation is in second normal form (2NF) if it is in 1NF & every non key attribute
is fully functionally dependent on the primary key.

• R can be decomposed into 2NF relations via the process of 2NF normalization

Note:

• No non-key attribute is functionally dependent on part of the primary key.

• A relation that is in 1st normal form will be in 2NF if any one of the following
conditions applies:
1. The primary key consists of only one attribute.
2. No non key attributes exist in the relation. (That is, all of the attributes in the relation are
components of the Pk.
55
3. Every non key attribute is functionally dependent on the full set of primary key attributes.
Steps to Convert to 2NF:
• To achieve 2NF, we eliminate partial dependencies by splitting the table based on attributes
that depend on subsets of the composite primary key.
• Identify Partial Dependencies
• Create New Tables for Partial Dependencies:
• Update the Original Table
• Define Relationships

56
Example: - Student _ Relative

57
Student_relative(Student_id,Relative_id,R_P,Stud_tele)

• The PK of the above Table is (Stud_id , Relative_id )


• The stud_ tele field is not fully functionally dependent on the PK ,i.e,
Partial Dependency:
(Stud_id, Relative id)→ Stud _tele is not a full Functional
dependency because:
Stud _id → stud _tele is a correct functional dependency.
• So, to normalize the above table, we need to split it in to two using the
functional dependency: Stud_id →stud-tele as a guide.
• The resulting tables are:
Student_relative(Student_id,Relative_id,R_P,) Student_tele( Stud _id, stud _tele)
58
Example: Consider the Employee table shown below

When the table is decomposed and becomes


2NF
EmpID, CourseTitle ➔ DateCompleted
Course(EmpID, CourseTitle , DateCompleted)
EmpID ➔ Name, DeptName, Salary
Employee(EmpID, Name, DeptName, Salary )
Therefore, NOT in 2nd Normal Form!!
59
Convert the following table into 2NF

60
Solution

PK={studid, courseno}

Consider the partial Dependencies:

Studid ➔ name

courseno ➔ course_name,Dept,Dhead

2nd normal form conversion results


Student (studid, name)
Course (courseno,course_name)
ASSIGN (studid,courseno)
61
Third Normal Form
Recall transitive functional dependency:
If we have the functional dependencies :
A→ B and
B→ C then A→ C
Then we say that C is transitively dependent on A.
Examples:
SSN -> DMGRSSN is a transitive FD since
SSN → DNUMBER and
DNUMBER → DMGRSSN hold
SSN → ENAME is non-transitive since there is no set of attributes X where SSN → X and
62
Third Normal Form…
• A relation schema R is in third normal form (3NF) if it is in 2NF and no
non-prime attribute A in R is transitively dependent on the primary key

• A table is in 3NF if:


1. It is in 2NF (no partial dependencies on the primary key).
2. There are no transitive dependencies, meaning non-key attributes do not depend on
other non-key attributes (all non-key attributes depend only on the primary key or
candidate keys).

• R can be decomposed into 3NF relations via the process of 3NF


normalization
63
Note:
• In X -> Y and Y -> Z, with X as the primary key, we consider this a problem
only if Y is not a candidate key. When Y is a candidate key, there is no
problem with the transitive dependency.

E.g.

Consider EMP (SSN, Emp#, Salary ).

• Here, SSN -> Emp# -> Salary and Emp# is a candidate key.

64
Conversion to 3NF
• A relation that is in 1NF and 2NF and in which no non-primary-key
attribute is transitively dependent on the primary key.

• Identify the primary key in the 2NF relation.

• Identify functional dependencies in the relation.

• If transitive dependencies exist on the primary key remove them by


placing them in a new relation along with a copy of their dominant.

65
• Consider the following table:

Now, PK : empid
We have functional dependencies:
Empid → depid
Depid → depname
Or Depid → depbudjet
Therefore, the above table is not is 3NF. To normalize it, we can use the functional dependencies:
Depid → depname
Depid → depbudjet
And
Empid → depid
66
So that the resulting tables are the following:

67
Example: SALES relation with simple data

Relation with transitive dependency


Removing a transitive dependency

(a) Decomposing the SALES relation


Salesperson Region

CustID → Name
CustID → Salesperson

Now, there are no transitive dependencies… Both relations are in 3rd NF


Boyce-Codd Normal Form (BCNF)

• While Third Normal Form (3NF) is generally sufficient for organizing relational
databases, it may not completely eliminate redundancy. Redundancy can still
occur if there’s a dependency X→Y where X is not a candidate key. This issue is
addressed by a stronger normal form known as Boyce-Codd Normal Form
(BCNF).

• Based on functional a dependency that takes into account all candidate keys in a relation.

• For a relation with only one candidate key, 3NF and BCNF are equivalent.

• A relation is in BCNF, if and only if every determinant is a candidate key.


• All determinants are candidate keys…there is no determinant that is not a unique
identifier 71
Boyce-Codd Normal Form (BCNF)

• Boyce-Codd Normal Form (BCNF) is a stricter version of Third Normal Form (3NF) that
ensures a more simplified and efficient database design. It enforces that every non-trivial
functional dependency must have a superkey on its left-hand side.
• This approach addresses potential issues with candidate keys and ensures the database is
free from redundancy.
• BCNF eliminates redundancy more effectively than 3NF by strictly requiring that all
functional dependencies originate from super-keys.

72
3NF to BCNF
• Identify all candidate keys in the relation.

• Identify all functional dependencies in the relation.

• If functional dependencies exist in the relation where their


determinants are not super key(or candidate keys) for the relation,
remove the functional dependencies by placing them in a new relation
along with a copy of their determinant.

73
Rules for BCNF

Rule 1: The table should be in the 3rd Normal Form.

Rule 2: X should be a super-key for every functional dependency (FD) X−>Y in a


given relation.

Note: To test whether a relation is in BCNF, we identify all the determinants and make
sure that they are candidate keys.

74
Example:
Assumptions:
Student Instructor Course
• One instructor teaches one course.
Ayele Dagne Database
• One course can be taught by many
Ayele Hagos C++ instructors
Kedir Dagne Database • One student can take many courses
Kedir Asnake C++

FD: {
(student, Instructor) -> Course
(student, Course) -> Instructor,
(Instructor) -> Course
}

75
• Candidate keys are (student, Instructor) and (student, course).
• The above relation is in 3NF (since there is no transitive dependency). A relation R is in
BCNF if for every non-trivial FD X->Y, X must be a key.
• The above relation is not in BCNF, because in the FD (teacher->subject), teacher is not a key.

• This issue occurs because the instructor is a determinant but not a candidate key.
• R is divided into two relations R1(Instructor,Course) and R2(Student, Instructor).

76
Example:

Instructor Course Student Instructor


Dagne Database Ayele Dagne
Hagos C++ Ayele Hagos
Kedir Dagne
Kedir Asnake

77
Other Normal Forms
• 4th NF
• No multivalued dependencies
• 5th NF
• No “lossless joins”
• Domain-key NF
• The “ultimate” NF…perfect elimination of all possible anomalies
Physical Database Design

79
✓ Introduction to Physical Design
✓ Steps in physical Database Design
• Designing Base Table
• Designing Constraints and Business Rules
• Selecting appropriate file organization based on the analysis of the transactions
• Selecting index to improve index
• Designing User View
• Designing Security Measures
✓ Introduction to File Organization
• Types of File Organization
✓ Introduction to Indexes

80
Physical Design
• The aim of physical database design is to decide how the logical database
design will be implemented.

• For the relational database, this involves:


• Defining a set of the table structures, data types for fields, and constraints on these tables such
as primary key, foreign key, unique key, not null and domain definitions to check if data are out
of the range.

81
Physical Design(Cont..)

• Identifying the specific storage structures and access methods to retrieve data efficiently.
For example, adding a secondary index to a relation.

• Designing security features for the database system including account creation, privilege
granting/revocation, access protection, and security level assignment.

82
Physical Design(Cont..)

• Physical database design is the process of producing a description of the implementation of


the database on secondary storage; it describes the base relations, file organizations, and
indexes used to achieve efficient access to the data, and any associated integrity constraints
and security measures.

83
Physical Design(Cont..)

• Sources of information for the physical design process includes global logical data model
and documentation that describes model.

• Logical database design is concerned with the what, physical database design is concerned
with the how.
Physical Design(Cont..)

• Physical design is DBMS-specific whereas logical design by contrast is DBMS-independent.


Logical design is concerned with the what; physical database design is concerned with the
how. In short, physical design is a process of implementing a database on secondary storage

with a specific DBMS.

85
• Design physical representations
• Analyze transactions
• To understand the functionality of the transactions that will run on the database
and to analyze the important transactions
• Design security mechanisms
• Design user views
• to design the user views that were identified in the conceptual database design
methodology
• Design access rules
• to design the access rules to the base relations and user views
• Choose file organizations

• to determine an efficient file organization for each base relation.

• Consider the introduction of controlled redundancy

• to determine whether introducing redundancy in a controlled manner by relaxing the


normalization rules will improve the performance of the system.

• Estimate disk space requirements

• to estimate the amount of disk space that will be required by the database
Physical Design(Cont..)

• How to translate entities into physical tables

• What attributes to use for columns of the physical tables

• Which columns of the tables to define as keys

• What indexes to define on the tables

• What views to define on the tables

88
Step 1: Designing Base Table

• For each identified table in the logical data model, identify:

• The name of the table

• A list of simple columns

• The primary key and Foreign keys

• Referential integrity constraints for each identified foreign key.


• For each column,
• Describe its domain, data type, length, and any constraint on the domain
• An optional default value for the column
• Where the column can hold nulls
• Whether the column is derived and, if so , how it should be computed.

89
Example:

For the Branch relation schema shown below, design its corresponding base table in a
relational data model using SQL Server DBMS.

Branch relation schema:

Branch(branchNo,street,city,state,zipCode,mgrstaffNo)
• Pk: branchNo

• Candidate key: zipCode

• Foreign key: mgrstaffNo

90
Solution

• Domain Branch_Numbers fixed length character string length 4

• Domain Street_Names variable length character string maximum length 30

• Domain city_Names variable length character string maximum length 20

• domain state_Codes fixed length character string length 2

• domain zip_Codes fixed length character string length 5

• domain staff_Numbers fixed length character string length 5

91
Solution(Cont..)
Branch(
Branch_Numbers NOT NULL,

street street_Names NOT NULL,

city city_Names NOT NULL,

State stae_Names NOT NULL,

zipCode zip_codes NOT NULL,

mgrStaffNo staff_Numbers NOT NULL,

Primary key branchNo

Alternate key zipCode

Foreign key mgrStaff References staff(StaffNo


92
)
Step 2: Designing Constraints and Business Rules
Design Constraints

• Design any other business rules that have to be imposed on the data

Example:

• Preventing a member from renting more than 10 videos at any one time.

• Maximum rental period for videos is five days

93
Designing Constraints (cont..)

• Default Value: A value a field will assume unless an explicit value is entered for that field

• Values Range control: Limits range of values that can be entered into field

• Referential Integrity: An integrity constraint specifying that the value (or existence) of an
attribute in one relation depends on the value (or existence) of the same attribute in another
relation

• Null Value: A special field value , distinct from zero, blank, or any other value , that
indicates that the value for the field is missing or otherwise unknown.

94
Step 3: Analyze users’ transactions to determine characteristics that may impact
performance

• Select appropriate file organization based on the analysis of the transactions

• Select index to improve index

95
Introduction to File Organization
• The database on secondary storage is organized into one or more files, where each file consists

of one or more records and each record consists of one or more fields.

• A record corresponds to entity occurrence

• A field corresponds to attribute/column.

96
Consider the EMPLOYEE table shown below

ID Name Age Salary City Country


1 Ramesh 32 2000.00 Hyderabad India
1 Ramesh 32 2000.00 Delhi India
2 Mukesh 40 5000.00 New York USA
3 Sumit 45 4500.00 Muscat Oman
4 Kaushik 25 2500.00 Kolkata India

• Each record in this table maps to a record in the operating system file that holds the
EMPLYEE table.
• The physical record is the unit of transfer between disk and primary storage and vice

versa.
97
• The terms “Block” and “Page” are generally used in place of physical record.
• A single record maybe stored in one or more pages.

File Organization
• A way of arranging the records in a file when the file is stored on disk.

• File organization refers to how records are physically stored and retrieved in a database.

The main Types of File Organization:


• Heap(unsorted)

• Sequential(Sorted)

• Hash Files

98
• Along with a file organization, there is a set of access methods.

Access Method:

• The steps involved in storing and an retrieving records from a file.


Heap Files
• Unordered file, sometimes called heap file
• Records are placed in the file in the same order as they are inserted.
• A new record is inserted in the last page of the file, if there is insufficient space in the last page , a new
page is added to the file.
• This makes insertion very efficient
• A heap file no particular ordering with respect to field values
• The records are not sorted or ordered in any way
• A linear search must be performed to access a record.
• A linear search involves reading pages from the file until the required record is found. 99
How Data is Stored in a Heap File?
• Records are inserted in the order they arrive, without sorting or hashing.
• New records are added at the end of the file (or in any available free space).
• No enforced relationship between logical order (e.g., primary key) and
physical order on disk.
• Deleted records may leave gaps, which can be reused later

100
Example of Heap File Storage
Suppose we have a Students table:

StudentID Name Age Major


101 Alice 20 CS
102 Bob 22 Physics
103 Carol 21 Math
Heap File Representation on Disk:

Block 1: [101, Alice, 20, CS] → [102, Bob, 22, Physics]


Block 2: [103, Carol, 21, Math] → [Empty space]

• New records are appended to the next available slot (e.g., if a new student 104, Dave, 23,
Biology is inserted, it goes into Block 2's empty space).
101
How Are Heap Files Accessed?

Since heap files are unordered, accessing data requires:

Sequential Scan (Full Table Scan)

The DBMS reads every block in the file to find matching records.

Example: SELECT * FROM Students WHERE Major = 'CS' requires scanning


all blocks.

102
Heap Files(Cont..)

• In a heap file organization, if we wish to search, update, or remove data, we must


traverse the data from the beginning of the file until we find the desired record.

• Because there is no ordering or sorting of records, searching, updating, or removing


records will take a long time if the database is huge.

• We must check all of the data in the heap file organization until we find the
necessary record.

103
Heap Files(Cont..)

Advantages of Heap File Organization

• Fast inserts (just append at the end).

• No overhead for maintaining indexes or sorting.

• Good for bulk-loading data (e.g., initial database population).

• It’s a great way to organize your files for mass inclusion. This method is best suited when a
significant amount of data needs to be loaded into the database at once.

104
Heap Files(Cont..)

Disadvantages of Heap File Organization

• For huge or complicated databases, this type of organization could not be used.

• Slow searches (must scan entire file for queries).


• Because it takes time to find or alter a record in a large database, this method is comparatively inefficient.

• Inefficient for large tables (full scans are expensive).

• Fragmentation over time (deletions create gaps)

105
When to Use Heap Files?
• Temporary tables (no need for long-term optimization).
• Logging/staging data (insert-heavy, rarely queried).
• Small lookup tables (where scans are still fast).

106
Sequential File (Sorted file) Organization

• The new record is always added at the end of the file in this approach, and the sequence is
then sorted in descending or ascending order.

• Records are sorted using any primary key or other keys.

• If a record is modified, it will first update the record, then sort the file, and then store the
revised record in the correct location.

107
Storage Structure

• Records are stored in sorted blocks on disk.

• New records are inserted in the correct sorted position, which may require shifting
existing records.

• Deletions create gaps that may be filled during reorganization.

108
Sequential File (Sorted file) Organization

• Insertion of the New Record:


• Assume there is a pre-existing sorted sequence of four records, R1, R3 … R9, and
R8. If a new record R2 needs to be added to the sequence, it will be added at the end
of the file, and then the series will be sorted.

109
Sequential File (Sorted file) Organization(Cont..)

110
Example: Employee Table (Sequential File)

Suppose we store employees sorted by EmployeeID:


Block 1 Block 2 Block 3
E001 E003 E005
E002 E004 E006

Inserting E003.5 requires shifting E004 and E005 to make space.

Searching for E004 uses binary search (faster than heap scan).

111
Access Methods in Sequential Files

Binary Search

For equality searches (WHERE EmployeeID = E004), the DBMS jumps to the middle of
the file and narrows down the range.

Time Complexity: O(log n) (vs. O(n) for heaps).

112
Advantages of sequential file organization

• Faster searches (binary search possible on sorted data).


• Efficient range queries (e.g., date ranges, ID ranges).
• No extra sorting needed for queries matching the file order.
• Better for read-heavy workloads where data is frequently accessed in order.

113
Disadvantages of Sequential Files

• Slower inserts/deletes (Sorted file method takes more time and space for sorting the
records. Must shift records to maintain order).

• Fragmentation (deletions create gaps; periodic reorganization needed).

• Not optimal for random writes (e.g., OLTP with frequent small inserts).

114
Sequential File vs. Heap File
Scenario: Employee Database
Operation Heap File Sequential File
Insert Fast (append) Slow (shift records)
Search Full scan(O(n) Binary search (O(log n))
Range Query Full scan Fast (sequential read)

115
Hash File Organization

• Hash file organization is a storage method that uses a hash function to directly map record
keys to physical storage locations. Unlike sequential or indexed files, it provides O(1)
average-case access time for exact-match queries.

• Hashing refers to the process of generating a fixed-size output from an input of variable size
using the mathematical formulas known as hash functions. This technique determines an
index or location for the storage of an item in a data structure.

• Hash File Organization uses the computation of the hash function on some fields of a record.
The output of the hash function defines the position of the disk block where the records will

be stored. 116
Hash File Organization(Cont..)

• When a record is requested using the hash key columns, an address is generated, and the
entire record is fetched using that address.

• When a new record needs to be inserted, the hash key is used to generate the address, and
the record is then directly placed. In the case of removing and updating, the same
procedure is followed.

117
How Hash Files Work
1. Hash Function Calculation

• A hash function converts a record's key into a bucket address:

Eg: h(key) = key % N (where N = number of buckets)

• a bucket is a fundamental storage unit that holds records mapped to the same hash value. Think of it as a
"slot" or "container" where the database stores one or more records that share the same hash key.
Example:
For EmployeeID as key and N=3 buckets:
•h("E101") = 101 % 3 = 2 → Store in Bucket 2
•h("E102") = 102 % 3 = 0 → Store in Bucket 0
Physical Layout Example:

2. Data Storage Structure Bucket 0: [E102, Bob]


Bucket 1: [E103, Carol] → [E106, Alice]
Buckets (blocks) store records at computed addresses. (collision chain)
Bucket 2: [E101, Dave] 118
3. Record Retrieval

To find EmployeeID = E106:

Compute h("E106") = 106 % 3 = 1.

Traverse Bucket 1’s chain:

Check E103 → Not a match.

Check E106 → Found!

119
When to Use Hash Files?

1. Exact-Match Queries (No Range Scans)

Example: User login systems.


sql

SELECT * FROM Users WHERE UserID = 'U789'; -- O(1) lookup

Why Hash?

Faster than B-trees (O(1) vs. O(log n)).

No wasted overhead for sorting or tree balancing.

120
Advantages of Hash Files

• Very-fast lookups (O(1) average time).


• Efficient inserts/deletes (no rebalancing).
• Optimal for key-value workloads.

Disadvantages of Hash Files

• Useless for range queries (e.g., WHERE Salary > 5000).


• Hash collisions degrade performance (long chains → O(n) worst-case).
• Static sizing issues (poor hash function → uneven bucket distribution).

121
Hash vs. Other File Organizations
Operation Hash File B-Tree Sequential File
Exact-key search O(1) O(log n) O(log n)
Range query Not supported O(log n + k) O(log n + k)
Insert/Delete O(1) O(log n) O(n)
Storage overhead Low (buckets) High (tree nodes) Medium (gaps)

122
Index

• Indexing is a way to optimize the performance of a database by minimizing the


number of disk accesses required when a query is processed.

• It is a data structure technique which is used to quickly locate and access the data
in a database.
Index (Cont..)

• Indexes are created using a few database columns.

• The first column is the Search key that contains a copy of the primary key or
candidate key of the table. These values are stored in sorted order so that the
corresponding data can be accessed quickly.
Note: The data may or may not be stored in sorted order.

• The second column is the Data Reference or Pointer which contains a set of
pointers holding the address of the disk block where that particular key value can be
found.
Index (Cont..)

• Think of an index in a book – you can look up a word in the index and it gives
you a page number or numbers to look at. In terms of file structures, an index
is similar – it is a list of key value and addresses for records.

• More formally:
• an index is a structured collection of key value & address pairs. The purpose of an index
is to facilitate access to a collection of records.
Index (Cont..)

Structure of an Index in Database

Search Key Data Reference

126
Types of indexes:
• Primary Index

• Secondary Index

• Clustered Index

• B-Tree Index

• B+ Tree Index

127
Primary Index in DBMS
• A primary index is built on the primary key or a unique, ordered field of the file.

• The file is physically sorted based on this key.

• Primary Index is an ordered file which is fixed length size with two fields. The
first field is the same a primary key and second, field is pointed to that specific
data block.

• In the primary Index, there is always one to one relationship between the
index entry and data record.
• The data is sorted according to the search key and the primary key of the
database table is used to create the index. 128
Sample Data File

Assume the data file is stored on disk in blocks, with the following records (sorted by Student_ID):

Block Student_ID Name Department


Block 1 101 Alice CS
102 Bob EE
103 Charlie CS
Block 2 104 David ME
105 Emma CS Primary Index
106 Frank EE
Block 3 107 Grace ME
108 Hannah CS The primary index contains entries for the first record
109 Ian EE of each block, along with a pointer to the block's
location on disk. The index looks like this:
Index Entry Student_ID Pointer to Block
1 101 Block 1
2 104 Block 2
3 107 Block 129
• Indexing in DBMS is also further divided into two types.

• Dense Index

• Sparse Index

130
Dense Index
• In a dense index, a record is created for every search key valued in the database.

• the number of records in the index table is similar to the number of records in the main table.

• This helps you to search faster but needs more space to store index records.

• In this Indexing method, records contain search key value and points to the real record on the disk.

131
Dense Index Files

• E.g. index on ID attribute of instructor relation


Sparse Index

• The index record appears only for a few items in the data file. Each item points to a block as
shown.

• To locate a record, we find the index record with the largest search key value less than or
equal to the search key value we are looking for.

• We start at that record pointed to by the index record, and proceed along with the pointers in
the file (that is, sequentially) until we find the desired record.

Sparse Index: The index only stores the first Student_ID of each block (101, 104, 107) rather than every Student_ID,
reducing the index size.
Pointer: The pointer indicates the starting address of the block on disk.
Sparse Index(cont..)
Sparse Index(cont..)

• To locate a record with search-key value K we:


• Find index record with largest search-key value < K
• Search file sequentially starting at the record to which the index record points
Sparse Index(cont..)

• Compared to dense indices:

• Less space and less maintenance overhead for insertions and deletions.

• Generally slower than dense index for locating records.


Secondary index

• A secondary index is created on a non-key or non-ordered field to facilitate faster


searches on attributes other than the primary key.
Characteristics:
• Dense index: One index entry per record.
• Data file is not sorted by the indexed field.
• Can be created on non-unique fields, leading to multiple records per index entry.
• Use Case: Useful for queries on non-primary attributes (e.g., searching students by
name instead of Student_ID).
• Example: An index on a "City" field in an employee database, allowing quick lookup
of employees by city. 137
Data file :Unsorted
Index file :Sorted

138
Clustered Indexing

• When more than two records are stored in the same file these types of storing known
as cluster indexing.

• By using the cluster indexing we can reduce the cost of searching the reason being
multiple records related to the same thing are stored at one place.
Clustered Index

• A clustered index is an index-based file organization method in a database management


system (DBMS) where the data file is physically sorted based on the indexed field, and the
index determines the physical order of the records.
• Unlike a primary index, which is specifically built on the primary key, a clustered index can
be created on any field (key or non-key), but there can only be one clustered index per table
because the data can only be physically sorted in one order.
• The clustered index is typically a sparse index, meaning it contains one entry per data block
rather than per record, making it efficient for storage and retrieval.

140
Working of a Clustered Index

• Physical Ordering: The data records in the table are physically sorted based on the
clustered index key (the field chosen for the index). This ensures that records with
similar key values are stored together on disk, typically grouped into blocks.
• Index Structure: The clustered index is a sparse index, containing pairs of (key,
pointer) values, where each entry points to the starting record of a data block in the
sorted file. The key in the index is typically the first key value in each block.

141
Sample Data File

Assume the data file is stored on disk in blocks, with records sorted by
Department:
Block Employee_ID Department Name
Block 1 101 CS Alice Clustered Index
104 CS David
108 CS Hannah The clustered index is sparse, containing entries for the
Block 2 102 EE Bob first Department value in each block, along with a
106 EE Frank pointer to the block’s location on disk:
109 EE Ian Index Entry Department Pointer to Block
Block 3 103 ME Charlie 1 CS Block 1
105 ME Emma 2 EE Block 2
3 ME Block 3

• The data is physically sorted by Department (CS, EE, ME).


• Each block contains 3 records, and the clustered index is built on the Department
field.

142
Clustered Index Cont’d

• Sparse Index: The index only stores the first Department value of each block
(CS, EE, ME) rather than every record, reducing storage overhead.
• Pointer: Each pointer indicates the starting address of the block on disk.
• Non-Unique Key: Since Department is non-unique, multiple records may share
the same Department value (e.g., all records in Block 1 have Department = CS).

143
Step 4: Design User View

• Design the user views that were identified during the requirement collection and
analysis stage of the database application lifecycle
Step 5:Design Security Mechanisms

• Design the security measures for the database as specified during the requirement
collection and analysis stage of the database application lifecycle

144

You might also like