Understanding Database Anomalies
Understanding Database Anomalies
1
Outline:
2
Recall: Database design lifecycle
• Requirements analysis
• User needs; what must database do?
• Conceptual design
• High-level description; often using E/R model
• Logical design
• Translate E/R model into relational schema
• Schema refinement
• Check schema for redundancies and anomalies
• Physical design/tuning
• Consider typical workloads, and further optimise
3
• Why are some designs bad?
• What’s a functional dependency?
• What’s the theory of functional dependencies?
• How can we use this theory to classify redundancy in relation design
4
Not all designs are equally good
• Why is this design bad?
Data(sid,sname,address,cid,cname,grade)
Student(sid,sname,address)
Course(cid,cname)
Enrolled(sid,cid,grade)
5
An instance of our bad design
6
Redundancy
• Redundancy is the root of many problems associated with relational schemas
• Redundant storage
• Update anomalies
• Insertion anomalies
• Deletion anomalies
• In general, with higher redundancy, if transactions are correct (no anomalies), then they
have to lock more objects thus causing greater contention and lower throughput
7
Anomalies
8
Why are anomalies Bad?
• Data Integrity Issues: It becomes hard to enforce rules that ensure the data is
accurate and reliable.
• Wasted Storage: Storing the same data multiple times consumes unnecessary
space.
9
Types of Anomalies
The three main types of database anomalies are:
1. Insertion anomaly
2. Deletion anomaly
3. Update anomaly
10
Example: Employee_Department_Info(A poorly designed table)
Employee_Department_Info a single table designed to store information about employees, their departments,
and the department's manager (Head of Department - HOD).
• Notice the redundancy: 'HR' and 'Mr. Smith' are repeated for every HR employee; 'IT' and
'Ms. Jones' are repeated for every IT employee. This redundancy is the source of the
anomalies.
11
Insertion Anomaly
• An insertion anomaly occurs when you cannot add certain information to the database without
adding some other, unrelated information first. It often prevents you from recording details
about a new entity until it's related to another existing entity in the table.
Example using Employee_Department_Info:
• Suppose the company wants to create a new department, 'Finance' (DeptID D03), with 'Mr. Brown' as
the HOD. However, no employees have been assigned to this department yet.
• Problem: Using the table structure above, how do we add the 'Finance' department? We cannot add a
row for DeptID 'D03', DeptName 'Finance', and DeptHOD 'Mr. Brown' without also providing an
EmpID and EmpName. The table structure forces us to associate a department with an employee. We
either have to leave EmpID/EmpName blank (which might violate database constraints if EmpID is a
primary key or required field) or create a dummy employee just to record the department's existence.
• Consequence: Difficulty in adding new entities (like a department) if they don't yet have associated
12
entities (like employees) required by the table structure.
Deletion Anomaly
• A deletion anomaly occurs when deleting a record (row) unintentionally removes other
essential information that was stored only in that record. You lose facts about one entity when
you delete information about another entity.
Example using Employee_Department_Info:
• Suppose employee ‘Mohammed' (EmpID 106) is the last remaining employee in the ‘Marketing'
department If Mohammed leaves the company and we delete his record (the row with EmpID 106) from
the table:
• Problem: We not only lose the information that Mohammed worked for the company, but we also lose
the fact that there is an ‘Marketing' department (D03) and that its HOD is 'Ms. Kedir'. This information
existed only in the rows associated with Marketing employees.
• Consequence: Accidental loss of valuable data about one entity (the department) when removing data
about another entity (the last employee in that department).
13
Update(or Modification) Anomaly
• A update anomaly occurs when updating a single piece of information requires updating
multiple rows. If the update is not performed consistently across all relevant rows, it leads to
data inconsistency.
Example using Employee_Department_Info:
• Suppose the HOD of the 'HR' department changes from 'Mr. Smith' to 'Ms. Davis'.
• Problem: To reflect this change accurately, we must find every single row where DeptID is 'D01' (the
HR department) and change 'Mr. Smith' to 'Ms. Davis'. In our example table, this means updating the
rows for Alice (101), Bob (102), and Eve (105). If we miss even one row, or if there's a typo in one
update, the database will become inconsistent, showing conflicting HODs for the same department.
• Consequence: Data inconsistency due to partial updates, increased effort and potential for errors during
data modification.
14
How to Avoid Anomalies: Normalization (to be seen next)
15
Functional dependencies(FDs)
• Basic tool for analyzing relational schemas
• Are used to specify formal measures of the "goodness" ofrelational designs
• Functional dependencies are relationships between attributes in a table.
• A functional dependency exists when the value of one attribute uniquely determines the value of another attribute.
• In DB, we often have the case where one field defines the other. The value of one attribute in a table is determined
entirely by the value of another.
Eg1: Social Security Number (SSN) defines a name.
• If I know someone's SSN, then I can find their name.W will say that we have defined name as being functionally
dependent on SSN.
16
How to Denote a Functional Dependency?
• A functional dependency is denoted by an arrow “→”. The functional dependency of A on B is
represented by A → B. It means each value of A is associated with exactly one value of B
• Consider a relation with four attributes A, B, C and D,
R (ABCD)
• A → BCD
• B → CD
• For the first functional dependency A → BCD, attributes B, C and D are functionally
dependent on attribute A.
• Function dependency B → CD has two attributes C and D functionally depending upon
attribute B.
• In the functional dependency A→B, A is the determinant, and B is functionally dependent
on A
17
How to Denote a Functional Dependency?
• Sometimes everything on the left side of functional dependency is also referred to as
determinant set, while everything on the right side is referred to as depending attributes
R(ABCDE)
1) A → BCD 2) B → CDE
18
Consider the table below:
sid sname address cid cname grade
124 Abebe Addis Ababa 206 Database A++
204 Ayele Bahirdar 202 OOP C
125 Hagos Jima 201 S/Eng A+
206 Aster Hawas 206 Database B-
124 Abebe Addis Ababa 202 OOP B+
19
Functional dependencies cont.
20
Functional dependencies cont.
21
Example:
EmpNo → Name. This means that every
EmpNo Job Name time we find 104, we find the name, Fred.
101 President Herbert • Just because something is on the left-hand
104 Programmer Fred
103 Designer Beryl side of a FD, it does not imply that you
103 Programmer Beryl have a key or that it will be unique in the
database.
• i.e the FD X → Y only means that for
every occurrence of X you will get the
same value of Y.
22
Figure 14.1 Simplified version of the COMPANY relational database schema.
23
Examples of Functional Dependencies
24
Types of Functional Dependencies in DBMS
25
1. Trivial functional dependency
Example:
{ Employee_Id, Name } → { Name } is a Trivial functional dependency, since the
dependent Name is the subset of determinant { Employee_Id, Name }.
26
2. Non-Trivial functional dependency
Example:
{ Employee_Id, } → { Name } is a Non-Trivial functional dependency.
27
3. Multivalued functional dependency
• In Multivalued functional dependency, attributes in the dependent set are not dependent
on each other.
• For example, X → { Y, Z }, if there exists is no functional dependency between Y and Z,
then it is called as Multivalued functional dependency.
Example:
{ Employee_Id } → { Name, Age } is a Multivalued functional dependency, since the dependent attributes
Name, Age are not functionally dependent(i.e. Name → Age or Age → Name doesn’t exist !).
28
3. Transitive functional dependency
Here, { Employee_Id → Department } and { Department → Street Number } holds true. Hence, according to
the axiom of transitivity, { Employee_Id → Street Number } is a valid functional dependency.
29
Armstrong’s Axioms/Properties of Functional Dependency
• William W. Armstrong established a set of rules which can be used to Inference the
functional dependencies in a relational database:
30
Functional dependencies(FDs (Con’t)
31
Normalization
• Is a formal process for deciding which attributes should be grouped together in a relation.
• Normalization is a technique for producing a set of relations with desirable properties, given
the data requirements of an enterprise.
32
Normalization (Cont’d)
• The process of normalization is a formal method that identifies relations based on their
primary key.
• Primarily a tool to validate and improve a logical design so that it satisfies certain constraints
that avoid unnecessary duplication of data
• A process used to organize data in a relational database to eliminate redundancy, ensure data
integrity, and optimize query performance.
• It involves decomposing a larger, complex table into smaller, well-structured tables while
maintaining relationships between them.
• The process of decomposing relations with anomalies to produce smaller, well- structured
relations 33
Well-Structured Relations
• A relation that contains minimal data redundancy and allows users to insert, delete, and update rows
without causing data inconsistencies
• General rule of thumb: a table should not pertain to more than one entity type
Problems caused by Unnormalized forms
• Redundant storage
• Potential inconsistency (eg. Spelling errors)
• Anomalies
o Insertion anomaly
o Deletion anomaly
o Update anomaly
5
• Unnormalized Form (UNF)
• A table that contains one or more repeating groups.
• A repeating group is a field or group of fields that hold multiple values for a single
occurrence of a field.
36
• Unnormalized Form (UNF)
• A table that contains one or more repeating groups.
• A repeating group is a field or group of fields that hold multiple values for a single
occurrence of a field.
studid name courseid coursename dept dhead
Repeating group =
(courseid,coursename,dept
123 Abebe M Inst 223 Organization Inf. sc. Bekele ,dhead)
Inst 224 Systems Inf. Sc. Bekele
Geog 211 Geography Geography Woldu
Mtpa 123 organization management Tsega
234 Worku K. Inst 224 Systems Inf. Sc. Bekele
Inst 334 database Inf. Sc. Bekele
345 Abebe W. Inst 223 Organization Inf. Sc. Bekele
Geog 211 Geography Geography Woldu
Mtpa 123 Organization management Tsega
37
Unnormalized Form (UNF) (Cont’d)
Question – Is this a relation? Answer –Yes: unique rows and no multivalued Attributes
• Deletion – if we remove employee 140, we lose information about the existence of a Tax Acc
class
• Modification – giving a salary increase to employee 100 forces us to update multiple records
• Because we’ve combined two themes (entity types) into one relation. This results in
duplication, and an unnecessary dependency between the entities
• Normalization is guided by a series of rules called normal forms, each addressing specific
• If a table violates the rules of some normal form, then we decompose the table in to
tables that individually meet the requirements of normalization.
41
First Normal Form (1NF)
• A relation is in 1NF if it contains no multi-valued attributes and the intersections of each
row and column contains one and only one value.
• It was defined to disallow multivalued attributes, composite attributes, and their combinations.
• It states that the domain of an attribute must include only atomic (simple, indivisible) values
and that the value of any attribute in a tuple must be a single value from the domain of that
attribute.
• Hence, 1NF disallows having a set of values, a tuple of values, or a combination of both as an
attribute value for a single tuple. In other words, 1NF disallows "relations within relations" or
"relations as attribute values within tuples."
42
First Normal Form (1NF) (Cont’d)
• The only attribute values permitted by lNF are single atomic (or indivisible) values.
• Rows uniquely identified, add unique id, or add more columns to make unique
In Summary:
• A table is in 1NF if:
• All attributes (columns) contain only atomic (indivisible) values.
43
Example: Unnormalized Table (Not in 1NF)
44
Example: Unnormalized Table (Not in 1NF)
• This is not in 1NF because coursed,coursename, dept, dhead are not an atomic attributes.
45
Conversion to 1NF
• Proper primary key must be developed (Uniquely identifies attribute values or rows)
• Expand the key so that there will be a separate tuple in the original relation.
46
Examples:
1. The student table shown below is not in 1NF.
Table in 1NF is :
• Expand the table
• Set new Primary key
• This is not in 1NF because coursed,coursename, dept, dhead are not an atomic attributes.
48
1NF of the table in the previous slide:
• Expand the table
• Set new Primary key
Pk = {studid,Courseno}
49
Example 3: The car-owners Table show below is Not in 1NF)
Pk={plateNo}
50
Example 4: In the company database, the department table shown below is Not
in 1NF
51
Second Normal Form
Definitions:
• A prime attribute is an attribute that is part of any candidate key of a relation
• A non-prime attribute is an attribute that is not part of any candidate key in the relation.
Example:
{SSN, PNUMBER} -> ENAME is a partial dependency) since
SSN -> ENAME also holds.
Full functional dependency
A FD Y -> Z where removal of any attribute from Y means the FD does not hold any more
Examples:
• {SSN, PNUMBER} -> HOURS is a full FD since neither SSN -> HOURS nor PNUMBER ->
HOURS hold
• {SSN, PNUMBER} -> ENAME is not a full FD (it is called a partial dependency)
since SSN -> ENAME also holds
• Emp_course (empid, course_title, Date_completed)
Empid, Course_title → Date_Completed is a full FD
53
Second Normal Form
• R can be decomposed into 2NF relations via the process of 2NF normalization
Note:
• A relation that is in 1st normal form will be in 2NF if any one of the following
conditions applies:
1. The primary key consists of only one attribute.
2. No non key attributes exist in the relation. (That is, all of the attributes in the relation are
components of the Pk.
55
3. Every non key attribute is functionally dependent on the full set of primary key attributes.
Steps to Convert to 2NF:
• To achieve 2NF, we eliminate partial dependencies by splitting the table based on attributes
that depend on subsets of the composite primary key.
• Identify Partial Dependencies
• Create New Tables for Partial Dependencies:
• Update the Original Table
• Define Relationships
56
Example: - Student _ Relative
57
Student_relative(Student_id,Relative_id,R_P,Stud_tele)
60
Solution
PK={studid, courseno}
Studid ➔ name
courseno ➔ course_name,Dept,Dhead
E.g.
• Here, SSN -> Emp# -> Salary and Emp# is a candidate key.
64
Conversion to 3NF
• A relation that is in 1NF and 2NF and in which no non-primary-key
attribute is transitively dependent on the primary key.
65
• Consider the following table:
Now, PK : empid
We have functional dependencies:
Empid → depid
Depid → depname
Or Depid → depbudjet
Therefore, the above table is not is 3NF. To normalize it, we can use the functional dependencies:
Depid → depname
Depid → depbudjet
And
Empid → depid
66
So that the resulting tables are the following:
67
Example: SALES relation with simple data
CustID → Name
CustID → Salesperson
• While Third Normal Form (3NF) is generally sufficient for organizing relational
databases, it may not completely eliminate redundancy. Redundancy can still
occur if there’s a dependency X→Y where X is not a candidate key. This issue is
addressed by a stronger normal form known as Boyce-Codd Normal Form
(BCNF).
• Based on functional a dependency that takes into account all candidate keys in a relation.
• For a relation with only one candidate key, 3NF and BCNF are equivalent.
• Boyce-Codd Normal Form (BCNF) is a stricter version of Third Normal Form (3NF) that
ensures a more simplified and efficient database design. It enforces that every non-trivial
functional dependency must have a superkey on its left-hand side.
• This approach addresses potential issues with candidate keys and ensures the database is
free from redundancy.
• BCNF eliminates redundancy more effectively than 3NF by strictly requiring that all
functional dependencies originate from super-keys.
72
3NF to BCNF
• Identify all candidate keys in the relation.
73
Rules for BCNF
Note: To test whether a relation is in BCNF, we identify all the determinants and make
sure that they are candidate keys.
74
Example:
Assumptions:
Student Instructor Course
• One instructor teaches one course.
Ayele Dagne Database
• One course can be taught by many
Ayele Hagos C++ instructors
Kedir Dagne Database • One student can take many courses
Kedir Asnake C++
FD: {
(student, Instructor) -> Course
(student, Course) -> Instructor,
(Instructor) -> Course
}
75
• Candidate keys are (student, Instructor) and (student, course).
• The above relation is in 3NF (since there is no transitive dependency). A relation R is in
BCNF if for every non-trivial FD X->Y, X must be a key.
• The above relation is not in BCNF, because in the FD (teacher->subject), teacher is not a key.
• This issue occurs because the instructor is a determinant but not a candidate key.
• R is divided into two relations R1(Instructor,Course) and R2(Student, Instructor).
76
Example:
77
Other Normal Forms
• 4th NF
• No multivalued dependencies
• 5th NF
• No “lossless joins”
• Domain-key NF
• The “ultimate” NF…perfect elimination of all possible anomalies
Physical Database Design
79
✓ Introduction to Physical Design
✓ Steps in physical Database Design
• Designing Base Table
• Designing Constraints and Business Rules
• Selecting appropriate file organization based on the analysis of the transactions
• Selecting index to improve index
• Designing User View
• Designing Security Measures
✓ Introduction to File Organization
• Types of File Organization
✓ Introduction to Indexes
80
Physical Design
• The aim of physical database design is to decide how the logical database
design will be implemented.
81
Physical Design(Cont..)
• Identifying the specific storage structures and access methods to retrieve data efficiently.
For example, adding a secondary index to a relation.
• Designing security features for the database system including account creation, privilege
granting/revocation, access protection, and security level assignment.
82
Physical Design(Cont..)
83
Physical Design(Cont..)
• Sources of information for the physical design process includes global logical data model
and documentation that describes model.
• Logical database design is concerned with the what, physical database design is concerned
with the how.
Physical Design(Cont..)
85
• Design physical representations
• Analyze transactions
• To understand the functionality of the transactions that will run on the database
and to analyze the important transactions
• Design security mechanisms
• Design user views
• to design the user views that were identified in the conceptual database design
methodology
• Design access rules
• to design the access rules to the base relations and user views
• Choose file organizations
• to estimate the amount of disk space that will be required by the database
Physical Design(Cont..)
88
Step 1: Designing Base Table
89
Example:
For the Branch relation schema shown below, design its corresponding base table in a
relational data model using SQL Server DBMS.
Branch(branchNo,street,city,state,zipCode,mgrstaffNo)
• Pk: branchNo
90
Solution
91
Solution(Cont..)
Branch(
Branch_Numbers NOT NULL,
• Design any other business rules that have to be imposed on the data
Example:
• Preventing a member from renting more than 10 videos at any one time.
93
Designing Constraints (cont..)
• Default Value: A value a field will assume unless an explicit value is entered for that field
• Values Range control: Limits range of values that can be entered into field
• Referential Integrity: An integrity constraint specifying that the value (or existence) of an
attribute in one relation depends on the value (or existence) of the same attribute in another
relation
• Null Value: A special field value , distinct from zero, blank, or any other value , that
indicates that the value for the field is missing or otherwise unknown.
94
Step 3: Analyze users’ transactions to determine characteristics that may impact
performance
95
Introduction to File Organization
• The database on secondary storage is organized into one or more files, where each file consists
of one or more records and each record consists of one or more fields.
96
Consider the EMPLOYEE table shown below
• Each record in this table maps to a record in the operating system file that holds the
EMPLYEE table.
• The physical record is the unit of transfer between disk and primary storage and vice
versa.
97
• The terms “Block” and “Page” are generally used in place of physical record.
• A single record maybe stored in one or more pages.
File Organization
• A way of arranging the records in a file when the file is stored on disk.
• File organization refers to how records are physically stored and retrieved in a database.
• Sequential(Sorted)
• Hash Files
98
• Along with a file organization, there is a set of access methods.
Access Method:
100
Example of Heap File Storage
Suppose we have a Students table:
• New records are appended to the next available slot (e.g., if a new student 104, Dave, 23,
Biology is inserted, it goes into Block 2's empty space).
101
How Are Heap Files Accessed?
The DBMS reads every block in the file to find matching records.
102
Heap Files(Cont..)
• We must check all of the data in the heap file organization until we find the
necessary record.
103
Heap Files(Cont..)
• It’s a great way to organize your files for mass inclusion. This method is best suited when a
significant amount of data needs to be loaded into the database at once.
104
Heap Files(Cont..)
• For huge or complicated databases, this type of organization could not be used.
105
When to Use Heap Files?
• Temporary tables (no need for long-term optimization).
• Logging/staging data (insert-heavy, rarely queried).
• Small lookup tables (where scans are still fast).
106
Sequential File (Sorted file) Organization
• The new record is always added at the end of the file in this approach, and the sequence is
then sorted in descending or ascending order.
• If a record is modified, it will first update the record, then sort the file, and then store the
revised record in the correct location.
107
Storage Structure
• New records are inserted in the correct sorted position, which may require shifting
existing records.
108
Sequential File (Sorted file) Organization
109
Sequential File (Sorted file) Organization(Cont..)
110
Example: Employee Table (Sequential File)
Searching for E004 uses binary search (faster than heap scan).
111
Access Methods in Sequential Files
Binary Search
For equality searches (WHERE EmployeeID = E004), the DBMS jumps to the middle of
the file and narrows down the range.
112
Advantages of sequential file organization
113
Disadvantages of Sequential Files
• Slower inserts/deletes (Sorted file method takes more time and space for sorting the
records. Must shift records to maintain order).
• Not optimal for random writes (e.g., OLTP with frequent small inserts).
114
Sequential File vs. Heap File
Scenario: Employee Database
Operation Heap File Sequential File
Insert Fast (append) Slow (shift records)
Search Full scan(O(n) Binary search (O(log n))
Range Query Full scan Fast (sequential read)
115
Hash File Organization
• Hash file organization is a storage method that uses a hash function to directly map record
keys to physical storage locations. Unlike sequential or indexed files, it provides O(1)
average-case access time for exact-match queries.
• Hashing refers to the process of generating a fixed-size output from an input of variable size
using the mathematical formulas known as hash functions. This technique determines an
index or location for the storage of an item in a data structure.
• Hash File Organization uses the computation of the hash function on some fields of a record.
The output of the hash function defines the position of the disk block where the records will
be stored. 116
Hash File Organization(Cont..)
• When a record is requested using the hash key columns, an address is generated, and the
entire record is fetched using that address.
• When a new record needs to be inserted, the hash key is used to generate the address, and
the record is then directly placed. In the case of removing and updating, the same
procedure is followed.
117
How Hash Files Work
1. Hash Function Calculation
• a bucket is a fundamental storage unit that holds records mapped to the same hash value. Think of it as a
"slot" or "container" where the database stores one or more records that share the same hash key.
Example:
For EmployeeID as key and N=3 buckets:
•h("E101") = 101 % 3 = 2 → Store in Bucket 2
•h("E102") = 102 % 3 = 0 → Store in Bucket 0
Physical Layout Example:
119
When to Use Hash Files?
Why Hash?
120
Advantages of Hash Files
121
Hash vs. Other File Organizations
Operation Hash File B-Tree Sequential File
Exact-key search O(1) O(log n) O(log n)
Range query Not supported O(log n + k) O(log n + k)
Insert/Delete O(1) O(log n) O(n)
Storage overhead Low (buckets) High (tree nodes) Medium (gaps)
122
Index
• It is a data structure technique which is used to quickly locate and access the data
in a database.
Index (Cont..)
• The first column is the Search key that contains a copy of the primary key or
candidate key of the table. These values are stored in sorted order so that the
corresponding data can be accessed quickly.
Note: The data may or may not be stored in sorted order.
• The second column is the Data Reference or Pointer which contains a set of
pointers holding the address of the disk block where that particular key value can be
found.
Index (Cont..)
• Think of an index in a book – you can look up a word in the index and it gives
you a page number or numbers to look at. In terms of file structures, an index
is similar – it is a list of key value and addresses for records.
• More formally:
• an index is a structured collection of key value & address pairs. The purpose of an index
is to facilitate access to a collection of records.
Index (Cont..)
126
Types of indexes:
• Primary Index
• Secondary Index
• Clustered Index
• B-Tree Index
• B+ Tree Index
127
Primary Index in DBMS
• A primary index is built on the primary key or a unique, ordered field of the file.
• Primary Index is an ordered file which is fixed length size with two fields. The
first field is the same a primary key and second, field is pointed to that specific
data block.
• In the primary Index, there is always one to one relationship between the
index entry and data record.
• The data is sorted according to the search key and the primary key of the
database table is used to create the index. 128
Sample Data File
Assume the data file is stored on disk in blocks, with the following records (sorted by Student_ID):
• Dense Index
• Sparse Index
130
Dense Index
• In a dense index, a record is created for every search key valued in the database.
• the number of records in the index table is similar to the number of records in the main table.
• This helps you to search faster but needs more space to store index records.
• In this Indexing method, records contain search key value and points to the real record on the disk.
131
Dense Index Files
• The index record appears only for a few items in the data file. Each item points to a block as
shown.
• To locate a record, we find the index record with the largest search key value less than or
equal to the search key value we are looking for.
• We start at that record pointed to by the index record, and proceed along with the pointers in
the file (that is, sequentially) until we find the desired record.
Sparse Index: The index only stores the first Student_ID of each block (101, 104, 107) rather than every Student_ID,
reducing the index size.
Pointer: The pointer indicates the starting address of the block on disk.
Sparse Index(cont..)
Sparse Index(cont..)
• Less space and less maintenance overhead for insertions and deletions.
138
Clustered Indexing
• When more than two records are stored in the same file these types of storing known
as cluster indexing.
• By using the cluster indexing we can reduce the cost of searching the reason being
multiple records related to the same thing are stored at one place.
Clustered Index
140
Working of a Clustered Index
• Physical Ordering: The data records in the table are physically sorted based on the
clustered index key (the field chosen for the index). This ensures that records with
similar key values are stored together on disk, typically grouped into blocks.
• Index Structure: The clustered index is a sparse index, containing pairs of (key,
pointer) values, where each entry points to the starting record of a data block in the
sorted file. The key in the index is typically the first key value in each block.
141
Sample Data File
Assume the data file is stored on disk in blocks, with records sorted by
Department:
Block Employee_ID Department Name
Block 1 101 CS Alice Clustered Index
104 CS David
108 CS Hannah The clustered index is sparse, containing entries for the
Block 2 102 EE Bob first Department value in each block, along with a
106 EE Frank pointer to the block’s location on disk:
109 EE Ian Index Entry Department Pointer to Block
Block 3 103 ME Charlie 1 CS Block 1
105 ME Emma 2 EE Block 2
3 ME Block 3
142
Clustered Index Cont’d
• Sparse Index: The index only stores the first Department value of each block
(CS, EE, ME) rather than every record, reducing storage overhead.
• Pointer: Each pointer indicates the starting address of the block on disk.
• Non-Unique Key: Since Department is non-unique, multiple records may share
the same Department value (e.g., all records in Block 1 have Department = CS).
143
Step 4: Design User View
• Design the user views that were identified during the requirement collection and
analysis stage of the database application lifecycle
Step 5:Design Security Mechanisms
• Design the security measures for the database as specified during the requirement
collection and analysis stage of the database application lifecycle
144