1.
Database Design and SQL Query Processing
1.1 Introduction to Database, Data Models and Architecture of DBMS
(Views of data: Schemas and Instances, Data Independence)
1.2 Data Modelling using ER Diagram: Representation of Entities,
Attributes, Relationships and their Types, Cardinality, Generalization,
Specialization, Aggregation.
1.3 Relational Data Model: Structure of Relational Database Model,
Referential Integrity Constraints & its types, Codd’s rules
1.4 Database Design using E-R, E-R to Relational Tables Conversion,
Database design using Normalization – Normal forms - 1NF, 2NF, 3NF -
Case Studies
1.5 Introduction to SQL Query Processing (DDL, DML, Aggregate
Functions and Joins
HISTORY OF DATABASE SYSTEM
Manual System - 1950s:
Data was stored as paper records
Huge man power involved
Unnecessary time was wasted like when searching for a particular record
1950s and early 1960s:
Data processing using magnetic tapes for storage.
Tapes provided only sequential access.
Punched cards for input.
1960s and 1970s:
Hard disks allowed direct access to data.
1968: Data stored in files known as File Based System.
1968-1980 non-relational database
Hierarchical model: IBM's first hierarchical DBMS called IMS (Information
Management System).
Network model: Charles Bachmann developed first DBMS at Honeywell,
Integrated Data Store (IDS) then a group called CODASYL.
- CODASYL DBTG made Network model(CODASYL Conference on Data
System Languages, DBTG: Data Base Task Group)
- IDMS (Information Data Management System) most popular network
DBMS.
1980-present : Relational Database
Ted Codd defines the Relational Data Model.
P. Chenn proposd Entity-Relational Model for database design.
1980s:
Research relational prototypes evolve into commercial systems.
SQL becomes industrial standard
Parallel and distributed database systems
Object-oriented database systems.
1990s:
Large decision support and data-mining applications
Large multi-terabyte data warehouses
Emergence of Web commerce
Early 2000s:
Google, Amazon, Yahoo
File system has several disadvantages and the following problems are
associated with file system:
1. Data redundancy and inconsistency
The major problem with file processing system is that it maintains several
versions of same file i.e.; duplication of data is possible at multiple
places(file)
2. Difficulty in accessing the data
Consider the airline reservation system. If the senior management of
company wants to access the information of all its customers who are living
in the same postal code, it has to be done manually because current file
processing system does not allow the user to obtain this information.
So in the above case, there are two options. Either the application
programmer has to write a new application program to satisfy the unusual
request or could get this information manually.
In former case, it doesn’t guarantee that the same query will be asked and
same application program would be used in future.
If a query changes, a new application program should be written to get the
needed information.
3. Security problems
Every user in this system should be able to access the data which he is allowed
to access and not all the data.
For example, the salesperson in an organization should be allowed to access
the data related to him and should not be allowed to access data which is used
HR team or finance department in an organization.
If the new constraints are added to avoid this kind of unauthorized access,
these constraints is difficult because the existing application programs are
added to the system in an adhoc manner.
4. Integrity problems(File Name)
Data stored in the database should be allowed to satisfy certain constraint
checking.
For eg, before adding a new employee in the Employee table, if we check the
age of the employee and if we apply constraint such that only those employee
whose age is greater than 18 years should be allowed to enter in the table which
means that before the new data is inserted the age of the employee should be
calculated.
When a new constraint such as one which is discussed above is added, it
becomes difficult to change the existing programs to enforce the new
constraints.
5. Atomicity problems
Every application system is assumed to fail at some point in near future.
In many applications, if the system fails, the data should be rolled back
to the state before the failure occurs.
Consider the customer is withdrawing some cash from the ATM
machine from his own account and if the failure happens in the system,
it should not happen that the amount is deducted from customer
account but the customer is not getting any cash from the machine
In simple word the withdrawn should be atomic- it must be happen in its
entirely or not at all.
Another disadvantage with file processing system is that it becomes
difficult to ensure atomicity.
Difference between File System and DBMS
File System DBMS
The file system is a way of
DBMS is software for managing
arranging the files in a storage
medium within a computer. the database
In DBMS there is no redundant
data can be present in
Redundant(same data is present data.
in multiple places) It provides in house tools for
It doesn’t provide Inbuilt backup and recovery of data
mechanism for backup and even if it is lost.
recovery of data if it is lost. a file Efficient query processing is
system. there in DBMS.
There is no efficient query
processing in the file system.
There is less data consistency in There is more data consistency
the file system. because of the process of
It is less complex as compared normalization.
to DBMS. It has more complexity in
File systems provide less handling as compared to the file
security in comparison to DBMS. system.
It is less expensive than DBMS. DBMS has more security
mechanisms as compared to file
Only one user can access data
systems.
at a time.
It has a comparatively higher
Data is distributed in many files.
cost than file system
So, it is not easy to share data
Multiple users can access data
at a time.
Concept:
Definition
A Database Management System (DBMS) is a software system that is
designed to manage and organize data in a structured manner. It allows
users to create, modify, and query a database, as well as manage the
security and access controls for that database.
Data Modelling
A Data Model in Database Management System (DBMS) is the concept
of tools that are developed to summarize the description of the
database.
Data Models provide us with a transparent picture of data which helps
us in creating an actual database. It shows us from the design of the
data to its proper implementation of data.
DBMS Architecture
A Database stores a lot of critical information to access data quickly and
securely.
Hence it is important to select the correct architecture for efficient data
management.
DBMS Architecture helps users to get their requests done while
connecting to the database.
We choose database architecture depending on several factors like the
size of the database, number of users, and relationships between the
users.
There are two types of database models that we generally use, logical
model and physical model. Several types of architecture are there in the
database which we will deal with in the next section.
1-Tier Architecture
In 1-Tier Architecture the database is directly available to the user, the
user can directly sit on the DBMS and use it that is, the client, server,
and Database are all present on the same machine.
For Example: to learn SQL we set up an SQL server and the database on
the local system. This enables us to directly interact with the relational
database and execute operations. The industry won’t use this
architecture they logically go for 2-tier and 3-tier Architecture.
Advantages of 1-Tier Architecture
Simple Architecture: 1-Tier Architecture is the most simple
architecture to set up, as only a single machine is required to maintain
it.
Cost-Effective: No additional hardware is required for implementing 1-
Tier Architecture, which makes it cost-effective.
Easy to Implement: 1-Tier Architecture can be easily deployed, and
hence it is mostly used in small projects.
2-Tier Architecture
The 2-tier architecture is similar to a basic client-server model. The
application at the client end directly communicates with the database
on the server side.
APIs like ODBC(Open Database Connectivity) and JDBC(Java
Database Connectivity.) are used for this interaction. The server side
is responsible for providing query processing and transaction
management functionalities.
On the client side, the user interfaces and application programs are run.
The application on the client side establishes a connection with the
server side to communicate with the DBMS.
An advantage of this type is that maintenance and understanding are
easier, and compatible with existing systems. However, this model gives
poor performance when there are a large number of users.
2 Tier Architecture examples are:-
File-Server Architecture: A client accesses a shared file server for
data storage and retrieval.
Client-Server Architecture: A client communicates directly with a
database server for data storage and retrieval.
Terminal Services Architecture: A client communicates directly with
a terminal server, which then communicates with a database server for
data storage and retrieval.
Web-Based Architecture: A web browser communicates directly with
a web server, which communicates with a database server for data
storage and retrieval.
Advantages of 2-Tier Architecture
Easy to Access: 2-Tier Architecture makes easy access to the
database, which makes fast retrieval.
Scalable: We can scale the database easily, by adding clients or
upgrading hardware.
Low Cost: 2-Tier Architecture is cheaper than 3-Tier Architecture and
Multi-Tier Architecture.
Easy Deployment: 2-Tier Architecture is easier to deploy than 3-Tier
Architecture.
Simple: 2-Tier Architecture is easily understandable as well as simple
because of only two components.
3-Tier Architecture
In 1971, DBTG(DataBase Task Group) realized the requirement for a two-level
approach having views and schema.
in 1975, ANSI-SPARC realized the need for a three-level approach with the
three levels of abstraction external level , a conceptual level, and an internal
level.
The three schema architecture is also called ANSI/SPARC architecture or
three-level architecture.((American National Standards Institute,
Standards Planning And Requirements Committee)
This framework is used to describe the structure of a specific database
system.
The three schema architecture is also used to separate the user applications
and physical database.
The three schema architecture contains three-levels. It breaks the database
down into three different categories.
Objectives of Three schema Architecture
The main objective of three level architecture is to multiple users to
access the same data with a personalized view while storing the
underlying data only once. Thus it separates the user's view from the
physical structure of the database.
This separation is desirable for the following reasons:
Different users need different views of the same data.
The approach in which a particular user needs to see the data may
change over time.
All users should be able to access the same data according to their
requirements.
DBMS should be able to change the conceptual structure of the
database without affecting the user's
Internal structure of the database should be unaffected by changes to
physical aspects of the storage.
Presentation Tier: The presentation tier is the user interface or client
layer of the application. It is responsible for presenting data to the user
and receiving input from the user. This tier can be a web browser,
mobile app, or desktop application.
Application Tier: The application tier is the middle layer of the 3-tier
architecture. It is responsible for processing and managing the business
logic of the application. This tier communicates with the presentation
tier to receive user input and communicates with the data management
tier to retrieve or store data. This tier may include application servers,
web servers.
Data Management Tier: The data management tier is the bottom
layer of the 3-tier architecture. It is responsible for managing and storing
data. This tier can include databases, data warehouses, or data lakes.
The data management tier communicates with the application tier to
receive or store data.
1. Internal Level
The internal level has an internal schema which describes the physical
storage structure of the database.
The internal schema is also known as a physical schema.
It uses the physical data model. It is used to define that how the data
will be stored in a block.
The physical level is used to describe complex low-level data structures
in detail.
2. Conceptual Level
The conceptual schema describes the design of a database at the
conceptual level. Conceptual level is also known as logical level.
The conceptual schema describes the structure of the whole database.
The conceptual level describes what data are to be stored in the database
and also describes what relationship exists among those data.
In the conceptual level, internal details such as an implementation of the
data structure are hidden.
Programmers and database administrators work at this level.
3. External Level
At the external level, a database contains several schemas that
sometimes called as subschema. The subschema is used to describe the
different view of the database.
An external schema is also known as view schema.
Each view schema describes the database part that a particular user
group is interested and hides the remaining database from that user
group.
The view schema describes the end user interaction with database
systems.
Advantages of 3-Tier Architecture
Enhanced scalability: Scalability is enhanced due to the distributed
deployment of application servers. Now, individual connections need not
be made between the client and server.
Data Integrity: 3-Tier Architecture maintains Data Integrity. Since there
is a middle layer between the client and the server, data corruption can
be avoided/removed.
Security: 3-Tier Architecture Improves Security. This type of model
prevents direct interaction of the client with the server thereby reducing
access to unauthorized data.
Views of data- Schemas and instances
What is Schema?
The Skeleton of the database is created by the attributes and this
skeleton is named Schema.
Schema mentions the logical constraints like table, primary key, etc.
The schema does not represent the data type of the attributes.
Database Schema
A database schema is the skeleton structure that represents the logical
view of the entire database. It defines how the data is organized and
how the relations among them are associated.
A database schema is a logical representation of data that shows
how the data in a database should be stored logically. It shows how the
data is organized and the relationship between the tables.
Database schema contains table, field, views and relation between
different keys like primary key, foreign key.
Data are stored in the form of files which is unstructured in nature which
makes accessing the data difficult. Thus to resolve the issue the data
are organized in structured way with the help of database schema.
Database schema defines a set of guidelines that control the database
along with that it provides information about the way of accessing and
modifying the data.
Types of Database Schemas
Physical Database Schema
A Physical schema defines, how the data or information is stored
physically in the storage systems in the form of files & indices. This is
the actual code or syntax needed to create the structure of a database,
we can say that when we design a database at a physical level, it’s
called physical schema.
The Database administrator chooses where and how to store the data in
the different blocks of storage.
Logical Database Schema
A logical database schema defines all the logical constraints that need
to be applied to the stored data, and also describes tables, views, entity
relationships, and integrity constraints.
The Logical schema describes how the data is stored in the form of
tables & how the attributes of a table are connected.
View Database Schema
It is a view level design which is able to define the interaction between
end-user and database.
User is able to interact with the database with the help of the interface
without knowing much about the stored mechanism of data in database.
Creating Database Schema
For creating a schema, the statement “CREATE SCHEMA” is used in
every database. But different databases have different meanings for
this.
1. MySQL: In MySQL, we use the “CREATE SCHEMA” statement for
creating the database, because, in MySQL CREATE SCHEMA and CREATE
DATABASE, both statements are similar.
2. SQL Server: In SQL Server, we use the “CREATE SCHEMA” statement
for creating a new schema.
3. Oracle Database: In Oracle Database, we use “CREATE USER” for
creating a new schema, because in the Oracle database, a schema is
already created with each database user. The statement “CREATE
SCHEMA” does not create a schema, instead, it populates the schema
with tables & views and also allows one to access those objects without
needing multiple SQL statements for multiple transactions.
Database Schema Designs
Flat Model
A flat model schema is a 2-D array in which every column contains the
same type of data/information and the elements with rows are related to
each other. It is just like a table or a spreadsheet. This schema is better
for small applications that do not contain complex data.
Hierarchical Model
Data is arranged using parent-child relationships and a tree-like
structure in the Hierarchical Database Model. Because each record
consists of several children and one parent, it can be used to illustrate
one-to-many relationships in diagrams such as organizational charts.
Although obvious, it might not be as adaptable in complicated
partnerships.
Network Model
he network model and the hierarchical model are quite similar with an
important difference that is related to data relationships. The network
model allows many-to-many relationships whereas hierarchical models
allow one-to-many relationships.
Relational Model
The relational model is mainly used for relational databases, where the
data is stored as relations of the table. This relational model schema is
better for object-oriented programming.
Star Schema
Star schema is better for storing and analyzing large amounts of data. It
has a fact table at its center & multiple dimension tables connected to it
just like a star, where the fact table contains the numerical data that run
business processes and the dimension table contains data related to
dimensions such as product, time, people, etc. or we can say, this table
contains the description of the fact table. The star schema allows us to
structure the data of
Snowflake Schema
Just like star schema, the snowflake schema also has a fact table at its
center and multiple dimension tables connected to it, but the main
difference in both models is that in snowflake schema – dimension
tables are further normalized into multiple related tables. The snowflake
schema is used for analyzing large amounts of data.
Database Instance
The data stored in database at a particular moment of time is called
instance of database.
The situation where a data or information is stored in the database at a
particular moment of time is called an instance. An instance is also
called a current state or database state.
The database schema that defines variables in tables which belong to a
specific database, the records of these variables at a particular moment
are called the instance of the database.
Consider a table given below which has the Student (Schema)
Std ID Name City
100 Lucky Hyderabad
101 Pinky Delhi
102 Bob Hyderabad
In the above table, rows are called instances.
Finally, we can say that the content of database at a point of time is
called instance or database state
Difference Between Schema and Instance
Schema Instance
It is the overall description of the It is the collection of information
database. stored in a database at a
particular moment.
The schema is same for the
whole database Data in instances can be
changed using addition,
Does not change Frequently.
deletion, and updating.
Defines the basic structure of Changes Frequently.
the database i.e. how the data
will be stored in the database. It is the set of Information stored
at a particular time.
Data Independence/Data
Abstraction
Data Independence is mainly defined as a property of DBMS that
helps you to change the database schema at one level of a system
without requiring to change the schema at the next level. it helps to
keep the data separated from all program that makes use of it.
We have namely two levels of data independence arising from these
levels of abstraction:
Physical level data independence
Logical level data independence
Physical Level Data
Independence
It refers to the characteristic of being able to modify the physical
schema without any alterations to the conceptual or logical schema,
done for optimization purposes, e.g., the Conceptual structure of the
database would not be affected by any change in storage size of the
database system server. Changing from sequential to random access
files is one such example. These alterations or modifications to the
physical structure may include:
Utilizing new storage devices.
Modifying data structures used for storage.
Altering indexes or using alternative file organization techniques etc.
Logical Level Data Independence
It refers characteristic of being able to modify the logical schema
without affecting the external schema or application program. The user
view of the data would not be affected by any changes to the
conceptual view of the data. These changes may include insertion or
deletion of attributes, altering table structures entities or relationships
to the logical schema, etc.
Data Modelling using ER Diagram
ER model stands for an Entity-Relationship model. It is a high-level data
model. This model is used to define the data elements and relationship
for a specified system.
It develops a conceptual design for the database. It also develops a very
simple and easy to design view of data.
Suppose we design a school database. In this database, the student will
be an entity with attributes like address, name, id, age, etc. The address
can be another entity with attributes like city, street name, pin code, etc
and there will be a relationship between them.
Symbols Used in ER Model
Rectangles: Rectangles represent Entities in the ER Model.
Ellipses: Ellipses represent Attributes in the ER Model.
Diamond: Diamonds represent Relationships among Entities.
Lines: Lines represent attributes to entities and entity sets with other
relationship types.
Double Ellipse: Double Ellipses represent Multi-Valued Attributes.
Double Rectangle: Double Rectangle represents a Weak Entity.
Components of ER Diagram
Entity
An entity may be any object, class, person or place. In the ER diagram,
an entity can be represented as rectangles.
Consider an organization as an example- manager, product, employee,
department etc. can be taken as an entity.
A. Weak Entity
An entity that depends on another entity called a weak entity. The weak
entity doesn't contain any key attribute of its own. The weak entity is
represented by a double rectangle.
B. Strong Entity
A Strong Entity is a type of entity that has a key Attribute. Strong Entity
does not depend on other Entity in the Schema. It has a primary key,
that helps in identifying it uniquely, and it is represented by a rectangle.
These are called Strong Entity Types.
Difference between Strong and
Weak Entity:
While a weak entity has a partial
Strong entity always has a discriminator key.
primary key.
Weak entity depends on strong
Strong entity is not dependent entity.
on any other entity.
Weak entity is represented by a
Strong entity is represented by a double rectangle.
single rectangle.
While the relation between one
Two strong entity’s relationship strong and one weak entity is
is represented by a single represented by a double
diamond diamond.
Strong entities have either total While weak entity always has
participation or not. total participation.
Attributes
The attribute is used to describe the property of an entity.
For example, Roll_No, Name, DOB, Age, Address, and Mobile_No are the
attributes that define entity type Student. In ER diagram, the attribute is
represented by an oval.
1. Key Attribute
2. Composite Attribute
3. Multivalued Attribute
4. Derived Attribute
1. Key Attribute
The attribute which uniquely identifies each entity in the entity set is
called the key attribute. For example, Roll_No will be unique for each
student. In ER diagram, the key attribute is represented by an oval with
underlying lines.
2. Composite Attribute
An attribute composed of many other attributes is called a
composite attribute. For example, the Address attribute of the student
Entity type consists of Street, City, State, and Country. In ER diagram,
the composite attribute is represented by an oval comprising of ovals.
3. Multivalued Attribute
An attribute consisting of more than one value for a given entity. For
example, Phone_No (can be more than one for a given student). In ER
diagram, a multivalued attribute is represented by a double oval.
4. Derived Attribute
An attribute that can be derived from other attributes of the entity type
is known as a derived attribute. e.g.; Age (can be derived from DOB). In
ER diagram, the derived attribute is represented by a dashed oval.
The Complete Entity Type
Student with its Attributes can
be represented as:
Relationship and their type
A Relationship Type represents the association between entity types. For
example, ‘Enrolled in’ is a relationship type that exists between entity
type Student and Course. In ER diagram, the relationship type is
represented by a diamond and connecting the entities with lines.
Types of relationship are as
follows
Cardinality
cardinality represents the number of times an entity of an entity set
participates in a relationship set. Or we can say that the cardinality of a
relationship is the number of tuples (rows) in a relationship.
The number of times an entity of an entity set participates in a
relationship set is known as cardinality. Cardinality can be of different
types:
a. One-to-One Relationship
b. One-to-many relationship
c. Many-to-one relationship
d. Many-to-many relationship
a. One-to-One Relationship
When each entity in each entity set can take part only once in the
relationship, the cardinality is one-to-one. Let us assume that a male
can marry one female and a female can marry one male. So the
relationship will be one-to-one.
One-to-Many: In one-to-many mapping as well where each entity can
be related to more than one relationship and the total number of tables
that can be used in this is 2. Let us assume that one surgeon deparment
can accomodate many doctors. So the Cardinality will be 1 to M. It
means one deparment has many Doctors.
Many-to-One: When entities in one entity set can take part only once
in the relationship set and entities in other entity sets can take part
more than once in the relationship set, cardinality is many to one. Let us
assume that a student can take only one course but one course can be
taken by many students. So the cardinality will be n to 1. It means that
for one course there can be n students but for one student, there will be
only one course.
Many-to-Many: When entities in all entity sets can take part more than
once in the relationship cardinality is many to many. Let us assume that
a student can take more than one course and one course can be taken
by many students. So the relationship will be many to many.
Generalization(Bottom-up approach)
Generalization in a Database Management System is a bottom-up
approach where entities with common attributes are combined to form a
higher-level entity.
This process simplifies ER diagrams by merging lower-level entities, such
as cars and bikes, into a more generalized entity, enhancing the
management of growing data complexities.
Unlike the subclass-superclass system, generalization uniquely groups
subclasses into a superclass, thereby streamlining the ER model.
This approach simplifies entity relationships and evolves the traditional
ER model to efficiently handle increased data complexities.
Let's take the above example which has the entity as car and bike and
we need to generalize it based on the common entities between them
so before generalizing we will have two separate entities for cars and
bikes.
Entities Before Generalization
Car Entity: Car entity can have attributes
like car_ID, DateManufactured, Cost, Model, Manufacturer which will be
generalized further.
Bike Entity: Bike entity can have attributes
like bike_ID, DateManufactured, Cost, Model, Manufacturer which will be
generalized further.
ER Diagram after Generalization
We can see that for cars and bikes some attributes
like DateManufactured, Cost, Model, Manufacturer are the same so we
can generalize them into an entity called Vehicle.
The vehicle entity will be having these common attributes and the bike
and car entity will be related to them using "Is A" relationship table, So
ER diagram can be generalized by:
Specialization(top-down approach)
In specialization, an entity is divided into sub-entities based on its
characteristics. It is a top-down approach where the higher-level entity is
specialized into two or more lower-level entities.
Specialization is a top-down approach where a higher-level
entity is divided into two or more lower-level, more specific sub-
entities (e.g., an Employee entity is specialized into Hourly and
Salaried employees)
For Example, an EMPLOYEE entity in an Employee management system
can be specialized into DEVELOPER, TESTER, etc. as shown in Figure . In
this case, common attributes like E_NAME, E_SAL, etc. become part of a
higher entity (EMPLOYEE), and specialized attributes like TES_TYPE
become part of a specialized entity (TESTER).
Aggregation
Aggregation
An ER diagram is not capable of representing the relationship between
an entity and a relationship which may be required in some scenarios. In
those cases, a relationship with its corresponding entities is aggregated
into a higher-level entity. Aggregation is an abstraction through which
we can represent relationships as higher-level entity sets.
Aggregation is an abstraction in which a relationship itself is
treated as an entity.
It is used when we need to represent a relationship between a
relationship and another entity.
Aggregation is a technique to express relationship among
relationship.
Through E-R modeling we cannot Express relationship among
relationships. Thus, we use the concept of aggregation for this
purpose.
Aggregation is an abstraction through which relationships are
treated as Entities.
In aggregation, the relation between two Entities is treated as a
single entity In aggregation, relationship with its Corresponding
entities is aggregated into a higher Level Entity.
For Example, an Employee working on a project may require some
machinery. So, REQUIRE relationship is needed between the relationship
WORKS_FOR and entity MACHINERY. Using aggregation, WORKS_FOR
relationship with its entities EMPLOYEE and PROJECT is aggregated into a
single entity and relationship REQUIRE is created between the aggregated
entity and MACHINERY.
Difference between
Generalization and Specialization
:
Generalization works in Bottom- Specialization works in top-down
Up approach. approach.
In Generalization, size of In Specialization, size of schema
schema gets reduced. gets increased.
Generalization is normally We can apply Specialization to a
applied to group of entities. single entity.
Generalization can be defined as Specialization can be defined as
a process of creating groupings process of creating subgrouping
from various entity sets within an entity set
There is no inheritance in There is inheritance in
Generalization. Specialization.
Relational data model:
In 1970, E.F. Codd developed the relational model. He proposed this
model as well as a non-procedural approach for modeling data in the
form of relations or tables.
If we model the database using ER diagrams, we must convert them
into the relational model, which can be implemented by one of the
RDBMS languages such as SQL and MySQL.
The relational model represents how data is stored in Relational
Databases
Relational model can represent as a table with columns and rows. Each
row is known as a tuple. Each table of the column has a name or
attribute.
Consider a relation STUDENT with attributes ROLL_NO, NAME, ADDRESS,
PHONE, and AGE shown in the table.
Table Student
ROLL_NO NAME ADDRESS PHONE AGE
1 RAM DELHI 9455123451 18
2 RAMESH GURGAON 9652431543 18
3 SUJIT ROHTAK 9156253131 20
4 SURESH DELHI 18
Attribute: Attributes are the properties that define an entity.
e.g.; ROLL_NO, NAME, ADDRESS
Relation Schema: A relation schema defines the structure of the
relation and represents the name of the relation with its attributes. e.g.;
STUDENT (ROLL_NO, NAME, ADDRESS, PHONE, and AGE) is the relation
schema for STUDENT.
Tuple: Each row in the relation is known as a tuple. The above relation
contains 4 tuples, one of which is shown as:
1 RAM DELHI 9455123451 18
Differences between ER model and Reational
model
ER model Relational model
Developed by Peter Chen in 1976. Developed by E.F. Codd in 1970.
ER model is the high level or It is the representational or
conceptual model. implementation model.
It is used by people who don’t It is used by programmers.
know how database is
implemented. It represent data in the form of
tables and describes relationship
It represents collection of entities
between them.
and describes relationship
between them It consists of components like
it consists of components like domain, attributes, tuples.
Entity, Entity Type, Entity Set.
Relation Instance: The set of tuples of a relation at a particular
instance of time is called a relation instance. Table 1 shows the relation
instance of STUDENT at a particular time. It can change whenever there
is an insertion, deletion, or update in the database.
Degree: The number of attributes in the relation is known as the degree
of the relation. The STUDENT relation defined above has degree 5.
Cardinality: The number of tuples in a relation is known as cardinality.
The STUDENT relation defined above has cardinality 4.
Column: The column represents the set of values for a particular
attribute. The column ROLL_NO is extracted from the relation STUDENT.
ROLL_NO
4
NULL Values: The value which is not known or unavailable is called a
NULL value. It is represented by blank space. e.g.; PHONE of STUDENT
having ROLL_NO 4 is NULL.
Relation Key: These are basically the keys that are used to identify the
rows uniquely or also help in identifying tables. These are of the
following types.
Primary Key
Candidate Key
Super Key
Foreign Key
Alternate Key
Composite Key
Primary Key:
A primary key in a table that uniquely identifies each row and column or set of
columns in the table. The primary key is an attribute or a set of attributes that
help to uniquely identify the tuples(records) in the relational table.
Candidate Key
A candidate key is any attribute or combination of attributes that
uniquely identifies rows in the table and the attribute that forms the key
can not be further reduced.
Super Key
Super Key is an attribute (or set of attributes) that is used to uniquely
identifies all attributes in a relation.
Foreign Key
A FOREIGN KEY is a field (or collection of fields) in one table, that
refers to the PRIMARY KEY in another table.
The table with the foreign key is called the child table, and the
table with the primary key is called the referenced or parent table.
Alternate Key:
Example 1:
Consider an EMPLOYEE_DETAIL table where we have the following
attributes:
Emp_SSN: The SSN number of employees is stored in this field.
Emp_Id: An attribute that stores the value of the employee
identification number.
Emp_name: An attribute that stores the name of the employee holding
the specified employee id.
Emp_email: An attribute that stores the email id of the specified
employees.
The EMPLOYEE_DETAIL table is given below that will help you
understand better:
So from the above table, there are the following Candidate keys:
These are the candidate keys we concluded from the above attributes.
Now, we have to choose one primary key, which is the most appropriate
out of the three, and it is Emp_Id. So, the primary key is Emp_Id. Now,
the remaining two candidate keys are Emp_SSN and Emp_email.
Therefore, Emp_SSN and Emp_Email are the alternate keys.
Composite Key
A composite key is made by the combination of two or more columns in
a table that can be used to uniquely identify each row in the table when
the columns are combined uniqueness of a row is guaranteed, but when
it is taken individually it does not guarantee uniqueness, or it can also
be understood as a primary key made by the combination of two or
more attributes to uniquely identify every row in a table.
A primary key having two or more attributes is called composite key. It is
a combination of two or more columns.
Let us see another example −
StudentI StudentE Student Student
D nrollNo Marks Percenta
ge
S001 0721722 570 90
S002 0721790 490 80
S003 0721766 440 86
Above, our composite keys are StudentID and StudentEnrollNo. The
table has two attributes as primary key.
Therefore, the Primary Key consisting of two or more attribute is called
Composite Key.
Integrity Constraints
Integrity constraints are a set of rules. It is used to maintain the quality
of information.
Integrity constraints ensure that the data insertion, updating, and other
processes have to be performed in such a way that data integrity is not
affected.
Thus, integrity constraint is used to guard against accidental damage to
the database.
1. Domain constraints
Domain constraints can be defined as the definition of a valid set of
values for an attribute.
The data type of domain includes string, character, integer, time, date,
currency, etc. The value of the attribute must be available in the
corresponding domain.
2. Entity integrity constraints
The entity integrity constraint states that primary key value can't be
null.
This is because the primary key value is used to identify individual rows
in relation and if the primary key has a null value, then we can't identify
those rows.
A table can contain a null value other than the primary key field.
3. Referential Integrity Constraints
A referential integrity constraint is specified between two tables.
In the Referential integrity constraints, if a foreign key in Table 1 refers
to the Primary Key of Table 2, then every value of the Foreign Key in
Table 1 must be null or be available in Table 2.
4. Key constraints
Keys are the entity set that is used to identify an entity within its entity
set uniquely.
An entity set can have multiple keys, but out of which one key will be
the primary key. A primary key can contain a unique and null value in
the relational table.(CK)
Codd's Rules
Every database has tables, and constraints cannot be referred to as a
rational database system. And if any database has only relational data
model, it cannot be a Relational Database System (RDBMS). So,
some rules define a database to be the correct RDBMS. These rules were
developed by Dr. Edgar F. Codd (E.F. Codd) in 1985, who has vast
research knowledge on the Relational Model of database Systems. Codd
presents his 13 rules for a database to test the concept of DBMS against
his relational model, and if a database follows the rule, it is called a true
relational database (RDBMS). These 13 rules are popular in RDBMS,
known as Codd's 12 rules.
Rule 1: The Information Rule
All information, whether it is user information or metadata, that is stored
in a database must be entered as a value in a cell of a table. It is said
that everything within the database is organized in a table layout.
(TABLE)
Rule 2: The Guaranteed Access Rule
Each data element is guaranteed to be accessible logically with a
combination of the table name, primary key (row value), and attribute
name (column value).
Rule 3: Systematic Treatment of NULL Values
Every Null value in a database must be given a systematic and uniform
treatment. (enter dummy data)
Rule 4: Active Online Catalog Rule
The database catalog, which contains metadata about the database,
must be stored and accessed using the same relational database
management system.
Rule 5: The Comprehensive Data Sublanguage Rule
A crucial component of any efficient database system is its ability to
offer an easily understandable data manipulation language (DML) that
facilitates defining, querying, and modifying information within the
database.(data definition, view definition, data manipulation, integrity
constraints, authorization, transaction boundaries(begin commit and
rollback)
Rule 6: The View Updating Rule
All views that are theoretically updatable must also be updatable by the
system.
Rule 7: High-level Insert, Update, and Delete(DML)
A successful database system must possess the feature of facilitating
high-level insertions, updates, and deletions that can grant users the
ability to conduct these operations with ease through a single query.
Rule 8: Physical Data Independence
Application programs and activities should remain unaffected when
changes are made to the physical storage structures or methods.
Rule 9: Logical Data Independence
Application programs and activities should remain unaffected when
changes are made to the logical structure of the data, such as adding or
modifying tables
Rule 10: Integrity Independence
Integrity constraints should be specified separately from application
programs and stored in the catalog. They should be automatically
enforced by the database system.(Backend)
Rule 11: Distribution Independence(centralized data)
The distribution of data across multiple locations should be invisible to
users, and the database system should handle the distribution
transparently.
Rule 12: Non-Subversion Rule
If the interface of the system is providing access to low-level records,
then the interface must not be able to damage the system and bypass
security and integrity constraints.
Normalization
Normalization is the process of organizing the data and the attributes of
a database.
It is performed to reduce the data redundancy in a database and to
make sure that data is stored logically.
Data redundancy means having the same data but at different
places. It is necessary to remove data redundancy because it causes
anomalies in a database which makes it very hard for a database
administrator to maintain it.
Normalization is used to minimize the redundancy from a relation or set
of relations. It is also used to eliminate undesirable characteristics like
Insertion, Update, and Deletion Anomalies.
Normalization divides the larger table into smaller and links them using
relationships.
The normal form is used to reduce redundancy from the database table.
Advantages of Normalization
Normalization helps to minimize data redundancy.
Greater overall database organization.
Data consistency within the database.
Much more flexible database design.
Enforces the concept of relational integrity.
Disadvantages of Normalization
You cannot start building the database before knowing what the user
needs.
The performance degrades when normalizing the relations to higher
normal forms, i.e., 4NF, 5NF.
It is very time-consuming and difficult to normalize relations of a higher
degree.
First Normal Form (1NF)
A relation will be 1NF if it contains an atomic value.
It states that an attribute of a table cannot hold multiple values. It must
hold only single-valued attribute.
First normal form disallows the multi-valued attribute, composite attribute,
and their combinations.
Example: Relation EMPLOYEE is not in 1NF because of multi-valued attribute
EMP_PHONE.
EMPLOYEE table:
EMP_ID EMP_NAME EMP_PHONE EMP_STATE
14 John 7272826385, UP
9064738238
20 Harry 8574783832 Bihar
12 Sam 7390372389, Punjab
8589830302
The decomposition of the EMPLOYEE table into 1NF has been shown
below:
EMP_ID EMP_NAME EMP_PHONE EMP_STATE
14 John 7272826385 UP
14 John 9064738238 UP
20 Harry 8574783832 Bihar
12 Sam 7390372389 Punjab
12 Sam 8589830302 Punjab
Second Normal Form (2NF)
In the 2NF, relational must be in 1NF.
In the second normal form, all non-key attributes are fully functional
dependent on the primary key
Example: Let's assume, a school can store the data of teachers and the
subjects they teach. In a school, a teacher can teach more than one
subject.
TEACHER table
TEACHER_ID SUBJECT TEACHER_AGE
25 Chemistry 30
25 Biology 30
47 English 35
83 Math 38
83 Computer 38
In the given table, non-prime attribute TEACHER_AGE is dependent on
TEACHER_ID which is a proper subset of a candidate key. That's why it
violates the rule for 2NF.
To convert the given table into 2NF, we decompose it into two tables:
TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE
25 30
47 35
83 38
TEACHER_SUBJECT table:
TEACHER_ID SUBJECT
25 Chemistry
25 Biology
47 English
83 Math
83 Computer
3NF
A relation will be in 3NF if it is in 2NF and not contain any transitive
partial dependency.
3NF is used to reduce the data duplication. It is also used to achieve the
data integrity.
If there is no transitive dependency for non-prime attributes, then the
relation must be in third normal form.
Example:
EMPLOYEE_DETAIL table:
EMP_ID EMP_NA EMP_ZIP EMP_STA EMP_CIT
ME TE Y
222 Harry 201010 UP Noida
333 Stephan 02228 US Boston
444 Lan 60007 US Chicago
555 Katharine 06389 UK Norwich
666 John 462007 MP Bhopal
Super key in the table above:
{EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME, EMP_ZIP}
Candidate key: {EMP_ID}
Non-prime attributes: In the given table, all attributes except EMP_ID
are non-prime.
Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP
dependent on EMP_ID. The non-prime attributes (EMP_STATE, EMP_CITY)
transitively dependent on super key(EMP_ID). It violates the rule of third
normal form.
That's why we need to move the EMP_CITY and EMP_STATE to the new
<EMPLOYEE_ZIP> table, with EMP_ZIP as a Primary key.
EMPLOYEE table:
EMP_ID EMP_NAME EMP_ZIP
222 Harry 201010
333 Stephan 02228
444 Lan 60007
555 Katharine 06389
666 John 462007
EMPLOYEE_ZIP table:
EMP_ZIP EMP_STATE EMP_CITY
201010 UP Noida
02228 US Boston
60007 US Chicago
06389 UK Norwich
462007 MP Bhopal
Boyce Codd normal form (BCNF)
BCNF is the advance version of 3NF. It is stricter than 3NF.
A table is in BCNF if every functional dependency X → Y, X is the
super key of the table.
For BCNF, the table should be in 3NF, and for every FD, LHS is
super key.
Example: Let's assume there is a company where employees
work in more than one department.
EMPLOYEE table:
EMP_ID EMP_COU EMP_DEPT DEPT_TYP EMP_DEPT
NTRY E _NO
264 India Designing D394 283
264 India Testing D394 300
364 UK Stores D283 232
364 UK Developin D283 549
g
In the above table Functional dependencies are as follows:
EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate key: {EMP-ID, EMP-DEPT}
The table is not in BCNF because neither EMP_DEPT nor EMP_ID alone
are keys.
To convert the given table into BCNF, we decompose it into three tables:
EMP_COUNTRY table:
EMP_ID EMP_COUNTRY
264 India
264 India
EMP_DEPT table:
EMP_DEPT DEPT_TYPE EMP_DEPT_NO
Designing D394 283
Testing D394 300
Stores D283 232
Developing D283 549
EMP_DEPT_MAPPING table:
EMP_ID EMP_DEPT
D394 283
D394 300
D283 232
D283 549
Functional dependencies:
EMP_ID → EMP_COUNTRY
EMP_DEPT → {DEPT_TYPE, EMP_DEPT_NO}
Candidate keys:
For the first table: EMP_ID
For the second table: EMP_DEPT
For the third table: {EMP_ID, EMP_DEPT}
Now, this is in BCNF because left side part of both the functional
dependencies is a key.
Draw ERD and Design normalized database design. The
company database keeps track of employee, department and
project. We store employee name, SSN, address, salary gender,
date of birth, age. An Employee assigned to one department, but
may work on several projects which are not necessary
controlled by the same department. A particular employee
manages the department. Each department has unique name,
unique number and several locations. The department controls
no. of projects each of which has a unique name and number.
We want to keep track of the dependents of each employee for
insurance purpose. We keep each dependent first name, sex,
birth date and relationship of employee. [