0% found this document useful (0 votes)
11 views92 pages

Understanding DBMS Architecture Types

The document provides an overview of Database Management System (DBMS) architecture, detailing the different types: 1-Tier, 2-Tier, and 3-Tier architectures, along with their advantages and disadvantages. It also discusses data abstraction levels, data independence, and the roles of Data Definition Language (DDL) and Data Manipulation Language (DML) in managing databases. Additionally, it highlights the significance of data models in summarizing database descriptions and facilitating the creation of actual databases.

Uploaded by

sleepyshaw7
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views92 pages

Understanding DBMS Architecture Types

The document provides an overview of Database Management System (DBMS) architecture, detailing the different types: 1-Tier, 2-Tier, and 3-Tier architectures, along with their advantages and disadvantages. It also discusses data abstraction levels, data independence, and the roles of Data Definition Language (DDL) and Data Manipulation Language (DML) in managing databases. Additionally, it highlights the significance of data models in summarizing database descriptions and facilitating the creation of actual databases.

Uploaded by

sleepyshaw7
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

DBMS

Unit-1
A Database stores a lot of critical information to access data quickly and securely. Hence it is
important to select the correct architecture for efficient data management. Database
Management System (DBMS) architecture is crucial for efficient data management and
system performance. It helps users to get their requests done while connecting to the
database. It focuses on how the database is designed, built and maintained, shaping how
users access and interact with it.
Types of DBMS Architecture
There are several types of DBMS Architecture that we use according to the usage
requirements. Types of DBMS Architecture are discussed here.
 1-Tier Architecture
 2-Tier Architecture
 3-Tier Architecture
1-Tier Architecture
In 1-Tier Architecture the database is directly available to the user, the user can directly sit
on the DBMS and use it that is, the client, server, and Database are all present on the same
machine. This setup is simple and is often used in personal or standalone applications where
the user interacts directly with the database.
For Example: A Microsoft Excel spreadsheet is a great example of one-tier architecture.
 Everything—the user interface, application logic and data is handled on a single
system.
 The user directly interacts with the application, performs operations like
calculations or data entry and stores data locally on the same machine.
This architecture is simple and works well for personal, standalone applications where no
external server or network connection is needed.
Advantages of 1-Tier Architecture
Below mentioned are the advantages of 1-Tier Architecture.
 Simple Architecture: 1-Tier Architecture is the most simple architecture to set up,
as only a single machine is required to maintain it.
 Cost-Effective: No additional hardware is required for implementing 1-Tier
Architecture, which makes it cost-effective.
 Easy to Implement: 1-Tier Architecture can be easily deployed, and hence it is
mostly used in small projects.
2-Tier Architecture
The 2-tier architecture is similar to a basic client-server model . The application at the client
end directly communicates with the database on the server side. APIs like ODBC and JDBC
are used for this interaction. The server side is responsible for providing query processing
and transaction management functionalities. On the client side, the user interfaces and
application programs are run. The application on the client side establishes a connection with
the server side to communicate with the DBMS.
For Example: A Library Management System used in schools or small organizations is a
classic example of two-tier architecture.
1. Client Layer (Tier 1): This is the user interface that library staff or users interact
with. For example they might use a desktop application to search for books, issue
them, or check due dates.
2. Database Layer (Tier 2): The database server stores all the library records such as
book details, user information, and transaction logs.
The client layer sends a request (like searching for a book) to the database layer which
processes it and sends back the result. This separation allows the client to focus on the user
interface, while the server handles data storage and retrieval.
Advantages of 2-Tier Architecture
 Easy to Access: 2-Tier Architecture makes easy access to the database, which
makes fast retrieval.
 Scalable: We can scale the database easily, by adding clients or upgrading
hardware.
 Low Cost: 2-Tier Architecture is cheaper than 3-Tier Architecture and Multi-Tier
Architecture.
 Easy Deployment: 2-Tier Architecture is easier to deploy than 3-Tier Architecture.
 Simple: 2-Tier Architecture is easily understandable as well as simple because of
only two components.
3-Tier Architecture
In 3-Tier Architecture, there is another layer between the client and the server. The client
does not directly communicate with the server. Instead, it interacts with an application server
which further communicates with the database system and then the query processing and
transaction management takes place. This intermediate layer acts as a medium for the
exchange of partially processed data between the server and the client. This type of
architecture is used in the case of large web applications.
For Example: E-commerce Store
User: You visit an online store, search for a product and add it to your cart.
Processing: The system checks if the product is in stock, calculates the total price and
applies any discounts.
Database: The product details, your cart and order history are stored in the database for
future reference.
Advantages of 3-Tier Architecture
 Enhanced scalability: Scalability is enhanced due to the distributed deployment of
application servers. Now, individual connections need not be made between the
client and server.
 Data Integrity: 3-Tier Architecture maintains Data Integrity. Since there is a
middle layer between the client and the server, data corruption can be
avoided/removed.
 Security: 3-Tier Architecture Improves Security. This type of model prevents
direct interaction of the client with the server thereby reducing access to
unauthorized data.
Disadvantages of 3-Tier Architecture
 More Complex: 3-Tier Architecture is more complex in comparison to 2-Tier
Architecture. Communication Points are also doubled in 3-Tier Architecture.
 Difficult to Interact: It becomes difficult for this sort of interaction to take place
due to the presence of middle layers.
Data abstraction is the process of hiding unwanted and irrelevant details from the end user.
It helps to store information in such a way that the end user can access data which is
necessary, the user will not be able to see what data is stored or how it is stored in a
database. Data abstraction helps to keep data secure from unauthorized access and it hides all
the implementation details.
Levels of Abstraction in DBMS
There are three levels of data abstraction in DBMS that are mentioned below.

 Physical or internal level


 logical or conceptual level
 view or external level
Physical or Internal Level
It is the lowest level of data abstraction which defines how data is stored in database . It
defines data structures used to store data and methods to access data in database. It is very
complex to understand and hence kept hidden from user. Database administrator decides
how and where to store the data in database.
Physical level deals with actual storage details like data organization, disk space allocation
and data access methods.
Logical or Conceptual Level
It is intermediate level present next to physical level. It defines what data is present in
database and their relationships between them . It is less complex as compared to physical
level. Programmers generally work at this level and depending on data, structure of tables,
relationships and their constraints is decided at this level.
View or External Level
It is the highest level in abstraction. There are different levels of views and each view defines
only a part of whole data required to user. This level defines many views of same database
for simplification of view to user. This is the highest level and easiest to understand for user.
Data Independence is mainly defined as a property of DBMS that helps you to change the
database schema at one level of a system without requiring to change the schema at the next
level. it helps to keep the data separated from all program that makes use of it.
We have namely two levels of data independence arising from these levels of abstraction:
 Physical level data independence
 Logical level data independence
Physical Level Data Independence
It refers to the characteristic of being able to modify the physical schema without any
alterations to the conceptual or logical schema, done for optimization purposes, e.g., the
Conceptual structure of the database would not be affected by any change in storage size of
the database system server. Changing from sequential to random access files is one such
example. These alterations or modifications to the physical structure may include:
 Utilizing new storage devices.
 Modifying data structures used for storage.
 Altering indexes or using alternative file organization techniques etc.
Logical Level Data Independence
It refers characteristic of being able to modify the logical schema without affecting the
external schema or application program. The user view of the data would not be affected by
any changes to the conceptual view of the data. These changes may include insertion or
deletion of attributes, altering table structures entities or relationships to the logical schema,
etc.
DDL actually represents Data Definition Language, which is actually a set of commands
used to create a structure and maintain databases. Those would include CREATE, ALTER,
DROP, TRUNCATE, and RENAME statements for creating, changing the structure of, and
dropping structures in the database, such as tables. DDL basically deals with the storage of
the data and not the data itself.
Example: The ‘CREATE TABLE’ command defines a new table called “Employees” with
columns including EmployeeID, FirstName, LastName, and HireDate along with their
datatypes.
Types of DDL Commands
DDL includes the following commands:
1. CREATE
This command is used to create table in the relational database.
This can be done by specifying the names and datatypes of various columns.
Example:
CREATE TABLE
Student
(Student_id INT,
Name VARCHAR(100),
Marks INT);

2. ALTER
Alter command is used for altering the table in many forms like:
1. Add a column
2. Rename existing column
3. Drop a column
4. Modify the size of the column or change datatype of the column
 ADD using ALTER:
Example:
ALTER TABLE Student
ADD
(Address VARCHAR(200));
 RENAME using ALTER
Example:
ALTER TABLE
Student
RENAME
Marks TO Age;
 DROP using ALTER
Example:
ALTER TABLE Student
DROP
(Age);
 MODIFY using ALTER
Example:
ALTER TABLE
Student
MODIFY
(name varchar(300));

3. TRUNCATE
This command removes all the records from a table. But this command will not destroy the
table’s structure.
Example:
TRUNCATE TABLE Student;

4. DROP
This command completely removes the table from the database along with the destruction of
the table structure.
Example:
DROP TABLE Student;

Advantages of DDL
 Defines Structure: DDL commands like CREATE and ALTER allow you to set up
and modify the structure of database objects (tables, indexes, views), organizing data
effectively.
 Manages Schema: You can easily create, update, or remove entire database schemas,
making it simpler to maintain and update the database structure over time.
 Enforces Data Integrity: DDL allows the implementation of rules and constraints
(PRIMARY KEY, FOREIGN KEY, UNIQUE, NOT NULL) that ensure data remains
accurate and consistent.
 Boosts Performance: By creating indexes and partitions using DDL commands, you
can significantly improve query performance and speed up data retrieval.
 Standardized: DDL commands are standardized across most SQL-based databases,
making it easier to apply knowledge across different database systems.
Disadvantages of DDL
 Irreversible Changes: DDL commands, especially DROP and TRUNCATE, are
auto-committed, meaning changes can’t be undone once executed. This can result in
permanent data loss if not used carefully.
 Risk of Data Loss: Misuse of commands like DROP can delete entire tables along
with their data, leading to significant data loss, particularly if no backup is available.
 Complex for Large Databases: Altering the structure of large databases can be
complicated and may require downtime or additional planning to avoid disrupting
operations.
 Causes Locking: DDL operations can lock database objects during execution, which
might slow down or block other operations, impacting overall database performance.
 Compatibility Issues: While DDL is standardized, different databases (like MySQL,
Oracle, or PostgreSQL) may have variations in syntax and features, leading to
compatibility issues during migration.
Applications of DDL
 Creating Database Objects: DDL statements can be used to create various database
objects such as tables, views, indexes, and stored procedures.
 Modifying Database Objects: DDL statements can be used to modify the structure of
existing database objects such as adding or dropping columns from tables, modifying
the data type of columns, renaming tables or columns, etc.
 Managing Database Constraints: DDL statements can be used to create or alter
database constraints such as primary keys, foreign keys, unique constraints, and check
constraints.
 Granting or Revoking Permissions: DDL statements can be used to grant or revoke
permissions to various database objects such as tables, views, stored procedures, and
indexes.
 Indexing: DDL statements can be used to create or modify indexes on database tables,
which can improve the performance of SQL queries.
 Partitioning: DDL statements can be used to create or modify partitioned tables, which
can improve the performance of queries that access large amounts of data.
DML stands for Data Manipulation Language. Tables and formulas are helpful when
communicating with data stored up to a point in a database through SQL (Structured
Query Language), but a time comes when we actually want to execute some fairly
complicated data interactions. We will also need the Data Manipulation Language in that
situation. DML is a way to inform a database precisely what we want it to do by conversing
in a manner that it has been built to comprehend from scratch. When it comes to interacting
with existing data, whether adding, moving, or deleting data, it provides a convenient way to
do so.
Generally, DML commands fall into one of four primary categories:
 INSERT adds fresh data to a table.
 UPDATE Change the data that is already in a table.
 DELETE takes a record out of a table.
 SELECT Get information out of one or more tables.
Types of Data Manipulation Language
There are basically two types of Data Manipulation Language. These are mentioned below.
We have described them in the difference between format.
High-Level or Non-Procedural DML:
 It is also labelled as set-at-a-time or series oriented DML.
 It can be used on its own for precisely specifying complex operations in the database.
 It is prescriptive in nature.
 It demands that a user must clearly state which data is needed without clarifying how
and when to obtain those data.
 For Example: Every SQL statement is a prescriptive command.
Low-level or Procedural DML
 It is also labelled as track-at-a-time DML.
 It must be integrated to a general-purpose programming language.
 It is indispensable in nature.
 It demands that a user must clearly state which data is needed and how to obtain those
data.
 For Example: DB2’s SQL PL, Oracle’s PL/SQL.
Characteristics of DML
It performs interpret-only data queries. It is used in a database schema to recall and
manipulate the information. DML It is a dialect which is used to select, insert, delete and
update data in a database. Data Manipulation Language (DML) commands are as follows:
SELECT Command
This command is used to get data out of the database. It helps users of the database to access
from an operating system, the significant data they need. It sends a track result set from one
tables or more.
Example:
SELECT *
FROM students
where due_fees <=20000;
INSERT Command
This command is used to enter the information or values into a row. We can connect one or
more records to a single table within a repository using this instruction. This is often used to
connect an unused tag to the documents.
Example :
INSERT INTO students ('stu_id' int, 'stu_name' varchar(20), 'city' varchar(20))
VALUES ('1', 'Nirmit', 'Gorakhpur');
UPDATE Command
This command is used to alter existing table records. Within a table, it modifies data from
one or more records. This command is used to alter the data which is already present in a
table.
Example:
UPDATE students
SET due_fees = 20000
WHERE stu_name = 'Mini';
DELETE Command
It deletes all archives from a table. This command is used to erase some or all of the previous
table’s records. If we do not specify the ‘WHERE’ condition then all the rows would be
erased or deleted.
Example:
DELETE FROM students
WHERE stu_id = '001';
Advantages of DML
 DML statements could alter the data that is contained or stored in the database.
 It delivers effective human contact with the machine.
 User could specify what data is required.
 DML aims to have many different varieties and functionalities between vendors
providing databases.
Disadvantages of DML
 We cannot use DML to change the structure of the database.
 Limit table view i.e., it could conceal some columns in tables.
 Access the data without having the data stored in the object.
 Unable to build or erase lists or sections using DML.

A Data Model in Database Management System (DBMS) is the concept of tools that are
developed to summarize the description of the database. Data Models provide us with a
transparent picture of data which helps us in creating an actual database. It shows us from the
design of the data to its proper implementation of data.
Types of Relational Models
1. Conceptual Data Model
The conceptual data model describes the database at a very high level and is useful to
understand the needs or requirements of the database. It is this model, that is used in the
requirement-gathering process i.e. before the Database Designers start making a particular
database. One such popular model is the entity/relationship model (ER model). The E/R
model specializes in entities, relationships, and even attributes that are used by database
designers. In terms of this concept, a discussion can be made even with non-computer
science(non-technical) users and stakeholders, and their requirements can be understood.
Entity-Relationship Model( ER Model): It is a high-level data model which is used to
define the data and the relationships between them. It is basically a conceptual design of any
database which is easy to design the view of data.
Components of ER Model:
1. Entity: An entity is referred to as a real-world object. It can be a name, place,
object, class, etc. These are represented by a rectangle in an ER Diagram.
2. Attributes: An attribute can be defined as the description of the entity. These are
represented by Ellipse in an ER Diagram. It can be Age, Roll Number, or Marks
for a Student.
3. Relationship: Relationships are used to define relations among different entities.
Diamonds and Rhombus are used to show Relationships.
Characteristics of a conceptual data model
 Offers Organization-wide coverage of the business concepts.
 This type of Data Models are designed and developed for a business audience.
 The conceptual model is developed independently of hardware specifications like
data storage capacity, location or software specifications like DBMS vendor and
technology. The focus is to represent data as a user will see it in the “real world.”
Conceptual data models known as Domain models create a common vocabulary for all
stakeholders by establishing basic concepts and scope
2. Representational Data Model
This type of data model is used to represent only the logical part of the database and does not
represent the physical structure of the database. The representational data model allows us to
focus primarily, on the design part of the database. A popular representational model is
a Relational model. The relational Model consists of Relational Algebra and Relational
Calculus. In the Relational Model, we basically use tables to represent our data and the
relationships between them. It is a theoretical concept whose practical implementation is
done in Physical Data Model.
The advantage of using a Representational data model is to provide a foundation to form the
base for the Physical model.
Characteristics of Representational Data Model
 Represents the logical structure of the database.
 Relational models like Relational Algebra and Relational Calculus are commonly
used.
 Uses tables to represent data and relationships.
 Provides a foundation for building the physical data model.
3. Physical Data Model
The physical Data Model is used to practically implement Relational Data Model.
Ultimately, all data in a database is stored physically on a secondary storage device such as
discs and tapes. This is stored in the form of files, records, and certain other data structures.
It has all the information on the format in which the files are present and the structure of the
databases, the presence of external data structures, and their relation to each other. Here, we
basically save tables in memory so they can be accessed efficiently. In order to come up with
a good physical model, we have to work on the relational model in a better way. Structured
Query Language (SQL) is used to practically implement Relational Algebra.
This Data Model describes HOW the system will be implemented using a specific DBMS
system. This model is typically created by DBA and developers. The purpose is actual
implementation of the database.
Characteristics of a physical data model:
 The physical data model describes data need for a single project or application
though it maybe integrated with other physical data models based on project scope.
 Data Model contains relationships between tables that which addresses cardinality
and nullability of the relationships.
 Developed for a specific version of a DBMS, location, data storage or technology
to be used in the project.
 Columns should have exact datatypes, lengths assigned and default values.
 Primary and Foreign keys, views, indexes, access profiles, and authorizations, etc.
are defined

The Entity Relationship Model is a model for identifying entities (like student, car or
company) to be represented in the database and representation of how those entities are
related. The ER data model specifies enterprise schema that represents the overall logical
structure of a database graphically.
Why Use ER Diagrams In DBMS?
 ER diagrams represent the E-R model in a database, making them easy to convert
into relations (tables).
 ER diagrams provide the purpose of real-world modeling of objects which makes
them intently useful.
 ER diagrams require no technical knowledge of the underlying DBMS used.
 It gives a standard solution for visualizing the data logically.
Components of ER Diagram
ER Model consists of Entities, Attributes, and Relationships among Entities in a Database
System.
What is Entity?
An Entity may be an object with a physical existence – a particular person, car, house, or
employee – or it may be an object with a conceptual existence – a company, a job, or a
university course.
What is Entity Set?
An Entity is an object of Entity Type and a set of all entities is called an entity set. For
Example, E1 is an entity having Entity Type Student and the set of all students is called
Entity Set. In ER diagram, Entity Type is represented as:
Types of Entity
There are two types of entity:
1. Strong Entity
A Strong Entity is a type of entity that has a key Attribute. Strong Entity does not depend on
other Entity in the Schema. It has a primary key, that helps in identifying it uniquely, and it
is represented by a rectangle. These are called Strong Entity Types.
2. Weak Entity
An Entity type has a key attribute that uniquely identifies each entity in the entity set. But
some entity type exists for which key attributes can’t be defined. These are called Weak
Entity types .
What is Attributes?
Attributes are the properties that define the entity type. For example, Roll_No, Name, DOB,
Age, Address, and Mobile_No are the attributes that define entity type Student. In ER
diagram, the attribute is represented by an oval.
Types of Attributes
1. Key Attribute
The attribute which uniquely identifies each entity in the entity set is called the key attribute.
For example, Roll_No will be unique for each student. In ER diagram, the key attribute is
represented by an oval with underlying lines.
2. Composite Attribute
An attribute composed of many other attributes is called a composite attribute. For example,
the Address attribute of the student Entity type consists of Street, City, State, and Country. In
ER diagram, the composite attribute is represented by an oval comprising of ovals.
3. Multivalued Attribute
An attribute consisting of more than one value for a given entity. For example, Phone_No
(can be more than one for a given student). In ER diagram, a multivalued attribute is
represented by a double oval.
4. Derived Attribute
An attribute that can be derived from other attributes of the entity type is known as a derived
attribute. e.g.; Age (can be derived from DOB). In ER diagram, the derived attribute is
represented by a dashed oval.
Relationship Type and Relationship Set
A Relationship Type represents the association between entity types. For example, ‘Enrolled
in’ is a relationship type that exists between entity type Student and Course. In ER diagram,
the relationship type is represented by a diamond and connecting the entities with lines.
A set of relationships of the same type is known as a relationship set. The following
relationship set depicts S1 as enrolled in C2, S2 as enrolled in C1, and S3 as registered in C3.

Degree of a Relationship Set


The number of different entity sets participating in a relationship set is called the degree of a
relationship set.
1. Unary Relationship: When there is only ONE entity set participating in a relation, the
relationship is called a unary relationship. For example, one person is married to only one
person.
2. Binary Relationship: When there are TWO entities set participating in a relationship, the
relationship is called a binary relationship. For example, a Student is enrolled in a Course.
3. Ternary Relationship: When there are three entity sets participating in a relationship, the
relationship is called a ternary relationship.
4. N-ary Relationship: When there are n entities set participating in a relationship, the
relationship is called an n-ary relationship.
What is Cardinality?
The number of times an entity of an entity set participates in a relationship set is known
as cardinality . Cardinality can be of different types:
1. One-to-One: When each entity in each entity set can take part only once in the
relationship, the cardinality is one-to-one. Let us assume that a male can marry one female
and a female can marry one male. So the relationship will be one-to-one.
the total number of tables that can be used in this is 2.

one to one cardinality

Using Sets, it can be represented as:


Set Representation of One-to-One

2. One-to-Many: In one-to-many mapping as well where each entity can be related to more
than one entity and the total number of tables that can be used in this is 2. Let us assume that
one surgeon department can accommodate many doctors. So the Cardinality will be 1 to M.
It means one department has many Doctors.
total number of tables that can used is 3.

one to many cardinality

Using sets, one-to-many cardinality can be represented as:

Set Representation of One-to-Many

3. Many-to-One: When entities in one entity set can take part only once in the relationship
set and entities in other entity sets can take part more than once in the relationship set,
cardinality is many to one. Let us assume that a student can take only one course but one
course can be taken by many students. So the cardinality will be n to 1. It means that for one
course there can be n students but for one student, there will be only one course.
The total number of tables that can be used in this is 3.
Using Sets, it can be represented as:

In this case, each student is taking only 1 course but 1 course has been taken by many
students.
4. Many-to-Many: When entities in all entity sets can take part more than once in the
relationship cardinality is many to many. Let us assume that a student can take more than
one course and one course can be taken by many students. So the relationship will be many
to many.
the total number of tables that can be used in this is 3.

Using Sets, it can be represented as:

In this example, student S1 is enrolled in C1 and C3 and Course C3 is enrolled by S1, S3,
and S4. So it is many-to-many relationships.
Participation Constraint
Participation Constraint is applied to the entity participating in the relationship set.
1. Total Participation – Each entity in the entity set must participate in the relationship. If
each student must enroll in a course, the participation of students will be total. Total
participation is shown by a double line in the ER diagram.
2. Partial Participation – The entity in the entity set may or may NOT participate in the
relationship. If some courses are not enrolled by any of the students, the participation in the
course will be partial.
Network Model
This model was formalized by the Database Task group in the 1960s. This model is the
generalization of the hierarchical model. This model can consist of multiple parent segments
and these segments are grouped as levels but there exists a logical association between the
segments belonging to any level. Mostly, there exists a many-to-many logical association
between any of the two segments. We called graphs the logical associations between the
segments. Therefore, this model replaces the hierarchical tree with a graph-like structure, and
with that, there can more general connections among different nodes. It can have M: N
relations i.e, many-to-many which allows a record to have more than one parent segment.
Here, a relationship is called a set, and each set is made up of at least 2 types of record which
are given below:
 An owner record that is the same as of parent in the hierarchical model.
 A member record that is the same as of child in the hierarchical model.
Structure of a Network Model

In the above figure, member TWO has only one owner ‘ONE’ whereas member FIVE has
two owners i.e, TWO and THREE. Here, each link between the two record types represents
1 : M relationship between them. This model consists of both lateral and top-down
connections between the nodes. Therefore, it allows 1: 1, 1 : M, M : N relationships among
the given entities which helps in avoiding data redundancy problems as it supports multiple
paths to the same record. There are various examples such as TOTAL by Cincom Systems
Inc., EDMS by Xerox Corp., etc.
Advantages of Network Model
 This model is very simple and easy to design like the hierarchical data model.
 This model is capable of handling multiple types of relationships which can help in
modeling real-life applications, for example, 1: 1, 1: M, M: N relationships.
 In this model, we can access the data easily, and also there is a chance that the
application can access the owner’s and the member’s records within a set.
 This network does not allow a member to exist without an owner which leads to
the concept of Data integrity.
 Like a hierarchical model, this model also does not have any database standard,
 This model allows to represent multi parent relationships.
Disadvantages of Network Model
 The schema or the structure of this database is very complex in nature as all the
records are maintained by the use of pointers.
 There’s an existence of operational anomalies as there is a use of pointers for
navigation which further leads to complex implementation.
 The design or the structure of this model is not user-friendly.
 This model does not have any scope of automated query optimization.
 This model fails in achieving structural independence even though the network
database model is capable of achieving data independence.
Object Oriented Data Model :
In Object Oriented Data Model, data and their relationships are contained in a single
structure which is referred as object in this data model. In this, real world problems are
represented as objects with different attributes. All objects have multiple relationships
between them. Basically, it is combination of Object Oriented programming and Relational
Database Model as it is clear from the following figure :
Object Oriented Data Model
= Combination of Object Oriented Programming + Relational database model
 Inheritance –
By using inheritance, new class can inherit the attributes and methods of the old class
i.e. base class. For example: as classes Student, Doctor and Engineer are inherited
from the base class Person.
Advantages of Object Oriented Data Model :
 Codes can be reused due to inheritance.
 Easily understandable.
 Cost of maintenance can reduced due to reusability of attributes and functions
because of inheritance.
Disadvantages of Object Oriented Data Model :
 It is not properly developed so not accepted by users easily.
Relational Model
The relational model represents how data is stored in Relational Databases. A relational
database consists of a collection of tables each of which is assigned a unique name. Consider
a relation STUDENT with attributes ROLL_NO, NAME, ADDRESS, PHONE, and AGE
shown in the table.
Table STUDENT

Characteristics of the Relational Model


 Data Representation: Data is organized in tables (relations), with rows (tuples)
representing records and columns (attributes) representing data fields.
 Atomic Values: Each attribute in a table contains atomic values, meaning no multi-
valued or nested data is allowed in a single cell.
 Unique Keys: Every table has a primary key to uniquely identify each record,
ensuring no duplicate rows.
 Attribute Domain: Each attribute has a defined domain, specifying the valid data
types and constraints for the values it can hold.
 Tuples as Rows: Rows in a table, called tuples, represent individual records or
instances of real-world entities or relationships.
 Relation Schema: A table’s structure is defined by its schema, which specifies the
table name, attributes, and their domains.
 Data Independence: The model ensures logical and physical data independence,
allowing changes in the database schema without affecting the application layer.
 Integrity Constraints: The model enforces rules like:
o Domain constraints: Attribute values must match the specified domain.
o Entity integrity: No primary key can have NULL values.
o Referential integrity: Foreign keys must match primary keys in the referenced
table or be NULL.
 Relational Operations: Supports operations like selection, projection, join, union,
and intersection, enabling powerful data retrieval manipulation.
 Data Consistency: Ensures data consistency through constraints, reducing redundancy
and anomalies.
 Set-Based Representation: Tables in the relational model are treated as sets, and
operations follow mathematical set theory principles.
Constraints in Relational Model
While designing the Relational Model, we define some conditions which must hold for data
present in the database are called Constraints. These constraints are checked before
performing any operation (insertion, deletion, and updation ) in the database. If there is a
violation of any of the constraints, the operation will fail.
Domain Constraints
Domain Constraints ensure that the value of each attribute A in a tuple must be an atomic
value derived from its specified domain, dom(A). Domains are defined by the data types
associated with the attributes. Common data types include:
 Numeric types: Includes integers (short, regular, and long) for whole numbers
and real numbers (float, double-precision) for decimal values, allowing precise
calculations.
 Character types: Consists of fixed-length (CHAR) and variable-
length (VARCHAR, TEXT) strings for storing text data of various sizes.
 Boolean values: Stores true or false values, often used for flags or conditional
checks in databases.
 Specialized types: Includes types
for date (DATE), time (TIME), timestamp (TIMESTAMP),
and money (MONEY), used for precise handling of time-related and financial data.
Key Integrity
Every relation in the database should have at least one set of attributes that defines a tuple
uniquely. Those set of attributes is called keys. e.g.; ROLL_NO in STUDENT is key. No
two students can have the same roll number. So a key has two properties:
 It should be unique for all tuples.
 It can’t have NULL values.
Referential Integrity Constraints
When one attribute of a relation can only take values from another attribute of the same
relation or any other relation, it is called referential integrity.
Anomalies in the Relational Model
An anomaly is an irregularity or something which deviates from the expected or normal
state. When designing databases, we identify three types of
anomalies: Insert, Update, and Delete.
Insertion Anomaly in Referencing Relation
We can’t insert a row in REFERENCING RELATION if referencing attribute’s value is not
present in the referenced attribute value. e.g.; Insertion of a student with BRANCH_CODE
‘ME’ in STUDENT relation will result in an error because ‘ME’ is not present in
BRANCH_CODE of BRANCH.
Deletion/ Updation Anomaly in Referenced Relation:
We can’t delete or update a row from REFERENCED RELATION if the value of
REFERENCED ATTRIBUTE is used in the value of REFERENCING ATTRIBUTE. e.g. if
we try to delete a tuple from BRANCH having BRANCH_CODE ‘CS’, it will result in an
error because ‘CS’ is referenced by BRANCH_CODE of STUDENT, but if we try to delete
the row from BRANCH with BRANCH_CODE CV, it will be deleted as the value is not
been used by referencing relation. It can be handled by the following method:
On Delete Cascade
It will delete the tuples from REFERENCING RELATION if the value used by
REFERENCING ATTRIBUTE is deleted from REFERENCED RELATION. e.g.; if we
delete a row from BRANCH with BRANCH_CODE ‘CS’, the rows in STUDENT relation
with BRANCH_CODE CS (ROLL_NO 1 and 2 in this case) will be deleted.
On Update Cascade
It will update the REFERENCING ATTRIBUTE in REFERENCING RELATION if the
attribute value used by REFERENCING ATTRIBUTE is updated in REFERENCED
RELATION. e.g., if we update a row from BRANCH with BRANCH_CODE ‘CS’ to ‘CSE’,
the rows in STUDENT relation with BRANCH_CODE CS (ROLL_NO 1 and 2 in this case)
will be updated with BRANCH_CODE ‘CSE’.
Super Keys
Any set of attributes that allows us to identify unique rows (tuples) in a given relationship is
known as super keys. Out of these super keys, we can always choose a proper subset among
these that can be used as a primary key. Such keys are known as Candidate keys. If there is a
combination of two or more attributes that are being used as the primary key then we call it a
Composite key.
Codd Rules in Relational Model
Edgar F Codd proposed the relational database model where he stated rules. Now these are
known as Codd’s Rules. For any database to be the perfect one, it has to follow the rules.
Advantages of the Relational Model
 Simple model: Relational Model is simple and easy to use in comparison to other
languages.
 Flexible: Relational Model is more flexible than any other relational model present.
 Secure: Relational Model is more secure than any other relational model.
 Data Accuracy: Data is more accurate in the relational data model.
 Data Integrity: The integrity of the data is maintained in the relational model.
 Operations can be Applied Easily: It is better to perform operations in the relational
model.
Disadvantages of the Relational Model
 Relational Database Model is not very good for large databases.
 Sometimes, it becomes difficult to find the relation between tables.
 Because of the complex structure, the response time for queries is high.
Data Manipulation
Data Manipulation is the process of manipulating (creating, arranging, deleting) data
points in a given data to get insights much easier. We know that about 90% of the data we
have are unstructured. Data manipulation is a fundamental step in data analysis, data
mining, and data preparation for machine learning and is essential for making informed
decisions and drawing conclusions from raw data.
Operations of Data Manipulation
Data Manipulation follows the 4 main operations, CRUD (Create, Read, Update and
Delete). It is used in many industries to improve the overall output.
In most DML, there is some version of the CRUD operations where:
 Create: To create a new data point or database.
 Read: Read the data to understand where we need to perform data manipulation.
 Update: Update missing/wrong data points with the correct ones to encourage data to
be streamlined.
 Delete: Deletes the rows with missing data points/ erroneous/ misclassified data.
These 4 main operations are performed in different ways seen below:
 Data Preprocessing: Most of the raw data that is mined may contain errors, missing
values and mislabeled data. This will hamper the final output if it is not dealt with in
the initial stages.
 Structuring data (if it is unstructured): If there’s any sort of data available in the
database which can be structured into a table to query them effectively, we sort those
data into tables for greater efficiency.
 Reduce the number of features: As we know, data analysis is inherently
computationally intensive. As a result, one of the reasons to perform data
manipulation is to find out the optimum number of features needed for getting the
result, while discarding the other features. Some techniques used here are, Principal
Component Analysis (PCA), Discrete Wavelet Transform and so on.
 Clean the data: Delete unnecessary data points or outliers which may affect the final
output. This is done to streamline the output.
 Transforming data: Some insights into data can be improved by transforming the
data. This may involve transposing data, and arranging/rearranging them.

Unit-2

Relational Database systems are expected to be equipped with a query language that assists
users to query the database. Relational Query Language is used by the user to
communicate with the database user requests for the information from the database.
Relational algebra breaks the user requests and instructs the DBMS to execute the requests.
It is the language by which the user communicates with the database. They are generally on a
higher level than any other programming language. These relational query languages can be
Procedural and Non-Procedural.

Types of Relational Query Language


There are two types of relational query language:
 Procedural Query Language
 Non-Procedural Language
Procedural Query Language
In Procedural Language, the user instructs the system to perform a series of operations on the
database to produce the desired results. Users tell what data to be retrieved from the database
and how to retrieve it. Procedural Query Language performs a set of queries instructing the
DBMS to perform various transactions in sequence to meet user requests.
Relational Algebra is a Procedural Query Language
Relational Algebra could be defined as the set of operations on relations.
There are a few operators that are used in relational algebra -
1. Select (sigma): Returns rows of the input relation that satisfy the provided predicate. It is
unary Operator means requires only one operand.
2. Projection (ℼ): Show the list of those attribute which we desire to appear and rest other
attributes are eliminated from the table. It seperates the table vertically.
3. Set Difference (-): It returns the difference between two relations . If we have two
relations R and S them R-S will return all the tuples (row) which are in relation R but not
in Relation S , It is binary operator.
4. Cartesian Product (X): Combines every tuple (row) of one table with every tuple (row)
in other table ,also referred as cross Product . It is a binary operator.
5. Union (U): Outputs the union of tuples from both the relations. Duplicate tuples are
eliminated automatically. It is a binary operator means it require two operands.
Non-Procedural Language
In Non Procedural Language user outlines the desired information without giving a specific
procedure or without telling the steps by step process for attaining the information. It only
gives a single Query on one or more tables to get .The user tells what is to be retrieved from
the database but does not tell how to accomplish it.
For Example: get the name and the contact number of the student with a Particular ID will
have a single query on STUDENT table.
Relational Calculus is a Non Procedural Language .
Relational Calculus exists in two forms:
1. Tuple Relational Calculus (TRC): Tuple Relational Calculus is a non procedural query
language, It is used for selecting the tuples that satisfy the given condition or predicate .
The result of the relation can have one or more tuples (row).
2. Domain Relational Calculus (DRC): Domain Relational Calculus is a Non Procedural
Query Language , the records are filtered based on the domains , DRC uses the list of
attributes to be selected from relational based on the condition.
SQL3, also known as SQL:1999, represents a significant advancement in the evolution of
the Structured Query Language (SQL). Published in December 1999 by the International
Organization for Standardization (ISO) and the International Electrotechnical Commission
(IEC), SQL3 introduced numerous features and extensions to address the growing demands
of database applications.

Key Features of SQL3:

1. User-Defined Data Types (UDTs): SQL3 allows users to define complex data types
tailored to specific application needs. For instance, a company can create a custom
data type to represent a full address, encapsulating fields like street, city, and postal
code. This enhances data modeling flexibility and accuracy.
2. Object-Oriented Support: Incorporating object-oriented programming concepts,
SQL3 introduces features such as inheritance, methods, and constructors. This enables
more natural and intuitive modeling of complex data structures, allowing objects to
encapsulate both data and behavior, thereby simplifying data management in intricate
applications.
3. New Data Types:
o BLOB (Binary Large Object): Designed to store large amounts of binary data,
such as images or multimedia files.
o CLOB (Character Large Object): Used for storing extensive text data.
o ARRAY: Enables the storage of arrays as column values, allowing for the
representation of multi-valued attributes within a single table.
o Structured Types: Facilitates the creation of complex data structures with
multiple attributes, akin to classes in object-oriented languages.
4. Collections of Objects: SQL3 generalizes the relational model into an object model,
offering abstract data types, multiple inheritance, and dynamic polymorphism. Tables
can contain collections (multisets) of objects, with sets and lists defined as related
collection types. This allows for more flexible and hierarchical data modeling.
5. Enhanced Procedural Capabilities: SQL3 introduces procedural language
extensions, making SQL a computationally complete programming language. This
facilitates the definition of functions and procedures within the database, allowing for
centralized processing logic and reduced network traffic.

These enhancements make SQL3 a powerful tool for developers and database administrators,
providing greater flexibility, improved performance, and more intuitive data modeling
capabilities in relational database management systems.

Data Definition Language


 DDL is used to specify a database’s structure, which includes its tables, views,
indexes, and constraints.
 DDL commands come in the following types: CREATE, ALTER, DROP,
RENAME, and TRUNCATE.
 DDL statements only modify the database’s schema; they have no direct effect on
the data within the database.
 DDL declarations are irreversible and difficult to undo.
Data Manipulation Language
 Inserting, updating, removing, and retrieving data from a database are all possible
with DML.
 DML commands come in the following types: SELECT, INSERT, UPDATE,
DELETE, and MERGE.
 DML statements have a direct impact on the database’s data.
 In the event of an error, data can be recovered thanks to the reversibility of DML
statements.
Aspect DDL (Data Definition Language) DML (Data Manipulation Language)
Defines and manages the database schema and Manipulates and manages data within existing
Purpose
structure. database structures.
Commands CREATE, ALTER, DROP, TRUNCATE, RENAME SELECT, INSERT, UPDATE, DELETE
Creates, modifies, or deletes database objects Retrieves, adds, modifies, or deletes data
Functionality
like tables, indexes, and views. within tables.
Transaction Commands are auto-committed; changes are Commands are not auto-committed; changes
Behavior immediately saved and cannot be rolled back. can be rolled back if not explicitly committed.
- CREATE TABLE Students (ID INT, Name - INSERT INTO Students (ID, Name)
Examples VARCHAR(50));- ALTER TABLE Students VALUES (1, 'Alice');- UPDATE Students
ADD COLUMN Age INT; SET Age = 20 WHERE ID = 1;

A database is a systematically organized set of structured data, usually stored electronically


on a computer. It is typically managed by a database management system (DBMS). The
combination of the data, the DBMS, and the related applications is known as a database
system, or simply a database.
What is Open Source Database?
An open-source database is a database where anyone can easily view the source code and
this is open and free to download. Also for the community version, some small additional
and affordable costs are imposed. Open Source Database provides Limited technical support
to end-users. Here Installation and updates are administered by the user.
Advantages of Open Source Databases
 Cost: Open-source databases are generally free, which means they can be used
without any licensing fees.
 Customization: Since the source code is available, developers can modify and
customize the database to meet specific requirements.
 Community Support: Open-source databases have a large community of users
who contribute to documentation, bug fixes, and improvements.
 Security: With open-source databases, security vulnerabilities can be detected and
fixed quickly by the community.
 Scalability: Open-source databases are typically designed to be scalable, which
means they can handle large amounts of data and traffic.
Disadvantages of Open Source Databases
 Limited Technical Support: While there is a large community of users who can
help troubleshoot issues, there is no guarantee of professional technical support.
 Complexity: Open source databases can be more difficult to set up and configure
than commercial databases, especially for users who are not experienced in
database administration.
 Lack of Features: Open source databases may not have all the features that are
available in commercial databases, such as advanced analytics and reporting tools.
What is Commercial Database?
Commercial databases are those that have been created for Commercial Purposes only. They
are premium and are not free like Open Source Database. In Commercial Database it is
guaranteed that technical support is provided. In this Installation, updates are Administered
by the Software Vendor. For example: Oracle, IBM DB2, etc.
Advantages of Commercial Databases
 Technical Support: Commercial databases usually come with professional
technical support, which can be helpful for organizations that need assistance with
setup, configuration, or troubleshooting.
 Features: Commercial databases typically have more features than open-source
databases, including advanced analytics, reporting, and data visualization tools.
 Security: Commercial databases often have built-in security features and can
provide better protection against cyber threats.
 Integration: Commercial databases are often designed to work seamlessly with
other enterprise software, making integration with existing systems easier.
Disadvantages of Commercial Databases
 Cost: Commercial databases can be expensive, with licensing fees and
maintenance costs that can add up over time.
 Vendor Lock-In: Organizations that use commercial databases may become
dependent on the vendor and find it difficult to switch to another database.
 Limited Customization: Commercial databases may not be as customizable as
open source databases, which can be a disadvantage for organizations with specific
requirements.
Similarities between Open Source Database and Commercial Database
 Both can handle large amounts of data and support complex data structures.
 Both can be used to store and retrieve data in a structured manner.
 Both can be used to support mission-critical applications and services.
 Both use SQL (Structured Query Language) to perform queries and manipulate
data.
 Both can be accessed and managed remotely using a variety of tools and interfaces.
 Both can be optimized for performance, scalability, and security.
Difference Between Open Source Database and Commercial Database
Aspect Open Source DBMS Commercial DBMS
A DBMS developed and distributed by a
A DBMS whose source code is publicly
company, with source code kept proprietary
Definition available for use, modification, and
and usage typically requiring the purchase of
distribution.
licenses.
Emphasizes community-driven development, Prioritizes comprehensive features, robust
Focus
flexibility, and cost-effectiveness. support, and enterprise-grade performance.
Source Code Fully accessible; users can view, modify, and Restricted; source code is not available for
Accessibility distribute the code. public access or modification.
Generally free to use; some may offer paid Requires upfront and/or recurring licensing
Cost
support or enterprise editions. fees, which can be substantial.
Relies on community forums, documentation, Provides dedicated, professional support with
Support and
and voluntary contributions; some vendors service-level agreements (SLAs) and regular
Maintenance
offer paid support. updates.
Highly customizable; feature development is Offers advanced, enterprise-grade features
Features and
driven by community contributions and can be with consistent updates; customization is
Customization
tailored to specific needs. limited to vendor-provided options.
Security updates depend on community Regular, vendor-provided security updates
Security responsiveness; transparency allows for rapid and patches; may include proprietary security
identification and patching of vulnerabilities. features.
Suitable for small to medium-scale
Designed for high scalability and
Scalability and applications; scalability and performance can
performance, catering to large enterprises
Performance vary depending on the system and community
with demanding workloads.
support.
Minimal; users have the freedom to modify High; transitioning to another system can be
Vendor Lock-In and migrate systems without significant challenging and costly due to proprietary
restrictions. formats and dependencies.
Strong community engagement with Development and innovation are managed by
Community collaborative development and shared the vendor's internal teams, focusing on
Involvement knowledge; innovation is driven by a diverse market demands and proprietary
group of contributors. advancements.
Oracle Database, Microsoft SQL Server,
Examples MySQL, PostgreSQL, MongoDB.
IBM Db2.

Which Database is Better?


In conclusion, it is important to remember that both Commercial and Open Source database
have their Advantages and Disadvantages. If we consider which Database is better, then in
most cases it makes sense to choose Open Source as compared to Commercial Database
because :
 Open Source database is Cost-effective.
 Better quality source code.
 More secure.
 More preferred.
Relational database design (RDD) models’ information and data into a set of tables with
rows and columns. Each row of a relation/table represents a record, and each column
represents an attribute of data. The Structured Query Language (SQL) is used to manipulate
relational databases. The design of a relational database is composed of four stages, where
the data are modeled into a set of related tables. The stages are −
 Define relations/attributes
 Define primary keys
 Define relationships
 Normalization
Relational databases differ from other databases in their approach to organizing data and
performing transactions. In an RDD, the data are organized into tables and all types of data
access are carried out via controlled transactions. Relational database design satisfies the
ACID (atomicity, consistency, integrity, and durability) properties required from a database
design. Relational database design mandates the use of a database server in applications for
dealing with data management problems.
Relational Database Design Process
Database design is more art than science, as you have to make many decisions. Databases are
usually customized to suit a particular application. No two customized applications are alike,
and hence, no two databases are alike. Guidelines (usually in terms of what not to do instead
of what to do) are provided in making these design decision, but the choices ultimately rest
on the designer.
Step 1 − Define the Purpose of the Database (Requirement Analysis)
 Gather the requirements and define the objective of your database.
 Drafting out the sample input forms, queries and reports often help.
Step 2 − Gather Data, Organize in tables and Specify the Primary Keys
 Once you have decided on the purpose of the database, gather the data that are
needed to be stored in the database. Divide the data into subject-based tables.
 Choose one column (or a few columns) as the so-called primary key, which
uniquely identifies the each of the rows.
Step 3 − Create Relationships among Tables
A database consisting of independent and unrelated tables serves little purpose (you may
consider using a spreadsheet instead). The power of a relational database lies in the
relationship that can be defined between tables. The most crucial aspect in designing a
relational database is to identify the relationships among tables. The types of relationship
include:
 A one-to-one relationship exists when each row in one table has only one related
row in a second table. For example, a university may decide to assign one faculty
member to one room. Thus, one room can only have one instructor assigned to it at
a given time. The university may also decide that a department can only have one
Dean. Thus, only one individual can be the head of a department.
 A one-to-many relationship exists when each row in one table has many related
rows in another table. For example, one instructor can teach many classes.
 A many-to-many relationship exists when a row in one table has many related rows
in a second table. Likewise, those related rows have many rows in the first table. A
student can enroll in many courses, and courses can contain many students.
Column Data Types
You need to choose an appropriate data type for each column. Commonly data types include
integers, floating-point numbers, string (or text), date/time, binary, collection (such as
enumeration and set).
Step 4 − Refine & Normalize the Design
For example,
 adding more columns,
 create a new table for optional data using one-to-one relationship,
 split a large table into two smaller tables,
 Other methods.
Normalization
Normal Forms in DBMS

Normal
Forms Description of Normal Forms

First Normal A relation is in first normal form if every attribute in that relation is single-
Form (1NF) valued attribute.

Second A relation that is in First Normal Form and every non-primary-key attribute
Normal Form is fully functionally dependent on the primary key, then the relation is
(2NF) in Second Normal Form (2NF).

A relation is in the third normal form, if there is no transitive dependency for


non-prime attributes as well as it is in the second normal form. A relation is
in 3NF if at least one of the following conditions holds in every non-trivial
function dependency X –> Y.
 X is a super key.
Third Normal  Y is a prime attribute (each element of Y is part of some candidate
Form (3NF) key).

Boyce-Codd For BCNF the relation should satisfy the below conditions
Normal Form
 The relation should be in the 3rd Normal Form.
(BCNF)
Normal
Forms Description of Normal Forms

 X should be a super-key for every functional dependency (FD) X−>Y


in a given relation.

A relation R is in 4NF if and only if the following conditions are satisfied:


Fourth
 It should be in the Boyce-Codd Normal Form (BCNF).
Normal Form
(4NF)  The table should not have any Multi-valued Dependency.

A relation R is in 5NF if and only if it satisfies the following conditions:


 R should be already in 4NF.
Fifth Normal
Form (5NF)  It cannot be further non loss decomposed (join dependency).

Domain:
A domain defines the permissible set of values that an attribute (or column) in a database can
hold. It establishes constraints on the data type and format for a given attribute, ensuring that
only valid and consistent data is stored. For instance, if we have an attribute "BirthMonth,"
its domain might be the set of all twelve month names (e.g., January, February, etc.). This
means that any value entered into the "BirthMonth" column must be one of these specified
month names
Data Dependency
 A dependency is a constraint that governs or defines the relationship between two
or more attributes.
 In a database, it happens when information recorded in the same table uniquely
determines other information stored in the same table.
 This may also be described as a relationship in which knowing the value of one
attribute (or collection of attributes) in the same table tells you the value of another
attribute (or set of attributes).
 It's critical to understand database dependencies since they serve as the foundation
for database normalization.
Armstrong’s Axioms
Armstrong’s axioms, introduced by William W. Armstrong in 1974, are a set of inference
rules utilized to deduce all the functional dependencies within a relational database. These
rules, when applied to a set of functional dependencies denoted as F+, are both sound,
ensuring that only dependencies within the closure set F are generated, and complete,
meaning that applying these rules iteratively will derive all functional dependencies within
the closure F+.
Types of Axioms
There are 3 types of Axioms:
1. Reflexivity Axiom:
For a set of attributes A and a subset B of A, if B is a subset of A (B⊆A), then A→B.
This axiom represents a trivial property where a set of attributes implies itself.
2. Augmentation Axiom:
If A→B is true and Y is a set of attributes, then AY→BY is also true. This axiom
demonstrates that adding attributes to dependencies does not alter the fundamental
dependencies. If A→B holds, AC→BC holds for any set of attributes C.
3. Transitivity Axiom:
If A→B holds and B→C holds, then A→C also holds, similar to the transitive rule in
algebra. Functionally, it means that if A determines B and B determines C, then A
determines C. If X→Y and Y→Z are true, then X→Z is also true.
Rules of Axioms
Below are the rules of Axioms in DBMS:
1. Decomposition
If A→BC, then A→B and A→C
Proof:
A→BC (given)_______________ (i)
BC→B (reflexivity)____________ (ii)
A→B (transitivity from i and ii)
2. Composition
If A→B and C→D then AC→BD
Proof
A→B_________(i)
C→D_________(ii)
AC→BC________(iii) (Augmentation of i and C)
AC→B________(iv) Decomposition of iii)
AC→AD_______(v) (Augmentation of ii and A)
AC→D___________(vi) (Decomposition of v)
AC→BD________ (Union iv and vi)
3. Union (Notation)
If A→B and A→C then A→BC
Proof
A→B________(i) (given)
A→C________(ii) (given)
A→AC_______(iii) (Augmentation of ii and A)
AC→BC______(iv) (Augmentation of i and C)
A→BC________(transitivity of iii and ii)
4. Pseudo transitivity
If A→B and BC→D then AC→D
Proof
A→B__________(i) (Given)
BC→D________(ii) (Given)
AC→BC_______(iii) (Augmentation of i and C)
AC →D_________(Transitivity of iii and ii)
5. Self-determination
A→A for any given A.
This rule directly follows the Axiom of Reflexivity.
6. Extensivity
Extensivity is a particular case of augmentation where C=A
If A→B, then A→AB
In the sense that augmentation can be proven from extensivity and other axioms, extensivity
can replace augmentation as an axiom.
Proof
AC→A____(i)
A→B________(ii)
AC→B________(iii) (Transitivity of i and ii)
AC→ABC_______(iv) (Extensivity of iii)
ABC→BC______(v) (Reflexivity)
AC→BC (Transitivity of iv and v)
What is Armstrong Relation in DBMS?
An Armstrong relation refers to a relation that satisfies all the functional dependencies
present in the closure F+ and no other dependencies, based on a given set of functional
dependencies F. However, it is worth noting that for a given set of dependencies, the
smallest possible Armstrong relation can have a size that grows exponentially with the
number of attributes in the dependencies being considered.
Normalization
 Normalization is the process of organizing database data by minimizing redundancy
and eliminating anomalies such as update, insertion, and deletion issues.
 It involves breaking a single table into smaller tables and linking them through
relationships, following different normal forms to reduce redundancy.
 Normalization continues until the database follows the Single Responsibility
Principle (SRP), ensuring that each table has a specific role for better efficiency and
consistency.
Why do we need normalization in databases?
Redundancy in data occurs when the same piece of information exists in a single database.
Database redundancy can lead to many drawbacks and introduces three anomalies (or
abnormalities). These anomalies are-
Insertion Anomaly
This anomaly occurs when specific data cannot be inserted into the table or database due to
the absence of some other data where both of these are independent of each other.
Deletion Anomaly
While deleting some data, when some critical information is lost that was necessary to
maintain the integrity of data, it is known as a deletion anomaly.
Updation / Modified Anomaly
 An update anomaly occurs when modifying a single piece of data requires changes in
multiple rows, leading to potential data inconsistency.
 It increases storage costs and database complexity due to redundant data, making
maintenance more challenging.
 Normalization helps eliminate redundancy, optimizing the database and ensuring data
consistency.
Types of DBMS Normal Form
Normalization in a database is done through a series of normal forms.

Normal Description
Form
1NF If a table has no repeated groups, it is in 1NF.

2NF If a table is in 1NF and every non-key attribute is fully dependent on the primary
key, then it is in 2NF.

3NF If a table is in 2NF and has no transitive dependencies, it is in 3NF.

BCNF If a table is in 3NF and every non-prime attribute fully dependent on the candidate
keys, then it is in BCNF.

4NF If a table is in BCNF and has no multi-valued dependencies, it is in 4NF.

First Normal Form (1NF)


In 1NF, every database cell or relation contains an atomic value that can’t be further divided,
i.e., the relation shouldn’t have multivalued attributes.
Example:
The following table contains two phone number values for a single attribute.

Emp_ID Student Name Phone Number

1 John 12345767890

2 Claire 9242314321, 7689025341

So to convert it into 1NF, we decompose the table as the following -

Emp_ID Student Name Phone Number

1 John 12345767890

2 Claire 9242314321

2 Claire 7689025341

Here, we can notice data repetition, but 1NF doesn’t care about it.
Second Normal Form (2NF)
In 2NF, the relation present should be 1NF, and no partial dependency should exist. Partial
dependency is when the non-prime attributes depend entirely on the candidate or primary
key, even if the primary key is composite.
Example 1: (depicting partial dependency issues)
If given with a relation R(A, B, C, D) where we have {A, B} as the primary key where A
and B can’t be NULL simultaneously, but both can be NULL independently and C, D are
non-prime attributes. If B is NULL, and we are given the functional dependency, say,
B → C. So can this ever hold?
As B contains NULL, it can never determine the value of C. So, as B → C is a partial
dependency, it creates a problem. Therefore, the non-prime attributes cannot be determined
by a part of the primary key. We can remove the partial dependency present by creating two
relations ( the 2NF conversion)-
Relation 1 = R1(ABD), where {A, B} is the primary key. AB determines D.
Relation 2 = R1(BC), where B is the primary key. And from this, B determines C.
Example 2:
Consider the following table. Its primary key is {StudentId, ProjectId}.
The Functional dependencies given are -
StudentId → StudentName
ProjectId → ProjectName

Student Project
StudentId ProjectId
Name Name

1 P2 John IOT

2 P1 Claire Cloud

3 P7 Clara IOT

4 P3 Abhk Cloud

As it represents partial dependency, we decompose ProjectId Project Name


the table as follows -
P2 IOT
StudentId ProjectId Student Name
P1 Cloud
1 P2 John
P7 IOT
2 P1 Claire
P3 Cloud
3 P7 Clara

4 P3 Abhk

Here projectId is mentioned in both tables to set up a relationship between them.


Third Normal Form (3NF)
In 3NF, the given relation should be 2NF, and no transitivity dependency should exist, i.e.,
non-prime attributes should not determine non-prime attributes.
Example:
Consider the following scenario where the functional dependencies are -
A → B and B → C, where A is the primary key.
As here, a non-prime attribute can be determined by a prime attribute, which implies
transitivity dependency exists. To remove this, we decompose this and convert it into 3NF.
So, we create two relations -
R1(A, B), where A is the primary key and R2(B, C), where B is the primary key.
Boyce-Codd Normal Form(BCNF)
In BCNF, the relation should be in [Link] given a relation, say A → B, A should be a super
key in this. This implies that no prime attribute should be determined or derived from any
other prime or non-prime attribute.
Example:
Given the following table. Its candidate keys are {Student, Teacher} and {Student, Subject}.
The Functional dependencies given are -
{Student, Teacher} → Subject
{Student, Subject} → Teacher
Teacher → Subject

Student Name Subject Teacher

John Physics Olivia

Claire English Emma

Clara Physics Olivia

Abhk English Sophia

As this table is not in BCNF form, so we decompose it into the following tables:

Student Name Teacher

John Olivia

Clara Emma

Robin Olivia
Teacher Subject

Student Name Teacher Olivia Physics

Kaley Sophia Emma English

Here Teacher is mentioned in both tables Olivia Physics to set up a


relationship between them.
Sophia English
Fourth Normal Form (4NF)
For any relation to be in 4NF, it should have no multi-valued dependencies and is in Boyce
Codd Normal Form. It simplifies the database by eliminating the non-trivial multi-valued
dependencies besides those including the candidate key.
Example:
Consider the following table:

Student Subject Student Phone Number

John Physics 12345767890

Clara English 9242314321

Robin Mathematics 7689025341

Kaley Database 9878632656

In the above table, the subject and student phone numbers are two independent entities,
showing no relation between the subject and
phone number. So to convert it into 4NF, we Student Subject
decompose the table as -
John Physics
Student Student Phone Number
Clara English
John 12345767890
Robin Mathematics
Clara 9242314321
Kaley Database
Robin 7689025341

Kaley 9878632656

Here the Student is mentioned in both tables to set up a relationship between them.
Importance of Normalization in Database Design
 Reduces Data Redundancy: Normalization minimizes the duplication of data by
organizing it into related tables, ensuring that each piece of information is stored only
once.
 Improves Data Integrity: By enforcing rules and constraints, normalization ensures the
accuracy and consistency of data across the database.
 Simplifies Database Maintenance: A normalized database is easier to update, modify,
and maintain, as changes need to be made in only one place.
 Enhances Query Performance: Properly normalized databases can improve query
efficiency by reducing the amount of data scanned during operations.
 Prevents Update Anomalies: Normalization eliminates issues like insertion, deletion,
and update anomalies, ensuring that data remains consistent.
 Facilitates Scalability: A well-structured, normalized database can handle growth and
changes in data requirements more effectively.
 Supports Better Design Practices: Normalization encourages a logical and organized
approach to database design, making it easier to understand and manage.
Dependency Preservation:
A decomposition is considered dependency-preserving if, after the decomposition, all the
original functional dependencies are still valid. This means that either:
 Each functional dependency from the original relation appears directly in one of the
decomposed relations.
 The union of the functional dependencies in the decomposed relations is equivalent to
the original set of functional dependencies.
Why is it important?
 Data Integrity: Dependency preservation ensures that the relationships between data
are maintained even after decomposition.
 Reduced Redundancy: By breaking down relations, dependency preservation helps to
minimize data redundancy and improve data integrity.
 Improved Query Performance: Smaller, more focused relations can lead to faster
query performance.
 Easier Maintenance and Updates: Decomposed relations are easier to maintain and
update.
Example:
 Let's say you have a relation R(A, B, C) with functional dependencies A -> B and B -> C.
 If you decompose R into R1(A, B) and R2(B, C), then the decomposition is dependency-
preserving because both original dependencies are present in the decomposed relations.
 If you decompose R into R1(A, B) and R2(A, C), then the decomposition is not
dependency-preserving because the dependency B -> C is not present in the decomposed
relations.
Lossless Join Decomposition
 Consider there is a relation R which is decomposed into sub relations R 1, R2, …., Rn.
 This decomposition is called lossless join decomposition when the join of the sub
relations results in the same relation R that was decomposed.
 For lossless join decomposition, we always have-
R1 ⋈ R2 ⋈ R3 ……. ⋈ Rn = R
where ⋈ is a natural join operator.

Example:
Consider the following relation R( A , B , C )-
Consider this relation is decomposed into two sub relations R1( A , B ) and R2( B , C )-

The two sub relations are-

Now, let us check whether this decomposition is lossless or not. For lossless decomposition,
we must have-
R1 ⋈ R2 = R
Now, if we perform the natural join ( ⋈ ) of the sub relations R1 and R2 , we get-

This relation is same as the original relation R. Thus, we conclude that the above
decomposition is lossless join decomposition.
Note
 Lossless join decomposition is also known as non-additive join decomposition. This is
because the resultant relation after joining the sub relations is same as the decomposed
relation.
 No extraneous tuples appear after joining of the sub-relations.
2. Lossy Join Decomposition
 Consider there is a relation R which is decomposed into sub relations R 1 , R2 , …. , Rn.
 This decomposition is called lossy join decomposition when the join of the sub relations
does not result in the same relation R that was decomposed.
 The natural join of the sub relations is always found to have some extraneous tuples.
 For lossy join decomposition, we always have-
R1 ⋈ R2 ⋈ R3 ……. ⋈ Rn ⊃ R
where ⋈ is a natural join operator.

Example:
Consider the above relation R( A , B , C )
Consider this relation is decomposed into two sub relations as R1( A , C ) and R2( B , C )-

The two sub relations are-

Now, let us check whether this decomposition is lossy or not.


For lossy decomposition, we must have-
R1 ⋈ R2 ⊃ R
Now, if we perform the natural join ( ⋈ ) of the sub relations R1 and R2 we get-

Clearly, R1 ⋈ R2 ⊃ R. Thus, we conclude that the above decomposition is lossy join


This relation is not same as the original relation R and contains some extraneous tuples.

decomposition.
Note
 Lossy join decomposition is also known as careless decomposition.
 This is because extraneous tuples get introduced in the natural join of the sub-relations.
 Extraneous tuples make the identification of the original tuples difficult.
Query
A query is a kind of request that is sent to the Database for retrieval of data. In a query, we
pass some specific condition, and then it matches the specific data if it is present in the
database.
We can write queries in SQL( Structured Query Language). This language is a way of
communication between the user and the relational database. We can perform different
operations on data by writing queries. We can select the data, delete the data, insert
something, etc.
Query Processing
Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps involved
are:
1. Parsing and translation
2. Optimization
3. Evaluation
The query processing works in the following way:
Parsing and Translation
As query processing includes certain activities for data retrieval. Initially, the given user
queries get translated in high-level database languages such as SQL. It gets translated into
expressions that can be further used at the physical level of the file system. After this, the
actual evaluation of the queries and a variety of query -optimizing transformations and takes
place. Thus before processing a query, a computer system needs to translate the query into a
human-readable and understandable language. Consequently, SQL or Structured Query
Language is the best suitable choice for humans. But, it is not perfectly suitable for the
internal representation of the query to the system. Relational algebra is well suited for the
internal representation of a query. The translation process in query processing is similar to
the parser of a query. When a user executes any query, for generating the internal form of the
query, the parser in the system checks the syntax of the query, verifies the name of the
relation in the database, the tuple, and finally the required attribute value. The parser creates
a tree of the query, known as 'parse-tree.' Further, translate it into the form of relational
algebra. With this, it evenly replaces all the use of the views when used in the query.
Thus, we can understand the working of a query processing in the below-described diagram:
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and
evaluating each operation. Thus, after translating the user query, the system executes a query
evaluation plan.
Query Evaluation Plan
 In order to fully evaluate a query, the system needs to construct a query evaluation plan.
 The annotations in the evaluation plan may refer to the algorithms to be used for the
particular index or the specific operations.
 Such relational algebra with annotations is referred to as Evaluation Primitives. The
evaluation primitives carry the instructions needed for the evaluation of the operation.
 Thus, a query evaluation plan defines a sequence of primitive operations used for
evaluating a query. The query evaluation plan is also referred to as the query execution
plan.
 A query execution engine is responsible for generating the output of the given query. It
takes the query execution plan, executes it, and finally makes the output for the user
query.
Optimization
 The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to write
their query efficiently.
 Usually, a database system generates an efficient query evaluation plan, which minimizes
its cost. This type of task performed by the database system and is known as Query
Optimization.
 For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.
Finally, after selecting an evaluation plan, the system evaluates the query and produces the
output of the query.
Evaluation of Expressions
For evaluating an expression that carries multiple operations in it, we can perform the
computation of each operation one by one. However, in the query processing system, we use
two methods for evaluating an expression carrying multiple operations. These methods are:
1. Materialization
2. Pipelining
Let's take a brief discussion of these methods.
Materialization
In this method, the given expression evaluates one relational operation at a time. Also, each
operation is evaluated in an appropriate sequence or order. After evaluating all the
operations, the outputs are materialized in a temporary relation for their subsequent uses. It
leads the materialization method to a disadvantage. The disadvantage is that it needs to
construct those temporary relations for materializing the results of the evaluated operations,
respectively. These temporary relations are written on the disks unless they are small in size.
Pipelining
Pipelining is an alternate method or approach to the materialization method. In pipelining, it
enables us to evaluate each relational operation of the expression simultaneously in a
pipeline. In this approach, after evaluating one operation, its output is passed on to the next
operation, and the chain continues till all the relational operations are evaluated thoroughly.
Thus, there is no requirement of storing a temporary relation in pipelining. Such an
advantage of pipelining makes it a better approach as compared to the approach used in the
materialization method. Even the costs of both approaches can have subsequent differences
in-between. But, both approaches perform the best role in different cases. Thus, both ways
are feasible at their place.
External Sort-Merge Algorithm
Till now, we saw that sorting is an important term in any database system. It means
arranging the data either in ascending or descending order. We use sorting not only for
generating a sequenced output but also for satisfying conditions of various database
algorithms. In query processing, the sorting method is used for performing various relational
operations such as joins, etc. efficiently. But the need is to provide a sorted input value to the
system. For sorting any relation, we have to build an index on the sort key and use that index
for reading the relation in sorted order. However, using an index, we sort the relation
logically, not physically. Thus, sorting is performed for cases that include:
Case 1: Relations that are having either small or medium size than main memory.
Case 2: Relations having a size larger than the memory size.
In Case 1, the small or medium size relations do not exceed the size of the main memory. So,
we can fit them in memory. So, we can use standard sorting methods such as quicksort,
merge sort, etc., to do so.
For Case 2, the standard algorithms do not work properly. Thus, for such relations whose
size exceeds the memory size, we use the External Sort-Merge algorithm.
The sorting of relations which do not fit in the memory because their size is larger than the
memory size. Such type of sorting is known as External Sorting. As a result, the external-
sort merge is the most suitable method used for external sorting.
External Sort-Merge Algorithm
Here, we will discuss the external-sort merge algorithm stages in detail:
In the algorithm, M signifies the number of disk blocks available in the main memory buffer
for sorting.
Stage 1: Initially, we create a number of sorted runs. Sort each of them. These runs contain
only a few records of the relation.
i = 0;

repeat
read either M blocks or the rest of the relation having a smaller size;
sort the in-memory part of the relation;
write the sorted data to run file Ri;
i =i+1;

Until the end of the relation


In Stage 1, we can see that we are performing the sorting operation on the disk blocks. After
completing the steps of Stage 1, proceed to Stage 2.
Stage 2: In Stage 2, we merge the runs. Consider that total number of runs, i.e., N is less
than M. So, we can allocate one block to each run and still have some space left to hold one
block of output. We perform the operation as follows:
read one block of each of N files Ri into a buffer block in memory;
repeat
select the first tuple among all buffer blocks (where selection is made in sorted orde
r);
write the tuple to the output, and then delete it from the buffer block;
if the buffer block of any run Ri is empty and not EOF(Ri)
then read the next block of Ri into the buffer block;
Until all input buffer blocks are empty
After completing Stage 2, we will get a sorted relation as an output. The output file is then
buffered for minimizing the disk-write operations. As this algorithm merges N runs, that's
why it is known as an N-way merge.
However, if the size of the relation is larger than the memory size, then either M or more
runs will be generated in Stage 1. Also, it is not possible to allocate a single block for each
run while processing Stage 2. In such a case, the merge operation process in multiple passes.
As M-1 input buffer blocks have sufficient memory, each merge can easily hold M-1 runs as
its input. So, the initial phase works in the following way:
 It merges the first M-1 runs for getting a single run for the next one.
 Similarly, it merges the next M-1 runs. This step continues until it processes all the initial
runs. Here, the number of runs has a reduced M-1 value. Still, if this reduced value is
greater than or equal to M, we need to create another pass. For this new pass, the input
will be the runs created by the first pass.
 The work of each pass will be to reduce the number of runs by M-1 value. This job
repeats as many times as needed until the number of runs is either less than or equal to M.
 Thus, a final pass produces the sorted output.
Estimating cost for External Sort-Merge Method
The cost analysis of external sort-merge is performed using the above-discussed stages in the
algorithm:
 Assume br denote number of blocks containing records of relation r.
 In the first stage, it reads each block of the relation and writes them back. It takes a total
of 2br block transfers.
 Initially, the value of the number of runs is [br/M]. As the number of runs decreases by
M-1 in each merge pass, so it needs a total number of [log M-1(br/M)] merge passes.
Every pass read and write each block of the relation only once. But with two exceptions:
 The final pass can give a sorted output without writing its result to the disk
 There might be chances that some runs may not be read or written during the pass.
Neglecting such small exceptions, the total number of block transfers for external sorting
comes out:
b r (2 Γ log M-1 (b r /M) ˥ + 1)
We need to add the disk seek cost because each run needs seeks reading and writing data for
them. If in Stage 2, i.e., the merge phase, each run is allocated with b b buffer blocks or each
run reads bb data at a time, then each merge needs [br /bb] seeks for reading the data. The
output is written sequentially, so if it is on the same disk as input runs, the head will need to
move between the writes of consecutive blocks. Therefore, add a total of 2[br /bb] seeks for
each merge pass and the total number of seeks comes out:
2 Γ b r /M ˥ + Γb r /b b ˥(2ΓlogM-1(b r /M)˥ - 1)
Thus, we need to calculate the total number of disk seeks for analyzing the cost of the
External merge-sort algorithm.
Hash Join Algorithm
The Hash Join algorithm is used to perform the natural join or equi join operations. The
concept behind the Hash join algorithm is to partition the tuples of each given relation into
sets. The partition is done on the basis of the same hash value on the join attributes. The hash
function provides the hash value. The main goal of using the hash function in the algorithm
is to reduce the number of comparisons and increase the efficiency to complete the join
operation on the relations.
For example, suppose there are two tuples a and b where both of them satisfy the join
condition. It means they have the same value for the join attributes. Suppose that both a and
b tuples consist of a hash value as i. It implies that tuple a should be in a i, and tuple b should
be in bi. Thus, we only compare a tuples in ai with b tuples of bi. We do not need to compare
the b tuples in any other partition. Therefore, in this way, the hash join operation works.
Hash Join Algorithm
//Partition s//
for each tuple ts in s do begin
i = h(ts [JoinAttrs]);
Hsi = Hsi U {ts};
end
//Partition r//
for each tuple tr in r do begin
i = h(tr[JoinAttrs]);
Hri = Hri U {tr};
end
//Perform the join operation on each partition//
for i= 0 to nh do begin
read Hsi and build an in-memory hash index on it;
for each tuple tr in Hri do begin
probe the hash index on Hsi to locate all tuples
such that ts[JoinAttrs] = tr[JoinAttrs];
for each matching tuple ts in Hsi do begin
add tr ⋈ ts to the result;
end
end
end
It is the Hash join algorithm in which we have computed the natural join of two given
relations r and s. In the algorithm, there are various terms used:
tr ⋈ ts: It defines the concatenation of the attributes of tuple tr and ts, which is further
followed by projecting out the repeated attributes.
tr and ts: These are the tuples of relations r and s, respectively.
Let's understand the hash join algorithm with the following steps:
Step 1: In the algorithm, firstly, we have partitioned both relations r and s.
Step 2: After partitioning, we perform a separate indexed nested-loop join on each of the
partition pairs i using for loop as i = 0 to nh.
Step 3: For performing the nested-loop join, it initially creates a hash index on each s i and
then probes with tuples from ri. In the algorithm, relation r is the probe input, and relation s
is the build input.
There is a benefit of using the Hash Join algorithm i.e., the hash index on si is built-in
memory, so for fetching the tuples, we do not need to access the disk. It is good to use
smaller input relations as the build relations.
Recursive Partitioning in Hash Join
Recursive partitioning is the one in which the system repeats the partitioning of the input
until each partition of the build input fits into the memory. The recursive partitioning is
needed when the value of nh is greater than or equal to the number of memory blocks. It
becomes difficult to split the relation in one pass since there can be insufficient buffer
blocks. So, it's better to split the relation in repeated passes. In one pass, we can split the
input as several partitions because there are sufficient blocks available to be used as output
buffers. Each bucket build by the pass is read separately and further partitioned in the next
pass so as to create smaller partitions. Also, the hash functions are different in different
passes. So, it is better to use recursive partitioning for handling such cases.
Overflows in Hash Join
The overflow condition in hash-table occurs in any partition i of the build relation s due to
the following cases:
Case 1: When the hash index on si is greater than the main memory, the overflow condition
occurs.
Case 2: When there are multiple tuples in the build relation with the same values for the join
attributes.
Case 3: When the hash function does not hold randomness and uniformity characteristics.
Case 4: When some of the partitions have more tuples than the average and others have
fewer tuples, then such type of partitioning is known as skewed.
Handling the Overflows
We can handle such cases of hash-table overflows using various methods.
 Using Fudge Factor
We can handle a small amount of skew by increasing the number of partitions with the use of
the fudge factor. The fudge factor is a small value that increases the number of partitions.
So, it will help to reduce the expected size of each partition, including their hash index less
than the memory size. Unfortunately, the use of a fudge factor makes the user conservative
on the size of the partitions. Thus, the chances of overflow are still possible. However, the
use of the fudge factor is suitable for handling small overflows, but it is not sufficient for
handling large overflows in the hash-table.
As a result, we have two more methods for handling the overflows.
1. Overflow Resolution
The overflow resolution method is applied during the build phase when a hash index
overflow is detected. The overflow resolution works in the following way:
It finds si for any partition i if having size larger than the memory size. It again partitions
such build relation si into smaller partitions through a different hash function. Similarly, it
partitions the probe relation ri through the new hash function, and only those tuples are
joined, which are having matching partitions. But, it is a less careful approach because this
method waits for such conditions to occur, and then take the necessary actions to resolve the
problem.
2. Overflow Avoidance
The overflow avoidance method uses a careful approach while partitioning in order to avoid
the occurrence of overflow in the build phase. The overflow avoidance works in the
following way:
It initially partitions the build relation s into several small partitions and then combines some
of the partitions. These partitions are combined in such a way that each combined partition
fits in the memory. Similarly, it partitions the probe relation r as the combined partitions on
s. But, the size of ri does not matter in this method.
Both overflow resolution and overflow avoidance methods may fail on some partitions if a
large number of tuples in s have the same value for the join attributes. In such a case, it is
better to use block nested-loop join rather than applying the hash join technique for
completing the join operation on those partitions.
Cost Analysis of Hash Join
For analyzing the cost of a hash join, we consider that no overflow occurs in the hash join.
We will consider only two cases where:
1. Recursive partitioning is not needed
We need to read and write relations r and s completely for partitioning them. For this, a total
of 2(b r + b s ) block transfers are required. The term b r and b s are the number of blocks
holding records of relations r and s. Both relations read each partition once for more b r +
bs blocks transfers. However, the partitions might have occupied slightly more number of
blocks than br + bs, which results in partially filled blocks. To access such partially filled
blocks can include the overhead of 2nh approximately for each relation. Thus, a hash join
cost estimates need:
Number of block transfers = 3(br + bs) + 4nh
Here, we can neglect the overhead value of 4nh since it is much smaller than br + bs value.
Number of disk seeks = 2(Γbr/bb ꓶ + Γbs/bb ꓶ) + 2nh
Here, we have assumed that each input buffers are allocated with bb blocks, and the build, as
well as probe phase, needs only one seek for each nh partition of the relation, as we can read
each partition sequentially.
2. Recursive partition is required
In this case, each pass reduces the size of each partition by M-1 expected factor, and also
passes are repeated until it makes the size of each partition as M blocks at most. Therefore,
for partitioning the relation s, we need:
Number of passes = ΓlogM-1(bs) - 1 ꓶ
The number of passes required in the partitioning of the build and probe relations is the
same. As in each pass, each block of s is read and written out and needs a total of 2b sΓlogM-
1(bs) - 1 ꓶ block transfers for splitting relation s. Thus, a hash join cost estimates need:

Number of block transfers = 2(br + bs)ΓlogM-1(bs) - 1 ꓶ + br + bs


Number of disk seeks = 2(Γbr/bb ꓶ + Γbs/bb ꓶ)ΓlogM-1(bs) - 1 ꓶ
Here, we assume that for buffering each partition we allocate bb blocks to them. Also, we
have neglected a relatively small number of seeks during the build and probe phase.
As a result, the hash join algorithm can be further improved if the size of the main memory
increases or is large.
Hybrid Hash Join
It is a type of hash join that is useful for performing the join operations in which the memory
size is relatively large. But still, the build relation does not fit in the memory completely. So,
the hybrid hash join algorithm resolves the drawback of the hash join algorithm.
Materialization in Query Processing
Materialization is an easy approach for evaluating multiple operations of the given query and
storing the results in the temporary relations. The result can be the output of any join
condition, selection condition, and many more. Thus, materialization is the process of
creating and setting a view of the results of the evaluated operations for the user query. It is
similar to the cache memory where the searched data get settled temporarily. We can easily
understand the working of materialization through the pictorial representation of the
expression. An operator tree is used for representing an expression.
The materialization uses the following approach for evaluating operations of the given
expression:
 In the operator tree, we begin from the lowest-level operations (at the bottom of the
tree) in the expression. The inputs to the lowest level operations are stored in the
form of relations in the database. For example, suppose we want to fetch the name
of the student as 'John' from the 'Student' relation.
The relation expression will be:
σ name= "John" (Student)
In this example, there is only one operation of selecting the name from the given
relation. Also, this operation is the lowest-level operation. So, we will begin by
evaluating this selection operation.
 Now, we will use an appropriate algorithm which is suitable for evaluating the
operation. Like in our example, we will use an appropriate selection algorithm for
retrieving the name from the Student relation.
 Then, store the result of the operation in the temporary relations.
 We use these temporary relations for evaluating the next-level operation in the
operator tree. The result works as an input for every next level up in the tree.
 Repeat these steps until all operators at the root of tree will be evaluated, and the
final result of the expression will be generated.
We also call the described evaluation as Materialized evaluation because the result of one
operation is materialized and used in the evaluation of next operation and so on.
Cost Estimation of Materialized Evaluation
The process of estimating the cost of the materialized evaluation is different from the process
of estimating the cost of an algorithm. It is because in analyzing the cost of an algorithm, we
do not include the cost of writing the results on to the disks. But in the evaluation of an
expression, we not only compute the cost of all operations but also include the cost of
writing the result of currently evaluated operation to disk.
To estimate the cost of the materialized evaluation, we consider that results are stored in the
buffer, and when the buffer fills completely, the results are stored to the disk.
Let, a total of br number of blocks are written. Thus, we can estimate br as:
br = nr/fr.
Here, nr is the estimated number of tuples in the result relation r and fr is the number of
records of relation r that fits in a block. Thus, fr is a blocking factor of the resultant relation
r.
With this, we also need to calculate the transfer time by estimating the number of required
disks. It is so because the disk head may have moved in-between the successive writes of the
block. Thus, we can estimate:
Number of seeks = Γ br/ bb ꓶ
Here, bb defines the size of the output buffer, i.e., measured in blocks.
We can optimize the cost estimation of the materialization process by using the concept of
double buffering. Double buffering is the method of using two buffers, where one buffer
executes the algorithm continuously, and the other is being written out. It makes the
algorithm to execute more fastly by performing CPU activities parallel with I/O activities.
We can also reduce the number of seeks by allocating the extra blocks to the output buffer
and altogether writing out multiple blocks.
Merge Join Algorithm
The merge joins are used for performing natural joins and equi-joins for given relations r and
s. We use an algorithm for performing the merge join, known as the Merge Join algorithm.
It is also known as a sort-merge-join algorithm.
Merge Join Algorithm
The merge join algorithm is given below:
pr = address of first tuple of relation r;
ps = address of first of relation s;
while (ps!=null && pr!=null) do begin
ts = tuple to which ps points;
Ss = {ts};
set ps to point the next tuple of relation s;
done = false;
while (!done && ps!=null) do begin
ts? = tuple to which ps points;
if (ts?[JoinAttrs] = ts[JoinAttrs])
begin
Ss = Ss U {ts?};
set ps to point the next tuple of relation s;
end
else
done = true;
end
tr = tuple to which pr points;
while (pr !=null && tr [JoinAttrs] < ts[JoinAttrs]) do begin
for each ts in Ss do begin
add ts ⋈ tr to result;
end
set pr to point nest tuple of r;
tr = tuple to which pr points;
end
In the algorithm, there are various terms used:
JoinAttrs: It denotes the attributes in the intersection of r ꓵ s.
r ꓵ s: The r ꓵ s refers to those attributes which are common in relations r and s.
ts ⋈ tr: A concatenated expression of the attributes of ts and tr tuples. It is further followed by
projecting out repeated attributes.
ts and tr: These are two tuples having the same value of JoinAttrs.
Ss: It reads those join attributes of a group of tuples of a relation which are having the same
values.
In the merge join algorithm, it associates each relation with a pointer. Initially, the pointer
points to the first tuple of the relation and then moves towards the next one as soon the
algorithm proceeds. Also, the algorithm needs that each set of tuples S s fits in the memory
even if the size of the relation s is large. However, if for some attribute values, S s seem larger
than the available memory size, we can perform block nested-loop join for it. Somehow if
the given input relations r and s are not sorted on the join attributes or anyone is unsorted, we
need to sort them before applying the merge join algorithm.
Cost Analysis of Merge Join Algorithm
If the relations are sorted and tuples having the same value on the join attributes are placed
consecutively. Then we need to read each tuple only once, and thus the block will also be
read for once. Thus,
Number of block transfers = br + bs
Also, in both files, the number of block transfers is equal.
Number of Disk seeks = [br/bb] + [bs/bb]
Here, br and bs are the numbers of blocks of the given relations r and s. The term bb means
that we are assuming that bb buffer blocks are allocated to both relations. But, we know that
data transfer is less expensive than disk seeks, so we should allocate multiple blocks to each
given relation. Consequently, it will provide extra memory space too.
Hybrid Merge Join Algorithm
The Hybrid merge join is different from the merge join. In merge join operation, we saw that
it is a must to sort the given relations before applying the merge join technique. However,
both join attributes consist of secondary indices, then also we can perform a variation of the
merge join operations on unsorted tuples too. For doing so, the applied merge join algorithm
will scan the records through the indices, which will enable to retrieve the records in a sorted
manner. Thus, such variation of the merge join operations leads to a significant drawback,
i.e.:
 It is possible that the records might be placed in different file blocks. It means they might
be scattered In several blocks of files. So, for accessing each tuple, we also need to access
the particular file block, and it is a costly step.
For preventing ourselves from such expensive access, we use a new technique which is
known as 'Hybrid Merge Join' technique. The hybrid merge join operation combines the
indices with merge join.
Nested-Loop Join Algorithm
In our previous section, we learned about joins and various types of joins. In this section, we
will know about the nested loop join algorithm.
A nested loop join is a join that contains a pair of nested for loops. To perform the nested
loop join i.e., θ on two relations r and s, we use an algorithm known as the Nested loop join
algorithm. The computation takes place as:
r⋈θs
where r is known as the outer relation and s is the inner relation of the join. It is because
the for loop of r encloses the for loop of s.
Nested-Loop Join Algorithm
Let's discuss the algorithm of nested-loop join:
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr, ts) to test if they satisfy the given join condition ?
if test satisfied
add tr . ts to the result;
end inner loop
end outer loop
In the algorithm, tr and ts are the tuples of relations r and s, respectively. The notation tr. ts is
a tuple constructed by concatenating the attribute values of tuples t r and ts.
With the help of the algorithm, we understood the following points:
 The nested-loop join does not need any indexing similar to a linear file scan for accessing
the data.
 Nested-loop join does not care about the given join condition. It is suitable for each given
join condition.
 The nested-loop join algorithm is expensive in nature. It is because it computes and
examines each pair of tuples in the given two relations.
Cost Analysis of Nested-loop Join Algorithm
For analyzing the cost of the nested-loop join algorithm, consider a number of pairs of tuples
as nr * ns. Here, nr specifies the number of tuples in relation r and ns specifies the number of
tuples in relation s. For computing the cost, perform a complete scan on relation s. Thus,
Total number of block transfers in worst case = nr * bs + br
Total number of seeks required in worst case = nr + br
Here, bs and br are the number of blocks holding the tuples of relation r and s, respectively.
In the best case, both relations r and s have sufficient memory to fit in the memory
simultaneously. So, each block will read only once. Thus,
Total number of block transfers in best case = br + bs.
Total number of seeks required = 2(nr + br)
In case if any one of the relations given fits entirely into the memory, it is a must to use that
relation as the inner relation. It is because we will read the inner relation only once. Thus,
Total number of block transfers in such case = br + bs
Total number of seeks required = 2(nr + br)
Block Nested-Loop Join
Block Nested-Loop Join is a variant of nested-loop join in which each block of the inner
relation is paired with each block of the outer relation. The block nested-loop join saves
major block access in a situation where the buffer size is small enough to hold the entire
relation into the memory. It does so by processing the relations on the basis of per block
rather on the basis of per tuple. Within each pair of blocks, the block nested-loop join pairs
each tuple of one block with each tuple in the other block to produce all pairs of tuples. It
pairs only those tuples that satisfy the given join condition and them to the result.
Block Nested-Loop Join Algorithm
The algorithm that is used to perform the block nested-loop join is known as the Block
Nested-Loop Join algorithm. We will use the same relations r and s in this algorithm.
for each block br of r do begin
for each block bs of s do begin
for each tuple tr in br do begin
for each tuple ts in bs do begin
test pair (tr, ts) to determine if they pass the given join condition
if test passed
add tr . ts to the result;
end
end
end
end
Cost Analysis of Block Nested-Loop Join Algorithm
There is a major difference between the cost of block nested-loop join and nested loop-join
algorithm. In the worst case of block nested-loop join, each block in the inner relation s is
read only for one time for each block in the outer relation r. On the other hand, the nested-
loop join reads each tuple in the inner relation s for one time for each tuple in the outer
relation r. Thus in block nested-loop join,
Total number of block transfers in worst case = br * bs+ br
Total number of seeks required = 2 * br
Here, br and bs are the number of blocks holding records of the given relation r and s,
respectively. Also, each scan of s (inner relation) needs only one seek, and r (outer relation)
requires one seek per block. In the best case, the inner relation fits entirely into memory.
Thus,
Total number of block transfers in best case = br + bs
Total number of seeks required = 2(nr + br)
In the case where none of the given relations r and s fits entirely into the memory, it is
efficient to use the inner relation i.e., s as the outer relation.
Improving Performance of Nested-Loop and Block Nested-Loop Joins
After understanding both joins, it was evaluated that the performance of both the joins can be
further improved:
1. If in an equijoin or natural join, the join attributes form a key on the given inner relation
s, then as soon the first match is found, the inner loop terminates for each outer relation
tuple.
2. Instead of using disk blocks in the block nested-loop join algorithm, we can use the
biggest size that could fit into memory and also leave enough space for the buffers of the
inner relation s and its output. As a result, it will reduce the number of scans of the inner
relation and also minimizes the cost.
3. We can perform scanning of the inner loop in both forward and backward direction in an
alternate manner. This approach reduces the number of requirements of disk accesses by
keeping the requests for disk blocks in order. Ordering of the request also helps to reuse
the remaining data left in the buffer after its previous scan.
4. If an index is present on the join attribute of the inner loop, we can replace the file scans
with highly efficient index lookups. Such type of join method is known as indexed
nested-loop join. Indexed nested-loop join can be used either with existing indexes or
temporary indexes, which are created for the evaluation of the joins. It is an optimized
technique for improving the performance of the nested-loop join.
Selection Operation in Query Processing
Generally, the selection operation is performed by the file scan. File scans are the search
algorithms that are used for locating and accessing the data. It is the lowest-level operator
used in query processing.
Let's see how selection using a file scan is performed.
Selection using File scans and Indices
In RDBMS or relational database systems, the file scan reads a relation only if the whole
relation is stored in one file only. When the selection operation is performed on a relation
whose tuples are stored in one file, it uses the following algorithms:
 Linear Search: In a linear search, the system scans each record to test whether satisfying
the given selection condition. For accessing the first block of a file, it needs an initial
seek. If the blocks in the file are not stored in contiguous order, then it needs some extra
seeks. However, linear search is the slowest algorithm used for searching, but it is
applicable in all types of cases. This algorithm does not care about the nature of selection,
availability of indices, or the file sequence. But other algorithms are not applicable in all
types of cases.
Selection Operation with Indexes
The index-based search algorithms are known as Index scans. Such index structures are
known as access paths. These paths allow locating and accessing the data in the file. There
are following algorithms that use the index in query processing:
 Primary index, equality on a key: We use the index to retrieve a single record that
satisfies the equality condition for making the selection. The equality comparison is
performed on the key attribute carrying a primary key.
 Primary index, equality on nonkey: The difference between equality on key and
nonkey is that in this, we can fetch multiple records. We can fetch multiple records
through a primary key when the selection criteria specify the equality comparison on a
nonkey.
 Secondary index, equality on key or nonkey: The selection that specifies an equality
condition can use the secondary index. Using secondary index strategy, we can either
retrieve a single record when equality is on key or multiple records when the equality
condition is on nonkey. When retrieving a single record, the time cost is equal to the
primary index. In the case of multiple records, they may reside on different blocks. This
results in one I/O operation per fetched record, and each I/O operation requires a seek and
a block transfer.
Selection Operations with Comparisons
For making any selection on the basis of a comparison in a relation, we can proceed it either
by using the linear search or via indices in the following ways:
 Primary index, comparison: When the selection condition given by the user is a
comparison, then we use a primary ordered index, such as the primary B+-tree index. For
example, when A attribute of a relation R compared with a given value v as A>v, then we
use a primary index on A to directly retrieve the tuples. The file scan starts its search
from the beginning till the end and outputs all those tuples that satisfy the given selection
condition.
 Secondary index, comparison: The secondary ordered index is used for satisfying the
selection operation that involves <, >, ≤, or ≥ In this, the files scan searches the blocks of
the lowest-level index.
(< ≤): In this case, it scans from the smallest value up to the given value v.
(>, ≥): In this case, it scans from the given value v up to the maximum value.
However, the use of the secondary index should be limited for selecting a few records. It
is because such an index provides pointers to point each record, so users can easily fetch
the record through the allocated pointers. Such retrieved records may require an I/O
operation as records may be stored on different blocks of the file. So, if the number of
fetched records is large, it becomes expensive with the secondary index.
Implementing Complex Selection Operations
Working on more complex selection involves three selection predicates known as
Conjunction, Disjunction, and Negation.
Conjunction: A conjunctive selection is the selection having the form as:
σ θ1 ꓥ θ2 ꓥ…ꓥ θn (r)
A conjunction is the intersection of all records that satisfies the above selection condition.
Disjunction: A disjunctive selection is the selection having the form as:
σ θ1 ꓦ θ2 ꓦ…ꓦ θn (r)
A disjunction is the union of all records that satisfies the given selection condition θi.
Negation: The result of a selection σ¬θ(r) is the set of tuples of given relation r where the
selection condition evaluates to false. But nulls are not present, and this set is only the set of
tuples in relation r that are not in σθ(r).
Using these discussed selection predicates, we can implement the selection operations by
using the following algorithms:
 Conjunctive selection using one index: In such type of selection operation
implementation, we initially determine if any access path is available for an attribute. If
found one, then algorithms based on the index will work better. Further completion of the
selection operation is done by testing that each selected records satisfy the remaining
simple conditions. The cost of the selected algorithm provides the cost of this algorithm.
 Conjunctive selection via Composite index: A composite index is the one that is
provided on multiple attributes. Such an index may be present for some conjunctive
selections. If the given selection operation proves true on the equality condition on two or
more attributes and a composite index is present on these combined attribute fields, then
directly search the index. Such type of index evaluates the suitable index algorithms.
 Conjunctive selection via the intersection of identifiers: This implementation involves
record pointers or record identifiers. It uses indices with the record pointers on those
fields which are involved in the individual selection condition. It scans each index for
pointers to tuples satisfying the individual condition. Therefore, the intersection of all the
retrieved pointers is the set of pointers to the tuples that satisfies the conjunctive
condition. The algorithm uses these pointers to fetch the actual records. However, in
absence of indices on each individual condition, it tests the retrieved records for the other
remaining conditions.
 Disjunctive selection by the union of identifiers: This algorithm scans those entire
indexes for pointers to tuples that satisfy the individual condition. But only if access paths
are available on all disjunctive selection conditions. Therefore, the union of all fetched
records provides pointers sets to all those tuples which satisfy or prove the disjunctive
condition. Further, it makes use of pointers for fetching the actual records. Somehow, if
the access path is not present for anyone condition, we need to use a linear search to find
those tuples that satisfy the condition. Thus, it is good to use a linear search for
determining such tests.
Cost Estimation
Here, the overall cost of the algorithm is composed by adding the cost of individual index
scans and cost of fetching the records in the intersection of the retrieved lists of pointers. We
can minimize the cost by sorting the list of pointers and fetching the sorted records. So, we
found the following two points for cost estimation:
 We can fetch all selected records of the block using a single I/O operation because each
pointer in the block appears together.
 The disk-arm movement gets minimized as blocks are read in sorted order.
Cost Estimation Chart for various Selection algorithms
Here, br is the number of blocks in the file.
hi denotes the height of the index
b is the number of blocks holding records with specified search key
n is the number of fetched records
Selection Algorithms Cost Why So?

Linear Search ts + br * tT It needs one initial seek with


br block transfers.

Linear Search, Equality on ts + (br/2) * tT It is the average case where it


Key needs only one record
satisfying the condition. So
as soon as it is found, the
scan terminates.

Primary B+-tree index, (hi +1) * (tr + ts) Each I/O operation needs one
Equality on Key seek and one block transfer to
fetch the record by traversing
the height of the tree.

Primary B+-tree index, hi * (tT + ts) + b * tT It needs one seek for each
Equality on a Nonkey level of the tree, and one seek
for the first block.

Secondary B+-tree index, (hi + 1) * (tr + ts) Each I/O operation needs one
Equality on Key seek and one block transfer to
fetch the record by traversing
the height of the tree.

Secondary B+-tree index, (hi + n) * (tr + ts) It requires one seek per
Equality on Nonkey record because each record
may be on a different block.

Primary B+-tree index, hi * (tr + ts) + b * tT It needs one seek for each
Comparison level of the tree, and one seek
for the first block.

Secondary B+-tree index, (hi + n) * (tr + ts) It requires one seek per
Comparison record because each record
may be on a different block.

Double-Pipelined Join Algorithm


There are several operations used in accessing the data from any particular system. But few
of them are inherently blocking operations, and others are not. Blocking operations are those
which do not output any results until all the input tuples are examined.
For example, operations such as hash-join is a blocking operation as before outputting any
result. It needs both its input to be fetched entirely as well as partitioned. On the other hand,
the indexed nested loop is able to output the resulting tuples as soon it gets tuples for the
outer relation. So, it is pipelined at its outer relation and blocking on its indexed input. It is
so because the indexed is created completely before the execution of the indexed nested
loop.
But in some cases where we want to perform join operation on two inputs. However, both
inputs are not already sorted, and we need to put them in a pipeline of the join operation. For
such cases, we use an alternative approach known as the Double-pipelined join method. The
double-pipelined join technique uses an evaluation algorithm for the implementation of the
pipeline, which is known as the Double-pipelined join algorithm.
Implementation of Pipelining
For implementing a pipeline in order to evaluate multiple operations of the given user query,
we need to construct a single and complex operation that merges the multiple operations of
the given query, which will implement a pipeline. However, such an approach is feasible and
efficient for some frequently occurring conditions.
The system can use any of the following ways for executing a pipeline:
Demand-driven Pipeline
In the demand-driven pipeline, the system repeatedly makes tuples request from the
operation, which is at the top of the pipeline. Whenever the operation gets the system request
for the tuples, initially, it computes those next tuples which will be returned, and after that, it
returns the requested tuples. The operation repeats the same process each time it receives any
tuples request from the system. In case, the inputs of the operation are not pipelined, then we
compute the next returning tuples from the input relations only. However, the system keeps
track of all tuples which have been returned so far. But if there are some pipelined inputs
present, the operation will make a request for tuples from its pipelined inputs also. After
receiving tuples from its pipelined inputs, the operation uses them for computing tuples for
its output or result and then passes them to its parent which is at the upper-level. So, in the
demand-driven pipeline, a pipeline is implemented on the basis of the demand or request of
tuples made by the system.
Implementing demand-driven pipeline
In the demand-driven pipeline, it implements each operation as an iterator. The iterator
provides three basic functions to implement a demand-driven pipeline. The functions are
open(), next(), and close(). These functions work as follows:
 After invoking the open() function, each call to next() returns the next tuple as an output
of the operation.
 In turn, the implementation of the operation invokes the open(), and next() functions on
its inputs so that the input tuples may be easily available when needed.
 After fulfilling the requirements, the close() function tells the iterator that there is no
more tuple requirement.
 Also, in-between the calling process or calls, the iterator maintains its state of execution.
As a result, the successive next() function receives tuples of the successive result.
Producer-driven Pipeline
The producer-driven pipeline is different from the demand-driven pipeline. In the producer-
driven pipeline, the operations do not wait for the system request for producing the tuples.
Instead, the operations are eager to produce such tuples. In the producer-driven pipeline, it
models each operation as a separate thread or process within the system. Here, the system
gets a stream of tuples from its pipelined inputs and finally generates or produces a stream of
tuples for its output. The producer-driven pipeline follows such an approach.
Implementing Producer-driven Pipeline
The way of implementing the producer-driven pipeline varies from demand-driven pipeline.
The implementation processes in the following described steps:
 For each pair of adjacent operations, the system constructs a buffer that holds the tuples
which are being passed from one operation to the next operation.
 After creating the buffer, the processes which are corresponding to different operations
are concurrently executed.
 All those operations which are present at the bottom of the pipeline continually produce
the output tuples put them in the output buffer until the buffer becomes full.
 As soon the operation uses a tuple from a pipelined input, it removes that tuple from its
input buffer.
 In case the output buffer becomes full, the operation waits until the buffer creates more
space for more tuples. What happens, the parent operation of the specified operation is
responsible for removing the tuples form the buffer. So, in actuality, the operation waits
for its parent operation to do so.
 So, when the buffer creates more space again, the operation restarts its tuples production
and continues until the buffer becomes full again.
 The operation repeats this process until the generation of all the output tuples.
Note: It becomes necessary for the system to switch operations if an input buffer is empty,
the output buffer is full, or when it needs more input tuples for generating more output
tuples.
Difference between Producer-driven pipeline and Demand-driven pipeline
There are the following difference points between the demand-driven pipeline and producer-
driven pipeline:

Demand-driven Pipeline Producer-driven Pipeline


It is similar to pulling data up from the top of It is similar to pushing data up from the
an operation tree. below of an operation tree.

Tuples are generated in a lazy manner. Tuples are eagerly generated.

It is easy to implement. It is not so easy to implement a producer-


driven pipeline.

It is most commonly used for evaluating an It is typical so rarely used in the systems. But,
expression. it is good for systems such as parallel
processing systems.

Pipelining in Query Processing


Pipelining helps in improving the efficiency of the query-evaluation by decreasing the
production of a number of temporary files. Actually, we reduce the construction of the
temporary files by merging the multiple operations into a pipeline. The result of one
currently executed operation passes to the next operation for its execution, and the chain
continues till all operations are completed, and we get the final output of the expression.
Such type of evaluation process is known as Pipelined Evaluation.
Advantages of Pipeline
There are following advantages of creating a pipelining of operations:
 It reduces the cost of query evaluation by eliminating the cost of reading and writing the
temporary relations, unlike the materialization process.
 If we combine the root operator of a query evaluation plan in a pipeline with its inputs,
the process of generating query results becomes quick. As a result, it is beneficial for the
users as they can view the results of their asked queries as soon as the outputs get
generated. Else, the users need to wait for high-time to get and view any query results.
Pipelining vs. Materialization

Pipelining Materialization

It is a modern approach to evaluate multiple It is a traditional approach to evaluate


operations. multiple operations.

It does not use any temporary relations for It uses temporary relations for storing the
storing the results of the evaluated results of the evaluated operations. So, it
operations. needs more temporary files and I/O.

It is a more efficient way of query evaluation It is less efficient as it takes time to generate
as it quickly generates the results. the query results.

It requires memory buffers at a high rate for It does not have any higher requirements for
generating outputs. Insufficient memory memory buffers for query evaluation.
buffers will cause thrashing.

Poor performance if trashing occurs. No trashing occurs in materialization. Thus,


in such cases, materialization is having better
performance.

It optimizes the cost of query evaluation. As The overall cost includes the cost of
it does not include the cost of reading and operations plus the cost of reading and
writing the temporary storages. writing results on the temporary storage.

Unit-3
The records in databases are stored in file formats. Physically, the data is stored in
electromagnetic format on a device. The electromagnetic devices used in database systems
for data storage are classified as follows:
1. Primary Memory
2. Secondary Memory
3. Tertiary Memory
Types of Memory
1. Primary Memory
The primary memory of a server is the type of data storage that is directly accessible by the
central processing unit, meaning that it doesn’t require any other devices to read from it.
The primary memory must, in general, function flawlessly with equal contributions from the
electric power supply, the hardware backup system, the supporting devices, the coolant that
moderates the system temperature, etc.
 The size of these devices is considerably smaller and they are volatile.
 According to performance and speed, the primary memory devices are the fastest devices,
and this feature is in direct correlation with their capacity.
 These primary memory devices are usually more expensive due to their increased speed
and performance.
The cache is one of the types of Primary Memory.
 Cache Memory: Cache Memory is a special very high-speed memory. It is used to speed
up and synchronize with a high-speed CPU. Cache memory is costlier than main memory
or disk memory but more economical than CPU registers. Cache memory is an extremely
fast memory type that acts as a buffer between RAM and the CPU.
2. Secondary Memory
Data storage devices known as secondary storage, as the name suggests, are devices that can
be accessed for storing data that will be needed at a later point in time for various purposes
or database actions. Therefore, these types of storage systems are sometimes called backup
units as well. Devices that are plugged or connected externally fall under this memory
category, unlike primary memory, which is part of the CPU. The size of this group of
devices is noticeably larger than the primary devices and smaller than the tertiary devices.
 It is also regarded as a temporary storage system since it can hold data when needed and
delete it when the user is done with it. Compared to primary storage devices as well as
tertiary devices, these secondary storage devices are slower with respect to actions and
pace.
 It usually has a higher capacity than primary storage systems, but it changes with the
technological world, which is expanding every day.
Some commonly used Secondary Memory types that are present in almost every system are:
 Flash Memory: Flash memory, also known as flash storage, is a type of nonvolatile
memory that erases data in units called blocks and rewrites data at the byte level. Flash
memory is widely used for storage and data transfer in consumer devices, enterprise
systems, and industrial applications. Unlike traditional hard drives, flash memories are
able to retain data even after the power has been turned off
 Magnetic Disk Storage: A Magnetic Disk is a type of secondary memory that is a flat
disc covered with a magnetic coating to hold information. It is used to store various
programs and files. The polarized information in one direction is represented by 1, and
vice versa. The direction is indicated by 0.
3. Tertiary Memory
For data storage, Tertiary Memory refers to devices that can hold a large amount of data
without being constantly connected to the server or the peripherals. A device of this type is
connected either to a server or to a device where the database is stored from the outside.
 Due to the fact that tertiary storage provides more space than other types of device
memory but is most slowly performing, the cost of tertiary storage is lower than primary
and secondary. As a means to make a backup of data, this type of storage is commonly
used for making copies from servers and databases.
 The ability to use secondary devices and to delete the contents of the tertiary devices is
similar.
Some commonly used Tertiary Memory types that are almost present in every system are:
 Optical Storage: It is a type of storage where reading and writing are to be performed
with the help of a laser. Typically data written on CDs and DVDs are examples
of Optical Storage.
 Tape Storage: Tape Storage is a type of storage data where we use magnetic tape to
store data. It is used to store data for a long time and also helps in the backup of data
in case of data loss.
Memory Hierarchy
A computer system has a hierarchy of memory. Direct access to a CPU’s main memory and
inbuilt registers is available. Accessing the main memory takes less time than running a
CPU. Cache memory is introduced to minimize this difference in speed. Data that is most
frequently accessed by the CPU resides in cache memory, which provides the fastest access
time to data. Fastest-accessing memory is the most expensive. Although large storage
devices are slower and less expensive than CPU registers and cache memory, they can store
a greater amount of data.
1. Magnetic Disks
Present-day computer systems use hard disk drives as secondary storage devices. Magnetic
disks store information using the concept of magnetism. Metal disks are coated with
magnetizable material to create hard disks. Spindles hold these disks vertically. As the
read/write head moves between the disks, it de-magnetizes or magnetizes the spots under it.
There are two magnetized spots: 0 (zero) and 1 (one). Formatted hard disks store data
efficiently by storing them in a defined order. The hard disk plate is divided into many
concentric circles, called tracks. Each track contains a number of sectors. Data on a hard disk
is typically stored in sectors of 512 bytes.
2. Redundant Array of Independent Disks(RAID)
In the Redundant Array of Independent Disks technology, two or more secondary storage
devices are connected so that the devices operate as one storage medium. A RAID array
consists of several disks linked together for a variety of purposes. Disk arrays are categorized
by their RAID levels.
 RAID 0: At this level, disks are organized in a striped array. Blocks of data are divided
into disks and distributed over disks. Parallel writing and reading of data occur on each
disk. This improves performance and speed. Level 0 does not support parity and backup.

 RAID 1: Mirroring is used in RAID 1. A RAID controller copies data across all disks
in an array when data is sent to it. In case of failure, RAID level 1 provides 100%
redundancy.
 RAID 2: The data in RAID 2 is striped on different disks, and the Error Correction
Code is recorded using Hamming distance. Similarly to level 0, each bit within a word
is stored on a separate disk, and ECC codes for the data words are saved on a separate
set of disks. As a result of its complex structure and high cost, RAID 2 cannot be
commercially deployed.

 RAID 3: Data is striped across multiple disks in RAID 3. Data words are parsed to
generate a parity bit. It is stored on a different disk. Thus, single-disk failures can be
avoided.

 RAID 4: This level involves writing an entire block of data onto data disks, and then
generating the parity and storing it somewhere else. At level 3, bytes are striped, while
at level 4, blocks are striped. Both levels 3 and 4 require a minimum of three disks.

 RAID 5: The data blocks in RAID 5 are written to different disks, but the parity bits
are spread out across all the data disks rather than being stored on a separate disk.
 RAID 6: The RAID 6 level extends the level 5 concept. A pair of independent parities
are generated and stored on multiple disks at this level. A pair of independent parities
are generated and stored on multiple disks at this level. Ideally, you need four disk
drives for this level.

Unit-4
Transactions
A transaction is a program including a collection of database operations, executed as a
logical unit of data processing. The operations performed in a transaction include one or
more of database operations like insert, delete, update or retrieve data. It is an atomic process
that is either performed into completion entirely or is not performed at all. A transaction
involving only data retrieval without any data update is called read-only transaction.
Each high level operation can be divided into a number of low level tasks or operations. For
example, a data update operation can be divided into three tasks −
 read_item() − reads data item from storage to main memory.
 modify_item() − change value of item in the main memory.
 write_item() − write the modified value from main memory to storage.
Database access is restricted to read_item() and write_item() operations. Likewise, for all
transactions, read and write forms the basic database operations.
Transaction Operations
The low level operations performed in a transaction are −
 begin_transaction − A marker that specifies start of transaction execution.
 read_item or write_item − Database operations that may be interleaved with main
memory operations as a part of transaction.
 end_transaction − A marker that specifies end of transaction.
 commit − A signal to specify that the transaction has been successfully completed in its
entirety and will not be undone.
 rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be rolled
back.
Transaction States
A transaction may go through a subset of five states, active, partially committed, committed,
failed and aborted.
 Active − The initial state where the transaction enters is the active state. The transaction
remains in this state while it is executing read, write or other operations.
 Partially Committed − The transaction enters this state after the last statement of the
transaction has been executed.
 Committed − The transaction enters this state after successful completion of the
transaction and system checks have issued commit signal.
 Failed − The transaction goes from partially committed state or active state to failed state
when it is discovered that normal execution can no longer proceed or system checks fail.
 Aborted − This is the state after the transaction has been rolled back after failure and the
database has been restored to its state that was before the transaction began.
The following state transition diagram depicts the states in the transaction and the low level
transaction operations that causes change in states.

Desirable Properties of Transactions


Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation,
and Durability.
 Atomicity − This property states that a transaction is an atomic unit of processing, that is,
either it is performed in its entirety or not performed at all. No partial update should exist.
 Consistency − A transaction should take the database from one consistent state to another
consistent state. It should not adversely affect any data item in the database.
 Isolation − A transaction should be executed as if it is the only one in the system. There
should not be any interference from the other concurrent transactions that are
simultaneously running.
 Durability − If a committed transaction brings about a change, that change should be
durable in the database and not lost in case of any failure.
Concurrency Control
• Concurrency Control in Database Management System is a procedure of managing
simultaneous operations without conflicting with each other.
• It ensures that Database transactions are performed concurrently and accurately to
produce correct results without violating data integrity of the respective Database.
• DBMS Concurrency Control is used to address such conflicts, which mostly occur with a
multi-user system.
• Therefore, Concurrency Control is the most important element for proper functioning of a
Database Management System where two or more database transactions are executed
simultaneously, which require access to the same data.
Schedules and Conflicts
In a system with a number of simultaneous transactions, a schedule is the total order of
execution of operations. Given a schedule S comprising of n transactions, say T1, T2,
T3..Tn; for any transaction Ti, the operations in Ti must execute as laid down in the schedule
S.
Types of Schedules
There are two types of schedules −
 Serial Schedules − In a serial schedule, at any point of time, only one transaction is
active, i.e. there is no overlapping of transactions. This is depicted in the following
graph −

 Parallel Schedules − In parallel schedules, more than one transactions are active
simultaneously, i.e. the transactions contain operations that overlap at time. This is
depicted in the following graph −

Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active
transactions perform non-compatible operations. Two operations are said to be in conflict,
when all of the following three conditions exists simultaneously −
 The two operations are parts of different transactions.
 Both the operations access the same data item.
 At least one of the operations is a write_item() operation, i.e. it tries to modify the data
item.
Serializability
A serializable schedule of n transactions is a parallel schedule which is equivalent to a serial
schedule comprising of the same n transactions. A serializable schedule contains the
correctness of serial schedule while ascertaining better CPU utilization of parallel schedule.
Equivalence of Schedules
Equivalence of two schedules can be of the following types −
 Result equivalence − Two schedules producing identical results are said to be result
equivalent.
 View equivalence − Two schedules that perform similar action in a similar manner are
said to be view equivalent.
 Conflict equivalence − Two schedules are said to be conflict equivalent if both contain
the same set of transactions and has the same order of conflicting pairs of operations.
Various concurrency control techniques are:
1. Locking Based Protocol
2. Time stamp ordering Protocol
3. Multi version concurrency control
4. Validation concurrency control
Two Phase Locking
The Two-Phase Locking (2PL) Protocol is a key technique used in database management
systems to manage how multiple transactions access and modify data at the same time. When
many users or processes interact with a database, it’s important to ensure that data remains
consistent and error-free. Without proper management, issues like data conflicts or
corruption can occur if two transactions try to use the same data simultaneously.
The Two-Phase Locking Protocol resolves this issue by defining clear rules for managing
data locks. It divides a transaction into two phases:
1. Growing Phase: In this step, the transaction gathers all the locks it needs to access the
required data. During this phase, it cannot release any locks.
2. Shrinking Phase: Once a transaction starts releasing locks, it cannot acquire any new
ones. This ensures that no other transaction interferes with the ongoing process.
Time Stamp Ordering Protocol
A timestamp is a tag that can be attached to any transaction or any data item, which denotes
a specific time on which the transaction or the data item had been used in any way. A
timestamp can be implemented in 2 ways. One is to directly assign the current value of the
clock to the transaction or data item. The other is to attach the value of a logical counter that
keeps increment as new timestamps are required. The timestamp of a data item can be of 2
types:
• W-timestamp(X): This means the latest time when the data item X has been written into.
• R-timestamp(X): This means the latest time when the data item X has been read from.
These 2 timestamps are updated each time a successful read/write operation is performed
on the data item X.
Multiversion Concurrency Control
Multiversion schemes keep old versions of data item to increase concurrency. Multiversion
2 phase locking: Each successful write results in the creation of a new version of the data
item written. Timestamps are used to label the versions. When a read(X) operation is issued,
select an appropriate version of X based on the timestamp of the transaction.
Validation Concurrency Control
The optimistic approach is based on the assumption that the majority of the database
operations do not conflict. The optimistic approach requires neither locking nor time
stamping techniques. Instead, a transaction is executed without restrictions until it is
committed. Using an optimistic approach, each transaction moves through 2 or 3 phases,
referred to as read, validation and write.
• During read phase, the transaction reads the database, executes the needed computations
and makes the updates to a private copy of the database values. All update operations of
the transactions are recorded in a temporary update file, which is not accessed by the
remaining transactions.
• During the validation phase, the transaction is validated to ensure that the changes made
will not affect the integrity and consistency of the database. If the validation test is
positive, the transaction goes to a write phase. If the validation test is negative, he
transaction is restarted and the changes are discarded.
• During the write phase, the changes are permanently applied to the database.
Deadlock in DBMS
A deadlock is a condition where two or more transactions are waiting indefinitely for one
another to give up locks. Deadlock is said to be one of the most feared complications in
DBMS as no task ever gets finished and is in waiting state forever.
Deadlock in DBMS
• Every process need some resource for its execution and these resources are granted in
sequential order
• First the process request some resource.
• OS grants the resource to the process if it is available or else it places the request in the
wait queue.
• The process uses it and releases on the completion so that it can be granted to another
process
• Deadlock is a situation where two or more transactions are waiting indefinitely for each
other to give up their locks.
Necessary conditions for Deadlocks
• Mutual Exclusion: It implies if two processes cannot use the same resource at the same
time.
• Hold and Wait: A process waits for some resources while holding another resource at the
same time.
• No pre-emption: The process which once scheduled will be executed till the completion.
No other process can be scheduled by the scheduler meanwhile.
• Circular Wait: All the processes must be waiting for the resources in a cyclic manner so
that the last process is waiting for the resource which is being held by the first process.
Starvation in a database management system (DBMS) is when a transaction or process is
repeatedly delayed or stalled because it cannot access the resources it needs to move
forward. When other transactions or processes take precedence over the one that is starving,
this might occur.
Reasons for Starvation –
 If the waiting scheme for locked items is unfair. ( priority queue )
 Victim selection. (the same transaction is selected as a victim repeatedly )
 Resource leak.
 Via denial-of-service attack.
Solutions of Starvation –
1. Increasing Priority – Starvation occurs when a transaction has to wait for an indefinite
time, In In this situation, we can increase the priority of that particular transaction/s. But
the drawback with this solution is that it may happen that the other transaction may have
to wait longer until the highest priority transaction comes and proceeds.
2. Modification in Victim Selection algorithm – If a transaction has been a victim of
repeated selections, then the algorithm can be modified by lowering its priority over other
transactions.
3. First Come First Serve approach – A fair scheduling approach i.e FCFS can be adopted,
In which the transaction can acquire a lock on an item in the order, in which the requested
the lock.
4. Wait for the die and wound wait scheme – These are the schemes that use the timestamp
ordering mechanism of transaction.
Database Recovery Techniques
Database Systems like any other computer system, are subject to failures but the data stored
in them must be available as and when required. When a database fails it must possess the
facilities for fast recovery. It must also have atomicity i.e. either transactions are completed
successfully and committed (the effect is recorded permanently in the database) or the
transaction should have no effect on the database.
Types of Recovery Techniques in DBMS
Database recovery techniques are used in database management systems (DBMS) to restore
a database to a consistent state after a failure or error has occurred. The main goal of
recovery techniques is to ensure data integrity and consistency and prevent data loss.
There are mainly two types of recovery techniques used in DBMS
• Rollback/Undo Recovery Technique
• Commit/Redo Recovery Technique
• CheckPoint Recovery Technique
Rollback/Undo Recovery Technique
The rollback/undo recovery technique is based on the principle of backing out or undoing the
effects of a transaction that has not been completed successfully due to a system failure or
error. This technique is accomplished by undoing the changes made by the transaction using
the log records stored in the transaction log. The transaction log contains a record of all the
transactions that have been performed on the database. The system uses the log records to
undo the changes made by the failed transaction and restore the database to its previous state.
Commit/Redo Recovery Technique
The commit/redo recovery technique is based on the principle of reapplying the changes
made by a transaction that has been completed successfully to the database. This technique is
accomplished by using the log records stored in the transaction log to redo the changes made
by the transaction that was in progress at the time of the failure or error. The system uses the
log records to reapply the changes made by the transaction and restore the database to its
most recent consistent state.
Checkpoint Recovery Technique
Checkpoint Recovery is a technique used to improve data integrity and system stability,
especially in databases and distributed systems. It entails preserving the system’s state at
regular intervals, known as checkpoints, at which all ongoing transactions are either
completed or not initiated. This saved state, which includes memory and CPU registers, is
kept in stable, non-volatile storage so that it can withstand system crashes. In the event of a
breakdown, the system can be restored to the most recent checkpoint, which reduces data
loss and downtime. The frequency of checkpoint formation is carefully regulated to decrease
system overhead while ensuring that recent data may be restored quickly.
Overall, recovery techniques are essential to ensure data consistency and availability
in Database Management System, and each technique has its own advantages and limitations
that must be considered in the design of a recovery system.
Database Systems
There are both automatic and non-automatic ways for both, backing up data and recovery
from any failure situations. The techniques used to recover lost data due to system crashes,
transaction errors, viruses, catastrophic failure, incorrect command execution, etc. are
database recovery techniques. So to prevent data loss recovery techniques based on deferred
updates and immediate updates or backing up data can be used. Recovery techniques are
heavily dependent upon the existence of a special file known as a system log. It contains
information about the start and end of each transaction and any updates which occur during
the transaction. The log keeps track of all transaction operations that affect the values of
database items. This information is needed to recover from transaction failure.
• The log is kept on disk start_transaction(T): This log entry records that transaction T
starts the execution.
• read_item(T, X): This log entry records that transaction T reads the value of database
item X.
• write_item(T, X, old_value, new_value): This log entry records that transaction T
changes the value of the database item X from old_value to new_value. The old value is
sometimes known as a before an image of X, and the new value is known as an
afterimage of X.
• commit(T): This log entry records that transaction T has completed all accesses to the
database successfully and its effect can be committed (recorded permanently) to the
database.
• abort(T): This records that transaction T has been aborted.
• checkpoint: A checkpoint is a mechanism where all the previous logs are removed from
the system and stored permanently in a storage disk. Checkpoint declares a point before
which the DBMS was in a consistent state, and all the transactions were committed.
A transaction T reaches its commit point when all its operations that access the database
have been executed successfully i.e. the transaction has reached the point at which it will
not abort (terminate without completing). Once committed, the transaction is permanently
recorded in the database. Commitment always involves writing a commit entry to the log and
writing the log to disk. At the time of a system crash, the item is searched back in the log for
all transactions T that have written a start_transaction(T) entry into the log but have not
written a commit(T) entry yet; these transactions may have to be rolled back to undo their
effect on the database during the recovery process.
• Undoing: If a transaction crashes, then the recovery manager may undo transactions i.e.
reverse the operations of a transaction. This involves examining a transaction for the log
entry write_item(T, x, old_value, new_value) and setting the value of item x in the
database to old-value. There are two major techniques for recovery from non-catastrophic
transaction failures: deferred updates and immediate updates.
• Deferred Update: This technique does not physically update the database on disk until a
transaction has reached its commit point. Before reaching commit, all transaction updates
are recorded in the local transaction workspace. If a transaction fails before reaching its
commit point, it will not have changed the database in any way so UNDO is not needed.
It may be necessary to REDO the effect of the operations that are recorded in the local
transaction workspace, because their effect may not yet have been written in the database.
Hence, a deferred update is also known as the No-undo/redo algorithm.
• Immediate Update: In the immediate update, the database may be updated by some
operations of a transaction before the transaction reaches its commit point. However,
these operations are recorded in a log on disk before they are applied to the database,
making recovery still possible. If a transaction fails to reach its commit point, the effect of
its operation must be undone i.e. the transaction must be rolled back hence we require
both undo and redo. This technique is known as undo/redo algorithm.
• Caching/Buffering: In this one or more disk pages that include data items to be updated
are cached into main memory buffers and then updated in memory before being written
back to disk. A collection of in-memory buffers called the DBMS cache is kept under the
control of DBMS for holding these buffers. A directory is used to keep track of which
database items are in the buffer. A dirty bit is associated with each buffer, which is 0 if
the buffer is not modified else 1 if modified.
• Shadow Paging: It provides atomicity and durability. A directory with n entries is
constructed, where the ith entry points to the ith database page on the link. When a
transaction began executing the current directory is copied into a shadow directory. When
a page is to be modified, a shadow page is allocated in which changes are made and when
it is ready to become durable, all pages that refer to the original are updated to refer new
replacement page.
• Backward Recovery: The term ” Rollback ” and ” UNDO ” can also refer to backward
recovery. When a backup of the data is not available and previous modifications need to
be undone, this technique can be helpful. With the backward recovery method, unused
modifications are removed and the database is returned to its prior condition. All
adjustments made during the previous traction are reversed during the backward
recovery. In other words, it reprocesses valid transactions and undoes the erroneous
database updates.
• Forward Recovery: “ Roll forward “and ” REDO ” refers to forwarding recovery.
When a database needs to be updated with all changes verified, this forward recovery
technique is helpful. Some failed transactions in this database are applied to the database
to roll those modifications forward. In other words, the database is restored using
preserved data and valid transactions counted by their past saves.
Backup Techniques
There are different types of Backup Techniques. Some of them are listed below.
• Full database Backup: In this full database including data and database, Meta
information needed to restore the whole database, including full-text catalogs are backed
up in a predefined time series.
• Differential Backup: It stores only the data changes that have occurred since the last full
database backup. When some data has changed many times since the last full database
backup, a differential backup stores the most recent version of the changed data. For this
first, we need to restore a full database backup.
• Transaction Log Backup: In this, all events that have occurred in the database, like a
record of every single statement executed is backed up. It is the backup of transaction log
entries and contains all transactions that had happened to the database. Through this, the
database can be recovered to a specific point in time. It is even possible to perform a
backup from a transaction log if the data files are destroyed and not even a single
committed transaction is lost.

Unit-5
Database Security
Database security in DBMS is a technique for protecting and securing a database from
intentional or accidental threats. Security considerations will apply not only to the data
stored in an organization's database: a breach of security may impact other aspects of the
system, which may ultimately affect the database structure. As a result, database
security encompasses hardware parts, software parts, human resources, and data.
To use the security efficiently, appropriate controls are required, which are separated into a
specific goal and purpose for the system. The demand for effective security, which was
frequently neglected or overlooked in the past, is now being rigorously verified by many
businesses.
We consider database security in the following scenarios:
 Theft and fraudulent.
 Loss of Data privacy.
 Loss of Data integrity.
 Loss of confidentiality or secrecy
 Loss of availability of data.
These above-stated conditions generally represent the areas where the organization should
focus on lowering the risk, which is the possibility of data loss or destruction within a
database. Since all of the data inside an organization is interrelated, an activity that results
in a loss in one area may also lead to a loss in another.
Why Database Security is Important?
Security is an important concern in database management because the information stored in a
database is a very valuable and, at times, quite sensitive commodity. As a result, data in a
database management system must be protected from abuse and illegal access and updates.
 Compromise of intellectual property: Our intellectual property—trade secrets,
inventions, or unique methods—could be essential for our ability to sustain an advantage
in our industry. If our intellectual property is stolen or leaked, then we will lose our
competitive advantage and it may be difficult to maintain or recover.
 The reputational harm is done to our brand: Customers or partners may refuse to buy
goods or services from us (or do business with us) if they do not believe they can trust our
company to protect their data or their own.
 The concept of business continuity (or lack of it): Some businesses are unable to
operate until a breach has been resolved.
 Penalties or fines to be paid for failure: The cost of failing to comply with international
regulations such as the Sarbanes-Oxley Act (SAO) or Payment Card Industry Data
Security Standard (PCI DSS) specific to industry regulations on data privacy, such as
HIPAA, or regional privacy laws like the European Union's General Data Protection
Regulation (GDPR) could be significant, with fines exceeding many millions of dollars in
the worst-case scenario.
 Costs of correcting breaches and notifying consumers about them: Along with
notifying customers of a breach, the organization that was breached must fund the
investigation and forensic services such as crisis management, triage repairs to the
affected systems, and much more.
Database Security Threats
Many software vulnerabilities, misconfigurations, or practices of misuse or carelessness
could lead to breaches. The following are some of the most well-known causes and types of
database security cyber threats.
1) SQL/NoSQL Injection Attacks
It is a type of attack that occurs when a malicious code is injected into frontend (web) apps
and then transmitted to the backend database. SQL injections provide hackers with
unrestricted access to any data saved in a database. There are two types of such computer
attacks: SQL injection attacks on traditional databases and NoSQL injection attacks on
large data databases. Typically, these are queries generated as an extension of online
application forms or received via HTTP requests. Any database system is vulnerable to these
attacks if developers do not follow secure coding practices and the organization does not
conduct regular vulnerability testing.
Countermeasures:
 Direct queries should be replaced with stored procedures.
 The MVC Architecture must be implemented.
2) Malware
Malware is software designed to corrupt data or harm a database. Malware could enter your
system via any endpoint device connected to the database's network and exploit
vulnerabilities in your system. Malware protection is important on any endpoint, but it is
particularly necessary on database servers due to their high value and
sensitivity. Examples of common malware include spyware, Trojan viruses, viruses, worms,
adware, and ransomware.
Countermeasures:
 Safeguard any endpoints on your databases. Install specialized malware protection
software and spread awareness among users on risk and preventative measures.
3) Lack of Security Expertise and Education
Databases are breached and leaked due to insufficient level of IT security expertise and
education of non-technical employees, who may violate basic database security standards
and endanger databases. IT security employees may also lack the necessary expertise to
create security controls, enforce rules, or execute incident response processes.
Countermeasures:
 Database users must be trained in database security.
 IT security professionals will be encouraged to advance their professional level and
qualifications.
4) Denial of Service (DoS/DDoS) Attacks
In a denial of service (DoS) attack, the cybercriminal uses a huge number of fake requests
to overwhelm the target service—in this case, the database server. As a result, the server
cannot handle legitimate requests from actual users and frequently crashes or becomes
unstable.
A DoS (Denial of Service) attack causes a database server to slow down and may render it
unavailable to all users. Even though a DoS attack does not reveal the contents of a database,
it can cost victims a significant amount of time and money. Moreover, what is the use of
having a database if you can't use or access it?
In a distributed denial of service (DDoS) attack, fake traffic is generated by a large number
of computers that are part of an attacker-controlled botnet. This results in extremely high
traffic volumes, which are difficult to stop without a highly scalable defensive architecture.
Cloud-based DDoS prevention services can dynamically scale up to deal with massive DDoS
attacks.
Countermeasures:
 Harden the TCP/IP stack by increasing the size of the TCP connection queue with the
appropriate registry settings.
 Reduce the time it takes to establish a connection.
 Use dynamic backlog methods to ensure that the connection queue is never full.
 Use an Intrusion Detection System (IDS) on your network.
5) Exploitation of Database Software Vulnerabilities
Attackers are continuously attempting to isolate and target software vulnerabilities, and
database management software is a particularly desirable target. New vulnerabilities are
identified on a daily basis, and security updates are issued regularly by all open-source
database management platforms and commercial database software manufacturers. However,
if you do not apply these changes immediately, your database may be vulnerable to attack.
Even if you deploy patches on time, there is always the risk of zero-day attacks, which occur
when attackers find a vulnerability that the database vendor has not yet found and patched.
Countermeasures:
 Encrypt any sensitive information in your database (s).
 Apply the necessary database controls and permissions.
 Conduct a regular search for new sensitive data in your databases. You may
accomplish this very successfully with the Periodic Data Discovery tool and
Compliance Manager, which will automatically discover and secure newly uploaded
sensitive data.
6) Excessive Database Privileges
Database users in DBMS may have varying levels of access. However, users may abuse
them, and the three basic categories of privilege abuse are as follows: excessive privilege
abuse, legitimate privilege abuse, and unused privilege abuse. Excessive privileges always
introduce unnecessary risks. According to statistics, 80% of attacks on company databases
are carried out by current or former workers.
Countermeasures:
 It is recommended that a strict access and privileges control policy be implemented
and enforced.
 Don't give staff too many privileges, and revoke outdated privileges as soon as
possible.
7) Weak Audit Trail
If a database is not audited, it represents a risk of noncompliance with sensitive data
protection rules at the national and international levels. All database events must be
automatically logged and registered, and automatic auditing solutions must be used. Failure
or unwillingness to do so represents a major risk on multiple levels.
Countermeasures:
 Use automatic auditing solutions that have no impact on database performance.
Control Measures for the Security of Data in Databases
The following are the key control measures used to ensure data security in databases:
Authentication
 Authentication is the process of confirming whether a user logs in only with the rights
granted to him to undertake database operations. A certain user can only log in up to
his privilege level, but he cannot access any other sensitive data.
 The ability to access sensitive data is restricted by the use of authentication. For
example, a mobile phone performs authentication by requesting a PIN, fingerprint, or
by face recognition. Similarly, a computer verifies a username by requesting the
appropriate password.
 However, in the context of databases, authentication takes on a new dimension
because it can occur at multiple levels. It can be done by the database itself, or the
configuration can be adjusted to allow the operating system or another external means
to authenticate users.
 For example, when creating a database in Microsoft's SQL Server a user must specify
whether to use database authentication, operating system authentication, or both (the
so-called mixed-mode authentication). Other databases that prioritize security use
near-foolproof authentication methods such as fingerprint recognition and retinal
scanning.
 By using various authentication technologies such as biometrics for retina and
fingerprints, you can protect your data from unauthorized/malicious users.
Access Control
 Database access control is a means of restricting access to sensitive company data to
only those people (database users) who are authorized to access such data and
permitting access to unauthorized persons. It is a key security concept that reduces risk
to the business or organization.
 Physical and logical access control are the two types of access control. Access to
campuses, buildings, rooms, and physical IT assets is restricted through physical
access control. Connections to computer networks, system files, and data are restricted
through logical access control.
 To safeguard a facility, corporations use electronic access control systems to track
employee access to restricted company locations and private regions, such as data
centers, using user credentials, access card readers, auditing, and reports. Some of
these systems include access control panels to restrict access to rooms and buildings,
as well as alarms and lockdown features to prevent unauthorized access or operations.
 Logical access control systems execute user and entity identification, authentication,
and authorization by evaluating needed login credentials, which can include
passwords, personal identification numbers, biometric scans, security tokens, or other
authentication factors. Multifactor authentication (MFA), which needs two or more
authentication factors, is frequently used as part of a layered defense to safeguard
access control systems.
 The most well-known Database Access Control examples are:
o Discretionary Access Control (DAC): The data owner grants access to DAC
models. DAC is a method for assigning access rights based on rules defined by
the user.
o Mandatory Access Control (MAC): In MAC, people are permitted access
based on an information clearance, designed using a nondiscretionary paradigm.
MAC refers to a policy that assigns access permissions based on central
authority regulations.
o Role-Based Access Control (RBAC): RBAC uses fundamental security
principles like “least privilege” and “separation of privilege” to give access
depending on a user’s role. As a result, someone wanting to access information
can only access the data required for their function.
o Attribute-Based Access Control (ABAC): Each resource and user in ABAC
receives a set of attributes. This dynamic approach makes a judgment on
resource access based on comparing the user’s features, such as time of day,
position, and location.
Inference Control
 Inference control in databases, also known as Statistical Disclosure Control (SDC), is a
discipline that aims to secure data so that it can be published without disclosing sensitive
information associated with specific individuals among those to whom the data
corresponds.
 It prevents the user from completing any inference channel. This strategy prevents
sensitive information from indirect disclosure. There are two kinds of inferences: identity
disclosure and attribute disclosure.
 SDC is used to protect the privacy of respondents in fields such as government statistics,
health statistics, e-commerce (sharing of customer data), etc. Since data modification
ultimately implies data protection, SDC aims to achieve protection with the minimum
amount of accuracy loss for database users.
Flow Control
 Distributed systems involve a large amount of data flow from one site to another as well
as within a site. Flow control prohibits data from being transferred in such a way that
unauthorized agents cannot access it.
 A flow policy specifies the channels through which data can flow. It also defines security
classes for data as well as transactions. Convert channels are the pathways for
information to flow implicitly in ways that violate a company's privacy policy.
Database Security Applying Statistical Method
 Statistical database security focuses on the protection of sensitive individual values stored
in so-called statistical databases and used for statistical purposes, as well as retrieving
summaries of values based on categories. They do not allow the retrieval of individual
information.
 Examples include patient records utilized by medical researchers and extensive phone call
records statistically examined by phone companies to improve their services.
 It provides access to the database to obtain statistical information about the number of
employees in the company but not to obtain detailed confidential/personal information
about a specific individual employee.
 The techniques used to prevent statistical database compromise are classified into two
types: noise addition, in which all data and/or statistics are available but are only
approximate rather than exact, and restriction, in which the system only delivers statistics
and/or data that are deemed safe.
Encryption
 Data encryption protects data confidentiality by converting it to encoded information
known as ciphertext, which can only be decoded with a unique decryption key generated
either during or before encryption.
 Data encryption can be used during data storage or transfer, and it is usually used in
conjunction with authentication services to ensure that keys are only given to or used by
authorized users.
 Data is more accessible and desirable to attackers than ever before, increasing the need
for security. Additionally, many firms must comply with data protection regulations,
many of which specifically require the use of encryption.
 Aside from the clear advantages of enhanced security, privacy protection, and the
avoidance of illegal access, encryption also helps to assure data integrity. Encryption
protects content against unauthorized change and can be used to validate the origin and
authenticity of data.
Data Protection Tools and Platforms
Several companies now offer data protection platforms and tools. A comprehensive approach
should include the following features:
 Discovery: The ability to discover is often required to meet regulatory compliance
requirements. Look for a solution that can discover and classify vulnerabilities in our
databases, whether they're in the cloud or on-premises. It will also offer recommendations
to address any discovered vulnerabilities.
 Data Activity Monitoring: The solution should be capable of monitoring and analyzing
all data activity in all databases, regardless of whether our application is on-premises, in
the cloud, or inside a container. It will notify us of suspicious activities in real-time,
allowing us to respond to threats more rapidly. It also provides visibility into the current
state of our data via an integrated and comprehensive user interface. It is also important to
select a system that enforces regulations, procedures, and the division of roles. Be
sure that the solution we select can generate the reports we require to comply with the
rules.
 Data Tokenization and Encryption: In the case of an incident, encryption serves as an
additional line of protection against any breach. Any software we deploy must be capable
of protecting data in the cloud, on-premises hybrid, or multi-cloud environments. Find a
tool that has volume, file, and application encryption features that meet our company's
compliance requirements. Tokenization (data concealment) or advanced key management
of security keys may be required.
 Data Security and Risk Analysis Optimization: An application that provides contextual
insights by combining security data with advanced analytics will allow the user to
perform optimizing, risk assessment, and reporting a breeze. Select a technology that can
store and combine vast amounts of recent and historical data regarding the security and
health of your databases. Also, choose a solution that allows for data analysis, auditing,
and reporting capabilities via a comprehensive but user-friendly self-service interface.

Unit-6
An object-oriented database (OODB) is a database that combines object-oriented
programming concepts with relational database principles. It is managed by an object-
oriented database management system (OODBMS). OODBs contain the following elements:
 Objects. The basic building block and an instance of a class. The type is either
built-in or user-defined.
 Classes. A schema or blueprint that defines object structure and behavior.
 Methods. A blueprint that defines the behavior of a class.
 Pointers. An entity that helps access elements of an object database. They also
help establish relationships between objects.
The main feature of objects is the possibility of user-constructed types. An object created in
a programming language (such as Java) saves into an OODB as is. All the information about
an object is in a single package instead of spread across multiple tables.
Object-Oriented Programming Concepts in Database Management
Object-oriented databases closely relate to object-oriented programming (OOP) concepts.
The four main features of OOP are:
 Polymorphism.
 Inheritance.
 Encapsulation.
 Abstraction.
These four attributes describe the main characteristics of object-oriented management
systems in general. The sections below explain each of these concepts in greater depth.
Polymorphism
Polymorphism is the capability of an object to take multiple forms. This ability allows the
same program code to work with different data types.
To illustrate, a Vehicle class can be defined to have a method called brake(). A Car and
a Bike can both inherit from the class and implement their version of the brake() method.
The same method is applied for different behaviors, resulting in polymorphism.
Inheritance
Inheritance creates a hierarchical relationship between related classes while making parts of
code reusable. Defining new types inherits all the existing class fields and methods plus
further extends them. The existing class is the parent class, while the child class extends the
parent.
For example, a parent class called Vehicle can have child classes such as Car and Bike.
Both child classes inherit information from the parent class. They also extend the parent
class with new information depending on the methods defined for each vehicle type.
Encapsulation
Encapsulation allows grouping variables and methods into a single object to create access
protection. This process hides information and details of how an object works from the rest
of the code and results in data and function security.
Classes interact with each other through interface methods without the need to know how
particular methods work.
For example, a Car class can have properties such as color, make, and model and methods
such as changeColor(). You can change the color of a car through a method, yet
the model and make are not accessible. Encapsulation bundles all the car information into
one entity, where some elements are modifiable while some are not.
Abstraction
Abstraction is the process of focusing on the essential characteristics to
provide functionality. The process selects vital information while unnecessary details stay
hidden. Abstraction helps reduce data complexity and simplifies code reusability.
For example, when a web browser connects to the internet, it doesn't need to know the
specific connection details. Whether the connection is established through Wi-
Fi or Ethernet is irrelevant. The specific connection type is hidden from the browser to
create an abstraction, whereas the various types of connections represent different
implementations of the abstraction.
Features of Object-Oriented Databases
OODB has different features and implementations. Most contain the attributes described in
the table below:

Feature Description

Query Language Finds objects and retrieve data from the database.

Transparent Allows accessing and using data with an object-oriented programming


Persistence language without special handling.

ACID Ensures that ACID transactions, and guarantees all transactions are
Transactions completed without conflicting changes.

Creates a partial replica of the database in memory. Allows faster access


Database Caching
to a database without reading from disk.

Provides disaster recovery mechanisms in case of application or


Recovery
system failure.

Object-Relational Database (ORD)


An Object-Relational Database (ORD) is a database management system (DBMS) that
integrates object-oriented database model features into relational databases. ORDs aim to
bridge the gap between relational databases and the object-oriented modeling techniques that
are commonly used in programming languages. This type of database supports data types,
structures, and behaviors directly in the database schema and query language.
Key Features of ORDBMS
Object-Relational Database Management Systems (ORDBMS) offer a range of features that
combine the advantages of both relational and object-oriented database systems. Here are
some key features:
1. Object-Oriented Capabilities:
 Support for Complex Data Types: ORDBMS can handle complex data
types, including user-defined types, which allow for the representation of
real-world entities more accurately.
 Inheritance: Objects in an ORDBMS can inherit properties and methods
from other objects, enabling code reuse and simplification of data modeling.
 Encapsulation: Data and operations on data can be encapsulated in objects,
promoting modularity and reusability.
 Polymorphism: ORDBMS supports polymorphism, allowing objects to be
treated as instances of their parent class, enhancing flexibility in data
manipulation.
2. Relational Database Capabilities:
 SQL Support: ORDBMS fully supports SQL for querying and manipulating
data, ensuring compatibility with existing applications and tools.
 Transactional Integrity: ORDBMS provides robust support for ACID
(Atomicity, Consistency, Isolation, Durability) properties, ensuring data
integrity and reliability.
 Scalability and Performance: ORDBMS are designed to handle large
volumes of data and complex queries efficiently, making them suitable for
enterprise-level applications.
3. Enhanced Data Modeling:
 Support for Multimedia and Spatial Data: ORDBMS can store and
manage multimedia data (images, videos, audio) and spatial data (geographic
information), which are often required in modern applications.
 Extensibility: Users can extend the database with custom functions and
procedures, tailored to specific application needs.
4. Interoperability:
 Integration with Programming Languages: ORDBMS often provide
integration with popular programming languages (e.g., Java, Python, C++),
allowing developers to use familiar tools and frameworks.
Challenges and Considerations
While Object-Relational Database Management Systems (ORDBMS) offer numerous
benefits, they also present certain challenges and considerations that organizations should be
aware of:
Complexity
1. Learning Curve: The combination of relational and object-oriented paradigms can
be complex, requiring database administrators and developers to have a solid
understanding of both models.
2. Complex Schema Design: Designing a schema that leverages both relational and
object-oriented features can be intricate and time-consuming.
Performance Overhead
1. Processing Overhead: The additional features and capabilities of an ORDBMS
can introduce processing overhead, potentially impacting performance compared to
simpler relational databases.
2. Optimization Challenges: Optimizing queries and data structures in an ORDBMS
can be more challenging due to the complexity of the object-relational model.
Compatibility and Integration
1. Compatibility Issues: Integrating an ORDBMS with existing systems and
applications that rely on traditional relational databases may require significant
adjustments.
2. Tool Support: Not all database management tools and applications fully support
the advanced features of ORDBMS, which can limit their utility in certain
environments.
Cost
1. Higher Costs: ORDBMS can be more expensive to implement and maintain due to
the need for specialized skills and more sophisticated hardware and software
infrastructure.
2. Licensing Fees: Commercial ORDBMS products may come with higher licensing
fees compared to traditional relational databases.
Data Migration
1. Migration Efforts: Migrating data from a traditional relational database to an
ORDBMS can be complex and require significant effort to ensure data integrity
and compatibility.
2. Data Transformation: The process of transforming data to fit the object-relational
model can introduce risks and require thorough testing and validation.
Logical Database
A Logical Database is a special type of ABAP (Advance Business Application and
Programming) that is used to retrieve data from various tables and the data is interrelated to
each other. Also, a logical database provides a read-only view of Data.
Structure Of Logical Database:
A Logical database uses only a hierarchical structure of tables i.e. Data is organized in a
Tree-like Structure and the data is stored as records that are connected to each other through
edges (Links). Logical Database contains Open SQL statements which are used to read data
from the database. The logical database reads the program, stores them in the program if
required, and passes them line by line to the application program.
Features of Logical Database:
In this section, let us look at some features of a logical database:
 We can select only that type of Data that we need.
 Data Authentication is done in order to maintain security.
 Logical Database uses hierarchical Structure due to this data integrity is maintained.
Goal Of Logical Database:
The goal of Logical Database is to create well-structured tables that reflect the need of the
user. The tables of the Logical database store data in a non-redundant manner and foreign
keys will be used in tables so that relationships among tables and entities will be supported.
Tasks Of Logical Database:
Below is some important task of Logical Database:
 With the help of the Logical database, we will read the same data from multiple
programs.
 A logical database defines the same user interface for multiple programs.
 Logical Database ensures the Authorization checks for the centralized sensitive
database.
 With the help of a Logical Database, Performance is improved. Like in Logical
Database we will use joins instead of multiple SELECT statements, which will
improve response time and this will increase the Performance of Logical Database.
Data View Of Logical Database:
Logical Database provides a particular view of Logical Database tables. A logical database is
appropriately used when the structure of the Database is Large. It is convenient to use flow
i.e
 SELECT
 READ
 PROCESS
 DISPLAY
In order to work with databases efficiently. The data of the Logical Database is hierarchical
in nature. The tables are linked to each other in a Foreign Key relationship.
Points To Remember:
 Tables must have Foreign Key Relationship.
 A logical Database consists of logically related tables that are arranged in a
hierarchical manner used for reading or retrieving Data.
 Logical Database consist of three main elements:
o Structure of Database
o Selections of Data from Database
o Database Program
 If we want to improve the access time on data, then we use VIEWS in Logical
Database.
Web Database
 A web database is a system for storing and displaying information that is accessible from
the Internet / web. The database might be used for any of a wide range of functions, such
as a membership database, client list, or inventory database. A web database is a type of
Web Application.
 A Web database is a database application designed to be managed and accessed through
the Internet. Website operators can manage this collection of data and present analytical
results based on the data in the Web database application. Databases first appeared in the
1990s, and have been an asset for businesses, allowing the collection of seemingly
infinite amounts of data from infinite amounts of customers.
 A web database is ideal for situations when the information should be shared, or when it
must be accessed from various locations. It is especially beneficial when the system is to
be shared between locations or different devices (like tablets, computers and cell phones).
 A web database can be used for a range of different purposes. Each field in a table has to
have a defined data type. For example, numbers, strings, and dates can all be inserted into
a web database. Proper database design involves choosing the correct data type for each
field in order to reduce memory consumption and increase the speed of access. Although
for small databases this often isn't so important, big web databases can grow to millions
of entries and need to be well designed to work effectively.
Benefits that come through the use of web-based DBMS are:
 Provides simplicity
 Web-DBMS is Platform independence
 Provides Graphical User Interface (GUI)
 Standardization
 Provides Cross-platform support
 Facilitates transparent network access
 Scalability
 Innovation
Distributed Database System
A distributed database is basically a database that is not limited to one system, it is spread
over different sites, i.e, on multiple computers or over a network of computers. A distributed
database system is located on various sites that don’t share physical components. This may
be required when a particular database needs to be accessed by various users globally. It
needs to be managed such that for the users it looks like one single database.
Types:
1. Homogeneous Database:
In a homogeneous database, all different sites store database identically. The operating
system, database management system, and the data structures used – all are the same at all
sites. Hence, they’re easy to manage.
2. Heterogeneous Database:
In a heterogeneous distributed database, different sites can use different schema and software
that can lead to problems in query processing and transactions. Also, a particular site might
be completely unaware of the other sites. Different computers may use a different operating
system, different database application. They may even use different data models for the
database. Hence, translations are required for different sites to communicate.
Distributed Data Storage :
There are 2 ways in which data can be stored on different sites. These are:
1. Replication –
In this approach, the entire relationship is stored redundantly at 2 or more sites. If the entire
database is available at all sites, it is a fully redundant database. Hence, in replication,
systems maintain copies of data.
This is advantageous as it increases the availability of data at different sites. Also, now query
requests can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated. Any
change made at one site needs to be recorded at every site that relation is stored or else it
may lead to inconsistency. This is a lot of overhead. Also, concurrency control becomes way
more complex as concurrent access now needs to be checked over a number of sites.
2. Fragmentation –
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and
each of the fragments is stored in different sites where they’re required. It must be made sure
that the fragments are such that they can be used to reconstruct the original relation (i.e, there
isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data, consistency is not a
problem.
Fragmentation of relations can be done in two ways:
 Horizontal fragmentation – Splitting by rows –
The relation is fragmented into groups of tuples so that each tuple is assigned to at
least one fragment.
 Vertical fragmentation – Splitting by columns –
The schema of the relation is divided into smaller schemas. Each fragment must
contain a common candidate key so as to ensure a lossless join.
In certain cases, an approach that is hybrid of fragmentation and replication is used.
Applications of Distributed Database:
 It is used in Corporate Management Information System.
 It is used in multimedia applications.
 Used in Military’s control system, Hotel chains etc.
 It is also used in manufacturing control system.
A distributed database system is a type of database management system that stores data
across multiple computers or sites that are connected by a network. In a distributed database
system, each site has its own database, and the databases are connected to each other to form
a single, integrated system.
The main advantage of a distributed database system is that it can provide higher
availability and reliability than a centralized database system. Because the data is stored
across multiple sites, the system can continue to function even if one or more sites fail. In
addition, a distributed database system can provide better performance by distributing the
data and processing load across multiple sites.
There are several different architectures for distributed database systems, including:
Client-server architecture: In this architecture, clients connect to a central server, which
manages the distributed database system. The server is responsible for coordinating
transactions, managing data storage, and providing access control.
Peer-to-peer architecture: In this architecture, each site in the distributed database system
is connected to all other sites. Each site is responsible for managing its own data and
coordinating transactions with other sites.
Federated architecture: In this architecture, each site in the distributed database system
maintains its own independent database, but the databases are integrated through a
middleware layer that provides a common interface for accessing and querying the data.
Distributed database systems can be used in a variety of applications, including e-
commerce, financial services, and telecommunications. However, designing and managing a
distributed database system can be complex and requires careful consideration of factors
such as data distribution, replication, and consistency.
Advantages of Distributed Database System :
1) There is fast data processing as several sites participate in request processing.
2) Reliability and availability of this system is high.
3) It possess reduced operating cost.
4) It is easier to expand the system by adding more sites.
5) It has improved sharing ability and local autonomy.
Disadvantages of Distributed Database System :
1) The system becomes complex to manage and control.
2) The security issues must be carefully managed.
3) The system require deadlock handling during the transaction processing otherwise
the entire system may be in inconsistent state.
4) There is need of some standardization for processing of distributed database
system.
Data Warehousing
It is a technology that aggregates structured data from one or more sources so that it can be
compared and analyzed rather than transaction processing. A data warehouse is designed to
support the management decision-making process by providing a platform for data cleaning,
data integration, and data consolidation. A data warehouse contains subject-oriented,
integrated, time-variant, and non-volatile data. The Data warehouse consolidates data from
many sources while ensuring data quality, consistency, and accuracy. Data
warehouse improves system performance by separating analytics processing from
transactional databases. Data flows into a data warehouse from the various databases. A data
warehouse works by organizing data into a schema that describes the layout and type of data.
Query tools analyze the data tables using schema.
Advantages of Data Warehousing
 The data warehouse’s job is to make any form of corporate data easier to understand.
The majority of the user’s job will consist of inputting raw data.
 The capacity to update continuously and frequently is the key benefit of this
technology. As a result, data warehouses are perfect for organizations and
entrepreneurs who want to stay current with their target audience and customers.
 It makes data more accessible to businesses and organizations.
 A data warehouse holds a large volume of historical data that users can use to evaluate
different periods and trends in order to create predictions for the future.
Disadvantages of Data Warehousing
 There is a great risk of accumulating irrelevant and useless data. Data loss and erasure
are other potential issues.
 Data is gathered from various sources in a data warehouse. Cleansing and
transformation of the data are required. This could be a difficult task.
Data Mining
It is the process of finding patterns and correlations within large data sets to identify
relationships between data. Data mining tools allow a business organization to predict
customer behavior. Data mining tools are used to build risk models and detect fraud. Data
mining is used in market analysis and management, fraud detection, corporate analysis, and
risk management.
Advantages of Data Mining
 Data mining aids in a variety of data analysis and sorting procedures. The
identification and detection of any undesired fault in a system is one of the best
implementations here. This method permits any dangers to be eliminated sooner.
 In comparison to other statistical data applications, data mining methods are both cost-
effective and efficient.
 Companies can take advantage of this analytical tool by providing appropriate and
easily accessible knowledge-based data.
 The detection and identification of undesirable faults that occur in the system are one
of the most astonishing data mining techniques.
Disadvantages of Data Mining
 Data mining isn’t always 100 percent accurate, and if done incorrectly, it can lead to
data breaches.
 Organizations must devote a significant amount of resources to training and
implementation. Furthermore, the algorithms used in the creation of data mining tools
cause them to work in different ways.
Difference Between Data Mining and Data Warehousing

Basis of
Comparison Data Warehousing Data Mining

A data warehouse is a
database system that is
Data mining is the process of analyzing
Definition designed for analytical
data patterns.
analysis instead of
transactional work.

Process Data is stored periodically. Data is analyzed regularly.

Data warehousing is the


process of extracting and Data mining is the use of pattern
Purpose
storing data to allow easier recognition logic to identify patterns.
reporting.

Managing Data warehousing is solely Data mining is carried out by business


Authorities carried out by engineers. users with the help of engineers.

Data warehousing is the


Data Data mining is considered as a process of
process of pooling all
Handling extracting data from large data sets.
relevant data together.

Subject-oriented,
AI, statistics, databases, and machine
integrated, time-varying and
Functionality learning systems are all used in data
non-volatile constitute data
mining technologies.
warehouses.

Data warehousing is the


process of extracting and Pattern recognition logic is used in data
Task
storing data in order to make mining to find patterns.
reporting more efficient.
Basis of
Comparison Data Warehousing Data Mining

It extracts data and stores it This procedure employs pattern


Uses in an orderly format, making recognition tools to aid in the
reporting easier and faster. identification of access patterns.

Data mining aids in the creation of


When a data warehouse is
suggestive patterns of key parameters.
connected with operational
Customer purchasing behavior, items,
business systems like CRM
Examples and sales are examples. As a result,
(Customer Relationship
businesses will be able to make the
Management) systems, it
required adjustments to their operations
adds value.
and production.

You might also like