0% found this document useful (0 votes)

7 views399 pages

Understanding the ER Model in Databases

Uploaded by

k.arunkumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views399 pages

Understanding the ER Model in Databases

Uploaded by

k.arunkumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Search...

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Introduction of ER Model
Last Updated : 09 Sep, 2025

The Entity-Relationship Model (ER Model) is a conceptual model for designing a

databases. This model represents the logical structure of a database, including entities,
their attributes and relationships between them.

Entity: An objects that is stored as data such as Student, Course or Company.

Attribute: Properties that describes an entity such as StudentID, CourseName , or
EmployeeEmail.
Relationship: A connection between entities such as "a Student enrolls in a Course ".

Components of ER Diagram

The graphical representation of this model is called an Entity-Relation Diagram (ERD).

ER Model in Database Design Process

We typically follow the below steps for designing a database for an application.

Gather the requirements (functional and data) by asking questions to the database
users.
Create a logical or conceptual design of the database. This is where ER model plays a
role. It is the most used graphical representation of the conceptual design of a database.
After this, focus on Physical Database Design (like indexing) and external design (like
views)

Why Use ER Diagrams In DBMS?

ER diagrams represent the E-R model in a database, making them easy to convert into
relations (tables).
These diagrams serve the purpose of real-world modeling of objects which makes them
intently useful.
Unlike technical schemas, ER diagrams require no technical knowledge of the underlying
DBMS used.
They visually model data and its relationships, making complex systems easier to
understand.

Symbols Used in ER Model

ER Model is used to model the logical view of the system from a data perspective which
consists of these symbols:

Rectangles: Rectangles represent entities in the ER Model.

Ellipses: Ellipses represent attributes in the ER Model.
Diamond: Diamonds represent relationships among Entities.
Lines: Lines represent attributes to entities and entity sets with other relationship types.
Double Ellipse: Double ellipses represent multi-valued Attributes, such as a student's
multiple phone numbers
Double Rectangle: Represents weak entities, which depend on other entities for
identification.

Symbols used in ER Diagram

What is an Entity?
An Entity represents a real-world object, concept or thing about which data is stored in a
database. It act as a building block of a database. Tables in relational database represent
these entities.

Example of entities:

Real-World Objects: Person, Car, Employee etc.

Concepts: Course, Event, Reservation etc.
Things: Product, Document, Device etc.

The entity type defines the structure of an entity, while individual instances of that type
represent specific entities.

What is an Entity Set?

An entity refers to an individual object of an entity type, and the collection of all entities of a
particular type is called an entity set. For example, E1 is an entity that belongs to the entity
type "Student," and the group of all students forms the entity set.

In the ER diagram below, the entity type is represented as:

Entity Set

We can represent the entity sets in an ER Diagram but we can't represent individual
entities because an entity is like a row in a table, and an ER diagram shows the structure
and relationships of data, not specific data entries (like rows and columns). An ER diagram
is a visual representation of the data model, not the actual data itself.

Types of Entity
There are two main types of entities:

1. Strong Entity

A Strong Entity is a type of entity that has a key Attribute that can uniquely identify each
instance of the entity. A Strong Entity does not depend on any other Entity in the Schema
for its identification. It has a primary key that ensures its uniqueness and is represented by
a rectangle in an ER diagram.

2. Weak Entity

A Weak Entity cannot be uniquely identified by its own attributes alone. It depends on a
strong entity to be identified. A weak entity is associated with an identifying entity (strong
entity), which helps in its identification. A weak entity are represented by a double
rectangle. The participation of weak entity types is always total. The relationship between
the weak entity type and its identifying strong entity type is called identifying relationship
and it is represented by a double diamond.

Example:

A company may store the information of dependents (Parents, Children, Spouse) of an

Employee. But the dependents can't exist without the employee. So dependent will be a
Weak Entity Type and Employee will be identifying entity type for dependent, which means
it is Strong Entity Type.

Strong Entity and Weak Entity

Attributes in ER Model
Attributes are the properties that define the entity type. For example, for a Student entity
Roll_No, Name, DOB, Age, Address, and Mobile_No are the attributes that define entity
type Student. In ER diagram, the attribute is represented by an oval.

Attribute

Types of Attributes

1. Key Attribute

The attribute which uniquely identifies each entity in the entity set is called the key
attribute. For example, Roll_No will be unique for each student. In ER diagram, the key
attribute is represented by an oval with an underline.

Key Attribute

2. Composite Attribute

An attribute composed of many other attributes is called a composite attribute. For

example, the Address attribute of the student Entity type consists of Street, City, State, and
Country. In ER diagram, the composite attribute is represented by an oval comprising of
ovals.

Composite Attribute

3. Multivalued Attribute

An attribute consisting of more than one value for a given entity. For example, Phone_No
(can be more than one for a given student). In ER diagram, a multivalued attribute is
represented by a double oval.

Multivalued Attribute

4. Derived Attribute

An attribute that can be derived from other attributes of the entity type is known as a
derived attribute. e.g.; Age (can be derived from DOB). In ER diagram, the derived attribute
is represented by a dashed oval.

Derived Attribute

The Complete Entity Type Student with its Attributes can be represented as:

Entity and Attributes

Relationship Type and Relationship Set

A Relationship Type represents the association between entity types. For example,
‘Enrolled in’ is a relationship type that exists between entity type Student and Course. In
ER diagram, the relationship type is represented by a diamond and connecting the entities
with lines.

Entity-Relationship Set

A set of relationships of the same type is known as a relationship set. The following
relationship set depicts S1 as enrolled in C2, S2 as enrolled in C1, and S3 as registered in
C3.

Relationship Set

Degree of a Relationship Set

The number of different entity sets participating in a relationship set is called the degree of
a relationship set.

1. Unary/Recursive Relationship: When there is only ONE entity set participating in a

relation, the relationship is called a unary relationship. For example, one person is married
to only one person.

Unary Relationship

Cardinality can be of different types:

1. One-to-One

When each entity in each entity set can take part only once in the relationship, the
cardinality is one-to-one. Let us assume that a male can marry one female and a female
can marry one male. So the relationship will be one-to-one.

One to One Cardinality

Using Sets, it can be represented as:

Set Representation of One-to-One

2. One-to-Many

In one-to-many mapping as well where each entity can be related to more than one entity.
Let us assume that one surgeon department can accommodate many doctors. So the
Cardinality will be 1 to M. It means one department has many Doctors.

one to many cardinality

Using sets, one-to-many cardinality can be represented as:

Set Representation of One-to-Many

3. Many-to-One

When entities in one entity set can take part only once in the relationship set and entities in
other entity sets can take part more than once in the relationship set, cardinality is many to
one.

Let us assume that a student can take only one course but one course can be taken by
many students. So the cardinality will be n to 1. It means that for one course there can be n
students but for one student, there will be only one course.

many to one cardinality

Using Sets, it can be represented as:

Set Representation of Many-to-One

In this case, each student is taking only 1 course but 1 course has been taken by many
students.

4. Many-to-Many

When entities in all entity sets can take part more than once in the relationship cardinality
is many to many. Let us assume that a student can take more than one course and one
course can be taken by many students. So the relationship will be many to many.

many to many cardinality

Using Sets, it can be represented as:

Many-to-Many Set Representation

In this example, student S1 is enrolled in C1 and C3 and Course C3 is enrolled by S1, S3,
and S4. So it is many-to-many relationships.

Participation Constraint
Participation Constraint is applied to the entity participating in the relationship set.

1. Total Participation: Each entity in the entity set must participate in the relationship. If
each student must enroll in a course, the participation of students will be total. Total
participation is shown by a double line in the ER diagram.

2. Partial Participation: The entity in the entity set may or may NOT participate in the
relationship. If some courses are not enrolled by any of the students, the participation in the
course will be partial.

The diagram depicts the ‘Enrolled in’ relationship set with Student Entity set having total
participation and Course Entity set having partial participation.

Total Participation and Partial Participation

Using Set, it can be represented as,

Set representation of Total Participation and Partial Participation

Every student in the Student Entity set participates in a relationship but there exists a
course C4 that is not taking part in the relationship.

How to Draw an ER Diagram

1. Identify Entities: The very first step is to identify all the Entities. Represent these
entities in a Rectangle and label them accordingly.

2. Identify Relationships: The next step is to identify the relationship between them and
represent them accordingly using the Diamond shape. Ensure that relationships are not
directly connected to each other.

3. Add Attributes: Attach attributes to the entities by using ovals. Each entity can have
multiple attributes (such as name, age, etc.), which are connected to the respective entity.

4. Define Primary Keys: Assign primary keys to each entity. These are unique identifiers
that help distinguish each instance of the entity. Represent them with underlined attributes.

5. Remove Redundancies: Review the diagram and eliminate unnecessary or repetitive

entities and relationships.

6. Review for Clarity: Review the diagram make sure it is clear and effectively conveys the
relationships between the entities.

No compatible source was found for this media.

Entity Relationship Model in DBMS Visit Course

Comment K kartik Follow 716

Article Tags : DBMS GATE CS DBMS-ER model

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

About Us POTD Programming Languages IBM Certification DSA Interview Corner
Corporate & Communications Address: Legal Job-A-Thon DSA DSA and Placements Python Aptitude
A-143, 7th Floor, Sovereign Corporate Privacy Policy Blogs Web Technology Web Development Java Puzzles
Tower, Sector- 136, Noida, Uttar Pradesh
Contact Us Nation Skill Up AI, ML & Data Science Programming Languages C++ GfG 160
(201305)
Advertise with us DevOps DevOps & Cloud Web Development System Design
GFG Corporate Solution CS Core Subjects GATE Data Science
Registered Address:
Campus Training Program Interview Preparation Trending Technologies CS Subjects
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam Software and Tools
Buddh Nagar, Uttar Pradesh, 201305
Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and

Difference between Generalization and Specialization in DBMS

Last Updated : 12 Jul, 2025

Generalization and Specialization are two essential ideas used to describe the hierarchical
connections between things in a database in the context of Enhanced Entity-Relationship
(EER) diagrams. The aforementioned principles facilitate the organization and structuring
of data by building connections among various entity levels. While specialization is the act
of breaking down a higher-level entity into more focused, lower-level entities,
generalization is integrating lower-level entities into a higher-level entity. Comprehending
these ideas is essential for efficient database administration and design.

What is Generalization?
In EER diagrams, generalization is a bottom-up method used to combine lower-level
entities into a higher-level object. This approach creates a more generic entity, known as a
superclass, by combining entities with similar features. By removing duplication and
arranging the data in a more organized manner, generalization streamlines the data model.

Advantages of Generalization

Cuts Down on Redundancy: Cuts down on data duplication by combining related

entities into a single entity.
Simplifies Schema: Combines many things into a single, clearer schema.
Enhances Data Organization: By cohesively presenting related entities, it makes better
organization possible.

Disadvantages of Generalization

Loss of Specificity: The generic entity may take center stage over the distinctive
qualities of lower-level entities.
Complexity of Querying: As data becomes more abstracted, queries may get more
complicated.

Example of Generalization

Consider two entities Student and Patient. These two entities will have some
characteristics of their own. For example, the Student entity will have Roll_No, Name, and
Mob_No while the patient will have PId, Name, and Mob_No characteristics. Now in this
example Name and Mob_No of both Student and Patient can be combined as a Person to
form one higher-level entity and this process is called as Generalization Process.
What is Specialization?
In EER diagrams, specialization is a top-down method where a higher-level entity is split
into two or more lower-level entities according to their distinct qualities. This technique,
which includes splitting a single entity set into subgroups, is often connected to
inheritance, in which attributes from the higher-level entity are passed down to the lower-
level entities.

Advantages of Specialization

Enhances Specificity: By forming specialized subgroups, it is possible to depict things

in more depth.
Encourages Inheritance: Relationships and characteristics from higher-level entities are
passed down to lower-level entities.
Enhances Data Integrity: Makes certain that every entity have distinct qualities
relevant to its area of expertise.

Disadvantages of Specialization

Expands Schema Size: Adding additional entities may lead to an increase in the
schema's complexity and size.
Can Cause Redundancy: There might be certain characteristics that are duplicated
across specialized entities.

Example of Specialization

Consider an entity Account. This will have some attributes consider them Acc_No and
Balance. Account entity may have some other attributes like Current_Acc and
Savings_Acc. Now Current_Acc may have Acc_No, Balance and Transactions while
Savings_Acc may have Acc_No, Balance and Interest_Rate henceforth we can say that
specialized entities inherits characteristics of higher level entity.

After applying generalization and specialization, the structure of resultant figures are
same.

Difference Between Generalization and Specialization

GENERALIZATION SPECIALIZATION

Generalization works in Bottom-Up

Specialization works in top-down approach.
approach.

In Generalization, size of schema gets In Specialization, size of schema gets

reduced. increased.

Generalization is normally applied to group We can apply Specialization to a single

of entities. entity.

Generalization can be defined as a process

Specialization can be defined as process of
of creating groupings from various entity
creating subgrouping within an entity set
sets

In Generalization process, what actually Specialization is reverse of Generalization.

happens is that it takes the union of two or Specialization is a process of taking a
more lower-level entity sets to produce a subset of a higher level entity set to form a
higher-level entity sets. lower-level entity set.

Generalization process starts with the

Specialization process starts from a single
number of entity sets and it creates high-
entity set and it creates a different entity set
level entity with the help of some common
by using some different features.
features.

In Generalization, the difference and

In Specialization, a higher entity is split to
similarities between lower entities are
form lower entities.
ignored to form a higher entity.

Conclusion
In database design, generalization and specialization are both crucial ideas for producing
effective and well-organized data structures. Generalization reduces duplication and
simplifies the schema by combining related things into a higher-level object. To improve
specificity and facilitate inheritance, specialization, on the other hand, splits a higher-level
object into smaller, more focused entities. Although they work in different ways—top-
down for specialization and bottom-up for generalization—the end objective is to develop
an adaptable and effective data model that satisfies certain organizational requirements.

Comment S snigdh… Follow 35

Article Tags : DBMS Difference Between GATE CS

Explore
Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Enhanced ER Model
Last Updated : 09 Sep, 2025

As data complexity grows, the traditional ER model becomes less effective for database
modeling. Enhanced ER diagrams extend the basic ER model to better represent complex
applications. They support advanced concepts like subclasses, generalization,
specialization, aggregation and categories.

The ER model is the abstract representation of a database structure that defines:

ER Model

Entities in a database.
Attributes that they had.
Relationships between them.

What is an Enhanced ER model?

Enhanced ER Models are high-level models that represent the requirements and
complexities of complex databases. The EER model includes all modeling concepts of the
ER model. In addition, EER includes the following concepts:

Subclasses and Superclasses

Generalization and Specialization
Category or Union type
Attribute and Relationship Inheritance

1. Superclass and Subclass

Superclass is a higher-level entity set that has common attributes. Subclass is a lower-
level entity set that inherits attributes and relationships from its superclass but also has its
own specific attributes or relationships.. This supports the concept of inheritance, where a
subclass automatically possesses the features of the superclass.

Example: Science is a Super class which has subclasses like Physics, Chemistry,
Biology.

2. Generalization and Specialization

Generalization and Specialization are common relationships added as enhancements to the

classical ER model. A subclass (specialized class) inherits from a superclass (generalized
class), similar to object-oriented concepts. This is best understood using IS-A relationships
like “Technician IS-A Employee” or “Laptop IS-A Computer.” These are tools for organizing
and simplifying data by abstracting or specifying entity relationships.

Example: Here we have three sets of employees: Secretary, Technician and Engineer. The
employee is a super-class of the rest three sets of individual sub-class is a subset of
Employee set.

Employee Set

An entity belonging to a sub-class is related to some super-class entity. For instance

emp, no 1001 is a secretary and his typing speed is 68. Emp no 1009 is an engineer
(sub-class) and her trade is “Electrical”, so forth.
Sub-class entity “inherits” all attributes of super-class; for example, employee 1001 will
have attributes eno, name, salary and typing speed.

Enhanced ER Model of Above Example:

Enhanced ER Model

Constraints: There are two types of constraints on the “Sub-class” relationship.

1. Total or Partial Sub-classing:

Total: Every entity in the superclass must be in at least one subclass (e.g., every
employee is either salaried or hourly).
Partial: Some entities may not belong to any subclass (e.g., not all employees are a
secretary, engineer or technician).
Total subclassing means complete coverage while, partial means incomplete coverage.

2. Overlapped or Disjoint Sub-Classing:

Overlapped: An entity can belong to multiple subclasses.

Disjoint: An entity can belong to only one subclass.
In the given examples, both job-type and salary-based subclassing are disjoint,
meaning no overlap.

3. Category or Union Type

A Category (or Union Type) is a subclass that is derived from two or more superclasses
that may not be related. It allows the model to represent an entity that can be a member of
more than one entity set.

Example: Set of Library Members is UNION of Faculty, Student and Staff. A union
relationship indicates either type; for example, a library member is either Faculty or Staff or
Student. Below are two examples that show how UNION can be depicted in ERD - Vehicle
Owner is UNION of PERSON and Company andRTO Registered Vehicle is UNION of Car
and Truck.

Enhanced ER Model with Union

There might be some confusion in Sub-class and UNION. Consider an example in above
figure Vehicle is super-class of CAR and Truck. In the example, Vehicle is a superclass of
Car and Truck, which normally implies inheritance of attributes. However, in the RTO-
registered case, Car and Truck form a union without inheriting from Vehicle, each has
independent attributes.

4. Attribute and Relationship Inheritance

In the EER model, subclasses inherit all attributes and relationships of their superclasses.
This supports reusability and data consistency, as common attributes don’t need to be
redefined. An entity can be a sub-class of multiple entity types such entities are sub-class
of multiple entities and have multiple super-classes. In multiple inheritances, attributes of
sub-class are the union of attributes of all super-classes.

Example: If Employee has attributes like Name and ID, all subclasses like Manager or
Engineer will automatically have these, in addition to their own unique attributes (like
Department or Project).

Key Features of the EER Model

Subtypes and Supertypes: The EER model allows for the creation of subtypes and
supertypes where a supertype represents general attributes and subtypes represent
specialized entities .
Generalization and Specialization: Generalization is the process of identifying common
attributes and combines common features into a supertype, while Specialization is the
process of defining subtypes with unique attributes from a supertype.
Inheritance: Inheritance is a mechanism that allows subtypes to inherit attributes and
relationships from their supertype.
Constraints: The EER model allows for the specification of constraints that must be
satisfied by entities and relationships.
Subclasses and Superclasses: EER model allows for the creation of a hierarchical
structure of entities.
Attribute Inheritance: EER model allows attributes to be inherited from a superclass to
its subclasses.
Union Types: EER model allows for the creation of a union type, which is a combination
of two or more entity types.
Aggregation: EER model allows for the creation of an aggregate entity that represents a
group of entities as a single entity.
Multi-valued Attributes: EER model allows an attribute to have multiple values for a
single entity instance.
Relationships with Attributes: EER model allows relationships between entities to
have attributes. These attributes can describe the nature of the relationship or provide
additional information about the relationship.

Overall, these features make the EER model more expressive and powerful than the
traditional ER model, allowing a more accurate representation of complex relationships
between entities.

Read related article - Generalization, Specialization and Aggregation in ER Model

Comment K Kadam Patel 135

Article Tags : Misc DBMS GATE CS Computer Science and Engineering

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

THE ENHANCED ER MODEL

As the complexity of data increased in the late 1980s, it became more and more difficult
to use the traditional ER Model for database modelling. Hence some improvements or
enhancements were made to the existing ER Model to make it able to handle the complex
applications better.
EER is a high-level data model that incorporates the extensions to the original ER model.
It is a diagrammatic technique for displaying the following concepts
 Sub Class and Super Class
 Specialization and Generalization
 Union or Category
 Aggregation
These concepts are used when the comes in EER schema and the resulting schema diagrams
called as EER Diagrams.
Features of EER Model
 EER creates a design more accurate to database schemas.
 It reflects the data properties and constraints more precisely.
 It includes all modeling concepts of the ER model.
 Diagrammatic technique helps for displaying the EER schema.
 It includes the concept of specialization and generalization.
 It is used to represent a collection of objects that is union of objects of different of
different entity types.
A. Sub Class and Super Class
 Sub class and Super class relationship leads the concept of Inheritance.

 The relationship between sub class and super class is denoted with symbol.
1. Super Class
 Super class is an entity type that has a relationship with one or more subtypes.
 An entity cannot exist in database merely by being member of any super class.
For example: Shape super class is having sub groups as Square, Circle, Triangle.
2. Sub Class
 Sub class is a group of entities with unique attributes.

CS3492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

 Sub class inherits properties and attributes from its super class.
For example: Square, Circle, Triangle are the sub class of Shape super class.

Hence, as part of the Enhanced ER Model, along with other improvements, three new concepts
were added to the existing ER Model, they were:
1. Generalization
2. Specialization
3. Aggregration
1. Generalization
Generalization is a bottom-up approach in which two lower level entities combine to form
a higher level entity. In generalization, the higher level entity can also combine with other lower
level entities to make further higher level entity.
It's more like Superclass and Subclass system, but the only difference is the approach,
which is bottom-up. Hence, entities are combined to form a more generalised entity, in other
words, sub-classes are combined to form a super-class.

For example, Saving and Current account types entities can be generalised and an entity with
name Account can be created, which covers both.

CS3492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

2. Specialization

 Specialization is a process that defines a group entities which is divided into sub groups
based on their characteristic.
 It is a top down approach, in which one higher entity can be broken down into two
lower level entity.
 It maximizes the difference between the members of an entity by identifying the
unique characteristic or attributes of each member.
 It defines one or more sub class for the super class and also forms the
superclass/subclass relationship.

CS3492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

3. Aggregation

 Aggregation is a process that represent a relationship between a whole object and its
component parts.
 It abstracts a relationship between objects and viewing the relationship as an object.
 It is a process when two entity is treated as a single entity.

In the above example, the relation between College and Course is acting as an Entity in
Relation with Student. In the diagram above, the relationship between Center and Course
together, is acting as an Entity, which is in relationship with another entity Visitor. Now in real
world, if a Visitor or a Student visits a Coaching Center, he/she will never enquire about the

CS3492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

center only or just about the course, rather he/she will ask enquire about both.

Category or Union
 Category represents a single super class or sub class relationship with more than one super
class.
 It can be a total or partial participation.
For example Car booking, Car owner can be a person, a bank (holds a possession on a Car) or a
company. Category (sub class) → Owner is a subset of the union of the three super classes →
Company, Bank, and Person. A Category member must exist in at least one of its super classes.

CS3492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

Generalization and Specialization –

These are very common relationships found in real entities. However, this kind of relationship
was added later as an enhanced extension to the classical ER model.
Specialized classes are often called subclass while a generalized class is called a
superclass, probably inspired by object-oriented programming. A sub-class is best understood
by “IS-A analysis”. Following statements hopefully makes some sense to your mind
“Technician IS-A Employee”, “Laptop IS-A Computer”.
An entity is a specialized type/class of another entity. For example, a Technician is a
special Employee in a university system Faculty is a special class of Employee. We call this
phenomenon generalization/specialization. In the example here Employee is a generalized
entity class while the Technician and Faculty are specialized classes of Employee.
Example – This example instance of “sub-class” relationships. Here we have four sets of
employees: Secretary, Technician, and Engineer. The employee is super-class of the rest three
sets of individual sub-class is a subset of Employee set.

CS3492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

 An entity belonging to a sub-class is related to some super-class entity. For instance emp,
no 1001 is a secretary, and his typing speed is 68. Emp no 1009 is an engineer (sub-class)
and her trade is “Electrical”, so forth.
 Sub-class entity “inherits” all attributes of super-class; for example, employee 1001 will
have attributes eno, name, salary, and typing speed.

CS3492-DATABASE MANAGEMENT SYSTEMS

Search... Sign In

Databases SQL MySQL PostgreSQL PL/SQL MongoDB SQL Cheat Sheet SQL Interview Questions MySQL Interview Questions PL/SQL Interview Questions

SQL UNION Operator

Last Updated : 15 Jul, 2025

The SQL UNION operator is used to combine the result sets of two or more SELECT
queries into a single result set. It is a powerful tool in SQL that helps aggregate data from
multiple tables, especially when the tables have similar structures.

In this guide, we'll explore the SQL UNION operator, how it differs from UNION ALL, and
provide detailed examples to demonstrate its usage.

What is SQL UNION Operator?

The SQL UNION operator combines the results of two or more SELECT statements into
one result set. By default, UNION removes duplicate rows, ensuring that the result set
contains only distinct records.

There are some rules for using the SQL UNION operator.

Rules for SQL UNION

Each table used within UNION must have the same number of columns.
The columns must have the same data types.
The columns in each table must be in the same order.

Syntax:

The Syntax of the SQL UNION operator is:

SELECT columnnames FROM table1

UNION
SELECT columnnames FROM table2;

UNION operator provides unique values by default. To find duplicate values, use UNION
ALL.

Note: SQL UNION and UNION ALL difference is that UNION operator removes
duplicate rows from results set and
UNION ALL operator retains all rows, including duplicate.

Examples of SQL UNION

Let's look at an example of UNION operator in SQL to understand it better.

Let's create two tables "Emp1" and "Emp2";

Emp1 Table

Write the following SQL query to create Emp1 table.

CREATE TABLE Emp1(

EmpID INT PRIMARY KEY,
Name VARCHAR(50),
Country VARCHAR(50),
Age int(2),
mob int(10)
);
-- Insert some sample data into the Customers table
INSERT INTO Emp1 (EmpID, Name,Country, Age, mob)
VALUES (1, 'Shubham', 'India','23','738479734'),
(2, 'Aman ', 'Australia','21','436789555'),
(3, 'Naveen', 'Sri lanka','24','34873847'),
(4, 'Aditya', 'Austria','21','328440934'),
(5, 'Nishant', 'Spain','22','73248679');

SELECT* FROM Emp1;

Output:

Emp1 Table

Emp2 Table

Write the following SQL query to create Emp2 table

CREATE TABLE Emp2(

EmpID INT PRIMARY KEY,
Name VARCHAR(50),
Country VARCHAR(50),
Age int(2),
mob int(10)
);
-- Insert some sample data into the Customers table
INSERT INTO Emp2 (EmpID, Name,Country, Age, mob)
VALUES (1, 'Tommy', 'England','23','738985734'),
(2, 'Allen', 'France','21','43678055'),
(3, 'Nancy', 'India','24','34873847'),
(4, 'Adi', 'Ireland','21','320254934'),
(5, 'Sandy', 'Spain','22','70248679');

SELECT * FROM Emp2;

Output:

Emp2 Table

Example 1: SQL UNION Operator

In this example, we will find the cities (only unique values) from both the "Table1" and the
"Table2" tables:

Query:

SELECT Country FROM Emp1

UNION
SELECT Country FROM Emp2
ORDER BY Country;

Output:

output

Example 2: SQL UNION ALL

In the below example, we will find the cities (duplicate values also) from both the "Emp1"
and the "Emp2" tables:

Query:

SELECT Country FROM Emp1

UNION ALL
SELECT Country FROM Emp2
ORDER BY Country;

Output:

Country

Australia

Austria

England

France

India

Ireland

Spain

Sri lanka

SQL UNION ALL With WHERE

You can use the WHERE clause with UNION ALL in SQL. The WHERE clause is used to
filter records and is added after each SELECT statement

Example : SQL UNION ALL with WHERE

The following SQL statement returns the cities (duplicate values also) from both the
"Geeks1" and the "Geeks2" tables:

Query:

SELECT Country, Name FROM Emp1

WHERE Name='Aditya'
UNION ALL
SELECT Country, Name FROM Emp2
WHERE Country='Ireland'
ORDER BY Country;

Output:

output

Important Points About SQL UNION Operator

The SQL UNION operator combines the result sets of two or more SELECT queries.
UNION returns unique rows, eliminating duplicate entries from the result set.
UNION ALL includes all rows, including duplicate rows.
Columns in the result set must be in the same order and have the same data types.
UNION is useful for aggregating data from multiple tables or applying different
filters to data from the same table.

Conclusion
The SQL UNION operator is a powerful tool for combining multiple SELECT statements into
one result set. Whether you need to eliminate duplicates or include them, UNION and
UNION ALL provide flexible options for aggregating data from multiple tables.
Understanding how and when to use these operators will make your SQL queries more
efficient and effective for data retrieval and analysis.

Comment K khush… Follow 7

Article Tags : SQL DBMS-SQL

Explore

SQL Tutorial 6 min read

Basics

Queries & Operations

SQL Joins & Functions

Data Constraints & Aggregate Functions

Advanced SQL Topics

Database Design & Security

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Generalization, Specialization and Aggregation in ER Model

Last Updated : 16 Jul, 2025

Using the ER model for bigger data creates a lot of complexity while designing a database
model, So in order to minimize the complexity Generalization, Specialization and
Aggregation were introduced in the ER model. These were used for data abstraction. In
which an abstraction mechanism is used to hide details of a set of objects.

Generalization
Generalization is the process of extracting common properties from a set of entities and
creating a generalized entity from it. It is a bottom-up approach in which two or more
entities can be generalized to a higher-level entity if they have some attributes in common.

Generalization

Example: STUDENT and FACULTY can be generalized to a higher-level entity called

PERSON as shown in diagram below. In this case, common attributes like P_NAME and
P_ADD become part of a higher entity (PERSON) and specialized attributes like S_FEE
become part of a specialized entity (STUDENT).

Generalization is also called as 'Bottom-up approach'.

Specialization
In specialization, an entity is divided into sub-entities based on its characteristics. It is a
top-down approach where the higher-level entity is specialized into two or more lower-
level entities.

Specialization

Example: an EMPLOYEE entity in an Employee management system can be specialized into

DEVELOPER, TESTER, etc. as shown in figure below. In this case, common attributes like
E_NAME, E_SAL, etc. become part of a higher entity (EMPLOYEE) and specialized attributes
like TES_TYPE become part of a specialized entity (TESTER).

Specialization is also called as "Top-Down approach".

Inheritance

It is an important feature of generalization and specialization. In specialization, a higher-

level entity is divided into lower-level sub-entities that inherit its attributes. In
generalization, similar lower-level entities are combined into a higher-level entity that holds
common attributes. In both cases, inheritance allows sub-entities to reuse the properties of
the parent entity.

1. Attribute inheritance: It allows lower level entities to inherit the attributes of higher
level entities and vice versa. In diagram Car entity is an inheritance of Vehicle entity ,So
Car can acquire attributes of Vehicle. Example: Car can acquire Model attribute of
Vehicle.
2. Relationship Inheritance: Sub-entities also inherit relationships of the parent entity.
3. Overriding Inheritance: Sub-entities can override or add their own attributes or
behaviors different from the parent.
4. Participation inheritance: Participation inheritance in ER modeling refers to the
inheritance of participation constraints from a higher-level entity (superclass) to a lower-
level entity (subclass). It ensures that subclasses adhere to the same participation rules
in relationships, although attributes and relationships themselves are inherited
differently.

Example of Relation

Example: In diagram Vehicle entity has an relationship with Cycle entity, but it would not
automatically acquire the relationship itself with the Vehicle entity. Participation inheritance
only refers to the inheritance of participation constraints, not the actual relationships
between entities.

Aggregation
An ER diagram is not capable of representing the relationship between an entity and a
relationship which may be required in some scenarios. In those cases, a relationship with its
corresponding entities is aggregated into a higher-level entity. Aggregation is an
abstraction through which we can represent relationships as higher-level entity sets.

Aggregation

Example: an Employee working on a project may require some machinery. So, REQUIRE
relationship is needed between the relationship WORKS_FOR and entity MACHINERY.
Using aggregation, WORKS_FOR relationship with its entities EMPLOYEE and PROJECT is
aggregated into a single entity and relationship REQUIRE is created between the
aggregated entity and MACHINERY.

Aggregation is also called as "Higher-Order Relationship".

Representing Aggregation Via Schema

To represent aggregation in a relational schema, follow these steps:

1. Create Schema for the Aggregated Relationship

This will be treated like an entity set.

It includes the primary keys of the participating entities in the base relationship.
It also includes any descriptive attributes of the base relationship.

2. Create Schema for the Higher-Level Relationship (Aggregation)

This schema includes: The primary key of the aggregated relationship schema.
The primary key of the associated entity it relates to.
Any additional descriptive attributes of this higher-level relationship.

No compatible source was found for this media.

Generalization, Specialization & Aggregation in ER Model | DBMS

Comment S sonal t… Follow 156

Article Tags : DBMS GATE CS

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

Chapter 9 1
Figure 3.2 ER schema diagram for the company database.

Fname Minit Lname

Number

N 1
Address Locations
Name WORKS_FOR Name
Sex Salary
Ssn
___
NumberOfEmployees DEPARTMENT
StartDate
EMPLOYEE
Bdate 1 1 1
MANAGES
CONTROLS

N
Hours

supervisor supervisee N
WORKS_ON PROJECT

1 SUPERVISION N 1
Name
Location

Number
______
DEPENDENTS_OF

DEPENDENT

Name Sex BirthDate Relationship

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Step 1: For each regular entity type E
• Create a relation R that includes all the
simple attributes of E.
• Include all the simple component attributes
of composite attributes.
• Choose one of the key attributes of E as
primary key for R.
• If the chosen key of E is composite, the set
of simple attributes that form it will together
form the primary key of R.
Chapter 9 2
Figure 7.5 Schema diagram for the COMPANY relational
database schema; the primary keys are underlined.

EMPLOYEE

FNAME MINIT LNAME SSN BDATE ADDRESS SEX SALARY SUPERSSN DNO

DEPARTMENT

DNAME DNUMBER MGRSSN MGRSTARTDATE

DEPT_LOCATIONS

DNUMBER DLOCATION

PROJECT

PNAME PNUMBER PLOCATION DNUM

WORKS_ON

ESSN PNO HOURS

DEPENDENT

ESSN DEPENDENT_NAME SEX BDATE RELATIONSHIP

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Step 2: For each weak entity type W with
owner entity type E
• Create a relation R, and include all simple
attributes and simple components of
composite attributes of W as attributes of R.
• In addition, include as foreign key attributes
of R the primary key attribute(s) of the
relation(s) that correspond to the owner
entity type(s).

Chapter 9 3
Figure 7.5 Schema diagram for the COMPANY relational
database schema; the primary keys are underlined.

EMPLOYEE

FNAME MINIT LNAME SSN BDATE ADDRESS SEX SALARY SUPERSSN DNO

DEPARTMENT

DNAME DNUMBER MGRSSN MGRSTARTDATE

DEPT_LOCATIONS

DNUMBER DLOCATION

PROJECT

PNAME PNUMBER PLOCATION DNUM

WORKS_ON

ESSN PNO HOURS

DEPENDENT

ESSN DEPENDENT_NAME SEX BDATE RELATIONSHIP

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Step 3: For each binary 1:1 relationship
type R
• Identify the relations S and T that correspond to
the entity types participating in R. Choose one of
the relations, say S, and include as foreign key in
S the primary key of T.
• It is better to choose an entity type with total
participation in R in the role of S.
• Include the simple attributes of the 1:1
relationship type R as attributes of S.
• If both participations are total, we may merge the
two entity types and the relationship into a single
relation.
Chapter 9 4
Figure 7.5 Schema diagram for the COMPANY relational
database schema; the primary keys are underlined.

EMPLOYEE

FNAME MINIT LNAME SSN BDATE ADDRESS SEX SALARY SUPERSSN DNO

DEPARTMENT

DNAME DNUMBER MGRSSN MGRSTARTDATE

DEPT_LOCATIONS

DNUMBER DLOCATION

PROJECT

PNAME PNUMBER PLOCATION DNUM

WORKS_ON

ESSN PNO HOURS

DEPENDENT

ESSN DEPENDENT_NAME SEX BDATE RELATIONSHIP

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Step 4: For each regular binary 1:N
relationship type R
• Identify the relation S that represents the
participating entity type at the N-side of the
relationship type.
• Include as foreign key in S the primary key
of the relations T that represents the other
entity type participating in R.
• Include any simple attributes of the 1:N
relationship type as attributes of S.

Chapter 9 5
Figure 7.5 Schema diagram for the COMPANY relational
database schema; the primary keys are underlined.

EMPLOYEE

FNAME MINIT LNAME SSN BDATE ADDRESS SEX SALARY SUPERSSN DNO

DEPARTMENT

DNAME DNUMBER MGRSSN MGRSTARTDATE

DEPT_LOCATIONS

DNUMBER DLOCATION

PROJECT

PNAME PNUMBER PLOCATION DNUM

WORKS_ON

ESSN PNO HOURS

DEPENDENT

ESSN DEPENDENT_NAME SEX BDATE RELATIONSHIP

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Step 5: For each binary M:N relationship
type R
• Create a new relation S to represent R.
• Include as foreign key attributes in S the
primary keys of the relations that represent
the participating entity types; their
combination will form the primary key of S.
• Also, include any simple attributes of the
M:N relationship type as attributes of S.

Chapter 9 6
Figure 7.5 Schema diagram for the COMPANY relational
database schema; the primary keys are underlined.

EMPLOYEE

FNAME MINIT LNAME SSN BDATE ADDRESS SEX SALARY SUPERSSN DNO

DEPARTMENT

DNAME DNUMBER MGRSSN MGRSTARTDATE

DEPT_LOCATIONS

DNUMBER DLOCATION

PROJECT

PNAME PNUMBER PLOCATION DNUM

WORKS_ON

ESSN PNO HOURS

DEPENDENT

ESSN DEPENDENT_NAME SEX BDATE RELATIONSHIP

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Step 6: For each multi-valued attribute A
• Create a new relation R that includes an
attribute corresponding to A plus the
primary key attribute K (as a foreign key in
R) of the relation that represents the entity
type or relationship type that has A as an
attribute.
• The primary key of R is the combination of
A and K. If a multi-valued attribute is
composite, we include its components.

Chapter 9 7
Figure 7.5 Schema diagram for the COMPANY relational
database schema; the primary keys are underlined.

EMPLOYEE

FNAME MINIT LNAME SSN BDATE ADDRESS SEX SALARY SUPERSSN DNO

DEPARTMENT

DNAME DNUMBER MGRSSN MGRSTARTDATE

DEPT_LOCATIONS

DNUMBER DLOCATION

PROJECT

PNAME PNUMBER PLOCATION DNUM

WORKS_ON

ESSN PNO HOURS

DEPENDENT

ESSN DEPENDENT_NAME SEX BDATE RELATIONSHIP

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Step 7: For each n-ary relationship type R,
n>2
• Create a new relation S to represent R.
• Include as foreign key attributes in the S the
primary keys of the relations that represent
the participating entity types.
• Also include any simple attributes of the n-
ary relationship types as attributes of S.
• The primary key for S is usually a
combination of all the foreign keys that
reference the relations representing the
participating entity types.
Chapter 9 8
TERNARY RELATIONSHIPS
Figure 9.1 Mapping the n-ary relationship type SUPPLY
from Figure 4.13(a).

SUPPLIER

SNAME
______

PROJECT

PROJNAME
__________

PART

PARTNO
_______

SUPPLY

SNAME PROJNAME PARTNO QUANTITY

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
• However, if the participation constraint
(min,max) of one of the entity types E
participating in the R has max =1, then the
primary key of S can be the single foreign
key attribute that references the relation E’
corresponding to E
• This is because , in this case, each entity e
in E will participate in at most one
relationship instance of R and hence can
uniquely identify that relationship instance.

Chapter 9 9
Step 8: To convert each super-class/sub-
class relationship into a relational schema
you must use one of the four options
available.

Let C be the super-class, K its primary key

and A1, A2, …, An its remaining attributes
and let S1, S2, …, Sm be the sub-classes.

Chapter 9 10
Option 8A (multiple relation option):

• Create a relation L for C with attributes

Attrs(L) = {K, A1, A2, …, An} and PK(L) = K.

• Create a relation Li for each subclass Si, 1 < i < m, with

the attributes
ATTRS(Li) = {K} U {attributes of Si} and
PK(Li) = K.

• This option works for any constraints: disjoint or

overlapping; total or partial.

Chapter 9 11
Option 8B (multiple relation option):

• Create a relation Li for each subclass Si, 1 < i < m, with

ATTRS(Li) = {attributes of Si} U {K, A1, A2, …, An}
PK(Li) = K
• This option works well only for disjoint and total
constraints.
• If not disjoint, redundant values for inherited attributes.
• If not total, entity not belonging to any sub-class is lost.

Chapter 9 12
Figure 9.2 Options for mapping specializations (or generalizations) to relations.
(a) Mapping the EER schema of Figure 4.4 to relations by using Option A. (b) Mapping the EER
schema of Figure 4.3(b) into relations by using Option B. (c) Mapping the EER schema of
Figure 4.4 by using Option C, with JobType playing the role of type attribute. (d) Mapping the EER
schema of Figure 4.5 by using Option D, with two Boolean type fields Mflag and Pflag.

(a) EMPLOYEE
SSN FName MInit LName BirthDate Address JobType

SECRETARY TECHNICIAN ENGINEER

SSN TypingSpeed SSN TGrade SSN EngType

(b) CAR
VehicleId LicensePlateNo Price MaxSpeed NoOfPassengers

TRUCK
VehicleId LicensePlateNo Price NoOfAxles Tonnage

(d) PART
PartNo Description MFlag DrawingNo ManufactureDate BatchNo PFlag SupplierName ListPrice

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Figure 9.2 Options for mapping specializations (or generalizations) to relations.
(a) Mapping the EER schema of Figure 4.4 to relations by using Option A. (b) Mapping the EER
schema of Figure 4.3(b) into relations by using Option B. (c) Mapping the EER schema of
Figure 4.4 by using Option C, with JobType playing the role of type attribute. (d) Mapping the EER
schema of Figure 4.5 by using Option D, with two Boolean type fields Mflag and Pflag.

(a) EMPLOYEE
SSN FName MInit LName BirthDate Address JobType

SECRETARY TECHNICIAN ENGINEER

SSN TypingSpeed SSN TGrade SSN EngType

(b) CAR
VehicleId LicensePlateNo Price MaxSpeed NoOfPassengers

TRUCK
VehicleId LicensePlateNo Price NoOfAxles Tonnage

(d) PART
PartNo Description MFlag DrawingNo ManufactureDate BatchNo PFlag SupplierName ListPrice

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Option 8c (Single Relation Option)
• Create a single relation L with attributes
Attrs(L) = {K, A1, …, An} U
{attributes of S1} U… U
{attributes of Sm} U {T}
and PK(L)=K
• This option is for specialization whose subclasses are
DISJOINT, and T is a type attribute that indicates the
subclass to which each tuple belongs, if any. This option
may generate a large number of null values.
• Not recommended if many specific attributes are defined
in subclasses (will result in many null values!)

Chapter 9 13
Figure 9.2 Options for mapping specializations (or generalizations) to relations.
(a) Mapping the EER schema of Figure 4.4 to relations by using Option A. (b) Mapping the EER
schema of Figure 4.3(b) into relations by using Option B. (c) Mapping the EER schema of
Figure 4.4 by using Option C, with JobType playing the role of type attribute. (d) Mapping the EER
schema of Figure 4.5 by using Option D, with two Boolean type fields Mflag and Pflag.

(a) EMPLOYEE
SSN FName MInit LName BirthDate Address JobType

SECRETARY TECHNICIAN ENGINEER

SSN TypingSpeed SSN TGrade SSN EngType

(b) CAR
VehicleId LicensePlateNo Price MaxSpeed NoOfPassengers

TRUCK
VehicleId LicensePlateNo Price NoOfAxles Tonnage

(d) PART
PartNo Description MFlag DrawingNo ManufactureDate BatchNo PFlag SupplierName ListPrice

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Option 8d (Single Relation Option)
• Create a single relation schema L with attributes
Attrs(L) = {K, A1, …, An} U
{attributes of S1} U… U
{attributes of Sm} U {T1, …, Tn}
and PK(L)=K
• This option is for specialization whose subclasses are
overlapping, and each Ti, 1 < i < m, is a Boolean attribute
indicating whether a tuple belongs to subclass Si.
• This option could be used for disjoint subclasses too.

Chapter 9 14
Figure 9.2 Options for mapping specializations (or generalizations) to relations.
(a) Mapping the EER schema of Figure 4.4 to relations by using Option A. (b) Mapping the EER
schema of Figure 4.3(b) into relations by using Option B. (c) Mapping the EER schema of
Figure 4.4 by using Option C, with JobType playing the role of type attribute. (d) Mapping the EER
schema of Figure 4.5 by using Option D, with two Boolean type fields Mflag and Pflag.

(a) EMPLOYEE
SSN FName MInit LName BirthDate Address JobType

SECRETARY TECHNICIAN ENGINEER

SSN TypingSpeed SSN TGrade SSN EngType

(b) CAR
VehicleId LicensePlateNo Price MaxSpeed NoOfPassengers

TRUCK
VehicleId LicensePlateNo Price NoOfAxles Tonnage

(d) PART
PartNo Description MFlag DrawingNo ManufactureDate BatchNo PFlag SupplierName ListPrice

© Addison Wesley Longman, Inc. 2000, Elmasri/Navathe, Fundamentals of Database Systems, Third Edition
Figure 9.3 Mapping the EER specialization lattice shown in Figure 4.7
using multiple options.

PERSON
SSN Name BirthDate Sex Address

EMPLOYEE
SSN Salary EmployeeType Position Rank PercentTime RAFlag TAFlag Project Course

ALUMNUS ALUMNUS_DEGREES
SSN SSN Year Degree Major

STUDENT
SSN MajorDept GradFlag UndergradFlag DegreeProgram Class StudAssistFlag

•PERSON/{EMPLOYEE,ALUMNUS,STUDENT}

Option 8C for

•EMPLOYEE/{STAFF,FACULTY,STUDENT_ASSISTANT}

Option 8D for

•STUDENT_ASSISTANT/{RESEARCH_ASSISTANT,
TEACHING_ASSISTANT}
•STUDENT/{STUDENT_ASSISTANT}
•STUDENT/{GRADUATE_ASSISTANT,
UNDERGRADUATE_STUDENT}
Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Mapping from ER Model to Relational Model

Last Updated : 23 Jul, 2025

Converting an Entity-Relationship (ER) diagram to a Relational Model is a crucial step in

database design. The ER model represents the conceptual structure of a database, while
the Relational Model is a physical representation that can be directly implemented using a
Relational Database Management System (RDBMS) like Oracle or MySQL. In this article, we
will explore how to convert an ER diagram to a Relational Model for different scenarios,
including binary relationships with various cardinalities and participation constraints.

Case 1: Binary Relationship with 1:1 cardinality with total participation of an entity

A person has 0 or 1 passport number and Passport is always owned by 1 person. So it is

1:1 cardinality with full participation constraint from Passport.

First Convert each entity and relationship to tables. Person table corresponds to Person
Entity with key as Per-Id. Similarly Passport table corresponds to Passport Entity with key
as Pass-No. Has Table represents relationship between Person and Passport (Which
person has which passport). So it will take attribute Per-Id from Person and Pass-No from
Passport.

Person Has Passport

Per- Other Person Per- Pass- Pass- Other

Id Attribute Id No No PassportAttribute

PR1 - PR1 PS1 PS1 -

PR2 - PR2 PS2 PS2 -

PR3 -

Table
1

As we can see from Table 1, each Per-Id and Pass-No has only one entry in Has Table. So
we can merge all three tables into 1 with attributes shown in Table 2. Each Per-Id will be
unique and not null. So it will be the key. Pass-No can’t be key because for some person, it
can be NULL.

Per-Id Other Person Attribute Pass-No Other PassportAttribute

Table 2

Case 2: Binary Relationship with 1:1 cardinality and partial participation of both
entities

A male marries 0 or 1 female and vice versa as well. So it is 1:1 cardinality with partial
participation constraint from both. First Convert each entity and relationship to tables. Male
table corresponds to Male Entity with key as M-Id. Similarly Female table corresponds to
Female Entity with key as F-Id. Marry Table represents relationship between Male and
Female (Which Male marries which female). So it will take attribute M-Id from Male and F-
Id from Female.

Male Marry Female

M-Id Other Male Attribute M-Id F-Id F-Id Other FemaleAttribute

M1 - M1 F2 F1 -

M2 - M2 F1 F2 -

M3 - F3 -

Table 3

As we can see from Table 3, some males and some females do not marry. If we merge 3
tables into 1, for some M-Id, F-Id will be NULL. So there is no attribute which is always not
NULL. So we can’t merge all three tables into 1. We can convert into 2 tables. In table 4,
M-Id who are married will have F-Id associated. For others, it will be NULL. Table 5 will
have information of all females. Primary Keys have been underlined.

M-Id Other Male Attribute F-Id

Table 4

F-Id Other FemaleAttribute

Table 5

Note: Binary relationship with 1:1 cardinality will have 2 table if partial participation of both
entities in the relationship. If atleast 1 entity has total participation, number of tables
required will be 1.

Case 3: Binary Relationship with n: 1 cardinality

In this scenario, every student can enroll only in one elective course but for an elective
course there can be more than one student. First Convert each entity and relationship to
tables. Student table corresponds to Student Entity with key as S-Id. Similarly
Elective_Course table corresponds to Elective_Course Entity with key as E-Id. Enrolls Table
represents relationship between Student and Elective_Course (Which student enrolls in
which course). So it will take attribute S-Id from Student and E-Id from Elective_Course.

Student Enrolls Elective_Course

S-Id Other Student Attribute S-Id E-Id E-Id Other Elective CourseAttribute

S1 - S1 E1 E1 -

S2 - S2 E2 E2 -

S3 - S3 E1 E3 -

S4 - S4 E1

Table 6

As we can see from Table 6, S-Id is not repeating in Enrolls Table. So it can be considered
as a key of Enrolls table. Both Student and Enrolls Table’s key is same. We can merge it as
a single table. The resultant tables are shown in Table 7 and Table 8. Primary Keys have
been underlined.

S-Id Other Student Attribute E-Id

Table 7

E-Id Other Elective CourseAttribute

Table 8

Case 4: Binary Relationship with m: n cardinality

In this scenario, every student can enroll in more than 1 compulsory course and for a
compulsory course there can be more than 1 student. First Convert each entity and
relationship to tables. Student table corresponds to Student Entity with key as S-Id.
Similarly Compulsory_Courses table corresponds to Compulsory Courses Entity with key as
C-Id. Enrolls Table represents relationship between Student and Compulsory_Courses
(Which student enrolls in which course). So it will take attribute S-Id from Person and C-Id
from Compulsory_Courses.

Student Enrolls Compulsory_Courses

S- Other Student S- C- C- Other Compulsory

Id Attribute Id Id Id CourseAttribute

S1 - S1 C1 C1 -

S2 - S1 C2 C2 -

S3 - S3 C1 C3 -

S4 - S4 C3 C4 -

S4 C2

S3 C3

Table 9

As we can see from Table 9, S-Id and C-Id both are repeating in Enrolls Table. But its
combination is unique; so it can be considered as a key of Enrolls table. All tables’ keys are
different, these can’t be merged. Primary Keys of all tables have been underlined.

Case 5: Binary Relationship with weak entity

In this scenario, an employee can have many dependents and one dependent can depend
on one employee. A dependent does not have any existence without an employee (e.g; you
as a child can be dependent of your father in his company). So it will be a weak entity and
its participation will always be total. Weak Entity does not have key of its own. So its key
will be combination of key of its identifying entity (E-Id of Employee in this case) and its
partial key (D-Name).

First Convert each entity and relationship to tables. Employee table corresponds to
Employee Entity with key as E-Id. Similarly Dependents table corresponds to Dependent
Entity with key as D-Name and E-Id. Has Table represents relationship between
Employee and Dependents (Which employee has which dependents). So it will take
attribute E-Id from Employee and D-Name from Dependents.

Employee Has Dependents

E- Other Employee E- D- D- E- Other

Id Attribute Id Name Name Id DependentsAttribute

E1 - E1 RAM RAM E1 -

E2 - E1 SRINI SRINI E1 -

E3 - E2 RAM RAM E2 -

E3 ASHISH ASHISH E3 -

Table 10

As we can see from Table 10, E-Id, D-Name is key for Has as well as Dependents Table.
So we can merge these two into 1. So the resultant tables are shown in Tables 11 and 12.
Primary Keys of all tables have been underlined.

E-Id Other Employee Attribute

Table 11

D-Name E-Id Other DependentsAttribute

Conclusion
Converting an ER diagram to a Relational Model is a crucial step in database design. The
ER model represents the conceptual structure, while the Relational Model is a physical
representation that can be directly implemented using a Relational Database Management
System (RDBMS) like Oracle or MySQL. We've explored how to convert ER diagrams to
Relational Models for different scenarios, including binary relationships with various
cardinalities and participation constraints. We've covered five cases, highlighting key
considerations and resulting table structures. By understanding these scenarios, database
designers and developers can effectively translate conceptual ER models into physical
Relational Models, ensuring successful database implementation using RDBMS. So,
mapping from ER Model to Relational Model is a vital skill, and we hope this article has
been helpful.

No compatible source was found for this media.

Conversion ER Model to Relational Model - Part1 Visit Course

Comment K kartik Follow 228

Article Tags : DBMS DBMS-Relational Model DBMS-ER model

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

About Us POTD Programming Languages IBM Certification DSA Interview Corner
Corporate & Communications Address: Legal Job-A-Thon DSA DSA and Placements Python Aptitude
A-143, 7th Floor, Sovereign Corporate Privacy Policy Blogs Web Technology Web Development Java Puzzles
Tower, Sector- 136, Noida, Uttar Pradesh
Contact Us Nation Skill Up AI, ML & Data Science Programming Languages C++ GfG 160
(201305)
Advertise with us DevOps DevOps & Cloud Web Development System Design
GFG Corporate Solution CS Core Subjects GATE Data Science
Registered Address:
Campus Training Program Interview Preparation Trending Technologies CS Subjects
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam Software and Tools
HOME C PROGRAMMING JAVA PYTHON JAVASCRIPT CRYPTOGRAPHY DBMS OTHER 

Home  DBMS Crash Course  Advanced SQL in DBMS

DBMS Crash Course Save Your Study or Programming Notes

UseMyNotes offers a free Online Notepad to

Advanced SQL in DBMS quickly save and access your notes anytime,
anywhere. 👉 Try it now
By Prakhar - April 14, 2025

Basic SQL provides a strong foundation to work with the databases, like inserting,
Recent Post
selecting, retrieve the data, whereas advanced SQL in DBMS helps in extending
the capabilities of the SQL and thus provides more sophisticated techniques,
features and functions to work with complex data. Top 30 DBMS Interview Questions in 2025

Difference Between DBMS and RDBMS

Know 11 APPLICATIONS OF DBMS in Real

Life

Types of Schema in DBMS

Embedded SQL in DBMS

Our Popular Courses

Blog 99

Java Programming Crash Course 28

JavaScript Crash Course for Free 26

C Language Basics Notes 21

Let’s learn more about Advanced SQL in DBMS. Microprocessor Crash Course 21

DBMS Crash Course 19

Table of Contents Network Security Crash Course 16

Python Crash Course 15

Operating System Crash Course 5

What is Advanced SQL?

Advanced SQL, basically refers to the use of sophisticated and complex features and
techniques in the SQL or Structured Query Language, that is used to perform more
intricate and powerful operations in the database.

Advanced SQL is used to handle complex data through its advanced concepts and
many high capabilities of SQL.

What are the Key Components of Advanced SQL?

Here is the list of the Key Components of Advanced SQL.

Subqueries: Queries within the queries are called subqueries. It simply means
that the queries are nested in other queries. In Subqueries, the result or the
output of one query is used as the input of another query. It provides more
flexibility and thus enables more complex data manipulation and retrieval.
Subqueries can be used with INSERT, SELECT, DELETE and UPDATE statements.

Joins: Join function is used to combine the data from multiple tables based on
particular attributes. It is used to retrieve data from multiple tables based on
specific conditions. It gives a more comprehensive and meaningful result. There
are many types of joins such as INNER JOIN, RIGHT JOIN, LEFT JOIN and FULL
JOIN.

Window Functions: These functions are used to perform complex calculations

across a set of rows that are related to the row chosen in the current. It is used
for specific ordering and partitioning of the dataset. They usually operate on the
windows of the data. Some of the Windows functions are ranking, moving
averages, cumulative sums, etc. These are used for performing advanced
analytical functions.

Views: These are called virtual tables, and are made or derived from the output
of the query. It is used to restrict sensitive and personal information and present
the data in a more user-friendly format.

Indexing and Optimization of Performance: In Advanced SQL, also involves

optimizing and understanding the performance of the queries in the database.
The techniques involved in this are the creation of the appropriate indexes in the
table, to analyze the performance of the query using some tools like Query
Profiler or EXPLAIN. It also helps in optimizing the query execution plans.

Common Table Expressions (CTEs): Within a SQL statement, CTEs provide a

way through which temporary named results can be defined. So, the complex big
queries are broken into smaller modules into parts that are more manageable
and thus improve maintainability and readability. It can be used to handle
complex data structures and enable hierarchical queries and thus can be
recursive.

Advanced-Data Manipulation: In Advanced SQL, Advance Data manipulation

includes performing advanced operations such as INSERT AND SELECT, UPDATE
AND FROM, DELETE AND FROM, etc. Based on the result of the join operations
and subqueries, these functions perform and give more accurate data sets.

Data Administrators and Developers can perform more complex operations with
data, and optimize the performance by using these components of the advanced
SQL. It can also help in creating more scalable and efficient database solutions.

Read about: Relational Model in DBMS

What are the additional data types in advanced SQL?

In advanced SQL, there are some additional data types apart from the common or
basics one that is mostly found in almost most of the database. These additional
data types often provide more flexibility in managing complex data and information.

Let’s see the list of some additional data types of the advanced SQL.

Javascript Object Notation (JSON): It is used to represent structured data.

Advanced SQL has native support for the JSON data type. It allows storage,
query, retrieval and manipulate the JSONser documents directly in the database.
This is used for handling hierarchical and semi-structured data.

eXtensible Markup Language (XML): It is a markup language that is used for

structuring and representing data. XML Data type is supported in some of the
advanced SQL databases. This plays a major role when the data is being
extracted from external systems or from some of the web services that must be
using XML as the format for data.

Arrays: It is the collection of multiple values of the same data types. It provides
efficient storage as retrieval of multiple related values happens at a time.
Examples are an array of integers, lists of tags, etc.

Geospatial Types: It is designed in such a manner that can represent and

query the geographical information. Some advanced SQL databases can store
and manipulate geographical data that includes points, lines, polygons, and
spatial indexes.

Binay Large Data (BLOB): This data type is used for storing very large data
objects such as images, videos, or documents. Advanced SQL supports BLOB
data type and can handle large data retrieval and manipulation easily. It uses
some specialized functions to perform any type of operation.

These additional data types may vary from the types of databases being used.
These data types are not supported by all the DBMSs.

Know more about DBA: Classification of DBMS Users

What are the functions in advanced SQL?

Advanced SQL supports many varieties of functions that are beyond the basics of
string and arithmetic manipulation functions.

Let’s see the list of the functions supported by Advanced SQL.

1) Aggregate Functions: It is used to perform calculations on a set of values but

return a single result. Some of the examples are:

SUM(): It is used to calculate the sum of the set of values in the query.

AVG(): It is used to calculate the average of the set of values in the query.

COUNT(): It is used to count the number of rows or all the not null values from
the set of values given.

MAX(): It is used to find the maximum value from the set of values.

MIN(): It is used to find the minimum values from the set of values.

2) String Functions: It operates on the character data, and is used for analyzing
and manipulating the string values. Some of the examples are:

CONCAT(): It is used for combining two or more strings.

LENGTH(): It is used to calculate the length of the given string.

SUBSTRING(): It is used to extract the specific portion from the given string.

UPPER(): It is used to convert the string to the Upper case.

LOWER(): It is used to convert the string to the Lower case.

TRIM(): It is used to remove the trailing and the leading spaces from the string.

3) Mathematical Functions: Some mathematical functions are also included in the

advanced SQL, that are used to perform some advanced complex calculations.
Some of the examples are:

ABS(): It is used to get the absolute value of a number.

ROUND(): It is used to round the given decimal value to the closest whole
number.

POWER(): It is used to raise a number to the specific power.

SQRT(): It is used to calculate the square root of the given number.

LOG(): It is used to calculate the logarithm of the given number.

4) Conditional Functions: It is used to return the specific results based on the

conditions applied. Some of the examples are:

CASE WHEN THEN END: It is used to evaluate the multiple conditions using the
CASE WHEN statement, and give the specific result based on the conditions. It is
similar to the IF ELSE statements of the other programming language.

NULLIF(): It is used to compare two more expressions and returns null if the
expressions are equal otherwise returns the first expressions.

COALESCE(): It is used to return the first not null values from the list of
arguments passed.

Know more about Basic Introduction to SQL

Advanced SQL has a variety of powerful features that can be used by database
administrators, or developers to manipulate and retrieve complex data most easily.
It also offers a wide range of functions which helps perform various types of
calculations with large data and it is thus very helpful for business users.

Previous article Next article

DBMS Basic Introduction to SQL Dynamic SQL in DBMS

Company Popular Courses Support Other

Home C Programming About Us Blog

About Us JavaScript Course Contact Us Online Notepad

Contact Us C Language Notes contact@[Link] Online Python Compiler

Privacy Policy DBMS Full Course

Terms and Conditions Network Security Course

Python for Beginners

Microprocessor Full Course

© 2025 UseMyNotes. All Right Reserved     

Search... Sign In

Databases SQL MySQL PostgreSQL PL/SQL MongoDB SQL Cheat Sheet SQL Interview Questions MySQL Interview Questions PL/SQL Interview Questions

SQL | Advanced Functions

Last Updated : 08 Sep, 2025

SQL provides a variety of advanced functions for performing complex calculations,

transformations and aggregations on data. These functions are essential for data analysis,
reporting, and efficient database management.

1. Aggregate Functions
Aggregate functions perform calculations on multiple rows and return a single value.

SUM(): Calculates the sum of values in a column.

AVG(): Computes the average of values in a column.
COUNT(): Returns the number of rows or non-null values in a column.
MIN(): Finds the minimum value in a column.
MAX(): Retrieves the maximum value in a column.

Example

SELECT COUNT(*), AVG(Salary), SUM(Salary), MIN(Salary), MAX(Salary)

FROM Employees;

Output :

Returns count of rows, average, total salary, min salary, and max salary.

2. Conditional Functions
Conditional functions help apply logic inside SQL queries.

CASE WHEN: Allows conditional logic to be applied in the SELECT statement.

COALESCE(): Returns the first non-null value in a list.
NULLIF(): Compares two expressions and returns null if they are equal; otherwise,
returns the first expression.

Example

SELECT Name,
CASE WHEN Salary > 5000 THEN 'High'
ELSE 'Low' END AS Salary_Level
FROM Employees;

Output:

Labels employees as High or Low salary.

3. Mathematical Functions
Mathematical functions are used for numeric calculations. Some commonly used
mathematical functions are given below:

ABS(): Returns the absolute value of a number.

ROUND(): Rounds a number to a specified number of decimal places.
POWER(): Raises a number to a specified power.
SQRT(): Calculates the square root of a number.

Example

SELECT ABS(-15), ROUND(25.678, 2), POWER(2, 3), SQRT(49);

Output :

15, 25.68, 8, 7

4. Advanced Functions in SQL

Beyond aggregates and math, SQL offers system and utility functions for deeper insights.

BIN()

Convert decimal to binary

SELECT BIN(18);

Output:

BINARY()

Convert to binary string

SELECT BINARY "GeeksforGeeks";

Output:

COALESCE()

Returns the first non-null expression in a list

SELECT COALESCE(NULL,NULL,'GeeksforGeeks',NULL,'Geeks');

Output:

CONNECTION_ID()

Returns the unique connection ID for the current connection

SELECT CONNECTION_ID();

Output:

CURRENT_USER()

Returns the user name and hostname for the MySQL account used by the server.

SELECT CURRENT_USER();

Output:

DATABASE()

Returns the name of the default database.

SELECT DATABASE();

Output:

IF()

Returns one value if a condition is TRUE, or another value if a condition is FALSE

SELECT IF(200<500, "YES", "NO");

Output:

LAST_INSERT_ID()

Returns the first AUTO_INCREMENT value that was set by the most recent INSERT or
UPDATE statement

SELECT LAST_INSERT_ID();

Output:

NULLIF()

Returns NULL if equal

SELECT NULLIF(115, 115);

Output:

SESSION_USER()

Returns the user name and host name for the current MySQL user

SELECT SESSION_USER();

Output:

SYSTEM_USER()

Returns the user name and host name for the current MySQL user.

SELECT SYSTEM_USER();

Output:

USER()

It returns the user name and host name for the current MySQL user

SELECT USER();

Output:

VERSION()

It returns the version of the MySQL database

SELECT VERSION();

Output:

Comment S Sakshi98 Follow 15

Article Tags : Misc SQL SQL-Functions

Explore

SQL Tutorial 6 min read

Basics

Queries & Operations

SQL Joins & Functions

Data Constraints & Aggregate Functions

Advanced SQL Topics

Database Design & Security

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

File Organization in DBMS

Last Updated : 09 Sep, 2025

File organization in DBMS refers to the method of storing data records in a file so they can
be accessed efficiently. It determines how data is arranged, stored, and retrieved from
physical storage.

The Objective of File Organization

It helps in the faster selection of records i.e. it makes the process faster.
Different Operations like inserting, deleting, and updating different records are faster
and easier.
It prevents us from inserting duplicate records via various operations.
It helps in storing the records or the data very efficiently at a minimal cost.

Types of File Organizations

Various methods have been introduced to Organize files. These particular methods have
advantages and disadvantages on the basis of access or selection. Thus it is all upon the
programmer to decide the best-suited file Organization method according to his
requirements.

Some types of File Organizations are:

Sequential File Organization

Heap File Organization
Clustered File Organization
ISAM (Indexed Sequential Access Method)
Hash File Organization
B+ Tree File Organization

we will be discussing each of the file Organizations in further sets of this article along with
the differences and advantages/ disadvantages of each file Organization method.

Sequential File Organization

The easiest method for file Organization is the Sequential method. In this method, the file is
stored one after another in a sequential manner. There are two ways to implement this
method:

1. Pile File Method

This method is quite simple, in which we store the records in a sequence i.e. one after the
other in the order in which they are inserted into the tables.

Pile File Method

Insertion of the new record: Let the R1, R3, and so on up to R5 and R4 be four records in
the sequence. Here, records are nothing but a row in any table. Suppose a new record R2
has to be inserted in the sequence, then it is simply placed at the end of the file.

New Record Insertion

2. Sorted File Method

In this method, As the name itself suggests whenever a new record has to be inserted, it is
always inserted in a sorted (ascending or descending) manner. The sorting of records may
be based on any primary key or any other key.

Sorted File Method

Insertion of the new record: Let us assume that there is a preexisting sorted sequence of
four records R1, R3, and so on up to R7 and R8. Suppose a new record R2 has to be
inserted in the sequence, then it will be inserted at the end of the file and then it will sort
the sequence.

new Record Insertion

Advantages of Sequential File Organization

Fast and efficient method for huge amounts of data.
Simple design.
Files can be easily stored inmagnetic tapes i.e. cheaper storage mechanism.

Disadvantages of Sequential File Organization

Time wastage as we cannot jump on a particular record that is required, but we have to
move in a sequential manner which takes our time.
The sorted file method is inefficient as it takes time and space for sorting records.

Heap File Organization

Heap File Organization works with data blocks. In this method, records are inserted at the
end of the file, into the data blocks. No Sorting or Ordering is required in this method. If a
data block is full, the new record is stored in some other block, Here the other data block
need not be the very next data block, but it can be any block in the memory. It is the
responsibility of DBMS to store and manage the new records.

Heap File Organization

Insertion of the new record: Suppose we have four records in the heap R1, R5, R6, R4,
and R3, and suppose a new record R2 has to be inserted in the heap then, since the last
data block i.e data block 3 is full it will be inserted in any of the data blocks selected by the
DBMS, let's say data block 1.

New Record Insertion

If we want to search, delete or update data in the heap file Organization we will traverse
the data from the beginning of the file till we get the requested record. Thus if the
database is very huge, searching, deleting, or updating the record will take a lot of time.

Other types file organistions are discussed into later articles.

Advantages of Heap File Organization

Fetching and retrieving records is faster than sequential records but only in the case of
small databases.
When there is a huge number of data that needs to be loaded into the database at a
time, then this method of file Organization is best suited.

Disadvantages of Heap File Organization

The problem of unused memory blocks.
Inefficient for larger databases.

Read related articles:

File Organization in DBMS | Set 2 (Hashing in DBMS)
File Organization in DBMS | Set 3

No compatible source was found for this media.

File Organization in DBMS Visit Course

Comment S Smith… Follow 52

Article Tags : Misc DBMS GATE CS dbms

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

Programs Blogs Topics Free Courses Enterprise More Start Learning

Home Learning Hub Articles

Hashing in DBMS: Techniques and Applications

Updated on February 17, 2025

Hashing is defined as a technique in DBMS that is used to search for records in databases that are very large or even small. In larger
Article Outline
databases, which contain thousands and millions of records, the indexing data structure technique becomes inefficient because searching
a specific record using indexing consumes more time. So, to counter this problem, hashing techniques are used. In this article, we will go
What is Hashing in DBMS? through various hashing techniques.

Properties of Hashing in DBMS

What is Hashing in DBMS?
The hashing technique uses a hash function to store data records in an auxiliary hash table. There are three major components in
hashing:
Important Terminologies

Hash Table: The total number of data records in the database determines the size of a hash table, which is an array or data structure.
Working of Hash Function The precise location of a data record is stored in each memory location in a hash table, which is referred to as a “bucket” or hash index
and is accessible via a hash function.
Types of Hashing in DBMS Bucket: In the hash table where the data record is stored, a bucket is a memory index. Typically, a disk block that holds numerous
records is stored in these buckets. Another name for it is the hash index.
Hash Function: A hash function is an algorithm or mathematical equation that computes the hash index as the output after receiving
Static Vs Dynamic Hashing in the main key of one data record as input.
DBMS

Also read: Cardinality in DBMS

Best Use Cases for Hashing

Conclusion Get curriculum highlights, career paths, industry insights and accelerate
your technology journey.
FAQs Download brochure

Properties of Hashing in DBMS

In this strategy, data is stored in blocks called addresses that are generated by the hashing process. These records are kept in memory at
locations called data buckets or data blocks.

In this instance, the address can be generated from any column value using a hash function. The hash function frequently generates the
address of the data block using the primary key. A hash function is a fundamental mathematical function to any sophisticated
mathematical function. The address of the data block, or any row that shares the same address as a main key within the data block, can
alternatively be thought of as the primary key.

The main key value in the image above corresponds to the data block addresses. An alternative to this hash function might be a
straightforward mathematical function, such as exponential, mod, cos, sin, and so forth. Let’s say we want to determine the address of the
data block using the mod (5) hash function. The mod (5) function is used in this situation to hash the primary keys, producing the results 3,
3, 1, 4, and 2, respectively. Records are then saved at those data block positions.

Important Terminologies

Data Bucket
Data buckets are storage locations within a hash table where actual data records are kept. Each bucket can hold one or more records,
depending on the implementation.

Hash Function
A hash function is a mathematical algorithm that converts a given input (often a primary key) into a specific address within the hash table.
This address indicates where the corresponding data record is stored.

Hash Index
The hash index is the result produced by the hash function, representing the address of the data block within the hash table. It serves as a
quick reference to locate the desired data.

Linear Probing
Linear probing is a collision resolution technique used when the initial bucket calculated by the hash function is already occupied. It
sequentially checks the next available buckets in the hash table until an empty one is found.

Quadratic Probing
Quadratic probing is another collision resolution method similar to linear probing. However, instead of checking the next bucket linearly, it
uses a quadratic function to determine the next bucket to check, reducing clustering issues.

Bucket Overflow
Bucket overflow occurs when the bucket identified by the hash function is already full. This situation requires additional handling
strategies, such as linear or quadratic probing, to find an alternative bucket for storing the new record.

Working of Hash Function

A hash index is produced by the hash function using the data record’s primary key. There are now two options:

There isn’t a value that already occupies the generated hash index. Thus, this is where the data record’s address will be kept.
There is already another value occupying the hash index that was generated. This is referred to as a collision, so a collision resolution
technique will be used to combat it.

The hash function now applies whenever we query a specific record, returning the data record much faster than indexing because we can
use the hash function to find the exact location of the data record without having to search through all of the indices one by one.

Types of Hashing in DBMS

There are two primary hashing techniques in DBMS:

Static Hashing
Dynamic Hashing

Static Hashing in DBMS

Static hashing in a Database Management System (DBMS) is a technique where the size and structure of the hash table are fixed when it
is created. Here are some key points about static hashing:

Fixed Number of Buckets: The number of data buckets remains constant throughout. Each bucket is a storage location where records
are stored.
Hash Function: A hash function is used to map search-key values to bucket addresses. For example, if the hash function is mod 5, it
will always map a given key to the same bucket address.

Operations in Static Hashing

Insertion: When a record needs to be entered using a static hash, the hash function h determines the bucket address (where the record
will be stored) for search key K.

Address of a bucket = h(K)

Search: The address of the bucket containing the data can be obtained using the same hash function when a record needs to be
retrieved.
Delete: This is simply a search followed by a deletion operation.
Update: The record is searched using the hash function and then updated.

Dynamic Hashing in DBMS

Dynamic hashing is a technique used in DBMS that handles the limitations of static hashing like bucket overflow. Here are the key aspects
of dynamic hashing:

Variable Number of Buckets: Unlike static hashing, the number of buckets in dynamic hashing can grow or shrink based on the
number of records.
Directory: A directory is used to keep track of the buckets. The directory itself can grow or shrink dynamically.
Hash Function: The hash function generates a hash value, and the directory uses a certain number of bits from this hash value to
determine the bucket address.
Bucket Splitting: When a bucket overflows, it is split into two, and the directory is updated to reflect this change. This helps in
distributing the records more evenly.

Operations in Dynamic Hashing

Querying: Means looking at the depth value of the hash index and then using those to compute the bucket address.
Update: To perform a query as above and update the data.
Deletion: Perform a query to locate the desired data and delete the same.
Insertion: Compute the address of the bucket
If the bucket is already full.
Add more buckets.
Add additional bits to the hash value.
Re-compute the hash function.
Else
Add data to the bucket,
If all the buckets are full, perform the remedies of static hashing.

Static Vs Dynamic Hashing in DBMS

Static Hashing Dynamic Hashing

Dynamic hashing employs an alternative methodology.

In static hashing, hash tables have a fixed size. When a table is filled
Instead of being fixed, dynamic hash tables grow to hold
up, an overflow page is allocated to take in the brimming data.
more data.

Because static hashing only requires two input/output operations—

If a dynamic hash table is empty, it can be resized to save
read and write—it is a quick way to perform insert and delete
memory usage.
operations.

The overflow chain can grow rather lengthy if the hash table isn’t The advantage of dynamic hashing lies in its flexibility: you
optimized, which will lower performance and complicate scaling. don’t have to plan its size in advance

Best Use Cases for Hashing

Hashing algorithms excel in various scenarios where fast data retrieval, security, and efficiency are essential. Graphics processing,
where quick access to pixel data or image properties is crucial, benefits from hashing’s speed. Similarly, game boards, particularly in
complex video games, rely on hashing to quickly determine the state of the game and make efficient moves.

In cybersecurity, hashing plays a pivotal role in generating message digests, ensuring data integrity, and detecting tampering. Moreover,
during compiler operation, hashing helps expedite symbol table lookups and optimize code generation. Password verification is another
prime application, enhancing security by storing and comparing password hashes instead of the actual passwords.

Lastly, when linking file names and paths in file systems, hashing enables rapid access and reduces the need for exhaustive searches.
In these use cases, hashing offers a powerful solution to streamline operations and enhance both performance and security.

Also Read: Data Integrity in DBMS

Conclusion
Hashing in databases is a technique that helps in speeding up data retrieval in large databases by using hash functions to search for
records. It overcomes the inefficiencies of traditional indexing methods. There are two main hashing methods Static and Dynamic, Static
uses fixed-sized tables while Dynamic adjusts itself based on the data volume. Each type has its advantages like static is simple and
dynamic is flexible. Hashing is widely used in areas that need fast access, security, and performance, like cybersecurity, graphics
processing, and file systems.

FAQs
What is a pointer in C?

What are null pointers?

A null pointer is a pointer that does not point to any valid memory location. It is often initialized with NULL to
signify that it is not currently pointing to a valid object.

Updated on February 17, 2025 Link

Programs Blogs Topics Free Courses Enterprise More Start Learning

Home Learning Hub Articles

Hashing in DBMS: Techniques and Applications

Updated on February 17, 2025

Properties of Hashing in DBMS

What is Hashing in DBMS?
The hashing technique uses a hash function to store data records in an auxiliary hash table. There are three major components in
hashing:
Important Terminologies

Also read: Cardinality in DBMS

Best Use Cases for Hashing

Conclusion Get curriculum highlights, career paths, industry insights and accelerate
your technology journey.
FAQs Download brochure

Properties of Hashing in DBMS

In this strategy, data is stored in blocks called addresses that are generated by the hashing process. These records are kept in memory at
locations called data buckets or data blocks.

Important Terminologies

Data Bucket
Data buckets are storage locations within a hash table where actual data records are kept. Each bucket can hold one or more records,
depending on the implementation.

Hash Index
The hash index is the result produced by the hash function, representing the address of the data block within the hash table. It serves as a
quick reference to locate the desired data.

Working of Hash Function

A hash index is produced by the hash function using the data record’s primary key. There are now two options:

Types of Hashing in DBMS

There are two primary hashing techniques in DBMS:

Static Hashing
Dynamic Hashing

Static Hashing in DBMS

Static hashing in a Database Management System (DBMS) is a technique where the size and structure of the hash table are fixed when it
is created. Here are some key points about static hashing:

Operations in Static Hashing

Insertion: When a record needs to be entered using a static hash, the hash function h determines the bucket address (where the record
will be stored) for search key K.

Address of a bucket = h(K)

Dynamic Hashing in DBMS

Dynamic hashing is a technique used in DBMS that handles the limitations of static hashing like bucket overflow. Here are the key aspects
of dynamic hashing:

Operations in Dynamic Hashing

Static Vs Dynamic Hashing in DBMS

Static Hashing Dynamic Hashing

Dynamic hashing employs an alternative methodology.

Because static hashing only requires two input/output operations—

If a dynamic hash table is empty, it can be resized to save
read and write—it is a quick way to perform insert and delete
memory usage.
operations.

Best Use Cases for Hashing

Also Read: Data Integrity in DBMS

FAQs
What is a pointer in C?

What are null pointers?

A null pointer is a pointer that does not point to any valid memory location. It is often initialized with NULL to
signify that it is not currently pointing to a valid object.

Updated on February 17, 2025 Link

Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Hashing in DBMS
Last Updated : 31 Jul, 2025

Hashing in DBMS is a technique to quickly locate a data record in a database irrespective of

the size of the database. For larger databases containing thousands and millions of
records, the indexing data structure technique becomes very inefficient because searching a
specific record through indexing will consume more time. This doesn't align with the goals
of DBMS, especially when performance and data retrieval time are minimized. So, to
counter this problem hashing technique is used. In this article, we will learn about various
hashing techniques.

The hashing technique utilizes an auxiliary hash table to store the data records using a
hash function. There are 3 key components in hashing:

Hash Table: A hash table is an array or data structure and its size is determined by the
total volume of data records present in the database. Each memory location in a hash
table is called a 'bucket' or hash indices and stores a data record's exact location and
can be accessed through a hash function.
Bucket: A bucket is a memory location (index) in the hash table that stores the data
record. These buckets generally store a disk block which further stores multiple records.
It is also known as the hash index.
Hash Function: A hash function is a mathematical equation or algorithm that takes one
data record's primary key as input and computes the hash index as output.

Hash Function
A hash function is a mathematical algorithm that computes the index or the location where
the current data record is to be stored in the hash table so that it can be accessed
efficiently later. This hash function is the most crucial component that determines the speed
of fetching data.

Working of Hash Function

The hash function generates a hash index through the primary key of the data record.

Now, there are 2 possibilities:

1. The hash index generated isn't already occupied by any other value. So, the address of
the data record will be stored here.
2. The hash index generated is already occupied by some other value. This is called
collision so to counter this, a collision resolution technique will be applied.
3. Now whenever we query a specific record, the hash function will be applied and returns
the data record comparatively faster than indexing because we can directly reach the
exact location of the data record through the hash function rather than searching through
indices one by one.

Example:

Hashing

Types of Hashing in DBMS

There are four primary hashing techniques in DBMS.

1. Static Hashing
In static hashing, the hash function always generates the same bucket's address. For
example, if we have a data record for employee_id = 107, the hash function is mod-5 which
is - H(x) % 5, where x = id. Then the operation will take place like this:

H(106) % 5 = 1.
This indicates that the data record should be placed or searched in the 1st bucket (or
1st hash index) in the hash table.

Example:

Static Hashing Technique

The primary key is used as the input to the hash function and the hash function generates
the output as the hash index (bucket's address) which contains the address of the actual
data record on the disk block.

Static Hashing has the following Properties

Fixed Table Size: The number of buckets remains constant.

Simple Hash Function: Typically uses a modulo function.
Best for Known Data Size: Efficient when the number of records is known and stable.
Inefficient with Dynamic Data: As data grows, collisions increase, leading to bucket
overflows or skew.

To resolve this problem of bucket overflow, techniques such as - chaining and open
addressing are used. Here's a brief info on both:

2. Chaining (Separate Chaining)

Chaining is a mechanism in which the hash table is implemented using an array of type
nodes, where each bucket is of node type and can contain a long chain of linked lists to
store the data records. So, even if a hash function generates the same value for any data
record it can still be stored in a bucket by adding a new node.

Given:

Hash table size = 5

Hash function: h(key) = key % 5
Keys to insert: 10, 15, 20, 25, 30, 11

Step-by-step hashing:

Key Hash Index (key % 5) Inserted At

10 0 Bucket 0 → [10]

15 0 Bucket 0 → [10 → 15]

20 0 Bucket 0 → [10 → 15 → 20]

25 0 Bucket 0 → [10 → 15 → 20 → 25]

30 0 Bucket 0 → [10 → 15 → 20 → 25 → 30]

11 1 Bucket 1 → [11]

Final Hash Table (with separate chaining):

Index Linked List (Bucket)

0 10 → 15 → 20 → 25 → 30

1 11

2 --

3 --

4 --

Key Points:

All keys that hash to the same index (like 10, 15, 20, etc.) are stored in a linked list at
that index.
Separate chaining avoids clustering and makes insertion easier.
Efficient when hash table load factor is high.

However, this will give rise to the problem bucket skew that is, if the hash function keeps
generating the same value again and again then the hashing will become inefficient as the
remaining data buckets will stay unoccupied or store minimal data.

3. Open Addressing (Closed Hashing)

This is also called closed hashing this aims to solve the problem of collision by looking out
for the next empty slot available which can store data. It uses techniques like linear
probing, quadratic probing, double hashing, etc.

Example:

Hash table size = 7

Hash function: h(key) = key % 7
Collision resolution: Linear Probing

Insert the keys: 50, 700, 76, 85, 92, 73

Step-by-step insertion:

Key Hash (key % 7) Insert At Collision? Final Position (after probing)

50 50 % 7 = 1 1 No 1

700 700 % 7 = 0 0 No 0

76 76 % 7 = 6 6 No 6

85 85 % 7 = 1 1 Yes 2 (next slot)

92 92 % 7 = 1 1 Yes 3 (after 1 and 2 are filled)

73 73 % 7 = 3 3 Yes 4 (next slot after 3)

Final Hash Table (index → value):

Index 0 1 2 3 4 5 6

Value 700 50 85 92 73 -- 76

4. Dynamic Hashing
Dynamic hashing is also known as extendible hashing, used to handle database that
frequently changes data sets. This method offers us a way to add and remove data
buckets on demand dynamically. This way as the number of data records varies, the
buckets will also grow and shrink in size periodically whenever a change is made.

Properties of Dynamic Hashing

Flexible Hash Function: Adapts based on data size.

Directories: Point to buckets and may grow in size.
Global Depth: Number of bits in directory indices.
Bucket Splitting: Prevents overflow and ensures balanced distribution.

Working of Dynamic Hashing

Example: If global depth: k = 2, the keys will be mapped accordingly to the hash index. K
bits starting from LSB will be taken to map a key to the buckets. That leaves us with the
following 4 possibilities: 00, 11, 10, 01.

Dynamic Hashing - mapping

As we can see in the above image, the k bits from LSBs are taken in the hash index to map
to their appropriate buckets through directory IDs. The hash indices point to the directories,
and the k bits are taken from the directories' IDs and then mapped to the buckets. Each
bucket holds the value corresponding to the IDs converted in binary.

Comment D dotsla… Follow 10

Article Tags : DBMS Geeks Premier League Geeks Premier League 2023

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Indexing in Databases
Last Updated : 31 Jul, 2025

Indexing in DBMS is used to speed up data retrieval by minimizing disk scans. Instead of
searching through all rows, the DBMS uses index structures to quickly locate data using key
values.

When an index is created, it stores sorted key values and pointers to actual data rows. This
reduces the number of disk accesses, improving performance especially on large datasets.

Structure of Index in Database

Attributes of Indexing
Several Important attributes of indexing affect the performance and efficiency of database
operations:

1. Access Types: This refers to the type of access such as value-based search, range
access, etc.
2. Access Time: It refers to the time needed to find a particular data element or set of
elements.
3. Insertion Time: It refers to the time taken to find the appropriate space and insert new
data.
4. Deletion Time: Time taken to find an item and delete it as well as update the index
structure.
5. Space Overhead: It refers to the additional space required by the index.

Structure of Index in Database

File Organization in Indexing

File organization refers to how data and indexes are physically stored in memory or on disk.
The following are the common types of file organizations used in indexing:

1. Sequential (Ordered) File Organization

In this type of organization, the indices are based on a sorted ordering of the values. These
are generally fast and a more traditional type of storing mechanism. These Ordered or
Sequential file organizations might store the data in a dense or sparse format.

i. Dense Index: Every search key value in the data file corresponds to an index record. This
method ensures that each key value has a reference to its data location.

Example: If a table contains multiple entries for the same key, a dense index ensures that
each key value has its own index record.

Dense Index

ii. Sparse Index: The index record appears only for a few items in the data file. Each item
points to a block as shown. To locate a record, we find the index record with the largest
search key value less than or equal to the search key value we are looking for.

Access Method: To locate a record, we find the index record with the largest key value less
than or equal to the search key, and then follow the pointers sequentially.

Access Cost = log 2 (n) + 1 , where n is the number of blocks involved in the index
file.

Sparse Index

2. Hash File Organization

Uses a hash function to map keys to buckets.

Offers fast access for exact-match queries.

Not suitable for range queries.

Types of Indexing Methods

There are different types of indexing techniques, each optimized for specific use cases.

1. Clustered Indexing

Clustered Indexing stores related records together in the same file, reducing search time
and improving performance, especially for join operations. Data is stored in sorted order
based on a key (often a non-primary key) to group similar records, like students by
semester. If the indexed column isn't unique, multiple columns can be combined to form a
unique key. This makes data retrieval faster by keeping related records close and allowing
quicker access through the index.

Clustered Indexing

2. Primary Indexing

This is a type of Clustered Indexing wherein the data is sorted according to the search key
and the primary key of the database table is used to create the index. It is a default format
of indexing where it induces sequential file organization. As primary keys are unique and
are stored in a sorted manner, the performance of the searching operation is quite
efficient.

Key Features: The data is stored in sequential order, making searches faster and more
efficient.

3. Non-clustered or Secondary Indexing

A non-clustered index just tells us where the data lies, i.e. it gives us a list of virtual
pointers or references to the location where the data is actually stored. Data is not
physically stored in the order of the index. Instead, data is present in leaf nodes.

Example: The contents page of a book. Each entry gives us the page number or location of
the information stored. The actual data here(information on each page of the book) is not
organized but we have an ordered reference(contents page) to where the data points
actually lie. We can have only dense ordering in the non-clustered index as sparse ordering
is not possible because data is not physically organized accordingly.

It requires more time as compared to the clustered index because some amount of extra
work is done in order to extract the data by further following the pointer. In the case of a
clustered index, data is directly present in front of the index.

Non Clustered Indexing

4. Multilevel Indexing

With the growth of the size of the database, indices also grow. As the index is stored in the
main memory, a single-level index might become too large a size to store with multiple
disk accesses. The multilevel indexing segregates the main block into various smaller
blocks so that the same can be stored in a single block.

The outer blocks are divided into inner blocks which in turn are pointed to the data blocks.
This can be easily stored in the main memory with fewer overheads. This hierarchical
approach reduces memory overhead and speeds up query execution.

Multilevel Indexing

Advantages of Indexing
Faster Queries: Indexes allow quick search of rows matching specific values, speeding
up data retrieval.
Efficient Access: Reduces disk I/O by keeping frequently accessed data in memory.
Improved Sorting: Speeds up sorting by indexing the relevant columns.
Consistent Performance: Maintains query speed even as data grows.
Data Integrity: Ensures uniqueness in columns indexed as unique, preventing duplicate
entries.

Disadvantages of Indexing
While indexing offers many advantages, it also comes with certain trade-offs:

Increased Storage Space: Indexes require additional storage. Depending on the size of
the data, this can significantly increase the overall storage requirements.
Increased Maintenance Overhead: Indexes must be updated whenever data is
inserted, deleted, or modified, which can slow down these operations.
Slower Insert/Update Operations: Since indexes must be maintained and updated,
inserting or updating data takes longer than in a non-indexed database.
Complexity in Choosing the Right Index: Determining the appropriate indexing strategy
for a particular dataset can be challenging and requires an understanding of query
patterns and access behaviors.

Features of Indexing
Several key features define the indexing process in databases:

Efficient Data Structures: Indexes use efficient data structures like B-trees, B+ trees,
and hash tables to enable fast data retrieval.
Periodic Index Maintenance: Indexes need to be periodically maintained, especially
when the underlying data changes frequently. Maintenance tasks include updating,
rebuilding, or removing obsolete indexes.
Query Optimization: Indexes play a critical role in query optimization. The DBMS query
optimizer uses indexes to determine the most efficient execution plan for a query.
Handling Fragmentation: Index fragmentation can reduce the effectiveness of an index.
Regular defragmentation can help maintain optimal performance.

No compatible source was found for this media.

Indexing in Database Visit Course

Comment K kartik Follow 244

Article Tags : DBMS Databases dbms DBMS Indexing

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Indexing in Databases
Last Updated : 31 Jul, 2025

Indexing in DBMS is used to speed up data retrieval by minimizing disk scans. Instead of
searching through all rows, the DBMS uses index structures to quickly locate data using key
values.

When an index is created, it stores sorted key values and pointers to actual data rows. This
reduces the number of disk accesses, improving performance especially on large datasets.

Structure of Index in Database

Attributes of Indexing
Several Important attributes of indexing affect the performance and efficiency of database
operations:

Structure of Index in Database

File Organization in Indexing

File organization refers to how data and indexes are physically stored in memory or on disk.
The following are the common types of file organizations used in indexing:

1. Sequential (Ordered) File Organization

i. Dense Index: Every search key value in the data file corresponds to an index record. This
method ensures that each key value has a reference to its data location.

Example: If a table contains multiple entries for the same key, a dense index ensures that
each key value has its own index record.

Dense Index

Access Method: To locate a record, we find the index record with the largest key value less
than or equal to the search key, and then follow the pointers sequentially.

Access Cost = log 2 ( n) + 1 , where n is the number of blocks involved in the index

file.

Sparse Index

2. Hash File Organization

Uses a hash function to map keys to buckets.

Offers fast access for exact-match queries.

Not suitable for range queries.

Types of Indexing Methods

There are different types of indexing techniques, each optimized for specific use cases.

1. Clustered Indexing

Clustered Indexing

2. Primary Indexing

Key Features: The data is stored in sequential order, making searches faster and more
efficient.

3. Non-clustered or Secondary Indexing

Non Clustered Indexing

4. Multilevel Indexing

Multilevel Indexing

Disadvantages of Indexing
While indexing offers many advantages, it also comes with certain trade-offs:

Features of Indexing
Several key features define the indexing process in databases:

No compatible source was found for this media.

Indexing in Database Visit Course

Comment K kartik Follow 244

Article Tags : DBMS Databases dbms DBMS Indexing

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

BLOCK III:
INTRODUCTION TO OBJECT ORIENTED,
DISTRIBUTED, MULTIMEDIAAND SPATIAL
DATABASES
Unit 1 : Object Oriented Database System
Unit 2 : Distributed Database
Unit 3 : Image and Multimedia Database
Unit 4 : Spatial Database

(7)
UNIT 1: OBJECT ORIENTED DATABASE Space for learners:
SYSTEM

Unit Structure:
1.1 Introduction
1.2 Unit Objectives
1.3 Concepts of Object-Oriented Databases
1.3.1 Key Features of Object Databases
1.3.2 Drawbacks of Object Databases
1.3.3 Popular Object Databases
1.4 Standards, Languages and Design
1.4.1 Standards, Languages
1.4.2 Object Database and Relational Database Design
1.5 Object Relational Database Systems
1.5.1 Relational Database Management System (RDBMS)
1.5.2 History of Object Relational Database System
1.5.3 Object-Oriented Relational Database Management
System (OORDBMS)
1.5.4 Comparative Analysis of RDBMS and OORDBMS
1.6 Summing Up
1.7 Answers to Check Your Progress
1.8 Possible Questions
1.9 References and Suggested Readings

1.1 INTRODUCTION

In the earlier chapters, learners have been acquainted with some

advanced form of traditional database concepts. In this chapter, the
learners are going to be acquainted with a new dimension or extension
area of the existing traditional database theory. All the data are
imagined or visualized as some objects in this new area of discussion in

227 | P a g e
addition to the relational form of existing database management system. Space for learners:
The data here are stored, manipulated and accessed as objects, which is
done in object-oriented programming paradigms. The concept of object
can be realized by defining one class with underlying characteristics
like data abstraction, information hiding, encapsulation and imposing on
it other object-oriented features like inheritance, polymorphism, early
binding and late binding. The idea of object-oriented database approach
comes into existence because of the acceptance of object-oriented
programming approach among wide range of users worldwide. Some
object databases, accepted widely and appreciated by the database
community are mentioned in this unit. The required standards in the
design of the object-oriented database systems and the associating query
languages needs to be discussed in order to have a detailed insight into
it. The relational database system is the basis on which the OORDBMS
approach is evolving. The history of object relational database system is
covered, which is followed up by its detailed description. The object-
oriented relational database management approach is compared with the
classic relational database management approach as the conclusive
topic.

1.2 UNIT OBJECTIVES

After going through this unit, you will be able to:

 Learn the object-oriented database system.
 Lean the need of object-oriented approach in databases.
 Learn the advantages and disadvantages of object-oriented
databases.
 Accustom with some popular object databases.
 Learn the standards, languages and design issues of object
based databases.
 Understand the history of object-oriented relational database
system.
 Understand how object-oriented relational database system
works.
 Compare RDBMS and OORDBMS.

228 | P a g e
1.3 CONCEPTS OF OBJECT-ORIENTED DATABASES Space for learners:

A database is generally considered as data storage. This storage is

further used for the purpose of searching, editing or updating,
generating reports etc. Data storages can further be classified in four
widely spelled categories viz., Traditional File System, Relational
DBMS, Object Oriented DBMS and Object-Oriented Relational DBMS.
These categories are classified on looking into the pattern of data.
The object-oriented paradigm is the basis upon which the object-
oriented database is designed. The data or the information in the object-
oriented database is represented and stored in the form of certain
objects. The object-oriented database is also known as object database
and is handled though object-oriented database management systems
(OODBMS).

Fig-1.1: Conceptual bock diagram for OODBMS

The OODBMS encompasses the conventional DBMS features as well
as the object-oriented features together. The conventional DBMS
features are like data integrity, persistence, concurrency, security,
backup, recovery query processing etc., while the object-oriented
features are encapsulation, class, object, overloading, overriding,
inheritance, early binding, late binding etc. Some of the popularly
known object-oriented programming (OOP) languages are C++, Java,
Perl, Ruby, Python and Java-script. Object-oriented databases are
administered through the object database management systems

229 | P a g e
(ODBMS). The preparatory idea of object-oriented databases immerged Space for learners:
in the late nineties of the nineteenth century and currently it has become
common for various OOP based languages, such as C++, Java,
Smalltalk and LISP. For example, Smalltalk is used in GemStone, LISP
is used in Gbase, and COP is used in Vbase and so on.
Objects are composed of some data members and member functions or
methods, which are encapsulated within a single unit with individual
values and certain properties. Objects come into existence by
instantiation of certain user defined classes. Objects generally go
through a cycle that includes the creation or allocation of objects, use of
the objects and the deletion or de-allocation of objects. Object databases
are common among many modern high performances applications with
high speed data access and manipulative facilities. Some of the
significant areas where object databases are taking a pivotal role are the
real-time systems, architectural engineering for 3-D modeling,
telecommunications, robotics, molecular science, astronomy and many
more.

Fig-1.2: Object Oriented Model in OODBMS

230 | P a g e
1.3.1 Key Features of Object Databases Space for learners:

 Object oriented databases or Object-oriented DBMS systems

provide persistent storage to objects.
 It is capable of storing and reading data and converting the same
into program objects for further storing of reading data & loading
object based data in memory.
 Object databases bring permanent persistence to objects.
 The reading and mapping of the data of an object database to the
objects is direct without any API like tool.
Object databases facilitate quick access of data or information and
better performance inevitably. There are some object based databases
with multi-lingual supports too. For example, Gemstone is such an
object database that supports C++, Smalltalk and Java programming
languages.

1.3.2 Drawbacks of Object Databases

 Object databases are still not popular among vast community of
database users as compared to RDBMS.
 Developers are less in numbers in handling of object databases.
 Not many programming language support object databases.
 RDBMS have SQL as a standard query language. Object
databases do not have such a standard.
 Object databases are difficult to learn for non-programmers

1.3.3 Popular Object Databases

Following are some of the popular object databases. These databases are
accepted by most database users because of the highly flexible features
that conform to the needs of current users. The descriptions of few such
databases are mentioned below.

231 | P a g e
Cache Space for learners:
Cache is developed by Inter Systems and it is a high-performing object
database. This object based database facilitates a set of services that
include data storage, concurrency management and handles diverse
transactions issues and process management activities. Cache engine
can be treated as full-fledged powerful database toolkit with extensive
relational database features. This database can be used for diverse
queries and modification purposes using standard SQL via ODBC,
JDBC or object based methods. The computational efficiency of Cache
is enormous and it is a most reliable relational database with high
scalability parameters. Some of the important features of Cache
database are mentioned below.
 Able to model data as objects, while eliminating mismatch
between databases and object-oriented applications.
 Supports user-defined data types.
 The ability to take the advantage of methods and inheritance like
functions.
 Object-extensions for SQL to handle object identity and
relationships.
 The ability to avail SQL and object-based access through a
single application.
 Clustering is used to store data ensuring maximum
performance.
ConceptBase
Concept Base is another database system with multi-user and object-
oriented support which is deductive in nature. It is a powerful tool for
meta-modeling and is very useful for customizing modeling languages.
Concept Base comes with an associating graphical user interface (GUI)
facilitating the users with some common routines. Concept Base is
developed by the Concept Base Team at University of Skövde (HIS)
and the University of Aachen (RWTH). Commonly available operating
systems like Linux, Windows and Mac support Concept Base. There is
also a pre-configured virtual application within Concep Base, which
contains associating executable files and source files along with the

232 | P a g e
tools for compiling. The system is distributed under a FreeBSD-style Space for learners:
license.
ObjectDB
Object DB is a powerful object-oriented database management system
(ODBMS) based on Java language. It is a compact but reliable system,
which is easy to use and extremely fast in terms of object database
access. It supports both the client-server mode and the embedded mode.
Object DB provides all the standard database management services.
This is the reason, why the development process gets easier and the
applications behave faster. It is capable of handling advanced level
queries and providing enhanced indexing facilities. It is very much
effective in multi-user environments, where there is always a rush of
users. ObjectDB can easily be embedded in any applications
irrespective of its sizes and types. This is such a database, which has
been tested with Tomcat, Jetty, GlassFish, JBoss and Spring.
Several other popular object based databases are ObjectDatabase++,
GemStone/S, Perst, ZODB, Wakanda, ODABA, Objectivity/DB. The
discussions on these object databases are beyond the scope of this
syllabus. The learners can use various internet sources to gather a
detailed knowledge on these object based databases.

1.4 STANDARDS, LANGUAGES AND DESIGN

There should always be a standard agreed upon by all vendors of a

particular type of database system. A standard can be resembled with an
agreed roadmap maintaining uniformity among all stakeholders to
proceed through a common model.

1.4.1 Standards and Languages

Some of the sound reasons for the need of standards are as follows.
 Standard provides support in maintaining the portability of
database applications. Portability is defined as the capability to
execute particular software or application on different platforms
with minimal modifications.

233 | P a g e
 Standards help in achieving interoperability. Interoperability Space for learners:
refers to the ability of an application to access multiple systems.
Here, the same application program may access some data stored
under one ODBMS package, and another data stored under another
source or package.
 Standard allows customers to compare commercial products of
various vendors more easily by determining which parts of the
standards are applied in their purchased product.
ODMG (Object Data Management Group) is an association for
monitoring the object-oriented database management activities. This
association proposed a standard for ODBMS in the year 1993 and it was
named as ODMG 1.0 followed by ODMG 2.0 in 1995 and ODMG 3.0
in 2000.
The ODMG 3.0 standard has the following major specifications:
 Object Model
 Object Definition Language (ODL)
 Object Query Language (OQL)
 C++ Language Binding
 Smalltalk Language Binding
 Java Language Binding

[Link] Object Model

The object model specifies the ODMS based constructs. The basic
building blocks of the object model are – objects and literals.
An object is referred to as the instance of its class type. The state of an
object is composed of the values that the object carries for a certain set
of properties. On the other hand, the behavior of an object is defined by
the set of operations executed by the objects.

234 | P a g e
Space for learners:

Fig-1.3: Hierarchy of classes/objects in OODBMS

An object is described with some associating parameters i.e. identifier,

name, lifetime and structure. The details of these parameters are
mentioned below.
Object Identifier: An object can be differentiated from all other nearby
objects within its storage domain by using the object identifier. The
objects always preserve the same object identifier in its lifetime during
execution of a computer program. Thus, the value of an object’s
identifier never changes. The object remains the same, even if its
attributes or the relationship values change. Object identifiers are
generated by the OODBMS, but not by the other applications.
Object Name: In addition to the object identifier, the OODBMS may
assign one or more names to the objects that are meaningful for the
programmers or the end users. The system can refer to an object by its
object name. It applies certain mapping functions to determine the
object identifiers and locate the desired object.
Object Lifetime: The lifetime of an object is another crucial issue to be
addressed. Object lifetime determines the extant of memory or the
storage time allowed to the object. Two variants of lifetime for the
object are supported in the object based models. They are transient and
persistent. An object, whose lifetime is transient, is allocated a memory
space to be managed by the program’s runtime system. When the

235 | P a g e
process terminates, the memory is de-allocated. On the other extreme, Space for learners:
an object, whose lifetime is persistent, is allocated memory space to be
managed by the OODBMS runtime system. This kind of objects exists
in memory after the termination of the process initiated by the
application program. So, it has a long lifetime as compared to transient
form.
Object Structure: The structure of an object can be either atomic or
non-atomic (if the object is composed of other objects). The atomic
object referred here is user-defined in nature. There is no built-in atomic
object type included in OODBMS object models.
Some other important definitions useful for the demonstration of an
object model are stated in the following section. The terms used here are
class, interface, struct, literals and various literal types.
 A class defines both the abstract state and the abstract behaviour of
the object.
 The interface defines only the abstract behavior of some objects.
 The struct defines the abstract state of some literals.
 A literal has no identifier and cannot act alone. The literals are
embedded in objects and cannot be individually referenced. Objects
and literals can be classified by their types. Although, all the
elements of a given type have a common range of states and
behaviors, a literal defines only the abstract state of a literal type.
The value of literals does not change. Few examples of literal values
are 67, 17.161576, ‘P’, ‘Q’, “GUIDOL” and “August-15, 2021”.
These examples are some constant numbers, characters and strings.

 In addition to the struct definition and the primitive literal datatypes

(boolean, char, short, long, float, double, string), object definition
languages define declarations for user-defined collection, union and
enumeration literal types.
 Three literal types are supported by the object models. They are
atomic, collection and structured literals.
o Atomic literals correspond to the values of basic data types.
Various numeric numbers, characters, Boolean values etc. are
the examples of atomic literal types.
o Collection literals are typically found in the ODMG object
models that support literals of the following types: set<t>,

236 | P a g e
bag<t>, list<t>, array<t>, dictionary<t, v> where t is a type of Space for learners:
objects or values in the collection.
o Structured literals correspond to the values constructed by
tuple constructor. They include the date, time, interval and
timestamp as built-in structures and any other user defined
structures.
[Link] Object Definition Language (ODL)
ODL is a specific kind of a language that specifies the structure of
databases in object-oriented terms. ODL is an extension of Interface
Description Language (IDL), which is again a component of CORBA
(Common Object Request Broker Architecture). CORBA is a standard
for distributed, object-oriented computing which will be discussed in
the later chapters. The ODL is basically a specification language or a
design language, which is used to define the specifications of object
types that obey the rules of ODMG object model. This can be used like
the E/R diagram used in the case of RDBMS platform. ODL is
independent of any programming language and it is not used for
database manipulation activities.
[Link] Object Query Language (OQL)
OQL is a query language preferred by object data management group
(ODMG) for object-oriented database management purpose. OQL
works closely with programming languages like C++. The embedded
OQL statements within a host language return compatible objects useful
for further processing. OQL’s syntax is similar to SQL with additional
features for object handling. This query language is designed to operate
on databases described through ODL. Unlike SQL, which produces
collection, OQL produces collections (sets, bags, lists) of objects. OQL
fits naturally in object oriented host languages. Returned objects are
assigned in the variables present in the host program and these variables
are then used for further programming based manipulative works.
[Link] C++ Language Binding
Binding of ODMG implementations to C++ intends at the writing of
portable C++ codes that manipulates persistent objects. This object
manipulative language of C++ is abbreviated as OML. The C++
language binding includes a version of the ODL that uses C++ syntax,

237 | P a g e
OQL invoking interface and some procedures for operations on Space for learners:
OODBMS prescribed transactions.
[Link] Smalltalk Language Binding
Binding of ODMG implementations to Smalltalk focuses on the binding
in terms of the mapping between ODL and Smalltalk. The Smalltalk
bindings also include a mechanism to invoke OQL and required
procedures for operations on databases and other transactions.

[Link] Java Language Binding

The binding between the ODMG Object Model (ODLs and OMLs) and
the Java programming language is defined here. The Java language
binding includes some mechanism allowing the invoking of the desired
OQL and procedures for operations on ODMSs and transactions.

1.4.2 Object Database and Relational Database Design

Whenever we discuss the differences between object database designs

(ODB) and relational database designs (RDB), the handling of the
relationships issue takes a major role.
In relational database designs, the relationships among the tuples or
records are specified by the attributes with matching values. These can
be termed as value references and is specified through the foreign key
concept. Foreign keys are the values of primary key attributes in tuples
of the referencing relation or table. The primary keys are limited to
being atomic in nature in each record.
In object database design, the relationship issue is handled by reference
attributes that include object identifiers (OIDs) of the related objects. In
object database design, both single references as well as collection of
references are allowed. Another notable and influencing difference
between ODB and RDB design is how the inheritance is handled. These
mentioned structures are built into the model, so that the mapping is
achieved by using the inheritance constructs. Inheritance can be
achieved through derived (:) and extends constructs. In relational
design, there are several options to choose, because there is no built-in
constructs for inheritance in the classic version of relational design. It is
necessary to specify the operations early on in the design since they are

238 | P a g e
part of the class specifications. It is an important matter to specify the Space for learners:
operations needed during the design phase for all types of databases.
But it may be delayed in RDB design, because it is not mandatory until
the implementation phase comes in force. One can easily observe one
realistic difference between the relational model and the object model of
data in terms of behavioral specifications. Although relational data
models do not compel or encourage the database designers to set some
valid behaviors or operations, this is an implicit requirement in the case
of object models.
[Link] Mapping of an Enhanced Entity Relationship
(EER) Schema into an Object Database (ODB)
Schema
The correlation of EER schemas and ODB schemas is simple, because
the ODB schemas provide support for inheritance. Once the mapping
has been completed, the operations need to be added to ODB schemas.
It is because the EER schemas do not include any operations like ODB.
The mapping of EER into ODB schemas can be exhibited using the
following steps.
Step -1
 Creation of an ODL class for each EER type.
 Multi-valued attributes are declared by sets, bags or lists.
 Composite attributes are mapped into tuple constructors.
Step – 2
 Add reference attributes for each binary relationship into the
ODL classes that participate in the relationship.
 Relationship cardinality is set as single-valued for 1:1 and N:1
types and set- valued for 1:N and M:N types.
 Relationship attributes are created through the use of tuple
constructors.
Step - 3
 Include the operations corresponding to each class.

239 | P a g e
 EER schema does not provide these operations and it must be Space for learners:
added to the database design by choosing it from the original
requirements.
 The associating constructor and destructor operations must also
be included.
Step - 4
 Inheritance relationships can be specified via extends clause.
 An ODL class that corresponds to a sub-class in the EER
schemas inherits the types and methods of its baser-class in the ODL
schemas.
 Its non-inherited attributes, relationship references and
operations are specified as mentioned in the earlier steps.
Step - 5
 Weak entities can also be mapped in the same way as the regular
entity types.
 Non-participating weak entities in any relationships may
alternatively be presented as composite multi-valued attribute of the
owner entity.
 The attributes of the weak entity are included in the struct <... >
construct.
Step - 6
 Map categories (union types) to object definition language.
 May follow the same mapping used for EER-to-relational
mapping.
 Declare a class to represent the category.
 Define the 1:1 relationship between the category and each of its
base-classes.
Step – 7
 Map multi-dimensional cardinality relationships whose degree is
greater than 2.

240 | P a g e
 Each relationship is mapped into a separate class with Space for learners:
appropriate reference to each participating class.

1.5 OBJECT RELATIONAL DATABASE SYSTEMS

Object–relational database systems are commonly termed as Object–

relational database management systems (OORDBMS). OORDBMS is
an object-oriented version of the traditional relational database
management systems (RDBMS). This is a kind of a hybrid approach
capable of handling the object oriented as well as relational aspects of
DBMS, which well fits with the current industry requirements.

1.5.1 Relational Database Management System (RDBMS)

RDBMS is a simply the relational version of traditional DBMS, which

incorporates the terms relations, tables, attribute, columns, integrity,
security etc. into its operational procedures. RDBMS deals with a
number of tables together to store, edit, update and delete data
considering the normalization aspects like 1NF, 2NF, 3NF and BCNF
forms. Standard SQL statements are used to operate on RDBMS.
Various commonly used RDBMSs are Oracle, Microsoft’s SQL server,
MySQL etc. Although, most of the needs of a common database user
are addressed by these softwares, the object-oriented aspects could not
be incorporated here. This is the reason why the object-oriented
relational database management system is becoming the need of the
hour. The later section is going to elaborate these aspects followed up
by sating the differences between OODBMS and RDBMS.

1.5.2 History of Object Relational Database System

The Object–relational database system or OORDBMS came into light in

the early 1990s. This trend comes into existence by extending the
relational database concepts with the addition of the concept of object.
The industry experts aimed to get hold on a declarative query-language
based upon predicate calculus as a vital component of OORDBMS.
Two most notable research projects viz., Illustra and PostgreSQL was

241 | P a g e
brought into reality by Postgres (UC Berkeley) during this time. In the Space for learners:
mid of 1990s, early commercially available products were released.
These releases include various products like Illustra
(IBM), Omniscience (Oracle) and UniSQL (KCOMS). The Ukrainian
developer Ruslan Zasukhin, who is the founder of Paradigma Software,
Inc. developed and released the first version of Valentina database in
the mid of 1990s, which was used as C++ SDK. After less than a decade
of time, PostgreSQL had become a commercially available database and
has become the basis for several currently available products
incorporating OORDBMS features. The experts in the domain started
referring these products as object oriented relational database
management systems or OORDBMS. Many of the ideas of early object
relational database efforts have largely been incorporated into SQL:
1999 via specific structured types. For example, IBM's DB2, Oracle
database, and Microsoft’s SQL Server are claiming to support most
OORDBMS requirements and do so with a varying degree of success.
SQL statements are written in RDBMS like this-
CREATE TABLE Customers (
Id CHAR(10) NOT NULL PRIMARY KEY,
Surname VARCHAR(30) NOT NULL,
FirstName VARCHAR(30) NOT NULL,
DOB DATE NOT NULL [# DOB :
Date of Birth]
);

SELECT InitCap(Surname) || ', ' || InitCap(FirstName)

FROM Customers
WHERE Month(DOB) = Month(getdate())
AND Day(DOB) = Day(getdate());

Standard SQL databases allow customized functions also, which allow

the following type of query-
SELECT Formal (Id)
FROM Customers

242 | P a g e
WHERE Birthday (DOB) = Today(); Space for learners:

In OORDBMS, queries containing user-defined data-types and

expressions like BirthDay() are seen as mentioned below-
CREATE TABLE Customers(
Id Cust_Id NOT NULL PRIMARY KEY,
Name PersonName NOT NULL,
DOB DATE NOT NULL
);
SELECT Formal( [Link] )
FROM Customers C
WHERE BirthDay ( [Link] ) = TODAY;

The object relational models can offer another interesting capability.

Here, the database can make use of the relationships between the data to
easily fetch the related records. For example, in an address
book software application, an additional table is added to the existing
ones to hold the addresses of customers. Using a traditional RDBMS,
collecting information for both the user and their address requires a
"join" as mentioned below-
SELECT InitCap([Link]) || ', ' || InitCap([Link]), [Link]
FROM Customers C join Addresses A ON A.Cust_Id=[Link]
WHERE [Link] = "New York";

The above query when applied in an object–relational database appears

in a simpler way as mentioned below-
SELECT Formal ( [Link] )
FROM Customers C
WHERE [Link] = "New York";

243 | P a g e
1.5.3 Object-Oriented Relational Database Management Space for learners:
System (OORDBMS)

An object-relational database is maintained by a relational database

management system with an associating object-oriented database
model, where all data and data models are created treating them as
objects. Data abstraction, data hiding, early binding, late binding,
polymorphism and inheritance like properties are directly supported in
the database schemas and the associating query languages support the
object based data access. Oracle is one of the popular RDBMSs, which
meets the industry standards. The object-relational database systems are
an attempt to merge the two dissimilar trends together. It can be
visualized as an object database expansion of a relational model
resulting in a hybrid design. One of the most visible aspects that we
might observe is in the addition of object database features in the SQL
revision. But, the tough part of a relational model immerges when
someone tries to describe complex objects.
The object-oriented relational database mechanism gains its importance
with the introduction of the type constructors describing row types,
array type being replaced by collections, sets and lists. The creation of
derived mechanisms for specifying object identity, encapsulation and
inheritance is also helping OORDBMS to gain its importance. It is to be
noted that the core technology used in OORDBMS is based on
relational models. The commercial products (e.g., Microsoft SQL
Server) have simply added a layer of some object-oriented principles on
top of the relational database management system. The translation of
object-oriented mechanism into relational mechanism is one of the
challenging tasks for typical OORDBMS. This problem is typically
addressed by an object-oriented application that does the
communication between the object-oriented applications with the
underlying relational databases.
Both relational and object-oriented mechanisms are having a lot of
differences in terms of their underlying principles. This is the reason
why this model tries to negotiate among these two techniques to adopt
some intermediate measures for the sake of developer's convenience.
One of the very important reasons is to permit the storage and retrieval
of objects in a way how RDBMS functions. This act provides an

244 | P a g e
extensive liberty to query languages to work on the object-oriented Space for learners:
principle. Some of the common implementations in this regard are the
Oracle Database, PostgreSQL, and Microsoft’s SQL Server. IBM DB2
also supports objects and can be considered as OORDBMS.
In OORDBMS, the approach is essentially that of relational databases,
where the data resides in the database and is manipulated collectively
with queries through a query language. But, in OODBMS, where the
database is essentially a persistent object store for software written in
an object-oriented programming language, a programming API is solely
responsible for storing and retrieving of objects. In this case, a very
little or no specific support presents for query languages.
The basic need of object–relational database arises from the fact that
both relational and object databases have their individual advantages
and drawbacks. Although, the object-oriented databases allow sets, lists,
arbitrary user-defined data types and nested objects, they do not provide
any mathematical base for in-depth analysis. The basic goal for the
object–relational database is to bridge the gap between relational
databases and the object-oriented modeling techniques. The commonly
used programming languages such as C++, C#, Java and Visual
[Link] are seen implementing these extensive features of object-
relational databases. Further, the object–relational DBMS or
OORDBMS allows software developers to integrate user-defined data
types and methods that apply to them into the DBMS. Some of the
leading features or characteristics of OORDBMS are Complex data,
Type inheritance and Object behavior.
Complex data creation is based on basic schema definition through
the user-defined types. Structured complex data are when stored in a
hierarchy; it offers an additional property termed as type inheritance.
That is, a structured type can have subtypes that reuse all of its
attributes and contain additional attributes specific to the subtype.
Finally, the object behavior is related with the access to the program
objects. Such program objects must be storable and transportable for
database processing. This is the reason why they are usually named
as persistent objects. Inside a database, all the relations with a persistent
program object are relations with its object identifiers. The mentioned
points above can be addressed in a proper relational system, although
the SQL standard and its implementations enforce arbitrary restrictions

245 | P a g e
and some amount of additional intricacy. Extension of the data Space for learners:
model with custom data types and methods is possible in a properly
arranged relational system.

1.5.4 Comparative Analysis of RDBMS and OORDBMS

The comparative analysis of a typical RDBMS with the OORDBMS

helps understanding the changes made in the OORDBMS.

Table 1.1: Comparative Analysis of RDBMS and OORDBMS

RDBMS OORDBMS

Ensures data independence as

Ensures only the data
well as data encapsulation and
independence part
abstraction of data
Data can only be recognized Data as well as class/object can
without affecting the mode be recognized without affecting
of using it the mode of using it
Stores not only the data but also
Stores only the data the methods imposed on that
data
The data can be used in direct
Data can be partitioned access mode and also through
depending upon the user’s the class/object methods and
requirements and specific sometimes the entire data can be
user’s applications made public using specific
access controls
Users can perceive data as Apart from handling complex
columns, rows or tuples structure of data this system can
(records) and tables handle relational data
Thus, after all these discussions, an OORDBMS can be understood as a
DBMS, but with the extended relational and object-oriented
capabilities. It is because of the functional differences among these two
extending approaches, they are to proceed in a hand-in-hand strategic
and somewhat compromising approach.

246 | P a g e
CHECK YOUR PROGRESS Space for learners:
Multiple choice questions:
1. Object databases are based on
i) Relational approach
ii) Object based approach
iii) Both (i) and (ii)
iv) None of these
2. The term attribute refers to a
i) Record
ii) Row
iii) Column
iv) Key
3. Which of the following can be defined using ODL?
i) Structure
ii) Attribute
iii) Operation
iv) All of above
4. Which of the following belongs to an atomic literal?
i) String
ii) Boolean
iii) Long
iv) All of above
5. Which among the following is/are not Object Based Database(s)?
i) Cache
ii) Foxpro
iii) Wakanda
iv) Both (i) and (iii)
State whether True or False:
6. A single programming paradigm acts behind a single programming
language.
7. A class is an instance of an object in OOP.
8. ODMG looks after the object models in an OODBMS.
9. MS SQL Server does not support OOP principles in any of its
versions.
10. OORDBMS works in the principles of OOPs as well as the
relational models.

247 | P a g e
1.6 SUMMING UP Space for learners:

 DBMS is a software platform, where data are stored, managed

and retrieved as per user’s requirement through some standard
language queries like SQL.
 RDBMS is a similar system like DBMS which also store,
manage and retrieve data when needed by the users, but is has
the additional capability of maintaining relational tables or data.
 ODMG stands for Object Data Management Group. It is a
consortium responsible for the monitoring of object oriented
database management activities.
 OODBMS is a DBMS system, where the data are stored,
managed and retrieved considering most of the data as objects is
called the OODBMS. Object oriented features like inheritance,
polymorphism are applicable here.
 OORDBMS is a RDBMS system with the object oriented
extension which is capable of implementing object oriented
features like class and object, inheritance, polymorphism etc. in
addition to the classic RDBMS features.
 ODL is a specific kind of a language that specifies the structure
of databases in object-oriented terms.
 OQL is a query language preferred by object data management
group (ODMG) for object-oriented database management
purpose.

1.7 ANSWERS TO CHECK YOUR PROGRESS

Multiple choice questions:

1. (ii) 2. (iii) 3. (iv) 4. (iv) 5.
(ii)
State whether True or False:
6. False 7. False 8. True 9. False 10. True

248 | P a g e
1.8 POSSIBLE QUESTIONS Space for learners:
Short answer type questions:
1) What is the difference between an object and a literal in the object
oriented data model (OODM)?
2) What is an object? What is an object model with reference to
ODMG standards?
3) What are the main difference between designing a relational
database and an object database?
4) Differentiate between:
i) Interface and Class
ii) Atomic object and Collection object
iii) Object identifier and Object lifetime
iv) Persistent object and Transient object
5) What is the significance of ODL in OODBMS?
Long answer type questions:
1) Explain the major specifications mentioned in ODMG 3.0 standard.
2) Describe the differences and similarities between objects and
literals in the ODMG object model?
3) Describe the steps involved in mapping the EER schema into ODB
schema.
4) Explain in detail the OORDBMS concept with the introduction to
all its organizing components.
5) Describe in detail the differences between RDBMS and
OORDBMS.

1.9 REFERENCES AND SUGGESTED READINGS

 M. Stonebraker, “Inclusion of New Types in Relational

Database Systems”, In Proc. of the International Conf. on Data
Engineering (1986), pages 262–269.
 M. Stonebraker and L. Rowe, “The Design of POSTGRES”, In
Proc. of the ACM SIGMOD Conf. on Management of Data
(1986), pages 340–355.
 M. Atkinson, [Link]., “The Object-Oriented Database System
Manifesto”,

249 | P a g e
In Proceedings of the “First International Conference on Space for learners:
Deductive and Object-Oriented Databases”, pages 223-40,
Kyoto, Japan, December 1989.
 [Link] And Lochovsky (Eds), Object-Oriented Concepts,
Databases, and Applications, Addison-Wesley (Reading MA),
1989
 [Link]
management systems
 [Link]
 [Link]

250 | P a g e
UNIT 2: DISTRIBUTED DATABASE Space for learners:

Unit Structure:
2.1 Introduction
2.2 Unit Objectives
2.3 Distributed Database
2.4 Data Fragmentation
2.5 Data Replication and Allocation Technique.
2.6 Types of Distributed Database System
2.7. Query Processing in Distributed Database
2.8 Concurrency and Recovery Distributed Database
2.9 Summing Up
2.10 Answers to Check Your Progress
2.11Possible Questions
2.12 References and Suggested Readings

2.1 INTRODUCTION
The database is a collection of structured information. Among
other database systems, a distributed database is one where files
are stored in different sites or systems. This unit will give an
overview of the distributed database Management System
(DDBMS). The unit shows the uses of distributed databases. Data
fragmentation, replication, and allocation are very much important
in a database. These are also explained in detail in this unit by
considering the example of distributed database. Types of the
distributed database are also explained in this unit by considering
the examples. Query processing and data recovery of the
distributed database are shown by taking the database example.

2.2 UNIT OBJECTIVES

After going through this unit, you will be able to know

i) About the distributed database

251 | P a g e
ii) About the types of the distributed database Space for learners:
iii) About the data fragmentation, replication, and allocation in
a distributed system.
iv) About the query processing database distributed system.
v) About the concurrency and recovery in a distributed
database.

2.3 DISTRIBUTED DATABASE

The database is a collection of structured information. Among all
database systems, a distributed database is one where files are
stored in different computer systems or sites. These sites are
connected through a communication network. The application
service layer of the distributed system provides services to the
user. The users are unknown about the distributed storage
structure of the system. They think that one single database is
present in the system to provide services to the user. Data is
distributed among the system through the communication network
and it is controlled by the Distributed Database Management
System (DDBMS).

Fig. 2.1. Architecture of DDBMS

In Fig. 2.1, four different systems are interconnected through the

communication network. This type of system is known as a
distributed system which is also known as a loosely coupled

252 | P a g e
system. In this type of system, the data is distributed among the Space for learners:
system. That is the reason the database of this type of system is
known as a distributed database. Every DDBMS has some
features.
I. Databases of the DDBMS are interlinked logically and they
are connected through a communication network. Often, the
DDBMS act as a single database for the user.
II. Data is physically stored in multiple sites and the data is
managed by a local DBMS in the site which is independent
of the other sites.
III. A distributed database is not a loosely connected system.
IV. A distributed database integrates transaction processing.
V. DDBMS synchronizes the distributed database periodically
for which it is transparent to the users.
Every distributed database has to build with some goals and these
are as follows.
i) Reliability: In DDBMS, if one of the systems fails, then
other systems will provide the service to the user. The
other system can complete the task of the failure system.
ii) Availability: In DDBMS, sites or systems are available to
provide reliability to the system. If one distributed system
fails, other sites can give service, and it maintains the
availability of the systems.
iii) Performance: Performance of the DDBMS can be
achieved by distributing data or information over
different sites which are located in different locations. So,
the databases are available to every location which is
maintained through the communication channel.

CHECK YOUR PROGRESS-I

1. What do you mean by distributed database?
2. What is reliability in DDB?
3. What are the goals of a distributed system?

2.4 DATA FRAGMENTATION IN DDB

Fragmentation is a normal process of diving the database into

different tables in DBMS. In a distributed database, the entire

253 | P a g e
database is divided into different subtables or sub relations so that Space for learners:
each subtable or sub relation can be saved in different sites of the
distributed system. These subtables or sub relations are the logical
units of the DDBMS. The fragmentation is done in such a way that
the subunits give the actual distributed database after combining it.
Let’s, you have a table T in your distributed system and it is
fragmented into different sub tables t1, t2, t3, ----, tn. These
fragments should have sufficient information, so that it will restore
the original table after combining the t1, t2, t3, ---, tn using the
UNION or JOIN operation. These subtables are known as the
fragments and the process is known as data fragmentation in
DDBMS. The fragments are independent of each other’s and user
are concerned about the data fragmentation. This is known as
fragmentation transparency.
The distributed data fragmentation process has some advantages:
I. As the data is fragmented and can be stored locally, the
performance of the DDBMS will increase.
II. Due to the local data store in the local sites, local query
optimization is possible in DDBMS.
III. Fragmentation helps to main the security and privacy of the
local system which will help to main the overall security of
the DDBMS.
You have 3 methods for data fragmenting of a table and they are.
i) Horizontal Fragmentation.
ii) Vertical Fragmentation.
iii) Hybrid Fragmentation.

2.4.1 Horizontal Fragmentation

Horizontal fragmentation allows dividing a table horizontally into
subsets of tables. It means that it will divide the table row-
wise(tuple). Let's you have a table IDOL as follows.

S_Roll_No S_Name Branch

2020001 A MSc. IT
2019001 D BSc. IT
2020002 B MSc. IT

254 | P a g e
2.4.2 Vertical Fragmentation Space for learners:

Vertical fragmentation divides the table column-wise (attribute). It

is more complex than horizontal fragmentation. For the
reconstruction of the original table from the fragment, the primary
key should be available in all the fragments. The reconstruction is
doing using join. For the above table IDOL database, you can
create the follows vertical fragmentation.
Vertical Fragmentation 1 =

S_Roll_No S_Name
2020001 A
2019001 D
2020002 B
2020003 C
2019002 E
Vertical Fragmentation 2 =

S_Roll_No Branch
2020001 MSc. IT
2019001 BSc. IT
2020002 MSc. IT
2020003 MSc. IT
2019002 BSc. IT
In vertical fragmentation 1 and vertical fragmentation 2, one filed
is common i.e the primary key of the IDOL table. It is required to
perform the join operation between the fragments. You can join
the two fragments to get back the original table IDOL as follows.
ΠIDOL (T1 ⋈T2).
In DDBMS, the vertical fragments are saved in different sites as
follows. In Fig. 2.3, fragment 1 is saved in site A where fragment
2 is saved in site B.

256 | P a g e
Space for learners:

CHECK YOUR PROGRESS-II

4. True or False
i) Vertical fragmentation divides the table column-wise
(attribute).
ii) Horizontal fragmentation allows dividing a table
horizontally into subsets of tables
5. Let's you have a table COURSE as follows.

S_Roll_No S_Name Course

2020001 A MSc. IT
2019001 D BSc. IT
2020002 B MSc. IT
2020003 C MSc. IT
2019002 E BSc. IT

i) Create one horizontal fragmentation based on MSc. IT

Course
ii) Create one vertical fragmentation based on Roll No
and Course = Bsc. IT

2.5 DATA REPLICATION AND ALLOCATION

The process of storing data or information in more than one site or
system in a distributed system is known as data replication. It is
useful in improving the availability of data. Data replication
copying the data from the database of one system to another
system. Due to this process, users can send the same data without
any inconsistency. The main goal of data replication is to increase
the availability of the data and also to increase the query
processing technique. Two types of data replication are present.
i) Synchronous Data Replication: In this type of replication,
once the changes are made in a table of the database, the
data replication is done immediately.
ii) Asynchronous Data Replication: In asynchronous
replication, the data replication is done after the commit
operation of the database.

258 | P a g e
Apart from the above data replication, there are another few data Space for learners:
replications in a distributed database.
i) Transactional Replication: Transactional replication is
generally used in server-to-server communication. In this
replication, a full copy of the database is present with one
system and that system gets the update notification from
the other system once the data changes. Data replication
is done in real-time, so it gives a consistency guarantee.
For example, Azure SQL.
ii) Snapshot Replication: In this replication, a snapshot of
the database is sent to one database from another
database. Data is not updated continuously. Data is
updated infrequently at a specific time. It is more
complex than transactional replication. For example, SQL
Server replication.
iii) Merge Replication: In this replication, data of one
database is combined with another database. In this type
of replication, the data is updated from both databases, so
hard to main consistency and concurrency. For example,
Server and Client Communication (SQL server).
Data replication in DDBMS happens in different modes. They are
as follows.
i) Full Replication: In this mode, a full copy of the database
is present at every site of the distributed system. This mode
increases the availability of the data in the system, and the
user gets the highest experience from it. It is hard to main
the concurrency.
ii) No Replication: Here, the data is divided into different
fragments and each fragment is present at only one site
which is located in different locations. Data availability is
less than the full replication but concurrency can be
controlled.
iii) Partial Replication: Here some of the data fragments of
the database are replicated but some are not. Data
replication is depending on the demands of the respective
data fragments.
Data allocation is a process to decide where exactly you want to
store the data. It involves at per which data has to be stored at
what location. The data allocation technique allocates data
fragments to a site in a distributed database. Each data fragment or

259 | P a g e
its replication can be stored in the particular site of a system. The Space for learners:
process of storing data in a site is known as data allocation. The
sites and numbers of data replication depend on the demand of the
data fragments. The choice of sites and the degree of replication
depend on the system performance and availability and also
depend on the number of transactions submitted on the site. If the
user demands high availability of data, then full replication is a
good choice for this allocation. Otherwise, if a fragment of data is
required then partial replication can be used to allocate the data.
Three main data allocation methods are there and they are as
follows.
i) Centralized: Here entire database is stored in a single site.
No such data distribution or replication occurs in this
process.
ii) Partitioned: In this technique, data is divided into different
fragments and those fragments are stored in different sites
of the distributed systems.
iii) Replicated: In this technique, a copy of the database is
present in a different location and it is accessed from those
locations.

2.6 TYPES OF DDB SYSTEM

There are two types of distributed databases are found and they are
homogenous database and heterogeneous database.
i) Homogeneous Database
In a homogeneous database, the physical and logical structures of
the database are identical for all the systems. It means that OS,
DBMS, and software are the same for that system. Hence, it is
easy to manage the homogenous distributed database. For
example, Oracle Database server.
ii) Heterogeneous Database
In a heterogeneous distributed database, the physical and logical
structures of the database are not the same. The sites use different
schema and software. It is hard to manage concurrency and
transactions in a distributed system. Here, one site of the
distributed system may be completely unaware of the other sites of
the systems. For example, Oracle8.

260 | P a g e
2.7 QUERY PROCESSING IN DDB Space for learners:

In a distributed database system, query processing is done at the

end of the user site and server site. A query comes from the user
site, so it is checked and optimized at the user site i.e. it is at the
local level. The query comes to the server, so it is processed and
optimized at the server site i.e., it is at the global level.
In a homogeneous distributed database, when a query comes from
a user site, it will be able to manage the query easily as the sites
have the same physical and logical structure. But in heterogeneous
systems, it will not able to manage the work easily. So, there
should be some techniques to handle queries in heterogeneous
system databases. There are two types of mechanisms in the
heterogeneous system to manage such situations.
i) Multi-Database
In this method, a dynamic schema is created for the
respective databases. If a user site uses a database, then a
dynamic schema is created to connect the database D. Due
to this schema, the user query is flexible with the database.
ii) Federated mechanism
In this method, a global schema is used to access the
database. It means that a centralized schema is used to
access all the databases of the distributed system. This
global schema will work properly even though the data is
fragmented and distributed over different sites.
When a federated mechanism is used, a few of the things have to
manage during the database access. They are presented below.
i) Data Models
During the time global schema, the schema should take care
of the data model. Because distributed database means
different databases with their physical and logical structure.
So, the federated schema should be compatible with all these
types of systems and also should handle the query.
ii) Constraints
Each database of the distributed system has its process of
defining the data constraints and has its method of accessing

261 | P a g e
the data. So, the federal schema should handle these Space for learners:
constraints.
iii) Query Language
In a distributed database, the databases are varying from site
to site. So, the query languages are also varied from site to
site. Hence federated schema should develop a common
language that is compatible with all the query languages.
iv) Data Transfer Cost:
In a distributed database, databases are distributed. So, the
table of the databases is also distributed. Even some tables are
fragmented. So, during the time of query processing; it may
need to access the tables at the different databases or different
locations. This demands a request and transfer cost for the
data which needs to optimize.
To explain data transfer cost, let’s you have two distributed
database tables namely IDOL_EMP and IDOL_DEPT. The
IDOL_EMP has a table ÉMP which is present in one location
(location 1) of the distributed system, and IDOL_DEPT has
another table DEPT in another location (location 2) of the
distributed system. The EMP contains the basic information of the
employee where the DEPT table contains the name of the
department where the employee works. Let’s you have 500 data of
size 50 bytes in your EMP table where DEPT table has 10 data of
size 10 bytes. Consider you have processed a query to find the
name of the employee and department from another location
(location 3). The result of this query will include 500 records,
assuming that every employee is related to a department. Suppose
that each record in the query result is 40 bytes long. In this
situation, you can execute your query based on the three costs, and
accordingly, you can choose the optimized cost.
CASE I. You are executing your query from location 3. For this
case, the cost is as bellow.
i) Cost of transferring EMP data: 500 records * 50 bytes =
25,000 bytes.
ii) Cost of transferring DEPT data: 10 records * 10 bytes =
100 bytes.

262 | P a g e
iii) Therefore, total cost = 25,000 bytes + 100 bytes Space for learners:
= 25,100 bytes
CASE II: You can shift the data of the EMP table from location 1
to location 2 and then you process it and transfer the data to
location 3. For this case, the cost is as bellow.
i) Cost of transferring EMP data: 500 records * 50 bytes =
25,000 bytes
ii) Cost of transferring the result: 500 records * 40 bytes =
20,000 bytes.
iii) Therefore, total cost = 25,000 bytes + 20,000 bytes
= 45,000 bytes
CASE III: You can shift the DEPT data of the EMP table from
location 2 to location 1 and then you process it and transfer the
data to location 3. For this case, the cost is as bellow.
i) Cost of transferring DEPT data: 10 records * 10 bytes =
100 bytes
ii) Cost of transferring the result: 500 records * 40 bytes =
20,000 bytes.
Therefore, total cost = 100 bytes + 20,000 bytes = 20,100 bytes
Now, if you compare the cost of CASE I, CASE II, and CASE II,
the cost of CASE III is the minimal one and it is optimized. Using
this method, you can perform your query in the distributed
database at a minimal cost.

2.8 CONCURRENCY AND RECOVERY IN DDB

During the time of concurrency control and recovery distributed

databases face lots of issues. They are presented below.
i) Multiple copies of data:
A distributed system may have a copy of the database in
each site to increase the availability of the system. To make
a copy consistent and to maintain consistency among the
copies of the database, concurrency and recovery are
important, but it is not easy to maintain consistency.
ii) Failure of a site:

263 | P a g e
In a distributed system, the database of one site may fail. Space for learners:
But the DDBMS should work with the other sites and it
will try to recover the sites and make its date up to date.
iii) Failure of Communication Network:
The DDBMS must deal with the communication failure
and will try to maintain the concurrency and recover the
sites as soon as possible. If network portioning occurs due
to network failure, then it is hard to recover the sites and
maintain consistency.
iv) Distributed Commit:
The problems occur when a commit transaction is done in
DDBMS where the database is present in a failed system.
The two-phase commit protocol is often used to deal with
this problem.
v) Distributed Deadlock:
Sometimes deadlock may occur in a distributed system. So,
it is necessary to main consistency and recovery in the
deadlock system.
The techniques which deal with concurrency control in DDBMS
are explained below.
I. Lock based protocol:
When two transactions are present in the database, a read-
write lock can apply in one transaction to avoid the
concurrency issue where others can access the data. This
lock
II. Shared lock system (Read lock):
The shared lock system is a read lock. The lock is shared
between the transaction. Any one of the transactions can
activate the shared lock for reading purposes.
III. Exclusive lock:
In this technique, an exclusive lock is activated for a
transaction for the read and write operation. In this
technique, no other lock can apply for the read and write
operation on the same data.

264 | P a g e
Lock-based concurrency protocol locks the data. A lock is a Space for learners:
variable that controls the read-write operation on data. It is two
types.
i) One phase Locking Protocol:
In this technique, a lock is applied by a transaction on data
before it uses and releases after the transaction is
complete.
ii) Two-phase locking protocol:
In the two-phase locking protocol, a transaction adopts all
the locks in the first phase and does not release any locks
until finish all read and write operations. In the second
phase, the transaction releases all the locks and never
requests any locks.
Recovery is the most important process in a DDBMS. It is
required to recover the information from a site. The recovery is
required due to the following reasons.
i) The receiver site may down
ii) The location of the receiver site may crash.
iii) The communication link between the sender and receiver
site may break.
A two-phase commit protocol is used to overcome the issue of the
data recovery on DDBMS. This atomic protocol coordinates the
process of DDBMS which decides to commit or terminate a
transaction. It provides the automatic recovery option in case of a
site failure. The original place of transaction is known as
coordinator and other places of the transaction are known as a
cohort. The protocol executes in two phases.
i) Commit request: In the commit phase, the coordinator
prepares the list of cohorts and asks to commit the
transaction.
ii) Commit phase: Based on the responses from the cohorts,
the coordinator can decide to commit or terminate a
transaction.

265 | P a g e
CHECK YOUR PROGRESS-III
Space for learners:
6. All sites in a distributed database commit at exactly the
same instant. TRUE/FALSE
7. Fill in the blanks.
i) The real use of the Two-phase commit protocol is
______________.
ii) Read one, write all available protocol is used to
increase ___________ in a distributed database
system.
iii) Commit and rollback in DDB are related to ..........
iv) If a distributed transactions are well-formed and 2-
phasedlocked, then ................ is the correct locking
mechanism in distributed transaction as well as in
centralized database.
v) A distributed transaction can be ............. if queries
are issued at one or more nodes.

2.9. SUMMING UP

 The distributed database is a collection of structured

information. Among all database systems, a distributed
database is one where files are stored in different computer
systems or sites. These sites are connected through a
communication network.
 Data in DDB is physically stored in multiple sites and the
data is managed by a local DBMS in the site which is
independent of the other sites.
 A distributed database integrates transaction processing.
 In DDBMS, if one of the systems fails, then other systems
will provide the service to the user. The other system can
complete the task of the failure system.
 Fragmentation is a normal process of diving the database into
different tables in DBMS. In a distributed database, the entire
database is divided into different subtables or sub relations so
that each subtable or sub relation can be saved in different
sites of the distributed system.

266 | P a g e
 You have 3 methods for data fragmenting of a table and they Space for learners:
are.
o Horizontal Fragmentation.
o Vertical Fragmentation.
o Hybrid Fragmentation
 The process of storing data or information in more than one
site or system in a distributed system is known as data
replication. It is useful in improving the availability of data.
 Two types of data replication are present.
o Synchronous Data Replication: In this type of
replication, once the changes are made in a table of the
database, the data replication is done immediately.
o Asynchronous Data Replication: In asynchronous
replication, the data replication is done after the commit
operation of the database.
 Data allocation is a process to decide where exactly you want
to store the data. It involves at per which data has to be stored
at what location.
 There are two types of distributed databases are found and
they are homogenous database and heterogeneous database. a)
Homogeneous Database b) Heterogeneous Database
 In a distributed database system, query processing is done at
the end of the user site and server site. A query comes from
the user site, so it is checked and optimized at the user site i.e.
it is at the local level. The query comes to the server, so it is
processed and optimized at the server site i.e. it is at the
global level.
 During the time of concurrency control and recovery
distributed databases face lots of issues. They are presented
below.
o Multiple copies of data,
o Failure of a site,
o Failure of Communication Network,
o Distributed Commit,
o Distributed Deadlock.

267 | P a g e
 A lock is a variable that controls the read-write operation on Space for learners:
data. It is two types.
o One phase Locking Protocol: In this technique, a lock is
applied by a transaction on data before it uses and releases
after the transaction is complete.
o Two-phase locking protocol: In the two-phase locking
protocol, a transaction adopts all the locks in the first
phase and does not release any locks until finish all read
and write operations. In the second phase, the transaction
releases all the locks and never requests any locks.

2.10 ANSWERS TO CHECK YOUR PROGRESS

1) The distributed database is a collection of structured

information. Among all database systems, a distributed
database is one where files are stored in different computer
systems or sites. These sites are connected through a
communication network.
2) In DDBMS, reliability means if one of the systems fails,
then other systems will provide the service to the user. The
other system can complete the task of the failure system.
3) The goals of DDB are as follows.
I. Reliability
II. Availability
III. Performance
4) i) TRUE ii) TRUE
5) i)
S_Roll_No S_Name Branch
2020001 A MSc. IT
2020002 B MSc. IT
2020003 C MSc. IT

ii)
S_Roll_No Branch
2019001 BSc. IT
2019002 BSc. IT
6) FALSE

268 | P a g e
Space for learners:
7)
i) Atomicity, i.e, all-or-nothing commits at all sites
ii) Both Availability and Robustness
iii) Data Consistency
iv) A two-phase locking.
v) partially read-only

2.11 POSSIBLE QUESTIONS

Short answer type questions:

1. What is a distributed system?
2. What is distributed database?
3. What are the goals of the distributed database?
4. What is the reliability and availability of distributed
database?
5. Difference between one phase and two-phase locking
protocol.
6. What are the types of distributed database systems
available?
7. What are the different modes of data replication in a
distributed system?
8. Difference between lock-based and shared lock systems.
9. What is data replication in DDBMS? What are the types?
Long answer type questions:
1. Explain the distributed database system with an example.
2. Explain with examples the fragmentation of tables in the
distributed system.
3. How are concurrency and recovery achieved in the
distributed database?
4. Explain data replication and allocation in DDBMS.
5. Explain the query processing in DDBMS.

269 | P a g e
2.12 REFERENCES AND SUGGESTED Space for learners:
READINGS

i) Principles of Distributed Database Systems. Author: M. Tamer

Özsu.
ii) Distributed System: Concepts, Design, and Applications
Publisher: O, Reilly, Author: [Link]

270 | P a g e
UNIT 3: IMAGE AND MULTIMEDIA Space for learners:

DATABASE
Unit Structure:
3.1 Introduction
3.2 Unit objectives
3.3 Concept of Image
3.4 Image Database and Multimedia database
3.5 Requirement of Multimedia database
3.6. Challenges of multimedia database
3.7 Contents of multimedia database
3.8 Application of multimedia database
3.9 Summing Up
3.10 Answers to Check Your Progress
3.11 Possible Questions
3.12 References and Suggested Readings

3.1 INTRODUCTION

This unit gives an overview of the multimedia database, especially

about the image database. Image means the collection of pixels.
Pixels have information about the images. The process of storing
images in a database is discussed in this unit. The unit also
discusses the contents of the multimedia database. Challenges of
the multimedia database are also discussed in this unit. The
contents of the multimedia database are also pointed in this chapter.
Finally, the different applications of the multimedia database are
reported in the unit.

3.2 UNIT OBJECTIVES

After learning this unit, you will be able to learn

i) About the definition of multimedia database

271 | P a g e
ii) About types of multimedia database including an Space for learners:
image.
iii) About image and multimedia database.
iv) About the challenges and contents of the multimedia
database.
v) About the challenges of the multimedia database.
vi) About the applications of the multimedia database.

3.3 CONCEPT OF IMAGE

An image is multimedia data. It consists of the pixel. The pixel of

an image contains all the necessary information about the image.
An image may be color, grayscale, or black and white. You can
extract the information of color from the pixel of an image. Apart
from color, other features such as texture and shape are also possible
to extract from an image. These features can be stored in a database.
Images are used in different fields, so it is necessary to store the
images in the database. The database where images are stored is
known as a multimedia database.

3.4 IMAGE AND MULTIMEDIA DATABASE

An image can not directly store in a database using a standard SQL

insert command. The embedded SQL is used to insert the images
into a database. A database should support an image to insert an
image in the database. The images are stored in binary form in the
cell of a table of a database and the data type of the cell is Binary
Large Object (BLOB). It is a MySQL data type that is not only used
to store the image data but also used to store the other data type. For
tightly coupled database such as employee database, student
database needs to upload the image in the database, so this type of
databases are known as multimedia database and storing of an
image one of the part of this database.
Let's explain the BLOB in MySQL using python. Her, you will
learn about the process of insertion and deletion of multimedia files
such as images, video, or songs in a multimedia MySQL database
using python. To Store and retrieve multimedia data, i.e BLOB data
in a MySQL table, you should have a table containing binary data
or you can update your table by inserting one extra column in the

272 | P a g e
database for the BLOB data. You can execute the following queries Space for learners:
for the BLOB data.
i) Table Creation Query: CREATE TABLE ìdol_emp` (
èmp_id` INT NOT NULL , èmp_name` TEXT NOT NULL ,
èmp_photo` BLOB NOT NULL , èmp_biodata` BLOB NOT
NULL , PRIMARY KEY (ìd`))
In query (i), the emp_photo and emp_biodata, these two fields
require the BLOB data. So their data types are BLOB.
ii) Data Insertion Query: As BLOB is MySQL datatype and it
has the following four BLOB data type depending on the length
of the data that they can hold.
a) TINY BLOB
b) BLOB
c) MEDIUMBLOB
d) LONG BOB
To insert the data into ‘idol_emp’ using BLOB and python, you
need to perform the following steps.
a) You need to install MySQL-Python connector using pip and
then need to establish the connection.
b) You need a python function that converts images and other
multimedia data into binary data.
c) Then define your insert query and execute the query using
the [Link]() function.
d) After the query execution, you need to commit your
database changes.
e) Then you need to close your cursor and database
connection.
f) Finally, verify your result.
The code of insertion into the database using the BLOB is given
below.

273 | P a g e
import [Link] Space for learners:
def multimediaToBinary(filename):
with open(filename, 'rb') as file:
binaryData = [Link]()
return binaryData
def insertBLOB(emp_id, emp_name, emp_photo, emp_biodata):
print("Inserting multimedia data into idol_emp")
try:
connection = [Link](host='localhost',
database='idol_db',
user='idol',
password='idolidol')

cursor = [Link]()
sql_insert_blob = """ INSERT INTO idol_emp
(emp_id, emp_name, emp_photo, emp_biodata)
VALUES (%s,%s,%s,%s)"""
emp_photo = convertToBinaryData(emp_photo)
emp_biodata = convertToBinaryData(emp_biodata)
insert_blob = (emp_id, emp_name, emp_photo,
emp_biodata)
result = [Link](sql_insert_blob, insert_blob)
[Link]()
print("Image and biodata has inserted successfully ", result)

except [Link] as error:

print("Failed inserting multimedia data {}".format(error))

finally:
if connection.is_connected():
[Link]()

274 | P a g e
[Link]() Space for learners:
print("MySQL connection is closed")
insertBLOB(1, "idol_emp1", "path of the image",
"path of the text")

After data insertion in a database using the BLOB in MySQL, you

can retrieve the data from the database as given below. For the
same, MySQL and python connector is required as stated above.
But to execute the select query [Link]() function is required.
Then you can use [Link]() to retrieve the data from the
database as below.
import [Link]

def write_file(data, filename):

Disk
with open(filename, 'wb') as file:
[Link](data)

def readBLOB(emp_id, emp_photo, emp_bioData):

print("Reading data from idol_emp table")
try:
connection = [Link](host='localhost',
database='idol_db',
user='idol',
password='idolidol')
cursor = [Link]()
sql_fetch = """SELECT * from idol_emp where id = %s"""
[Link](sql_fetch, (emp_id,))
record = [Link]()
for row in record:
print("Employee Id = ", row[0], )
print("Employee Name = ", row[1])
Employee image = row[2]
Employee biodata = row[3]

275 | P a g e
print("Storing employee’s photyo and biodata in the Space for learners:
local PC")
write_file(Employee image, emp_photo)
write_file(Employee biodata, emp_biodata)
except [Link] as error:
print("Failed to read data from idol_emp {}".format(error))
finally:
if connection.is_connected():
[Link]()
[Link]()
print("MySQL connection is closed")
readBLOB(1, "path of the image",

CHECK YOUR PROGRESS-I

1. What do you mean by multimedia database?
2. What are the BLOB data types?
3. State truth or false
i. TINY BLOB is Blob data type
ii. MEDIUMBLOB is not a blob data type.
iii. Images consist of pixel
4. What is [Link]() function?
5. What is the role of [Link]() ?

3.5 REQUIREMENT OF MULTIMEDIA

DATABASE

Like other DBMS, the multimedia database should address the

requirement issues.
i) Integration: It indicates that data of a multimedia
database should not be duplicate.

276 | P a g e
ii) Concurrency control: Like other DBMS, a multimedia Space for learners:
database should control the concurrency of the
transaction. Otherwise, consistency issues will be arises.
iii) Data Independency: In the multimedia database, data of
the different multimedia should be independent. It
should be managed from the user side.
iv) Persistence: Data of a multimedia database should be
saved and reused by the other transactions.
v) Recovery: Data should be recovered at the time of
failure. A system may fail due to different reasons, but
the recovery option of a multimedia database should
recover the data at the time of need.

3.6 CHALLENGES OF MULTIMEDIA DATABASE

Like other DBMS, multimedia databases also have some

challenges. They are presented below.
i) The designing of the multimedia database is not so easy as
the different format of data is present in the multimedia
database.
ii) As the multimedia database consists of images, text, video,
mp3, etc. So conversion of one file format to another format
is not so easy.
iii) Storing multimedia data requires more amounts of space.
Designing a large dataset is not so easy.
iv) Processing is another issue of multimedia databases because
the processing of data requires more amount of time.
v) Multimedia query processing and execution is another issue
of the multimedia database.

277 | P a g e
Space for learners:
CHECK YOUR PROGRESS-II
6. What is concurrency in multimedia database?
7. State truth or false
i. Integration is a requirement of multimedia database.
ii. Data independecy should be a part of multimedia
database.
iii. All multimedia shoud not be recovered.
8. State two challenges of multimedia database.

3.7 CONTENTS OF MULTIMEDIA DATABASE

The multimedia database store the multimedia information. It

contains the following information.
i) The multimedia database contains the multimedia data like
audio, video, text, animations, and images.
ii) The media of the multimedia information contains the
sampling rate, the frame rate of the data signal.
iii) The keyword data is used to represent the data of a
multimedia database such as image keyword means its
date, time, and description of the image.
iv) The media feature data contains the features of a data. For
example, an image means its color, texture, and shape.

3.8 APPLICATION OF MULTIMEDIA DATABASE

The multimedia database can be applied in the following areas.

i) The Multimedia databases can be applied in the area of
document and record management systems such as
insurance claim records.
ii) Multimedia databases can be applied in digital libraries.
For example IR@ inflibnet.

278 | P a g e
iii) The Multimedia databases are used in the video on Space for learners:
demand. For example. Netflix
iv) A Multimedia database is used in music. For example.
Ganna.
v) The multimedia database is used in GIS. For example.
Landsat 8.

CHECK YOUR PROGRESS-III

9. State truth or false
i. Audio is not a part of multimedia database.
ii. [Link] is an example of multimedia database.
10. State two applications multimedia database.
11. What is a netflex?
12. What is inflibnet?

3.9 SUMMING UP

 A Multimedia database is a collection of multimedia data

such as texts, images, videos, audios, etc.
 An image is multimedia data. It consists of the pixel. The
pixel of an image contains all the necessary information
about the image.
 An image may be color, grayscale, or black and white.
 The embedded SQL is used to insert the images into a
database. A database should support an image to insert an
image in the database.
 The images are stored in binary form in the cell of a table of
a database and the data type of the cell is Binary Large
Object (BLOB). It is a MySQL data type that is not only
used to store the image data but also used to store the other
data type.

279 | P a g e
 Table Creation Query: CREATE TABLE ìdol_emp` ( Space for learners:
èmp_id` INT NOT NULL , èmp_name` TEXT NOT
NULL , èmp_photo` BLOB NOT NULL , èmp_biodata`
BLOB NOT NULL , PRIMARY KEY (ìd`))
 Data Insertion Query: As BLOB is MySQL datatype and
it has the following four BLOB data type depending on the
length of the data that they can hold.
a) TINY BLOB
b) BLOB
c) MEDIUMBLOB
d) LONG BOB
 Like other DBMS, the multimedia database should address
the requirement issues.
1. Integration:
2. Concurrency control
3. Data Independency
4. Persistence
5. Recovery

 Like other DBMS, multimedia databases also have some

challenges. They are.
1. Designing of Multimed a database.
2. One File Format for multimedia data.
3. Processing is another issue of multimedia databases
because the
4. Multimedia query processing and execution.
 The multimedia database store the multimedia information.
It contains the following information.
1. Audio, video, text, animations, and images.
2. The sampling rate, the frame rate of the data signal.
3. Image keyword means its date, time, and description of
the image.
4. Features of data.

280 | P a g e
 The multimedia database can be applied in the following Space for learners:
areas.
1. Insurance claim records.
2. Inflibnet.
3. Netflix
4. Ganna.
5. Landsat 8

3.10 ANSWERS TO CHECK YOUR PROGRESS

1. A Multimedia database is a collection of multimedia data

such as texts, images, videos, audios, etc.
2. The four BLOB data types are
i. TINY BLOB
ii. BLOB
iii. MEDIUMBLOB
iv. LONG BOB
3. i) True ii) False iii) True
4. To insert data in the multimedia database, the insert queries
are executed using the [Link]() function.
5. Using the [Link](), one can retrieve the data from
the multimedia database.
6. In a database management system (DBMS), concurrency
control manages simultaneous access to a database. Like
other DBMS, a multimedia database should control the
concurrency of the transaction. Otherwise, consistency
issues will be arises.
7. i) True ii) True iii) False
8. Like other DBMS, multimedia databases also have some
challenges. They are presented below.
i. The designing of the multimedia database is not so easy
as the different format of data is present in the
multimedia database.

281 | P a g e
ii. As the multimedia database consists of images, text, Space for learners:
video, mp3, etc. So conversion of one file format to
another format is not so easy
9. i) False ii) True
10. The multimedia database can be applied in the following
areas.
i. Multimedia databases can be applied in digital libraries.
For example IR@ inflibnet.
ii. The Multimedia databases are used in the video on
demand. For example. Netflix
11. Netflix is an example of a multimedia database. It is a
streaming service that offers a wide variety of award-
winning TV shows, movies, anime, documentaries, etc.
12. Information and Library Network (INFLIBNET) is an
example of a multimedia database and is an autonomous
Inter-University Centre of the University Grants
Commission (UGC) that provides access to e-resources to
colleges, universities, and centrally funded technical
institutions

3.11 POSSIBLE QUESTIONS

Short answer type questions:

1. What is a multimedia database?
2. Define the terms image, video, and audio.
3. What is BLOB?
4. Why do you need BLOB?
5. What are the data types of BLOB?
6. What is an embedded query?
7. How do create a multimedia table?
8. How do you insert queries in a multimedia table?
9. State two issues of a multimedia database?
10. State two design issues of the multimedia database.
11. State two challenges of the multimedia database.
12. State two applications of a multimedia database.
Long answer type questions:

282 | P a g e
1. Explain the data insertion and retrieve in a multimedia Space for learners:
database using python.
2. Explain the different challenges of a multimedia database.
3. Explain the requirements of a multimedia database.

3.12 REFERENCES AND SUGGESTED

READINGS

1. Multimedia Database Management Systems, Author:

Prabhakaran, Publisher: Springer
2. Multimedia Database Management Systems, Author:
Guojun Li.

283 | P a g e
UNIT 4: SPATIAL DATABASE Space for learners:

Unit Structure:
4.1 Introduction
4.2 Unit Objectives
4.3 Spatial Database Concept
4.4 Spatial DBMS Data Models
4.5 Content-based Indexing and Retrieval
4.6 Different Indexing Techniques
4.7 Summing Up
4.8 Answers to Check Your Progress
4.9 Possible Questions
4.10 References and Suggested Readings

4.1 INTRODUCTION

This unit gives an overview of the spatial database. A spatial

database is one in which the geographic location information is
saved. The concept of a spatial database is explained here along
with its indexing techniques. Content-based indexing is also
discussed in this unit and along with the retrieval techniques.
Content-based image indexing means the properties of an image
are saved in the database and it is retrieved based on its
properties. Finally, the different indexing techniques such as R
trees, R+ Tress, and KD tree is discussed in the unit.

4.2 UNIT OBJECTIVES

After learning this unit, you will be able to learn

i) About the concept spatial database
ii) About the process of saving location in database
iii) About the Content-Based Indexing (CBI) and its
retrieving.

284 | P a g e
iv) About the indexing technique such as R tree, R+ tree, Space for learners:
and KD tree.

4.3 SPATIAL DATABASE CONCEPT

The spatial data is one where the geographic location such as a

village, town. Cities or locations are associated. The spatial
database is where you can be saved this type of information in
terms of some location object data. Technically, a spatial database
is optimized for storing and querying object data in a geometric
space. The Spatial database stores geometric objects like points,
lines, and polygons but some databases are also saved 2d objects,
linear networks, etc. It is a part of the GIS database (Fig.3.1)
Let's understand the concept of a spatial database with the help
of the following example. Let's have a satellite image of a road.
Though it is an image it has geographic information such as
points, lines, and polygons which represent the building, rod, etc.
So, this image, the spatial data is represented by vector data and
raster data. You can say that spatial data are two types.
i) Vector data: The data is represented using lines, points, and
polygons.
ii) Raster data: The data is presented using the matrix. For
example, data for building.
The database system which manages the spatial data is known as
Spatial Data Base Management System (SDBMS). The SDBMS
plays a prominent role in the management of queries of spatial
data. Spatial data is used in many disciplines such as
geography, remote sensing, urban planning, and natural resource
management. As mentioned above, the spatial database established
a specification to represent the above two types of spatial data.
Using this specification, spatial queries are processed using the
SQL (For example PostgreSQL. PostGIS, QGIS).

285 | P a g e
v) Environmental applications: In the case of Fire or Space for learners:
Pollution Monitoring, the SDBMS is used
vi) Administrative applications: In Public networks
administration and vehicle navigation, the SBMS is used.
The SDBMS are necessary for the following requirements before
its designs.
i) For the manipulation of very large amounts of data, e.g.,
terabytes of data per day from satellite images, the SDBMS is
required.
ii) For data distinction, e.g., spatial and non-spatial
(alphanumeric) data, the DBMS is necessary.
iii) For Complex spatial relationships and operations, e.g.,
topological, directional, metric relationships, the SDBMS
is necessary.
iv) Complex spatial relationships, e.g., find all cities adjacent
to a river, find all dark shapes left to the heart, and find the
5 closest hospitals concerning a given location.
v) Spatial join: An expensive operation, e.g., Find the 5
closest hospitals concerning any highway.

CHECK YOUR PROGRESS-I

1. What do you mean by spatial data and spatial database?
2. State few applications of spatial database.
3. Which classes does spatial data types in MySQL
correspond to?
4. Stet true or false
i) SPATIAL indexes cannot be created on NOT NULL
spatial columns.
ii) By ‘spatial data’ we mean data that has position value.

4.4 SPATIAL DBMS DATA MODELS

In section 4.3, the two types of the data model of SDBMS are
already mentioned. They are
i) Raster Model: In the raster model, SDBMS spaces are
subdivided into cells of regular size and shape such as square,

287 | P a g e
triangle, hexagon, etc. Each cell of the raster is assigned the Space for learners:
value of the attribute it represents and only one value is
assigned for the same. Different attributes are stored in
separate files (layers).
ii) Vector Model: In the vector model, the subdivision of the
space is done based on the position of the geographic feature,
i.e., irregular. The features are represented by (2-D space),
such as Points (x,y), Lines (x1,y1, x2,y2, ..., xn,yn), Regions
(x1,y1, ..., xn,yn, x1,y1).

4.5 CONTENT-BASED INDEXING AND RETRIEVAL

Like another database system, SDBMS also needs indexing of

spatial data for faster query processing and searching spatial data
most efficiently. Content-based indexing is one where data is
saved based on the properties of the data and it is retrieved based
on these properties. "Content-based" means that the search
analyses the contents of the data rather than the metadata such as
keywords, tags, or descriptions associated with the data. It results
in faster query processing and searching. For example, the
Content-Based Image Indexing and Retrieval (CBIR) system are
where images are saved in the database based on the properties of
the image data. The properties of the image mean its color,
texture, and shape. So based on these, you can index the images
and retrieve also.
The content indexing and retrieving are applicable in many areas
but image, video, text, music are very popular ones. Let's explain
the CBIR with a help of an example. Let's you have many images
in your database. In CBIR, query images will be there with you.
Initially, the feature of the image such as color, texture, position,
shape, etc can extract from the query image. The features are
saved as a vector for the query image and the same process can be
applied to the database also. When you have both the feature
vector, just compare the feature vector of the query image with the
feature vector of the database image.

288 | P a g e
Space for learners:

Fig. 3.2: CBIR system

CHECK YOUR PROGRESS-II

5. What are the spatial data models?
6. Can we use image and video for CBIR system?
7. Give two examples of CBIR search engine.
8. What features of an image are considered for CBIR?

4.6 DIFFERENT INDEXING TECHNIQUES

For the indexing of spatial data, different indexing techniques are

used.
i) R Tree
ii) R+ Tree
iii) KD Tree
Let's explain these techniques with the help of examples.
I. R tree: R-tree is a tree data structure to store the spatial data
efficiently. It is used for storing spatial data indexes. It is
useful for spatial data queries, storage, and indexing. are
highly useful for spatial data queries and storage. Indexing
multi-dimensional information. For example, handling of game

289 | P a g e
data, virtual maps implementation, and handling geospatial Space for learners:
coordinates, etc. The properties R tree are given below.
i. Consists of a single root with internal and leaf nodes.
ii. The root node contains a pointer to the largest region.
iii. The parent nodes contain pointers to their child nodes
where the region of child nodes completely overlaps with
the regions of parent nodes.
iv. Leaf nodes contain the actual data within the Minimum
Bounding region (MBR) to the current objects where the
MBR is the sub-regions within the entire space that group
data as efficiently.

(a)

290 | P a g e
(b) Space for learners:
Fig. 3.3 R tree in SDBMS

To locate an object, the search algorithm descends the tree from

the root. The algorithm recursively traverses down the subtrees of
bounding rectangles that intersect the query rectangle. When a
leaf node is reached, bounding rectangles are tested against the
query rectangle and their objects are fetched for testing if they
intersect the query rectangle.
II. R +Tree: R+tree is a variant of R trees where data is
indexed using (x,y) coordinates. It is a conciliation between
the R tree and the KD tree. In the R+ tree, the nodes may
not be half-filled and the internal nodes of the tree avoid
overlapping by inserting an object into multiple leaves. In
the R+ tree, the tree has minimal coverage and minimal
overlap and it overcomes the overlapping issue of the R
tree. The advantages and disadvantages of the R+ tree are
presented below.
Advantages:
i) Due to no overlapped between the nodes, the point
query performance benefits are covered by at most
one node.
ii) A single path is identified to visit the nodes.
Disadvantages:
i) Since rectangles are duplicated, an R+ tree can be
larger than an R tree built on the same data set.
ii) Construction and maintenance of R+ trees are more
complex than the R trees and other variants of the R
tree.
The duplication of objects or nodes in R+tree leads to the non-
overlapping of entries. If the corresponding covering rectangles
intersect the query region, then only the searching is possible in
R+tree. The disjoint covering rectangles avoid the multiple
search paths of the R-tree for point queries.
To insert an object, multiple paths may be traversed. At a node,
the subtrees with covering rectangles that intersect with the
object bounding rectangle must be traversed. On reaching the
leaf nodes, the object identifier will be stored in the leaf nodes.

291 | P a g e
Multiple noes of R+tree may store the same object. Three cases Space for learners:
should take care of the insertion.
i) Insert an object into a node where the covering
rectangles of all entries do not intersect with the object-
bounding rectangle.
ii) The second one is when the bounding rectangle of the
new object only partially intersects with the bounding
rectangles of entries.
iii) The third case is more serious in that the covering
rectangles of some entries can prevent each other from
expanding to include the new object.
III. KD Tree: KD tree is a binary search tree that is also known
as K dimensional tree. In the KD tree, the data in each node
of the tree represents the K dimensional point in space. So it
is also known as space partitioning data structure. It
represents the points or data in K dimensional space. The
non-leaf node of the KD tree effectively divides the tree into
two spaces, known as half-space. The data that is left of the
root will go into a left sub-tree, data right of a root will go in
a right subtree. Construction of the KD tree is as follows:
i) The axis used to generate splitting trees is cycled
repeatedly.
ii) The nodes are selected by taking the median of the
data being placed in the subtree.

Let's understand the basic concept of the KD tree by considering a

2D tree. In the KD tree, the left subtree contains those points
whose coordinates are smaller than the root node, and the right
subtree contains those points whose coordinates are grater-equal
to the root node.
Let’s build a k-d tree with the points: (30,40), (5,25), (70,70),
(50,30), (35,45). Let the root node is x aligned.
i) Take the first coordinates (30,40). As the tree is empty, so
make it the root node of the tree.
ii) Now the 2nd coordinate is (5,25). As the first x value of the
2nd coordinates is 5 and 5<30. So it will go to the left
subtree.

292 | P a g e
iii) The 3rd coordinate is (70,70). Now 70>30, so it will go right Space for learners:
subtree.
iv) The 4th coordinate is (50,30). First, compare with root
50>30. But already (70,70) is in the right subtree. Now
Compare 50 with the y value of 70. 50<70. So, it will be in
the left subtree of (70,70).
v) The final coordinate is (35,45). Comparing with root
(35>30). It will go right subtree. In the right subtree,
comparing with y coordinates of (70,70), you find 35<70.
So, it will go left subtree of (70,70). But in the left subtree
of (70,70), the (50,30) coordinate is present. Now compare
35 of (35,45) with x of (50,30). So, 35<45. So, it will be on
the left of (50,30).

Fig. 3.4 KD tree in SDBMS

CHECK YOUR PROGRESS-III

9. What is r tree and r+ tree?
10. In what time can a 2-d tree be constructed?
11. In a k-d tree, what is K meant?
12. Each level in a k-d tree is made of cutting and dimension
(True or False)

293 | P a g e
4.7 SUMMING UP Space for learners:

 The spatial data is one where the geographic location such

as a village, town. Cities or locations are associated. The
spatial database is where you can be saved this type of
information in terms of some location object data.

 The spatial data are two types.

o Vector data: The data is represented using lines,
points, and polygons.
o Raster data: The data is presented using the matrix.
For example, data for building.

 The spatial queries are processed using the SQL (For

example PostgreSQL. PostGIS, QGIS).

 The SDBMS are necessary for the following requirements

before its designs.
o For the manipulation of very large amounts of data.
o For data distinction.
o For Complex spatial relationships and operations.
o Complex spatial relationships.
o Spatial join.

 Content-Based Image Indexing and Retrieval (CBIR)

system are where images are saved in the database based
on the properties of the image data.

 R-tree is a tree data structure to store the spatial data

efficiently. It is used for storing spatial data indexes. It is
useful for spatial data queries, storage, and indexing. are
highly useful for spatial data queries and storage. Indexing
multi-dimensional information.

 R+tree is a variant of R trees where data is indexed using

(x,y) coordinates. It is a conciliation between the R tree
and the KD tree.

 KD tree is a binary search tree that is also known as K

dimensional tree. In the KD tree, the data in each node of
the tree represents the K dimensional point in space.
 Construction of the KD tree is as follows:

294 | P a g e
o The axis used to generate splitting trees is cycled Space for learners:
repeatedly.
o The nodes are selected by taking the median of the
data being placed in the subtree.

4.8 ANSWERS TO CHECK YOUR PROGRESS

1. The spatial data is one where the geographic location such as

a village, town. Cities or locations are associated. The
spatial database is where you can be saved this type of
information in terms of some location object data.
2. The few applications of spatial DBMS are
i) Image and Multimedia databases
ii) Time-series databases
iii)Traditional DBMS
3. OpenGIS
4. i) FALSE
ii) TRUE
5. Two types of spatial data models are raster and vector
model.
6. For CBIR, we can use only image.
7. eBay image Search and Google Image Search.
8. Color, shape, and texture.
9. R and R+ tree are spatial indexing techniques in spatial
DBMS.
10. O(nlogn)
11. Number of dimensions.
12. True

4.9 POSSIBLE QUESTIONS

Short answer type questions:

1) What do you mean by spatial data and spatial database
system?

295 | P a g e
2) Explain raster and vector model in spatial DBMS. Space for learners:
3) What is CBIR? Give examples.
4) What are the applications of spatial DBMS?
5) What are the requirements of Spatial SBMS?
6) State the difference between R and R+ tree.
7) Give some examples of spatial query language.
8) What are advantages of R+ tree over R tree?
9) Why do need KD tree if you have R and R+ tree?
10) What are the requirements of spatial DBMS?
Long answer type questions:
1) Explain the CBIR system with an example and diagrams.
2) Explain the KD tree with an example.
3) What are the indexing techniques of Spatial DBMS?
Explain.

4.10 REFERENCES AND SUGGESTED

READINGS

1) Spatial Databases: With Application to GIS, by Michel O.

Scholl
2) Spatial Data Management by Nikos Mamoulis

296 | P a g e
SQL - Object-Relational Extensions — CSCI 4380 Database Systems... [Link]

SQL - Object-Relational
Extensions
Contents
• Semantic Hierarchies - Inheritance
• Complex objects

• Postgresql and many other databases actually have many extensions that go well
beyond the relational data model.
• As these extensions violate relational data model, think about what you are giving up
and use them sparingly!
◦ Simplicity of data model and queries
◦ Optimizations may not be as easy to perform
• We will go through some of these here, using Postgresql as an example.

Semantic Hierarchies - Inheritance

• Recall in E-R diagrams, we talked about ISA relationships.
◦ A isa B, meaning A inherits all the attributes of B (and adds some more)
• Postgresql allows you to define class hierarchies:
See  example database to be used .

CREATE TABLE cities (

name text
, population float
, altitude int -- in feet
);

CREATE TABLE capitals (

state char(2)
) INHERITS (cities);

• Querying subtables:

1 of 7 13-10-2025, 21:41
SQL - Object-Relational Extensions — CSCI 4380 Database Systems... [Link]

SELECT
name
, altitude
FROM
cities
WHERE
altitude > 50;

Includes all cities, i.e. capitals as well.

SELECT
name
, altitude
FROM ONLY
cities
WHERE
altitude > 50;

Includes only cities, not capitals.

• To find out which table a row comes from:

SELECT
[Link]
, [Link]
, [Link]
FROM
cities c
, pg_class p
WHERE
[Link] > 50
and [Link] = [Link];

• Semantic hierarchies about sets of objects and their relationship to each other.
◦ A type of object (capital) is a special type of city.
◦ All cities include the capitals.

Complex objects

2 of 7 13-10-2025, 21:41
SQL - Object-Relational Extensions — CSCI 4380 Database Systems... [Link]

• You can create user defined types

create type phone_type as (

num varchar(12)
, type varchar(50)
);

create table person (

id int
, name varchar(30)
, phone phone_type
) ;

insert into person values(

1
, 'Kara Danvers'
, ('555-1234','work')::phone_type
) ;

select * from person ;

id | name | phone
----+--------------+-----------------
1 | Kara Danvers | (555-1234,work)

• These complex types really go against the first normal form: that all values should be
atomic. But, they allow multiple related values to be encapsulated.
• You can access the types using dot notation

select * from person where (phone).type = 'work';

• Technically you should store the both attributes for phone separately, but this way,
you can tell that they belong together.
• You can also define user defined types to be restricted domains of values and then
use in multiple places.

Collection of Values
• In addition to records (like the one above), you can also define collection of values.
• Arrays:

3 of 7 13-10-2025, 21:41
SQL - Object-Relational Extensions — CSCI 4380 Database Systems... [Link]

CREATE TABLE tictactoe (

squares integer[3][3]
);

INSERT INTO tictactoe VALUES('{{1,2,3},{4,5,6},{7,8,9}}');

SELECT squares[3][2] FROM tictactoe; --not zero indexed

squares
---------
8
(1 row)

CREATE TABLE messages (

msg text[]
) ;

INSERT INTO messages VALUES ('{"hello", "world"}') ;

INSERT INTO messages VALUES ('{"I", "feel", "so", "free"}') ;

SELECT msg[2] FROM messages ;

msg
-------
world
feel
(2 rows)

SELECT msg[2:3] FROM messages; --slicing, really?

msg
-----------
{world}
{feel,so}
(2 rows)

• The best of use complex types is to write procedures/functions using pl/pgsql or a

programming language like C.

Typed objects and methods

• The main use of typed objects is to create extensions for handling specific types of
data.
• For each data type, there are specific methods that apply to them, like an object-
oriented programming language!
• Some really useful examples:
◦ Geographic data: points (geo locations), polygons (state, city boundaries), line
segments (roads, rivers)

4 of 7 13-10-2025, 21:41
SQL - Object-Relational Extensions — CSCI 4380 Database Systems... [Link]

◦ Text data: vectors of words and weights for each word

◦ JSON

SELECT '{"foo": {"bar": "baz"}}'::jsonb;

jsonb
-------------------------
{"foo": {"bar": "baz"}}

SELECT '{"foo": {"bar": "baz"}}'::jsonb->'foo';

?column?
----------------
{"bar": "baz"}

Geographic Data
• PostGIS is an extension for supporting geographic data with many useful data types
of functions.
• First install postgis and create the extension from a superuser:

create extension postgis;

create database geodb owner sibeladali template template_postgis ;

• Now you can use all the data types and methods available in postgis.

CREATE TABLE bwithloc (

name VARCHAR(100)
, location geography(POINT,4326)
) ;

insert into bwithloc values('Rensselaer Polytechnic Institute',

ST_GeographyFromText('SRID=4326;POINT(42.7308634 -73.6816793)'));

insert into bwithloc values('Shalimar Restaurant',

ST_GeographyFromText('SRID=4326;POINT(42.732293 -73.688473)'));

insert into bwithloc values('The Placid Baker',

ST_GeographyFromText('SRID=4326;POINT(42.7313916 -73.690868)'));

• SRID shows the projection used to compute the latitude and longitude.
• You can also enter polygons as arrays of points, line segments are arrays of lines, etc.
• Many geography functions are available (distance is in meters):

5 of 7 13-10-2025, 21:41
SQL - Object-Relational Extensions — CSCI 4380 Database Systems... [Link]

SELECT
[Link]
, [Link]
, ST_DISTANCE([Link], [Link])
FROM
bwithloc b1
, bwithloc b2
WHERE
[Link] < [Link] ;

• Other examples:
◦ Check whether a point is inside a polygon (which city is this restaurant in)?
◦ Check the length of a line segment

Text Querying
• The text queries we have seen so far very simplistic: find if the text contains a specific
word.
• More sophisticated approaches treat text as a collection of words or tokens.
◦ If you want to learn more, information retrieval is a field that studies this!
• Postgresql supports text processing:

SELECT to_tsvector('fat cats ate fat rats');

to_tsvector
-----------------------------------
'ate':3 'cat':2 'fat':1,4 'rat':5

numbers show the location of the keyword in the text.

• Text queries will consist of boolean connection of keywords, tokenized and stop
words removed:

SELECT to_tsquery('english', 'The & Fat & Rats');

to_tsquery
---------------
'fat' & 'rat'

• You can search a keyword query in a document by relevance. The number of times a
word appears will increase the relevance of the text to the query.
We will use the Yelp database as an example:

6 of 7 13-10-2025, 21:41
SQL - Object-Relational Extensions — CSCI 4380 Database Systems... [Link]

SELECT
[Link]
, ts_rank_cd(to_tsvector('english', r.review_text), query) AS rank
FROM
reviews r
, businesses b
, to_tsquery('pizza & (crust | sauce) & (delicious|tasty)') query
WHERE
b.business_id = r.business_id
and to_tsvector('english', r.review_text) @@ query
ORDER BY
rank DESC
LIMIT 10;

Summary
• Postgresql extensible with many new data types and associated methods.
• We will also see how it is possible to create the appropriate indices for these data
types.

7 of 7 13-10-2025, 21:41
Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Database Design in DBMS

Last Updated : 23 Jul, 2025

Before designing a database, it's crucial to understand important terms and concepts. A
properly structured database guarantees efficiency, data accuracy, and usability. From
understanding data storage to the principles that define data relationships, these concepts
are essential for anyone involved in designing, managing, or optimizing databases.

Whether you are new to database design or an experienced professional, these

fundamental ideas serve as the building blocks for creating strong and scalable database
systems.

What is database design?

Database Design can be defined as a set of procedures or collection of tasks involving
various steps taken to implement a database.A good database design is important. It helps
you get the right information when you need it. Following are some critical points to keep in
mind to achieve a good database design:

1. Data consistency and integrity must be maintained.

2. Low Redundancy
3. Faster searching through indices
4. Security measures should be taken by enforcing various integrity constraints.
5. Data should be stored in fragmented bits of information in the most atomic format
possible.

However, depending on specific requirements above criteria might change. But these are
the most common things that ensure a good database design.

What are the Following Steps that can be taken by a Database

Designer to Ensure Good Database Design?
Step 1: Determine the goal of your database, and ensure clear communication with the
stakeholders (if any). Understanding the purpose of a database will help in thinking of
various use cases & where the problem may arise & how we can prevent it.

Step 2: List down all the entities that will be present in the database & what relationships
exist among them.

Step 3: Organize the information into different tables such that no or very little redundancy
is there.

Step 4: Ensure uniqueness in every table. The uniqueness of records present in any relation
is a very crucial part of database design that helps us avoid redundancy. Identify the key
attributes to uniquely identify every row from columns. You can use various key constraints
to ensure the uniqueness of your table, also keep in mind the uniquely identifying records
must consume as little space as possible & shall not contain any NULL values.

Step 5: After all the tables are structured, and information is organized apply Normalization
Forms to identify anomalies that may arise & redundancy that can cause inconsistency in
the database.

Primary Terminologies Used in Database Design

Following are the terminologies that a person should be familiar with before designing a
database:

Redundancy: Redundancy refers to the duplicity of the data. There can be specific use
cases when we need or don't need redundancy in our Database. For ex: If we have a
banking system application then we may need to strictly prevent redundancy in our
Database.
Schema: Schema is a logical container that defines the structure & manages the
organization of the data stored in it. It consists of rows and columns having data types
for each column.
Records/Tuples: A Record or a tuple is the same thing, basically its where our data is
stored inside a table
Indexing: Indexing is a data structure technique to promote efficient retrieval of the data
stored in our database.
Data Integrity & Consistency: Data integrity refers to the quality of the information
stored in our database and consistency refers to the correctness of the data stored.
Data Models: Data models provide us with visual modeling techniques to visualize the
data & the relationship that exists among those data. Ex: model, Network Model, Object
Oriented Model, Hierarchical model, etc.
Normalization: The process of organizing data to reduce redundancy and dependency
by dividing larger tables into smaller ones and defining relationships. It ensures data
storage and consistency.
Functional Dependency: Functional Dependency is a relationship between two
attributes of the table that represents that the value of one attribute can be determined
by another. Ex: {A -> B}, A & B are two attributes and attribute A can uniquely determine
the value of B.
Transaction: Transaction is a single logical unit of work. It signifies that some changes
are made in the database. A transaction must satisfy the ACID or BASE properties
(depending on the type of Database).
Schedule: Schedule defines the sequence of transactions in which they're executed by
one or multiple users.
Concurrency: Concurrency refers to allowing multiple transactions to operate
simultaneously without interfering with one another.
Constraints: Constraints are the rules applied to fields in a table to enforce data
integrity. e.g., NOT NULL, UNIQUE, CHECK, etc. It ensures data quality and accuracy.

Database Design Lifecycle

The database design lifecycle goes something like this:

Lifecycle of Database Design

1. Requirement Analysis

It's very crucial to understand the requirements of our application so that you can think in
productive terms. And imply appropriate integrity constraints to maintain the data integrity
& consistency.

2. Logical & Physical Design

This is the actual design phase that involves various steps that are to be taken while
designing a database. This phase is further divided into two stages:

Logical Data Model Design: This phase consists of coming up with a high-level design
of our database based on initially gathered requirements to structure & organize our
data accordingly. A high-level overview on paper is made of the database without
considering the physical level design, this phase proceeds by identifying the kind of
data to be stored and what relationship will exist among those data.
Entity, Key attributes identification & what constraints are to be implemented is the core
functionality of this phase. It involves techniques such as Data Modeling to visualize
data, normalization to prevent redundancy, etc.
Physical Design of Data Model: This phase involves the implementation of the logical
design made in the previous stage. All the relationships among data and integrity
constraints are implemented to maintain consistency & generate the actual database.

3. Data Insertion and testing for various integrity Constraints

Finally, after implementing the physical design of the database, we're ready to input the
data & test our integrity. This phase involves testing our database for its integrity to see if
something got left out or, if anything new to add & then integrating it with the desired
application.

Logical Data Model Design

The logical data model design defines the structure of data and what relationship exists
among those data. The following are the major components of the logical design:

1. Data Models: Data modeling is a visual modeling technique used to get a high-level
overview of our database. Data models help us understand the needs and requirements of
our database by defining the design of our database through diagrammatic representation.
Ex: model, Network model, Relational Model, object-oriented data model.

Data Models

2. Entity: Entities are objects in the real world, which can have certain properties & these
properties are referred to as attributes of that particular entity. There are 2 types of entities:
Strong and weak entity, weak entity do not have a key attribute to identify them, their
existence solely depends on one 1-specific strong entity & also have full participation in a
relationship whereas strong entity does have a key attribute to uniquely identify them.

Weak entity example: Loan -> Loan will be given to a customer (which is optional) & the
load will be identified by the customer_id to whom the lone is granted.

3. Relationships: How data is logically related to each other defines the relationship of
that data with other entities. In simple words, the association of one entity with another is
defined here.

A relationship can be further categorized into - unary, binary, and ternary relationships.

Unary: In this, the associating entity & the associated entity both are the same. Ex:
Employee Manages themselves, and students are also given the post of monitor hence
here the student themselves is a monitor.
Binary: This is a very common relationship that you will come across while designing a
database.
Ex: Student is enrolled in courses, Employee is managed by different managers, One
student can be taught by many professors.
Ternary: In this, we have 3 entities involved in a single relationship. Ex: an employee
works on a project for a client. Note that, here we have 3 entities: Employee, Project &
Client.

4. Attributes: Attributes are nothing but properties of a specific entity that define its
behavior. For example, an employee can have unique_id, name, age, date of birth (DOB),
salary, department, Manager, project id, etc.

5. Normalization: After all the entities are put in place and the relationship among data is
defined, we need to look for loopholes or possible ambiguities that may arise as a result of
CRUD operations. To prevent various Anomalies such as INSERTION, UPDATION, and
DELETION Anomalies.

Data Normalization is a basic procedure defined for databases to eliminate such anomalies
& prevent redundancy.

An Example of Logical Design

Logical Design Example

Physical Design
The main purpose of the physical design is to actually implement the logical design that is,
show the structure of the database along with all the columns & their data types, rows,
relations, relationships among data & clearly define how relations are related to each other.

Following are the steps taken in physical design

Step 1: Entities are converted into tables or relations that consist of their properties
(attributes)

Step 2: Apply integrity constraints: establish foreign key, unique key, and composite key
relationships among the data. And apply various constraints.

Step 3: Entity names are converted into table names, property names are translated into
attribute names, and so on.

Step 4: Apply normalization & modify as per the requirements.

Step 5: Final Schemes are defined based on the entities & attributes derived in logical
design.

Physical Design

Conclusion
In conclusion, a good database design is an essential part of a strong database
management system (DBMS). It provides the basis for data governance, data storage, and
data retrieval. The quality of a database has a direct impact on a system’s overall
performance and dependability. It is important to consider data organization,
standardization, performance, integrity, and more when designing a database to meet the
needs of your organization and your users.

Comment D dotsla… Follow 14

Article Tags : DBMS Geeks Premier League Geeks Premier League 2023

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

Conceptual Database Design

DBMS Database Data Storage

Conceptual database design is the process of identifying the essential data elements,
relationships, and constraints in a data model, which represents a particular
organization's business requirements. The conceptual design stage is the first step in
the database design process, which precedes the logical and physical design stages.
In this article, we will discuss the conceptual database design, its objectives, its
process, and the key components of a conceptual data model.

Objectives of Conceptual Database Design

The primary objective of conceptual database design is to create a high-level data
model that reflects the business requirements and provides a clear understanding of
the data elements, relationships, and constraints involved. This data model serves as
a blueprint for the logical and physical database design stages. The key objectives of
conceptual database design are as follows ?

1 of 3 13-10-2025, 21:52
Conceptual Database Design [Link]

Identify the entities and their attributes ? Entities are objects or

concepts that exist in the real world and can be distinguished from each
other. Attributes are the properties or characteristics of the entities. The
first objective of conceptual database design is to identify the entities and
their attributes that are relevant to the organization's business
requirements.

Define the relationships ? Relationships are the associations between

entities. The second objective of conceptual database design is to define
the relationships between the identified entities. Relationships can be one-
to-one, one-to-many, or many-to-many.

Establish the constraints ? Constraints are the rules that govern the
relationships between entities. The third objective of conceptual database
design is to establish the constraints between entities, which ensure data
consistency and integrity.

Process of Conceptual Database Design

The process of conceptual database design involves the following steps ?

Requirements gathering ? The first step in conceptual database design

is to gather the business requirements from the stakeholders. This
involves identifying the data elements, relationships, and constraints that
are essential to the organization's business requirements.

Entity-relationship modeling ? The second step in conceptual database

design is to create an entity-relationship (ER) model, which represents the
entities, attributes, and relationships between the entities. The ER model
is a graphical representation of the data elements and their relationships.

Normalization ? The third step in conceptual database design is to

normalize the ER model, which ensures that the data is organized
efficiently and reduces data redundancy

Review and feedback ? The fourth step in conceptual database design is

to review the ER model with the stakeholders and incorporate their
feedback into the design.

2 of 3 13-10-2025, 21:52
Conceptual Database Design [Link]

Components of Conceptual Data Model

The key components of a conceptual data model are as follows ?

Entities ? Entities are objects or concepts that exist in the real world and
can be distinguished from each other. Examples of entities include
customers, orders, products, and employees.

Attributes ? Attributes are the properties or characteristics of the entities.

Examples of attributes include name, address, date of birth, and product
code.

Relationships ? Relationships are the associations between entities.

Examples of relationships include a customer placing an order, an
employee managing a department, and a product belonging to a category.

Cardinality ? Cardinality is the number of instances of an entity that can

be associated with instances of another entity. Examples of cardinality
include one-to-one, one-to-many, and many-to-many relationships.

Constraints ? Constraints are the rules that govern the relationships

between entities. Examples of constraints include referential integrity,
which ensures that a foreign key value in one table matches a primary key
value in another table, and uniqueness, which ensures that a field value is
unique within a table.

Conclusion
Conceptual database design is an essential process in database development, as it
lays the foundation for the logical and physical design stages. The objectives of
conceptual database design are to identify the entities and their attributes, define the
relationships, and establish the constraints. The process of conceptual database
design involves requirements gathering and entity-relationship.

3 of 3 13-10-2025, 21:52
Overview of the C++ Language Binding in the ODMG Standard [Link]

Overview of the C++ Language Binding in the ODMG

Standard
C++ Server Side Programming Programming

Introduction
Diving into the world of data management and modeling can be a complex task,
especially when dealing with standards like the Object Data Management Group
(ODMG). Did you know that ODMG provides an essential standard for object-oriented
database systems, including a C++ language binding? This article will guide you
through an easy-to-understand overview of this very aspect of ODMG, highlighting its
key features such as ODL constructs and transactions.

Overview of the ODMG Standard

The ODMG standard, developed by the Object Data Management Group (ODMG),
provides a framework for managing object-oriented databases. It includes the Object
Definition Language (ODL) and Object Manipulation Language (OML), which are used
to define objects and manipulate them within an object-oriented database
management system.

Object Data Management Group (ODMG)

The Object Data Management Group (ODMG) serves as a critical aspect of data
management, revolutionizing the way we perceive information storage. Composed of
several prominent object database and object-relational vendors, ODMG aims to
establish standards for programming-language-centric data management.

By providing specifications for C++, Java, and Smalltalk, it seeks to bridge the gap
between object-oriented programming languages and databases. The ODMG standard
hinges on the premise that integrating the database with client language helps
streamline application development?a game-changing paradigm in modern computing
environments.

Purpose and Importance of the Standard

The purpose of the ODMG standard is to provide a consistent and standardized way of

1 of 5 13-10-2025, 22:06
Overview of the C++ Language Binding in the ODMG Standard [Link]

managing object-oriented data in databases. It aims to define a set of standards and

specifications that can be implemented by various object-oriented database systems,
allowing for interoperability and portability across different platforms and
programming languages.

This standard plays a crucial role in data modeling and management within object-
oriented databases. By defining an Object Definition Language (ODL) and an Object
Manipulation Language (OML), it provides a clear syntax and semantics for creating,
manipulating, querying, and deleting objects in the database.

The importance of the ODMG standard lies in its ability to bridge the gap between
application development using object-oriented programming languages like C++ or
Java, and underlying databases that store persistent data.

Object Definition Language (ODL)

The Object Definition Language (ODL) is a key component of the ODMG standard,
which plays a crucial role in object-oriented database systems. It serves as a
declarative portion of the ODMG specification that allows developers to define objects
and their relationships within an object-oriented database.

By using ODL, developers can specify the structure, behavior, and constraints of
objects in a concise manner. This includes defining classes, attributes, methods,
inheritance hierarchies, and associations between objects.

With ODL, developers can easily model complex data structures and implement them
in an object-oriented database system. The language provides a standardized syntax
for representing various constructs related to object modeling.

By adhering to the ODL specifications, different implementations of object-oriented

databases can ensure interoperability and compatibility across platforms.

Object Manipulation Language (OML)

The Object Manipulation Language (OML) is a key component of the C++ language
binding in the ODMG standard. It allows users to perform various operations on
objects, such as creating, naming, manipulating, and deleting them.

OML provides a set of commands and syntax for interacting with objects stored in an
object-oriented database management system (OODBMS). This includes features like
transaction support for ensuring data consistency and integrity during updates.

C++ Language Binding

The C++ language binding in the ODMG Standard provides a seamless integration
between the powerful C++ programming language and object-oriented database

2 of 5 13-10-2025, 22:06
Overview of the C++ Language Binding in the ODMG Standard [Link]

systems. Discover how ODMG standardizes object creation, manipulation, and deletion
in this overview.

Mapping ODL Constructs to C++ Constructs

The C++ language binding in the ODMG Standard involves mapping ODL constructs to
C++ constructs. This allows users to utilize the power of C++ programming language
for object-oriented data management. Here is an overview of how ODL constructs are
mapped to C++ constructs

Object Definition Language (ODL) declarations are mapped to C++ class

definitions. This means that objects and their attributes are defined using
C++ classes.

Relationships between objects, such as associations and aggregations, are

represented using pointers or references in C++. This allows for efficient
navigation and manipulation of object relationships.

Inheritance hierarchies defined in ODL are implemented using C++

inheritance syntax. This allows for code reuse and polymorphism in the C+
+ codebase.

ODL methods, which define behaviors associated with objects, are

implemented as member functions within the corresponding C++ classes.
This enables object-specific operations to be performed using familiar C++
syntax.

Data types defined in ODL, such as strings, integers, or floating-point

numbers, are mapped to equivalent data types available in the C++
programming language. This ensures compatibility and seamless
integration between ODMG-compliant databases and C++ applications.

3 of 5 13-10-2025, 22:06
Overview of the C++ Language Binding in the ODMG Standard [Link]

ODL collections, such as sets or lists, are typically implemented using

standard container classes provided by the C++ Standard Library. This
allows for efficient storage and retrieval of multiple objects within a single
data structure.

C++ Class Library for Object Manipulation

It provides seamless integration between the C++ programming language and the
ODMG standard, making it easier for programmers to work with object-oriented data
modeling and management. The C++ class library also supports transactions,
ensuring that changes made to objects can be committed or rolled back as needed.

Object Creation, Naming, Manipulation, and Deletion

The C++ language binding in the ODMG Standard provides a set of

constructs and functionalities for object creation, naming, manipulation,
and deletion.

With the C++ class library provided by the ODMG Standard, developers
can easily create new objects in their C++ applications.

The naming of objects is also supported, allowing developers to assign

unique names to objects for easy referencing and identification.

Object manipulation operations, such as updating attributes or invoking

methods on objects, can be performed using the provided C++ constructs.

When an object is no longer needed, it can be deleted using the

appropriate functions provided by the ODMG Standard.

Transactions

Transactions play a crucial role in the C++ language binding of the ODMG Standard.
In this context, transactions refer to a set of operations that are executed as a single
logical unit, ensuring consistency and integrity of data.

Transactions provide atomicity, which means that either all the operations within a
transaction are completed successfully or none of them is applied at all. This helps in
maintaining data integrity by preventing partial updates or inconsistent states.

Transactions also provide durability by ensuring that once committed, changes made
during the transaction persist even in case of failures. With the C++ language binding

4 of 5 13-10-2025, 22:06
Overview of the C++ Language Binding in the ODMG Standard [Link]

in the ODMG Standard, developers can easily work with transactions using a well-
defined API and perform actions such as starting a transaction, committing changes,
or rolling back if needed.

Conclusion
The C++ language binding in the ODMG Standard provides an efficient and powerful
way to interact with object-oriented databases. By mapping ODL constructs to C++
constructs and providing a comprehensive class library for object manipulation,
developers are able to easily create, name, manipulate, and delete objects.

Additionally, transactions ensure data integrity and consistency within the database.
Overall, the C++ language binding in the ODMG Standard is a valuable tool for anyone
working with object-oriented database systems.

FAQs
1. What is the C++ language binding in the ODMG standard?

The C++ language binding in the ODMG (Object Data Management Group) standard is
a set of specifications and guidelines that define how C++ programming language can
be used to implement object-oriented databases.

2. How does the C++ language binding in the ODMG standard work?

The C++ language binding provides a set of classes, interfaces, and methods that
allow developers to interact with object-oriented databases using C++. It includes
features such as object persistence, query capabilities, and transaction management.

3. What are the benefits of using the C++ language binding in the ODMG
standard?

Using the C++ language binding allows developers familiar with C++ to leverage their
existing skills and knowledge for building applications that interact with object-
oriented databases. It provides a standardized approach for database interaction,
which promotes interoperability and code reusability.

4. Is knowledge of the ODMG standard necessary to use the C++ language

binding?

While having an understanding of the overall concepts and principles of object-

oriented databases would be helpful, it is not strictly necessary to have in-depth
knowledge of the ODMG standard specifically. The documentation provided with the
C++ language binding should provide all necessary information for utilizing it
effectively.

5 of 5 13-10-2025, 22:06
ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

5. QUERY PROCESSING IN DBMS.

Query Processing is the activity performed in extracting data from the database. In query
processing, it takes various steps for fetching the data from the database. The steps involved are:
1. Parsing and translation
2. Optimization
3. Evaluation

The query processing works in the following way:

Parsing and Translation
 The scanning, parsing, and validating module produces an internal representation of the
query. The query optimizer module devises an execution plan which is the execution
strategy to retrieve the result of the query from the database files.
 A query typically has many possible execution strategies differing in performance, and
the process of choosing a reasonably efficient one is known as query optimization.
 The code generator generates the code to execute the plan. The runtime database
processor runs the generated code to produce the query result.
 Relational algebra is well suited for the internal representation of a query.

CS8492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

The translation process in query processing is similar to the parser of a query. When a
user executes any query, for generating the internal form of the query, the parser in the system
checks the syntax of the query, verifies the name of the relation in the database, the tuple, and
finally the required attribute value. The parser creates a tree of the query, known as 'parse-tree.'
Further, translate it into the form of relational algebra. With this, it evenly replaces all the use of
the views when used in the query.
It is done in the following steps:
Step-1:
Parser: During parse call, the database performs the following checks- Syntax check, Semantic
check and Shared pool check, after converting the query into relational algebra.
Parser performs the following checks as (refer detailed diagram):
1. Syntax check – concludes SQL syntactic validity. Example:
SELECT * FORM employee
Here error of wrong spelling of FROM is given by this check.
2. Semantic check – determines whether the statement is meaningful or not. Example:
query contains a table name which does not exist is checked by this check.
3. Shared Pool check – Every query possess a hash code during its execution. So, this check
determines existence of written hash code in shared pool if code exists in shared pool
then database will not take additional steps for optimization and execution.

Hard Parse and Soft Parse –

If there is a fresh query and its hash code does not exist in shared pool then that
query has to pass through from the additional steps known as hard parsing otherwise if hash
code exists then query does not passes through additional steps. It just passes directly to
execution engine (refer detailed diagram). This is known as soft parsing.
Hard Parse includes following steps – Optimizer and Row source generation.

Step-2:
Optimizer: During optimization stage, database must perform a hard parse atleast for one
unique DML statement and perform optimization during this parse. This database never
optimizes DDL unless it includes a DML component such as subquery that require optimization.

CS8492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

It is a process in which multiple query execution plan for satisfying a query are examined
and most efficient query plan is satisfied for execution.
Database catalog stores the execution plans and then optimizer passes the lowest cost plan for
execution.
Step-3:
Execution Engine: Finally runs the query and display the required result.
Thus, we can understand the working of a query processing in the below-described
diagram:
Suppose a user executes a query. As we have learned that there are various methods of
extracting the data from the database. In SQL, a user wants to fetch the records of the employees
whose salary is greater than or equal to 10000. For doing this, the following query is undertaken:
SELECT EMP_NAME FROM EMPLOYEE WHERE SALARY>10000;
Thus, to make the system understand the user query, it needs to be translated in the form of
relational algebra. We can bring this query in the relational algebra form as:

o σsalary>10000 (πEmp_Name(Employee))
o πEmp_Name(σsalary>10000 (Employee))
After translating the given query, we can execute each relational algebra operation by using
different algorithms. So, in this way, a query processing begins its working.
Evaluation
For this, with addition to the relational algebra translation, it is required to annotate the
translated relational algebra expression with the instructions used for specifying and evaluating
each operation. Thus, after translating the user query, the system executes a query evaluation
plan.
Query Evaluation Plan
o In order to fully evaluate a query, the system needs to construct a query evaluation plan.
o A query evaluation plan defines a sequence of primitive operations used for evaluating a
query. The query evaluation plan is also referred to as the query execution plan.
o A query execution engine is responsible for generating the output of the given query. It
takes the query execution plan, executes it, and finally makes the output for the user
query.
Optimization

CS8492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

o The cost of the query evaluation can vary for different types of queries. Although the
system is responsible for constructing the evaluation plan, the user does need not to
write their query efficiently.
o Usually, a database system generates an efficient query evaluation plan, which minimizes
its cost. This type of task performed by the database system and is known as Query
Optimization.
o For optimizing a query, the query optimizer should have an estimated cost analysis of
each operation. It is because the overall operation cost depends on the memory
allocations to several operations, execution costs, and so on.
Finally, after selecting an evaluation plan, the system evaluates the query and produces the
output of the query.
Example:
SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > (SELECT MAX (SALARY) FROM
EMPLOYEE WHERE DNO=5);
The inner block
(SELECT MAX (SALARY) FROM EMPLOYEE WHERE DNO=5)
 Translated in: ∏ MAX SALARY (σDNO=5(EMPLOYEE))
The Outer block
SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > C
 Translated in: ∏ LNAZME, FNAME (σSALARY>C (EMPLOYEE))
(C represents the result returned from the inner block.)
 The query optimizer would then choose an execution plan for each block.
 The inner block needs to be evaluated only once. (Uncorrelated nested query).
 It is much harder to optimize the more complex correlated nested queries.
External Sorting
It refers to sorting algorithms that are suitable for large files of records on disk that do not fit
entirely in main memory, such as most database files..
ORDER BY.
Sort-merge algorithms for JOIN and other operations (UNION, INTERSECTION). Duplicate
elimination algorithms for the PROJECT operation (DISTINCT).
Typical external sorting algorithm uses a sort-merge strategy:
Sort phase: Create sort small sub-files (sorted sub-files are called runs).
CS8492-DATABASE MANAGEMENT SYSTEMS
ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

Merge phase: Then merges the sorted runs. N-way merge uses N memory buffers to
buffer input runs, and 1 block to buffer output. Select the 1st record (in the sort order) among
input buffers, write it to the output buffer and delete it from the input buffer. If output buffer
full, write it to disk. If input buffer empty, read next block from the corresponding run. E.g. 2-way
Sort-Merge

CS8492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

7. HEURISTIC-BASED QUERY OPTIMIZATION

In general, many different relational algebra expressions—and hence many different
query trees can be equivalent; that is, they can represent the same query.

The query parser will typically generate a standard initial query tree to correspond to an
SQL query, without doing any optimization.
For example, for a SELECT-PROJECT-JOIN query, such as Q2, the initial tree is shown in
Figure. The CARTESIAN PRODUCT of the relations specified in the FROM clause is first
applied; then the selection and join conditions of the WHERE clause are applied, followed by the
projection on the SELECT clause attributes.
Such a canonical query tree represents a relational algebra expression that is very
inefficient if executed directly, because of the CARTESIAN PRODUCT (×) operations.
The heuristic query optimizer will transform this initial query tree into an equivalent final
query tree that is efficient to execute.
The optimizer must include rules for equivalence among relational algebra
expressions that can be applied to transform the initial tree into the final, optimized query tree.
First we discuss informally how a query tree is transformed by using heuristics, and then we
discuss general transformation rules and show how they can be used in an algebraic heuristic
optimizer.

1. Break up SELECT operations with conjunctive conditions into a cascade of SELECT

operations
2. Using the commutativity of SELECT with other operations, move each SELECT operation as
far down the query tree as is permitted by the attributes involved in the select condition
3. Using commutativity and associativity of binary operations, rearrange the leaf nodes of the
tree
4. Combine a CARTESIAN PRODUCT operation with a subsequent SELECT operation in the tree
into a JOIN operation, if the condition represents a join condition
5. Using the cascading of PROJECT and the commuting of PROJECT with other operations,
break down and move lists of projection attributes down the tree as far as possible by
creating new PROJECT operations as needed

CS8492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

6. Identify sub-trees that represent groups of operations that can be executed by a single
algorithm
Query "Find the last names of employees born after 1957 who work on a project named
‗Aquarius‘."
SQL
SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME=‗Aquarius‘AND PNUMBER=PNO AND ESSN=SSN AND BDATE.‗1957-12-
31‘;

CS8492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

CS8492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

Cost Components of Query Execution

The cost of executing the query includes the following components:
 Access cost to secondary storage.
 Storage cost.
 Computation cost.
 Memory uses cost.
 Communication cost.
Importance of Access cost
Out of the above five cost components, the most important is the secondary storage
access cost.
 The emphasis of the cost minimization depends on the size and type of database
applications.
 For example in smaller database the emphasis is on the minimizing computing cost as
because most of the data in the files involve in the query can be completely store in
the main memory.
 For large database, the main emphasis is on minimizing the access cost to secondary
device.
 For distributed database, the communication cost is minimized as because many sites
are involved for the data transfer.
– [nBlocks(R)/2], if the record is found.
– [nBlocks(R)], if no record satisfied the condition.
Binary Search :
[log2(nBlocks(R))], if equality condition is on key attribute, because SCA(R) = 1 in this case.
[log2(nBlocks(R))] + [SCA(R)/bFactor(R)] – 1, otherwise.
Equity condition on Primary key
– [nLevelA(I) + 1]
– [nLevelA(I) + 1] + [nBlocks(R)/2]

CS8492-DATABASE MANAGEMENT SYSTEMS

ROHINI COLLEGE OF ENGINEERING & TECHNOLOGY

Cost functions for JOIN Operation

Join operation is the most time consuming operation to process.
 An estimate for the size (number of tuples) of the file that results after the JOIN operation
is required to develop reasonably accurate cost functions for JOIN operations.
 The JOIN operations define the relation containing tuples that satisfy a specific predicate
F from the Cartesian product of two relations R and S.

CS8492-DATABASE MANAGEMENT SYSTEMS

Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

What is Heuristic Optimization in DBMS?

Last Updated : 23 Jul, 2025

Database management systems (DBMS) use optimization techniques to help better

yielding queries and improve overall system performance in the world of DBMS. Among
those techniques, heuristic optimization can be considered the leading one, which utilizes
table of thumb, oral communication, and a very simple method rather than a complex one
for instant optimization on query execution plans. Here, this text illustrates the complexity
of heuristic optimization in DBMS by pointing out its merit, means, and assessment of
overall database performance.

What is Heuristic Optimization?

In DBMS, heuristic optimization is a procedure that is aimed at the rapid exploration of
almost all execution plans in a quick and efficient way. Unlike the exhaustive optimization
methods that postulate all the feasible plan alternatives and then apply them to the
optimization process, the heuristic optimization rules, with the use of derived empirical
knowledge and approximate algorithms, tend to speed up the process. Through the
implementation of heuristics, the DBMS optimizers spur efficient execution query plan
convergence while also minimizing computational cost.

Key Components of Heuristic Optimization

1. Cost-Based Heuristics: Cost estimation is a critical process in DBMS, where heuristic
strategies are applied. We do this in the future related to the query execution plan of the
selection cost approach using statistical data distribution, system parameters, and the
physical machine characteristics. e.g., line-based cost estimation and cardinality supposition
are principally used to determine the cost correlated with each plan next. Through
utilization of service plans with lower estimated overheads by heuristic optimization,
budgets are managed judiciously, and the overall query is executed with the best
performance.

2. Join Order Heuristics: When there are queries involving multiple tables, deciding on
what optimum join order to use becomes necessary in order to decrease request execution
time. Join ordering heuristics can include the usage of greedy algorithms, dynamic
programming approaches, and others. These methods assist the system in exploring
different join orders and selecting the most efficient one. Furthermore, in cardinality
estimation techniques the scale of output temporarily created by intermediate join results is
predicted so that better judgments can be made regarding the order of joins.

3. Index Selection Heuristics: In addition to extraction and load of data, indexes have the
ability to accelerate query processing by making data retrieval fast. Index selection
heuristics encompass the usage of the heuristic optimization which can choose the most
influential indexes for query execution among them. These factors, like query predicates,
selectivity, and the fuel activation cost, are considered in the optimal indexing strategy.
Running queries based on index selection heuristics, DBMS optimizers lead to better
performance and a timely response decreasing overall.

4. Query Rewriting Heuristics: The use of heuristic optimization is focused on the rewriting
and transformation of the statements into semantically corresponding ones that are
suitable for optimal query execution. The rule of adaptation to domain knowledge is used
for the purpose of query structure suggestion and performance improvement. Methods like
operation shift and query splitting allow for compressed data flow and improved query
running strategies in this manner.

Challenges and Considerations

While heuristic optimization offers significant advantages in terms of speed and efficiency,
it is not without its challenges and considerations:

Suboptimality: Although the heuristic approaches sometimes can yield plans with
subpar query execution compared to the exhaustive precision optimization, they again
remove the need to follow exact heuristic rules and use only heuristics that are broadly
applicable, so they can be useful for many queries.
Cost Estimation Accuracy: The outcome of heuristic optimization, however, relies on the
accuracy of the cost estimations which in turn may be affected by the scalability of the
data, query complexity or system dynamics, among other factors.
Trade-offs: Heuristic optimization introduces a balance between optimality and
efficiency, ramifying the problem of proportionality between speed and quality of the
planning.

Conclusion
Heuristic optimization is an integral part of DBMS optimization tasks which are used to
efficiently handle the complexities of query optimization and system performance. Through
the application of heuristic methods within the cost estimation, join-order selection, index
usage, or query rewriting components of the DBMS, optimizers eventually come up with an
effective plan, thus improving the execution time for queries. Although heuristic optimization
undertakes the task of speeding up the optimality process, it should be noted that it has its
own limits, which make it vital to ensure that the speed and optimality do not move in an
opposite direction during an attempt to achieve exceptional performance in a database.

Comment D dhaka… Follow 3

Article Tags : DBMS

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

Search... Sign In

Databases SQL MySQL PostgreSQL PL/SQL MongoDB SQL Cheat Sheet SQL Interview Questions MySQL Interview Questions PL/SQL Interview Questions

Query Execution Plan in SQL

Last Updated : 08 Sep, 2025

An execution plan is a roadmap that shows how SQL Server retrieves the data for a query.
It breaks down the exact steps—like which indexes to use, how tables are joined, and in
what order operations are performed. The query optimizer creates this plan, evaluates
multiple options, and chooses the most efficient one. Once generated, plans are stored in
the plan cache for reuse.

In SQL Server, execution plans can be viewed as Graphical, Text, or XML formats.

Types of Execution Plan in SQL

There are two types of Execution Plans in SQL:

1. Actual Execution Plan

The actual execution plan is produced after the query has been executed. It reflects the
real operations carried out by SQL Server, along with runtime performance details.

Generated after the query runs.

Includes runtime details like resource usage and warnings.
Shows the exact plan used by the Database Engine.
Displays the actual rows processed and other execution statistics.

2. Estimated Execution Plan

The estimated execution plan is created before the query executes. It represents the query
optimizer’s prediction of how the query will run, based on available statistics.

Generated before the query runs.

Represents the query optimizer’s prediction of execution steps.
Based on database statistics, schema, and indexes.

Execution Plan Generation and Saving in SSMS

Before and after the execution of the query, the execution plans in SQL Server. Actual and
estimated execution plans can be achieved by the given steps:

Generation of Actual Execution Plans

The actual execution plan can be achieved in the following ways in SQL Server:

1. After completely writing the query, Press Ctrl+M, and the actual execution plan will be
generated.
2. Go to the query window and right-click on it, then click on the context menu and select
‘Display Actual Execution Plan’.
3. Or the ‘Display Actual Execution Plan’ icon can be directly selected from the toolbar.

Generation of Estimated Execution Plans

An estimated execution plan can be achieved using the following ways in SQL Server:

1. After completely writing the query, Press Ctrl+L, and the plan will be generated.
2. Go to the query window and right-click on it, then click on the context menu and select
"Display Estimated Execution Plan".
3. Or the "Display Estimated Execution Plan" icon can be directly selected from the toolbar.

How to save a Query Execution plan?

One has to save the query plan after interpreting the plan produced by the query. SQL
Server Management Studio has an extension of ".sqlplan" for saving the plan in the system.

Steps to save an execution plan:

1. Go to the plan window and right-click.

2. Click on ‘Save Execution Plan As’.
3. Click on the folder or location where you want to save the execution plan, then give the
name to the plan and click on ‘Save’.

Comment S shubh… Follow

Article Tags : SQL

Explore

SQL Tutorial 6 min read

Basics

Queries & Operations

SQL Joins & Functions

Data Constraints & Aggregate Functions

Advanced SQL Topics

Database Design & Security

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

Search... Sign In

Software Engineering Tutorial Software Development Life Cycle Waterfall Model Software Requirements Software Measurement and Metrics Software Design Process System configuration management Soft

Cost Based optimization

Last Updated : 15 Jul, 2025

Query optimization is the process of choosing the most efficient or the most favorable type
of executing an SQL statement. Query optimization is an art of science for applying rules to
rewrite the tree of operators that is invoked in a query and to produce an optimal plan. A
plan is said to be optimal if it returns the answer in the least time or by using the least
space.

Cost-Based Optimization:
For a given query and environment, the Optimizer allocates a cost in numerical form which
is related to each step of a possible plan and then finds these values together to get a cost
estimate for the plan or for the possible strategy. After calculating the costs of all possible
plans, the Optimizer tries to choose a plan which will have the possible lowest cost
estimate. For that reason, the Optimizer may be sometimes referred to as the Cost-Based
Optimizer. Below are some of the features of the cost-based optimization-

1. The cost-based optimization is based on the cost of the query that to be optimized.
2. The query can use a lot of paths based on the value of indexes, available sorting
methods, constraints, etc.
3. The aim of query optimization is to choose the most efficient path of implementing the
query at the possible lowest minimum cost in the form of an algorithm.
4. The cost of executing the algorithm needs to be provided by the query Optimizer so that
the most suitable query can be selected for an operation.
5. The cost of an algorithm also depends upon the cardinality of the input.

Cost Estimation:
To estimate the cost of different available execution plans or the execution strategies the
query tree is viewed and studied as a data structure that contains a series of basic
operation which are linked in order to perform the query. The cost of the operations that are
present in the query depends on the way in which the operation is selected such that, the
proportion of select operation that forms the output. It is also important to know the
expected cardinality of an operation output. The cardinality of the output is very important
because it forms the input to the next operation.
The cost of optimization of the query depends upon the following-

1. Cardinality-
Cardinality is known to be the number of rows that are returned by performing the
operations specified by the query execution plan. The estimates of the cardinality must
be correct as it highly affects all the possibilities of the execution plan.
2. Selectivity-
Selectivity refers to the number of rows that are selected. The selectivity of any row from
the table or any table from the database almost depends upon the condition. The
satisfaction of the condition takes us to the selectivity of that specific row. The condition
that is to be satisfied can be any, depending upon the situation.
3. Cost-
Cost refers to the amount of money spent on the system to optimize the system. The
measure of cost fully depends upon the work done or the number of resources used.

The first step is to use ANALYZE TABLE COMPUTE STATISTICS SQL command to compute
table statistics. Use DESCRIBE EXTENDED SQL command to inspect the statistics.

Table Statistics:
The table statistics can be computed for tables, partitions, and columns and are as follows-

1. Total size (in bytes) of a table or table partitions.

2. Row count of a table or table partitions.
3. Column statistics like min, max, num_nulls, distinct_count, avg_col_len, max_col_len,
histogram.

ANALYZE TABLE COMPUTE STATISTICS SQL Command:

Cost-Based Optimization uses the statistics stored in a meta store i.e. external catalog
using ANALYZE TABLE SQL command-

ANALYZE TABLE tableIdentifier partitionSpec;

COMPUTE STATISTICS (NOSCAN | FOR COLUMNS identifierSeq);

Depending on the variant, ANALYZE TABLE computes different statistics, i.e. of a table,
partitions, or columns-

ANALYZE TABLE with neither PARTITION specification nor FOR COLUMNS clause.
ANALYZE TABLE with PARTITION specification (but no FOR COLUMNS clause).
ANALYZE TABLE with FOR COLUMNS clause (but no PARTITION specification).

DESCRIBE EXTENDED SQL Command:

The statistics of a table can be viewed, partitions, or a column (stored in a meta store)
using DESCRIBE EXTENDED SQL command-

(DESC | DESCRIBE) TABLE? (EXTENDED | FORMATTED);

tableIdentifier partitionSpec? describeColName;

Cost Components Of Query Execution:

The following are the cost components of the execution of a query-

1. Access cost to secondary storage-

This can be the cost of searching, reading, or writing data blocks that originally found on
the secondary storage, especially on the disk. The cost of searching for records in a file
also depends upon the type of access structure that file has.
2. Memory usage cost-
The cost of memory usage can be calculated simply by using the number of memory
buffers that are needed for the execution of the query.
3. Storage cost-
The storage cost is the cost of storing any intermediate files(files that are the result of
processing the input but are not exactly the result) that are generated by the execution
strategy for the query.
4. Computational cost-
This is the cost of performing the memory operations that are available on the record
within the data buffers. Operations like searching for records, merging records, or sorting
records. This can also be called the CPU cost.
5. Communication cost-
This is the cost that is associated with sending or communicating the query and its
results from one place to another. It also includes the cost of transferring the table and
results to the various sites during the process of query evaluation.

Issues In Cost-Based Optimization:

The following are the issues in cost-based optimization-

1. In cost-based optimization, the number of execution strategies that can be considered is

not really fixed. The number of execution strategies may vary based on the situation.
2. Sometimes, this process is really very time-consuming to cost because it does not
always guarantee finding the best optimal strategy
3. It is an expensive process.

Comment R rishibh… Follow 4

Article Tags : Software Engineering

Explore

Software Engineering Basics

Software Measurement & Metrices

Software Development Models & Agile Methods

SRS & SPM

Testing & Debugging

Verification & Validation

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

Search... Sign In

Accountancy Business Studies Economics Organisational Behaviour Human Resource Management Entrepreneurship Marketing Income Tax Finance Management

Email Marketing: Meaning, Types, Process, Benefits and

Drawbacks
Last Updated : 23 Jul, 2025

Email marketing involves sending commercial emails to promote business offerings to

existing and potential customers. It is a digital marketing strategy used to engage existing
customers and attract new ones. Effective emails have compelling subject lines,
personalized content, visuals, clear calls-to-action, and mobile optimization. Campaigns
promote updates, offers, events, and content to communicate the brand story. In this article,
let’s understand what email marketing is, along with its benefits and drawbacks.

Table of Content
Types of Email Marketing
Steps to do Email Marketing
Benefits of Email Marketing
Drawbacks of Email Marketing
Conclusion
Frequently Asked Questions (FAQs)

Key Takeaways:

Process: It involves sending commercial emails to customers or prospects to promote

business offerings and generate sales.
Purpose: It facilitates direct communication with targeted subscriber lists to nurture
leads and retain existing customers.
Key Metrics: Open rates, click-through rates, and conversion rates are tracked to
optimize performance.

What is Email Marketing?

Email marketing refers to a digital marketing strategy that uses email to promote business
offerings and build relationships with potential or existing customers. The core goal is
driving sales revenue through email communications. In email marketing, businesses create
customized email campaigns targeted at certain subscriber lists. For example, they may
send promotional newsletters or product updates to customers who have purchased before
or signed up to receive such emails. The business may also acquire new email list contacts
interested in their offerings to expand their subscriber base.

Each email campaign involves carefully crafting compelling subject lines and content that
speaks to the unique interests and needs of the recipients. Calls-to-action are integrated at
key points, guiding the next click. The business works to build trust and nurture ongoing
dialogues with its email subscribers over time. The success of email campaigns can be
measured by metrics like open rates, click-through rates on links, and conversion rates on
desired outcomes like purchases. Email marketing analytics provide insight into optimizing
messages and segments for improved results. When used correctly and following best
practices, email allows meaningful customer connections that may ultimately lead to sales.

Types of Email Marketing

Marketers have lots of choices for how to contact customers by email. But some kinds of
emails work better than others to help your company, which are as follows:

1. Promotional Emails: These are emails focused on promoting special offers, sales, new
products, or other commercial announcements to drive purchases and transactions. For
example, coupon emails, sale announcement emails, or new product launch emails. They
advertise the business's latest deals.

2. Newsletters: Newsletters are regular, recurring emails that provide new and updated
content like articles, company news, blog summaries, tips, or other useful information to
subscribers. Rather than directly promoting products, they aim to build engagement.

3. Welcome Emails: Welcome emails are some of the most important emails sent. They are
the first email contact when a person signs up and sets the tone of the subscriber
relationship. Well-crafted welcome emails introduce the business, highlight subscription
benefits, and start subscriber engagement.

4. Cart Abandonment Emails: When customers add items to an online shopping cart but
don't complete the purchase, cart abandonment emails remind them to return and check
out. These transactional emails recover lost sales from shoppers needing an extra prompt
to buy.

5. Customer Re-engagement Emails: These emails target subscribers who have been
inactive for some time by re-engaging with them in an attempt to bring them back for
repeat business. Tactics may include sending promo codes, linking to the newest content,
or showcasing recently added inventory.

6. Onboarding Drip Campaigns: These nurture new subscribers by sending helpful

orientation content over their first thirty, sixty, or ninety days. The onboarding series covers
topics, like frequently asked questions, product tutorials, sizing guides, user community
details, or member benefits to aid in getting started.

7. Holiday or Event Emails: These capitalize on major holidays, events, or cultural moments
to send relevant communications. For example, Independence Day sales emails, Mother's
Day gift ideas emails, or event promotion emails around occasions like music festivals or
industry tradeshows. They tie into seasonal moments.

8. Ratings and Reviews Emails: These requests satisfy customer reviews or star ratings
post-purchase. The feedback allows businesses to monitor satisfaction and improve
products. Review emails tend to see high open rates as customers want to share evaluative
input.

Process of Email Marketing

1. Define your Audience: Clearly define your target audience by developing customer
personas. Analyze your current customer base to determine key demographics like location,
age, income level, gender, occupation, etc. Group them by common interests and behaviors.
Get very specific in terms of their unique preferences and needs to shape content that
resonates with them.

2. Establish your Goals: Decide on the purpose and goals of your email campaigns. Are
you aiming to drive traffic, generate leads, increase sales, boost customer engagement, and
promote brand awareness? Set specific KPIs related to your objectives, such as email open
rates, click-through rates, conversion rates, revenue metrics, or subscribers gained.

3. Create your Email List: Build your list through methods like offering opt-in forms on your
website, blog, or social channels, capturing leads at in-person events and promotions, and
through strategic list acquisition and partnerships. Focus on acquiring email contacts within
your target personas. Incentivize subscribers.

4. Pick an Email Campaign Type: Select campaign categories that align with audience
preferences and business goals. Campaign types include promo emails, content
newsletters, win-back offers, post-purchase follow-ups, holiday themes, and more. Map a
campaign calendar to your KPIs with campaigns scheduled.

5. Make a schedule: Build an email cadence and systematic schedule for how often to send
emails to each segment—weekly, monthly, etc. Welcome new subscribers with an
onboarding drip series. Leverage automation tools to schedule recurring campaigns like
win-back offers. Maintain a sense of exclusivity and anticipation without fatigue.

6. Measure your Results: Link the email platform to Google Analytics and add campaign
UTM tracking to monitor performance. See what emails drove the most website traffic,
subscriber growth, and sales to double down on those while reworking laggards.

Benefits of Email Marketing

1. Boosted Brand Awareness: Regularly connecting with subscribers through value-driven
email campaigns is a proven way to grow meaningful awareness of your brand, offerings,
and what sets you apart. Emails that resonate with audiences in a cluttered inbox
successfully gain mindshare.

2. Cost-Effective Reach: Email is considered an extremely cost-effective marketing

channel, often with higher ROI than traditional print or direct mail campaigns. When using
email service provider tools, there is very little incremental spending associated with
adding more contacts and limited variable costs involved in scaling campaigns.

3. Driving Website Traffic: Calls-to-action within email campaigns can effectively direct
engaged subscribers to targeted pages on your website or online store. Things like
promotional offers, gated content previews, and newsletter highlights convert existing
awareness into tangible website visits.

4. Lead Generation: Email often sits at the top of the purchase funnel, moving subscribers
from awareness into consideration. Asking for a lead-generating action within emails, such
as downloading an educational whitepaper or eBook, subscribing to a service trial,
registering for a demo, etc., can capture key contact information on hot prospects.

5. Enhanced Customer Retention: Ongoing email nurturing beyond the initial sale or sign-
up helps retain customers longer. Transactional and promotional emails focused squarely
on existing purchasers or loyal members build satisfaction and brand affinity, improving
customer lifetime value.

6. Sales Growth: Calls-to-action that directly elicit desired conversion events—be it a

purchase, account sign-up, or service enrollment—directly generate incremental revenue
and pipeline velocity. Of all marketing channels, properly executed email marketing fuels
some of the highest customer conversion rates over time.

Drawbacks of Email Marketing

1. Reaching Inboxes is Hard: With so many emails sent, it can be difficult to have your
emails make it into subscriber inboxes instead of getting marked as spam or promotions.
Standing out will be a challenge.

2. Audience Burnout: If you send too many emails or emails that are not relevant or
valuable, subscribers will disengage, open fewer emails, and may even unsubscribe from
your list altogether. Preventing this requires continual optimization.

3. Time-Consuming to Create: Designing great-looking email templates with compelling

content takes extensive time and creative effort. For best results, dedicated staff may be
needed, which is an added expense.

4. Advanced Analytics requires Work: While email providers offer basic reporting,
integrating deeper web and customer analytics requires manually implementing additional
tracking tools that may be outside of their core capabilities.

5. Reliance on Tech Platforms: Executing email campaigns relies on third-party email

service providers. If their deliverability or functionality faces technical issues, your email
reliability may suffer through no direct fault of your own.

Conclusion
Email marketing can be super helpful for connecting with customers and growing a
business when done right. With the perfect foundation built on customer needs, creativity,
and constantly optimizing based on data, an email marketing program can be a game-
changer. By understanding the dynamics and employing best practices, businesses can
leverage the strengths of email marketing while mitigating its drawbacks. Ultimately, a
well-executed email strategy will have the potential to promote meaningful connections,
drive sales, and fortify brand loyalty.

Comment S sriyali… Follow 2

Article Tags : Commerce Marketing Marketing Types

Explore

DSA Tutorial - Learn Data Structures and Algorithms 6 min read

System Design Tutorial 3 min read

Aptitude Questions and Answers 3 min read

Web Development Technologies 6 min read

AI, ML and Data Science Tutorial 3 min read

DevOps Tutorial 5 min read

Company Explore Tutorials Courses Videos Preparation Corner

About Us POTD Programming Languages IBM Certification DSA Interview Corner
Corporate & Communications Address: Legal Job-A-Thon DSA DSA and Placements Python Aptitude
A-143, 7th Floor, Sovereign Corporate Privacy Policy Blogs Web Technology Web Development Java Puzzles
Tower, Sector- 136, Noida, Uttar Pradesh
Contact Us Nation Skill Up AI, ML & Data Science Programming Languages C++ GfG 160
(201305)
Advertise with us DevOps DevOps & Cloud Web Development System Design
GFG Corporate Solution CS Core Subjects GATE Data Science
Registered Address:
Campus Training Program Interview Preparation Trending Technologies CS Subjects
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam Soware and Tools
Buddh Nagar, Uttar Pradesh, 201305
Search... Sign In

Accountancy Business Studies Economics Organisational Behaviour Human Resource Management Entrepreneurship Marketing Income Tax Finance Management

Ailiate Marketing - Working, Types, Advantages &

Disadvantages
Last Updated : 23 Jul, 2025

Affiliate marketing is a way for brands and people to work together and make money.
Whether you're a company wanting to sell more or someone looking to earn money by
promoting products, knowing about affiliate marketing is important.

This article is here to help you understand affiliate marketing better and use it to your
advantage. You'll also learn about the advantages, such as making money without creating
a product, and the disadvantages, like the potential unpredictability of earnings.

Table of Content
What is Affiliate Marketing?
How does Affiliate Marketing Works?
Types of Affiliate Marketing
Who are Affiliate Marketers?
Affiliate Marketing Examples
Advantages of Affiliate Marketing
Disadvantages of Affiliate Marketing
How to Start Affiliate Marketing

What is Affiliate Marketing?

Affiliate marketing is defined as a performance-based marketing strategy in which
companies (called merchants or advertisers) collaborate with people or other organizations
(called affiliates). Affiliates use a variety of digital platforms, including blogs, social media,
email marketing, and websites, to promote the goods and services of the advertisers. The
advertiser provides them with special tracking links or codes so they can monitor their
marketing campaigns.

Geeky Takeaways:
The affiliate receives a commission or a portion of the money made when
customers click on these links created by the affiliate and complete the specified
activity, such as making a purchase, signing up, or downloading.
Affiliates are encouraged by this pay plan to sell the advertiser's products
successfully and increase website traffic and sales.

How does Affiliate Marketing Works?

Affiliate marketing is a performance-based marketing strategy that involves three main
parties: the merchant (also known as the 'advertiser' or 'retailer'), the affiliate (also
known as the 'publisher'), and the customer. Here's a simplified overview of how it works:

1. Advertiser/Merchant: This is the firm or enterprise that provides a good or service that
has to be advertised. To determine which affiliate is in charge of bringing in the sale, they
provide affiliates personalized tracking links or affiliate codes.

2. Affiliate: An affiliate is a person or a different company that collaborates with the

advertiser. Through a variety of marketing channels, including websites, blogs, social
media, email marketing, and other online platforms, they advertise the advertiser's goods
and services.

3. Consumers: These are the people who go to the advertiser's website by clicking on the
affiliate's special tracking link after viewing the affiliate's promotional content.

4. Conversion: A conversion occurs when a user completes a desired activity on the

advertiser's website, including buying something, subscribing to a newsletter, or
completing a contact form.

5. Commission: In exchange for referring a transaction, the affiliate receives a commission

or a portion of the sale. Rates and compensation structures differ and are frequently pre-
agreed upon between the advertiser and affiliate.

Types of Affiliate Marketing

Affiliate marketing is available in a variety of formats and can be customised to fit a range
of niches and company types. Typical forms of affiliate marketing include the following:

1. Content Affiliate Marketing

Under this strategy, affiliates produce videos, articles, blogs, and reviews that contain
affiliate connections to goods and services. The affiliate receives a commission if readers or
viewers click on these links and buy anything. This is a widely used technique among
writers and content producers.

2. Coupon and Deal Websites

Members in this group concentrate on providing their audience with coupons, discounts,
and exclusive offers. The affiliate is paid when customers utilise these offers to make
purchases. Websites with coupons and deals are especially common in the e-commerce
industry.

3. Email Marketing

Some affiliates create email lists and use email marketing campaigns to offer goods and
services to their members. They include affiliate links in their emails, and the affiliate
receives income from subscribers who click and make purchases.

4. Social Media Affiliate Marketing

To advertise goods and services, affiliates use social media sites like Facebook, Instagram,
and YouTube. In their descriptions or postings, they contain affiliate links. Influencers
frequently employ this strategy.

5. Comparison and Review Websites

Affiliates build online resources that provide in-depth analyses and evaluations of different
goods and services in a particular market. When consumers click on the affiliate links in
their product reviews and make a purchase, the affiliate receives a commission.

6. Niche-Specific Affiliate Marketing

Certain affiliates concentrate on certain markets or sectors, such technology, finance, or

health and wellness. They develop into authorities in the market they have selected and
educate their audience about pertinent goods and services.

7. Software and Apps Affiliate Marketing

In this strategy, affiliates market mobile or software apps. They could include affiliate links
for these items along with reviews, tutorials, and instructions.

8. Multi-Tier Affiliate Marketing

Under this approach, affiliates have the ability to suggest other affiliates, and they are paid
for both their own and their recruited affiliates' referrals. A multi-tiered commission system
is produced as a result.

9. Lead Generation Affiliate Marketing

The goal of lead generation affiliates is to obtain leads or prospective consumers for the
advertiser, as opposed to concentrating on sales. They receive payment for bringing in
prospects who, frequently through forms or sign-ups, indicate interest in a good or service.

10. Pay-Per-Click (PPC) and Search Engine Marketing (SEM)

Affiliates in this category advertise goods and services on search engines and social media
platforms by running sponsored advertisements. When people click on their
advertisements and complete a desired activity on the advertiser's website, they are paid a
commission.

Who are Affiliate Marketers?

Affiliate Marketers are people or organisations that advertise goods or services provided
by other businesses (often referred to as merchants or advertisers) in exchange for a
commission that is earned through lead generation or traffic generation. To put it briefly,
they serve as partners or middlemen in internet marketing.

Salary of an Affiliate Marketer

The salary of an affiliate marketer can vary widely based on factors like experience, niche,
platform used (e.g., blog, YouTube, social media), and the amount of time they invest. An
average salary of an Affiliate Marketer based on the level of experience can be given as:

Experience Level Average Yearly Earnings (INR)

Beginner 8-9LPA

Intermediate 13-16LPA

Advanced 18 LPA +

Affiliate Marketing Examples

Some examples of Affliate Marketing are:

1. Blog Posts:A beauty blogger reviews a new skincare product and includes an affiliate
link to purchase the product online. When readers buy the product through this link, the
blogger earns a commission.
2. Social Media Influencers: A fitness influencer, for example, might share a post about
their favorite protein powder with a discount code provided by the company. Purchases
made with this code generate earnings for the influencer.
3. Email Newsletters: An email from a travel blogger could include recommendations for
travel gear with links to buy the items on an e-commerce site, earning the blogger a
commission on sales.
4. Product Review Websites: A tech review site might publish detailed reviews on the
latest smartphones with affiliate links to buy the phones from online retailers. The site
earns a commission for each sale made through these links.
5. Coupon and Deal Sites: For example, a coupon site might offer a special discount code
for online electronics stores, and each use of the code generates revenue for the site.

Advantages of Affiliate Marketing

I. Advantages for Advertisers
1. Cost-Effective: Advertisers pay commissions only for actual sales or leads, making
affiliate marketing an economical strategy. For instance, an online retailer pays affiliates
only for sales directly generated from their links.
2. Broader Reach: Affiliates help brands reach new audiences through their online
followings, like a travel company expanding its visibility through travel bloggers.
3. Risk Mitigation: Working with diverse affiliates across various platforms reduces
reliance on a single marketing channel, decreasing potential risks.

II. Advantages for Affiliates

1. Low Entry Barrier: With no need to create products, affiliate marketing is accessible to
anyone, including bloggers who can monetize their content through relevant affiliate
promotions.
2. Passive Income: Affiliates earn commissions from links within their content over time,
allowing for income even when not actively working.
3. No Inventory or Customer Service: Affiliates focus solely on marketing, without the
need to manage product stocks or customer inquiries, as these are handled by the
advertisers.

Disadvantages of Affiliate Marketing

I. Disadvantages for Advertisers
1. Risk of Unethical Practices: Some affiliates might use deceptive tactics, harming the
advertiser's reputation. For instance, false claims about a product's effectiveness can
lead to customer complaints.
2. Management Difficulties: Overseeing a large affiliate network requires significant
resources for monitoring performance and managing payments, with issues like
outdated links complicating operations.
3. Commission Costs: Paying commissions cuts into profits, especially if a significant
revenue portion comes from affiliate-driven sales, affecting overall profitability.

II. Disadvantages for Affiliates

1. Advertiser Dependence: Changes in affiliate program policies or commission rates can
negatively impact affiliates, such as a sudden decrease in payout rates.
2. High Competition: The crowded affiliate market makes it hard to stand out, with many
affiliates promoting similar products.
3. Income Fluctuation: Affiliate income can vary greatly, making it a less stable income
source, especially when marketing seasonal products.

How to Start Affiliate Marketing

Here are some of the quick and easy steps to start Affiliate Marketing:

1. Choose a Niche: Select a topic you're passionate about or have knowledge in.
2. Find Affiliate Programs: Look for programs in your niche that offer good commission
rates.
3. Create Content: Start a blog, YouTube channel, or social media account to share content.
4. Promote Products: Use your content to promote products with your affiliate links.
5. Drive Traffic: Use SEO, social media, and email marketing to attract viewers.
6. Track and Optimize: Monitor your performance and optimize your strategies for better
results.

Conclusion
In conclusion, affiliate marketing is a smart way for companies and individuals to work
together, helping each other grow. It's like a partnership where both sides can win:
companies get more customers, and individuals or websites earn money by promoting
products. As the online world changes, affiliate marketing will too, but its essence of
rewarding effort and quality will stay the same, making it a valuable strategy for the digital
age.

Comment B baisha… Follow

Article Tags : Commerce Marketing Marketing Types

Explore

DSA Tutorial - Learn Data Structures and Algorithms 6 min read

System Design Tutorial 3 min read

Aptitude Questions and Answers 3 min read

Web Development Technologies 6 min read

AI, ML and Data Science Tutorial 3 min read

DevOps Tutorial 5 min read

Company Explore Tutorials Courses Videos Preparation Corner

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Distributed Database System

Last Updated : 19 Sep, 2025

A Distributed Database System (DDBS) is a collection of multiple databases spread across

different physical locations, connected via a network. Unlike a centralized system, where all
data is stored in one place, a distributed system manages data across various sites while
making it appear as a single database to users. It improves data availability, reliability, and
performance by enabling local access, parallel processing, and fault tolerance.

Distributed Database

Types
Some of the type of distributed database system are:

1. Homogeneous Database:

In a homogeneous database, all different sites store database identically. The operating
system, database management system, and the data structures used all are the same at all
sites. Hence, they're easy to manage.

Features:

Unified query language and interface.

Low integration complexity.
Efficient synchronization.

Example: A bank with branches in different cities uses Oracle DB at every location. All
databases have the same structure and are synchronized regularly.

2. Heterogeneous Database

In a heterogeneous distributed database, different sites may use different DBMSs,

schemas, or data models, making query processing and transactions difficult. Some sites
may not even be aware of others, so translation mechanisms are needed for
communication.

Features:

Supports interoperability between diverse systems.

Complex query optimization and transaction management.
Useful in mergers or collaborations between organizations.

Example: A logistics company uses MySQL for inventory, MongoDB for vehicle tracking,
and PostgreSQL for billing. Integration middleware allows unified querying across these
platforms.

3. Client-Server Distributed Database System

In this model, the server stores and manages the database, while clients send queries over
the network. It offers centralized control with distributed access, making it ideal for
enterprise systems and web applications. Clients can be lightweight while the server
handles heavy processing. Example: Web application interacting with a central PostgreSQL
server.

Features:

Simplifies resource management.

Central servers can be optimized for performance.
Easily scalable with more clients.

Example: An e-commerce website where the frontend (client) is hosted separately and
interacts with a central PostgreSQL server to manage orders, users, and inventory.

4. Peer-to-Peer Distributed Database System

Here, all nodes are equal, with no fixed client or server roles. Each node can store data and
also process queries, leading to decentralized control. It supports fault tolerance and high
availability. Example: Blockchain networks like Ethereum, where each node maintains a part
of the distributed ledger.

Features:

No single point of failure.

Useful in decentralized and distributed apps.
High availability and data redundancy.

Example: Blockchain-based databases like Ethereum or BitTorrent-based systems, where

each peer maintains part of the ledger and participates equally in transactions.

5. Cloud-Based Distributed Database System

These systems are deployed on cloud platforms and span multiple geographic regions for
scalability and reliability. They abstract infrastructure details and are offered as DBaaS,
making them ideal for dynamic workloads. Example: Google Cloud Spanner and Amazon
DynamoDB used for global applications.

Features:

Automatic scaling and replication.

Pay-as-you-use pricing.
Global availability and disaster recovery.

Example:

Google Cloud Spanner: Global-scale relational database.

Amazon DynamoDB: Key-value and document database with high performance.
Azure Cosmos DB: Multi-model, globally distributed DBMS.

key components and challenges of a Distributed Database

Now lets view definition of each key concept:

1. Replication

In replication, copies of the same data are stored at two or more sites. If every site has the
full database, it's called full replication. This improves data availability and allows faster,
parallel query processing. However, updates must be made at all sites, or data may
become inconsistent. It also adds overhead and makes concurrency control more complex.

2. Fragmentation

In this approach, the relations are fragmented (i.e., they're divided into smaller parts) and
each of the fragments is stored in different sites where they're required. It must be made
sure that the fragments are such that they can be used to reconstruct the original relation
(i.e, there isn't any loss of data).
Fragmentation is advantageous as it doesn't create copies of data, consistency is not a
problem.
Fragmentation of relations can be done in two ways:

Horizontal fragmentation - Splitting by rows

The relation is fragmented into groups of tuples so that each tuple is assigned to at
least one fragment.
Vertical fragmentation - Splitting by columns
The schema of the relation is divided into smaller schemas. Each fragment must contain
a common candidate key so as to ensure a lossless join.

In certain cases, an approach that is hybrid of fragmentation and replication is used.

3. Concurrency Control

Concurrency control ensures data remains accurate when multiple transactions run at the
same time. Without it, issues like lost updates or dirty reads can occur. Its goal is to make
parallel transactions behave as if run one by one. Common methods include locking,
timestamps, and optimistic concurrency.

[Link] Heterogeneity

Semantic heterogeneity happens when different databases use the same data labels but
with different meanings, formats, or units. For example, one system may store salary in
dollars, another in rupees. This can cause confusion during data integration, so resolving it
is important for accurate results.

Functions of Distributed Database:

It is used in Corporate Management Information System.

It is used in multimedia applications.
Used in Military's control system, Hotel chains etc.
It is also used in manufacturing control system.

Advantages of Distributed Database System :

There is fast data processing as several sites participate in request processing.

Reliability and availability of this system is high.
It possess reduced operating cost.
It is easier to expand the system by adding more sites.
It has improved sharing ability and local autonomy.

Disadvantages of Distributed Database System :

The system becomes complex to manage and control.

The security issues must be carefully managed.
The security issues must be carefully managed.
The system require deadlock handling during the transaction processing otherwise
The entire system may be in inconsistent state.
There is need of some standardization for processing of distributed database
system.

Read related articles:

1. Federated database management system issues

2. Comparison – Centralized, Decentralized and Distributed Systems
3. Multimedia Database
4. OODBMS - Definition and Overview

Comment M me_l Follow 72

Article Tags : DBMS Distributed System

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

@GeeksforGeeks, Sanchhaya Education Private Limited, All rights reserved

SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT – I – DISTRIBUTED DATABASE – SCS1613

UNIT - I

INTRODUCTION TO DISTRIBUTED DATABASE

Introduction of Distributed Databases-Features of Distributed Databases-Distributed
databases versus Centralized Databases- Principles—Levels Of Distribution-
Transparency-Reference Architecture- Types of Data Fragmentation- Integrity Constraints
in Distributed Databases- Architectural Issues- Alternative Client/Server Architecture.
A database is an ordered collection of related data that is built for a specific purpose. A
database may be organized as a collection of multiple tables, where a table represents a real
world element or entity. Each table has several different fields that represent the characteristic
features of the entity.
For example, a company database may include tables for projects, employees, departments,
products and financial records. The fields in the Employee table may be Name, Company_Id,
Date_of_Joining, and so forth.
A database management system is a collection of programs that enables creation and
maintenance of a database. DBMS is available as a software package that facilitates
definition, construction, manipulation and sharing of data in a database. Definition of a
database includes description of the structure of a database. Construction of a database
involves actual storing of the data in any storage medium. Manipulation refers to the
retrieving information from the database, updating the database and generating reports.
Sharing of data facilitates data to be accessed by different users or programs.
Examples of DBMS Application Areas
Automatic Teller Machines Train Reservation System Employee Management System
Student Information System
Examples of DBMS Packages
MySQL
Oracle
SQL Server dBASE
FoxPro PostgreSQL, etc.
Database Schemas
A database schema is a description of the database which is specified during database design
and subject to infrequent alterations. It defines the organization of the data, the relationships
among them, and the constraints associated with [Link] are often represented
through the three-schema architecture or ANSISPARC architecture. The goal of this
architecture is to separate the user application from the physical database. The three levels are
−
Internal Level having Internal Schema − It describes the physical structure, details of
internal storage and access paths for the database.
Conceptual Level having Conceptual Schema − It describes the structure of the whole
database while hiding the details of physical storage of data. This illustrates the entities,
attributes with their data types and constraints, user operations and relationships.
External or View Level having External Schemas or Views − It describes the portion of a
database relevant to a particular user or a group of users while hiding the rest of database.
Types of DBMS
Hierarchical DBMS
In hierarchical DBMS, the relationships among data in the database are established so that
one data element exists as a subordinate of another. The data elements have parent-child
relationships and are modelled using the “tree” data structure. These are very fast and simple.

Figure 1.1 Hierarchical DBMS

Network DBMS
Network DBMS in one where the relationships among data in the database are of type many-
to-many in the form of a network. The structure is generally complicated due to the existence
of numerous many-to-many relationships. Network DBMS is modelled using “graph” data
structure.

Figure 1.2 Network DBMS

Relational DBMS
In relational databases, the database is represented in the form of relations. Each relation
models an entity and is represented as a table of values. In the relation or table, a row is
called a tuple and denotes a single record. A column is called a field or an attribute and
denotes a characteristic property of the entity. RDBMS is the most popular database
management system.
For example − A Student Relation −

Figure 1.3 A Student Relation

Object Oriented DBMS
Object-oriented DBMS is derived from the model of the object-oriented programming
paradigm. They are helpful in representing both consistent data as stored in databases, as well
as transient data, as found in executing programs. They use small, reusable elements called
objects. Each object contains a data part and a set of operations which works upon the data.
The object and its attributes are accessed through pointers instead of being stored in
relational table models.
For example − A simplified Bank Account object-oriented database −

Figure 1.4 A simplified Bank Account object-oriented database

Distributed DBMS
A distributed database is a set of interconnected databases that is distributed over the
computer network or internet. A Distributed Database Management System (DDBMS)
manages the distributed database and provides mechanisms so as to make the databases
transparent to the users. In these systems, data is intentionally distributed among multiple
nodes so that all computing resources of the organization can be optimally used.
A distributed database is a collection of multiple interconnected databases, which are
spread physically across various locations that communicate via a computer network.
Features
• Databases in the collection are logically interrelated with each other. Often they
represent a single logical database.
• Data is physically stored across multiple sites. Data in each site can be managed by
a DBMS independent of the other sites.
• The processors in the sites are connected via a network. They do not have any
multiprocessor configuration.
• A distributed database is not a loosely connected file system.
• A distributed database incorporates transaction processing, but it is not synonymous
with a transaction processing system.

Distributed Database Management System

A distributed database management system (DDBMS) is a centralized software system that
manages a distributed database in a manner as if it were all stored in a single location.
Features
• It is used to create, retrieve, update and delete distributed databases.
• It synchronizes the database periodically and provides access mechanisms by the
virtue of which the distribution becomes transparent to the users.
• It ensures that the data modified at any site is universally updated.
• It is used in application areas where large volumes of data are processed and accessed
by numerous users simultaneously.
• It is designed for heterogeneous database platforms.
• It maintains confidentiality and data integrity of the databases.

Factors Encouraging DDBMS

• Distributed Nature of Organizational Units − Most organizations in the current times
are subdivided into multiple units that are physically distributed over the globe. Each
unit requires its own set of local data. Thus, the overall database of the organization
becomes distributed.
• Need for Sharing of Data − The multiple organizational units often need to
communicate with each other and share their data and resources. This demands
common databases or replicated databases that should be used in a synchronized
manner.
• Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and
Online Analytical Processing (OLAP) work upon diversified systems which may
have common data. Distributed database systems aid both these processing by
providing synchronized data.
• Database Recovery − One of the common techniques used in DDBMS is replication
of data across different sites. Replication of data automatically helps in data recovery
if database in any site is damaged. Users can access data from other sites while the
damaged site is being reconstructed. Thus, database failure may become almost
inconspicuous to users.
• Support for Multiple Application Software − Most organizations use a variety of
application software each with its specific database support. DDBMS provides a
uniform functionality for using the same data among different platforms.
Advantages of Distributed Databases
• Modular Development − If the system needs to be expanded to new locations or new
units, in centralized database systems, the action requires substantial efforts and
disruption in the existing functioning. However, in distributed databases, the work
simply requires adding new computers and local data to the new site and finally
connecting them to the distributed system, with no interruption in current functions.
• More Reliable − In case of database failures, the total system of centralized databases
comes to a halt. However, in distributed systems, when a component fails, the
functioning of the system continues may be at a reduced performance. Hence
DDBMS is more reliable.
• Better Response − If data is distributed in an efficient manner, then user requests can
be met from local data itself, thus providing faster response. On the other hand, in
centralized systems, all queries have to pass through the central computer for
processing, which increases the response time.
• Lower Communication Cost − In distributed database systems, if data is located
locally where it is mostly used, then the communication costs for data manipulation
can be minimized. This is not feasible in centralized systems.

Adversities of Distributed Databases

• Need for complex and expensive software − DDBMS demands complex and often
expensive software to provide data transparency and co-ordination across the several
sites.
• Processing overhead − Even simple operations may require a large number of
communications and additional calculations to provide uniformity in data across the
sites.
• Data integrity − The need for updating data in multiple sites pose problems of data
integrity.
• Overheads for improper data distribution − Responsiveness of queries is largely
dependent upon proper data distribution. Improper data distribution often leads to
very slow response to user requests.

Distributed Database Vs Centralized Database

Centralized DBMS Distributed DBMS
In Distributed DBMS the database are stored
In Centralized DBMS the database are stored
in different site and help of network it can
in a only one site
access it

Database and DBMS software distributed

If the data is stored at a single computer
over many sites,connected by a computer
site,which can be used by multiple users
network

Database is maintained at a number of

Database is maintained at one site
different sites
If centralized system fails,entire system is If one system fails,system continues work
halted with other site

It is a less reliable It is a more reliable

Centralized database

Figure 1.5 Centralized database

Distributed database

Figure1. 6 Distributed database

Types of Distributed Databases

Figure 1.7 Types of Distributed Databases

Distributed databases can be broadly classified into homogeneous and heterogeneous
distributed database environments
Homogeneous Distributed Databases
In a homogeneous distributed database, all the sites use identical DBMS and operating
systems. Its properties are −
• The sites use very similar software.
• The sites use identical DBMS or DBMS from the same vendor.
• Each site is aware of all other sites and cooperates with other sites to process user
requests.
• The database is accessed through a single interface as if it is a single database.

Types of Homogeneous Distributed Database

There are two types of homogeneous distributed database −
Autonomous − Each database is independent that functions on its own. They are integrated
by a controlling application and use message passing to share data updates.
Non-autonomous − Data is distributed across the homogeneous nodes and a central or
master DBMS co-ordinates data updates across the sites.
Heterogeneous Distributed Databases
In a heterogeneous distributed database, different sites have different operating systems,
DBMS products and data models. Its properties are −
• Different sites use dissimilar schemas and software.
• The system may be composed of a variety of DBMSs like relational, network,
hierarchical or object oriented.
• Query processing is complex due to dissimilar schemas. Transaction processing is
complex due to dissimilar software.
• A site may not be aware of other sites and so there is limited co-operation in
processing user requests.

Types of Heterogeneous Distributed Databases

Federated − The heterogeneous database systems are independent in nature and integrated
together so that they function as a single database system.
Un-federated − The database systems employ a central coordinating module through which
the databases are accessed.
Distributed DBMS Architectures
DDBMS architectures are generally developed depending on three parameters −
• Distribution − It states the physical distribution of data across the different sites.
• Autonomy − It indicates the distribution of control of the database system and the
degree to which each constituent DBMS can operate independently.
• Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system
components and databases.

Architectural Models
• Client - Server Architecture for DDBMS
• Peer - to - Peer Architecture for DDBMS
• Multi - DBMS Architecture

Client - Server Architecture for DDBMS

This is a two-level architecture where the functionality is divided into servers and clients. The
server functions primarily encompass data management, query processing, optimization and
transaction management. Client functions include mainly user interface. However, they have
some functions like consistency checking and transaction management.
Distinguish the functionality and divide these functions into two classes, server functions and
client functions.
Server does most of the data management work
– query processing

– data management

– Optimization

– Transaction management etc

Client performs
– Application

– User interface

– DBMS Client model

The two different client - server architecture are −

Single Server Multiple Client
Single Server accessed by multiple clients
• A client server architecture has a number of clients and a few servers connected in a
network.
• A client sends a query to one of the servers. The earliest available server solves it and
replies.
• A Client-server architecture is simple to implement and execute due to centralized
server system.
Figure1. 8 Single Server Multiple Client

Multiple Server Multiple Client

Figure 1. 9 Multiple Servers accessed by multiple clients

Peer- to-Peer Architecture for DDBMS

In these systems, each peer acts both as a client and a server for imparting database services.
The peers share their resource with other peers and co-ordinate their activities.
This architecture generally has four levels of schemas –
Schemas Present
Individual internal schema definition at each site, local internal schema
Enterprise view of data is described the global conceptual schema.
Local organization of data at each site is describe in the local conceptual schema.
User applications and user access to the database is supported by external
schemas
Local conceptual schemas are mappings of the global schema onto each site.
Databases are typically designed in a top-down fashion, and, therefore all external
view definitions are made globally.
Major Components of a Peer-to-Peer System
– User Processor
– Data processor
User Processor
• User-interface handler
• responsible for interpreting user commands, and formatting the result data
• Semantic data controller
• checks if the user query can be processed.
• Global Query optimizer and decomposer
• determines an execution strategy
• Translates global queries into local one.
• Distributed execution
• Coordinates the distributed execution of the user request

Data processor
• Local query optimizer
• Acts as the access path selector
• Responsible for choosing the best access path
• Local Recovery Manager

Figure 1.10 Frame Work

• Makes sure local database remains consistent
• Run-time support processor
• Is the interface to the operating system and contains the database buffer
• Responsible for maintaining the main memory buffers and managing the data access.

Multi - DBMS Architectures

This is an integrated database system formed by a collection of two or more autonomous
database systems.
Multi-DBMS can be expressed through six levels of schemas −
• Multi-database View Level − Depicts multiple user views comprising of subsets of
the integrated distributed database.
• Multi-database Conceptual Level − Depicts integrated multi-database that comprises
of global logical multi-database structure definitions.
• Multi-database Internal Level − Depicts the data distribution across different sites and
multi-database to local data mapping.
• Local database View Level − Depicts public view of local data.
• Local database Conceptual Level − Depicts local data organization at each site.
• Local database Internal Level − Depicts physical data organization at each site.

There are two design alternatives for multi-DBMS −

Model with multi-database conceptual level.
Models Using a Global Conceptual Schema

Figure 1.11 Models Using a Global Conceptual Schema

•GCS is defined by integrating either the external schemas of local autonomous
databases or parts of their local conceptual schema
• Users of a local DBMS define their own views on the local database.
• If heterogeneity exists in the system, then two implementation alternatives exist:
unilingual and multilingual
• Unilingual requires the users to utilize possibly different data models and
languages
• Basic philosophy of multilingual architecture, is to permit each user to access the
global database.
GCS in multi-DBMS
– Mapping is from local conceptual schema to a global schema

– Bottom-up design

Model without multi-database conceptual level.

• Consists of two layers, local system layer and multi database layer.
• Local system layer , present to the multi-database layer the part of their local
database they are willing share with users of other database.
• System views are constructed above this layer
• Responsibility of providing access to multiple database is delegated to the
mapping between the external schemas and the local conceptual schemas.
• Full-fledged DBMs, exists each of which manages a different database.
GCS in Logically integrated distributed DBMS
– Mapping is from global schema to local conceptual schema

– Top-down procedure

Global Directory Issues

Global Directory is an extension of the normal directory, including information about the
location of the fragments as well as the makeup of the fragments, for cases of distributed
DBMS or a multi-DBMS, that uses a global conceptual schema,
• Relevant for distributed DBMS or a multi-DBMS that uses a global
conceptual schema
• Includes information about the location of the fragments as well as the makeup of
fragments.
• Directory is itself a database that contains meta-data about the actual data stored in
database.

Three issues
– A directory may either be global to the entire database or local to each site.
– Directory may be maintained centrally at one site, or in a distributed fashion by
distributing it over a number of sites.
➢ If system is distributed, directory is always distributed
– Replication may be single copy or multiple copies.
➢ Multiple copies would provide more reliability

Organization of Distributed systems

Three orthogonal dimensions
• Level of sharing
➢ No sharing, each application and data execute at one site
➢ Data sharing, all the programs are replicated at other sites but not the data.
➢ Data-plus-program sharing, both data and program can be shared
• Behavior of access patterns
➢ Static
Does not change over time
Very easy to manage
➢ Dynamic
Most of the real life applications are dynamic
• Level of knowledge on access pattern behavior.
➢ No information
➢ Complete information
➢ Access patterns can be reasonably predicted
➢ No deviations from predictions
➢ Partial information
➢ Deviations from predictions

Top Down Design

• Suitable for applications where database needs to be build from scratch
• Activity begins with requirement analysis
• Requirement document is input to two parallel activities:
➢ view design activity, deals with defining the interfaces for end users
➢ conceptual design, process by which enterprise is examined
– Can be further divided into 2 related activity groups
– Entity analyses, concerned with determining the entities, attributes and the
relationship between them
– Functional analyses, concerned with determining the fun
➢ Distributed design activity consists of two steps
– Fragmentation
– Allocation

Bottom-Up Approach
• Suitable for applications where database already exists
• Starting point is individual conceptual schemas
• Exists primarily in the context of heterogeneous database.
Design Alternatives
The distribution design alternatives for the tables in a DDBMS are as follows −
Non-replicated and non-fragmented
Fully replicated
Partially replicated Fragmented
Mixed
Non-replicated & Non-fragmented
In this design alternative, different tables are placed at different sites. Data is placed so that it
is at a close proximity to the site where it is used most. It is most suitable for database
systems where the percentage of queries needed to join information in tables placed at
different sites is low. If an appropriate distribution strategy is adopted, then this design
alternative helps to reduce the communication cost during data processing.
Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored. Since,
each site has its own copy of the entire database, queries are very fast requiring negligible
communication cost. On the contrary, the massive redundancy in data requires huge cost
during update operations. Hence, this is suitable for systems where a large number of
queries is required to be handled whereas the number of database updates is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution of the
tables is done in accordance to the frequency of access. This takes into consideration the
fact that the frequency of accessing the tables vary considerably from site to site. The
number of copies of the tables (or portions) depends on how frequently the access queries
execute and the site which generate the access queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or partitions,
and each fragment can be stored at different sites. This considers the fact that it seldom
happens that all data stored in a table is required at a given site. Moreover, fragmentation
increases parallelism and provides better disaster recovery. Here, there is only one copy of
each fragment in the system, i.e. no redundant data.
The three fragmentation techniques are −
• Vertical fragmentation
• Horizontal fragmentation
• Hybrid fragmentation

Mixed Distribution: This is a combination of fragmentation and partial replications. Here, the
tables are initially fragmented in any form (horizontal or vertical), and then these fragments
are partially replicated across the different sites according to the frequency of accessing the
fragments.
Design Strategies
In the last chapter, we had introduced different design alternatives. In this chapter, we will
study the strategies that aid in adopting the designs. The strategies can be broadly divided
into replication and fragmentation. However, in most cases, a combination of the two is
used.
Data Replication
Data replication is the process of storing separate copies of the database at two or more
sites. It is a popular fault tolerance technique of distributed databases.
Advantages of Data Replication
• Reliability − In case of failure of any site, the database system continues to work
since a copy is available at another site(s).
• Reduction in Network Load − Since local copies of data are available, query
processing can be done with reduced network usage, particularly during prime hours.
Data updating can be done at non-prime hours.
• Quicker Response − Availability of local copies of data ensures quick query
processing and consequently quick response time.
• Simpler Transactions − Transactions require less number of joins of tables located at
different sites and minimal coordination across the network. Thus, they become
simpler in nature.

Disadvantages of Data Replication

• Increased Storage Requirements − Maintaining multiple copies of data is associated
with increased storage costs. The storage space required is in multiples of the storage
required for a centralized system.
• Increased Cost and Complexity of Data Updating − Each time a data item is updated,
the update needs to be reflected in all the copies of the data at the different sites. This
requires complex synchronization techniques and protocols.
• Undesirable Application – Database coupling − If complex update mechanisms are
not used, removing data inconsistency requires complex coordination at application
level. This results in undesirable application – database coupling.

Some commonly used replication techniques are

Snapshot replication
Near-real-time replication
Pull replication
Fragmentation
Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of the
table are called fragments. Fragmentation can be of three types: horizontal, vertical, and
hybrid (combination of horizontal and vertical). Horizontal fragmentation can further be
classified into two techniques: primary horizontal fragmentation and derived horizontal
fragmentation.
Fragmentation should be done in a way so that the original table can be reconstructed from
the fragments. This is needed so that the original table can be reconstructed from the
fragments whenever required. This requirement is called “reconstructiveness.”
Advantages
1. Permits a number of transactions to executed concurrently
2. Results in parallel execution of a single query
3. Increases level of concurrency, also referred to as, intra query concurrency
4. Increased System throughput.
5. Since data is stored close to the site of usage, efficiency of the database system is
increased.
6. Local query optimization techniques are sufficient for most queries since data is
locally available.
7. Since irrelevant data is not available at the sites, security and privacy of the database
system can be maintained.

Disadvantages
1. Applications whose views are defined on more than one fragment may suffer
performance degradation, if applications have conflicting requirements.
2. Simple tasks like checking for dependencies, would result in chasing after data in a
number of sites
3. When data from different fragments are required, the access speeds may be very
high.
4. In case of recursive fragmentations, the job of reconstruction will need expensive
techniques.
5. Lack of back-up copies of data in different sites may render the database ineffective in
case of failure of a site.

Vertical Fragmentation
In vertical fragmentation, the fields or columns of a table are grouped into fragments. In
order to maintain reconstructiveness, each fragment should contain the primary key field(s)
of the table. Vertical fragmentation can be used to enforce privacy of data.
Grouping
• Starts by assigning each attribute to one fragment
o At each step, joins some of the fragments until some criteria is satisfied.
• Results in overlapping fragments
Splitting
• Starts with a relation and decides on beneficial partitioning based on the access
behavior of applications to the attributes
• Fits more naturally within the top-down design
• Generates non-overlapping fragments
For example, let us consider that a University database keeps records of all registered
students in a Student table having the following schema.
STUDENT
Regd_No Name Course Address Semester Fees Ma
rks

Now, the fees details are maintained in the accounts section. In this case, the designer will

CREATE TABLE STD_FEES AS

SELECT Regd_No, Fees
FROM STUDENT;

fragment

Horizontal Fragmentation
Horizontal fragmentation groups the tuples of a table in accordance to values of one or more
fields. Horizontal fragmentation should also confirm to the rule of reconstructiveness. Each
horizontal fragment must have all columns of the original base table.
• Primary horizontal fragmentation is defined by a selection operation on the owner
relation of a database schema.
• Given relation Ri, its horizontal fragments are given by
Ri = σFi(R), 1<= i <= w
Fi selection formula used to obtain fragment Ri
The example mentioned in slide 20, can be represented by using the above formula as
Emp1 = σSal <= 20K (Emp)
Emp2 = σSal > 20K (Emp)
For example, in the student schema, if the details of all students of Computer Science Course
needs to be maintained at the School of Computer Science, then the designer will
horizontally fragment the database as follows −

CREATE COMP_STD AS SELECT * FROM STUDENT

WHERE COURSE = "Computer Science";

Derived Horizontal Fragmentation

• Defined on a member relation of a link according to a selection operation
specified on its owner.

• Link between the owner and the member relations is defined as equi-join

• An equi-join can be implemented by means of semijoins.

• Given a link L where owner (L) = S and member (L) = R, the derived horizontal
fragments of R are defined as
Ri = R α Si, 1 <= I <= w
Where,
Si = σ Fi (S)
w is the max number of fragments that will be defined on
Fi is the formula using which the primary horizontal fragment Si is defined

Hybrid Fragmentation
In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniques
are used. This is the most flexible fragmentation technique since it generates fragments with
minimal extraneous information. However, reconstruction of the original table is often an
expensive task.
Hybrid fragmentation can be done in two alternative ways −
At first, generate a set of horizontal fragments; then generate vertical fragments from one or
more of the horizontal fragments.
At first, generate a set of vertical fragments; then generate horizontal fragments from one or
more of the vertical fragments.
Transparency
Transparency in DBMS stands for the separation of high level semantics of the system from
the low-level implementation issue. High-level semantics stands for the endpoint user, and
low level implementation concerns with complicated hardware implementation of data or
how the data has been stored in the database. Using data independence in various layers of
the database, transparency can be implemented in DBMS.
Distribution transparency is the property of distributed databases by the virtue of which the
internal details of the distribution are hidden from the users. The DDBMS designer may
choose to fragment tables, replicate the fragments and store them at different sites.
However, since users are oblivious of these details, they find the distributed database easy to
use like any centralized database.
Unlike normal DBMS, DDBMS deals with communication network, replicas and fragments
of data. Thus, transparency also involves these three factors.
Following are three types of transparency:
1. Location transparency
2. Fragmentation transparency
3. Replication transparency
Location Transparency
Location transparency ensures that the user can query on any table(s) or fragment(s) of a
table as if they were stored locally in the user’s site. The fact that the table or its fragments
are stored at remote site in the distributed database system, should be completely oblivious to
the end user. The address of the remote site(s) and the access mechanisms are completely
[Link] order to incorporate location transparency, DDBMS should have access to updated
and accurate data dictionary and DDBMS directory which contains the details of locations
of data.
Fragmentation Transparency
Fragmentation transparency enables users to query upon any table as if it were unfragmented.
Thus, it hides the fact that the table the user is querying on is actually a fragment or union of
some fragments. It also conceals the fact that the fragments are located at diverse [Link] is
somewhat similar to users of SQL views, where the user may not know that they are using a
view of a table instead of the table itself.
Replication Transparency
Replication transparency ensures that replication of databases are hidden from the users. It
enables users to query upon a table as if only a single copy of the table [Link]
transparency is associated with concurrency transparency and failure transparency. Whenever
a user updates a data item, the update is reflected in all the copies of the table. However, this
operation should not be known to the user. This is concurrency transparency. Also, in case of
failure of a site, the user can still proceed with his queries using replicated copies without
any knowledge of failure. This is failure transparency.
Combination of Transparencies
In any distributed database system, the designer should ensure that all the stated
transparencies are maintained to a considerable extent. The designer may choose to fragment
tables, replicate them and store them at different sites; all oblivious to the end user.
However, complete distribution transparency is a tough task and requires considerable design
efforts.
Database Control
Database control refers to the task of enforcing regulations so as to provide correct data to
authentic users and applications of a database. In order that correct data is available to users,
all data should conform to the integrity constraints defined in the database. Besides, data
should be screened away from unauthorized users so as to maintain security and privacy of
the database. Database control is one of the primary tasks of the database administrator
(DBA).
The three dimensions of database control are −
• Authentication
• Access Control
• Integrity Constraints

Authentication
In a distributed database system, authentication is the process through which only legitimate
users can gain access to the data resources.
Authentication can be enforced in two levels −
Controlling Access to Client Computer − At this level, user access is restricted while login
to the client computer that provides user-interface to the database server. The most common
method is a username/password combination. However, more sophisticated methods like
biometric authentication may be used for high security data.
Controlling Access to the Database Software − At this level, the database
software/administrator assigns some credentials to the user. The user gains access to the
database using these credentials. One of the methods is to create a login account within the
database server.
Access Rights
A user’s access rights refers to the privileges that the user is given regarding DBMS
operations such as the rights to create a table, drop a table, add/delete/update tuples in a
table or query upon the table.
In distributed environments, since there are large number of tables and yet larger number of
users, it is not feasible to assign individual access rights to users. So, DDBMS defines
certain roles. A role is a construct with certain privileges within a database system. Once the
different roles are defined, the individual users are assigned one of these roles. Often a
hierarchy of roles are defined according to the organization’s hierarchy of authority and
responsibility.
For example, the following SQL statements create a role "Accountant" and then assigns this
role to user "ABC".

CREATE ROLE ACCOUNTANT;

GRANT SELECT, INSERT, UPDATE ON EMP_SAL TO ACCOUNTANT; GRANT INSERT, UPDATE,

DELETE ON TENDER TO ACCOUNTANT; GRANT INSERT, SELECT ON EXPENSE TO
ACCOUNTANT;

COMMIT;

Semantic Integrity Control

Semantic integrity control defines and enforces the integrity constraints of the database
system.
The integrity constraints are as follows −
Data type integrity constraint
Entity integrity constraint
Referential integrity constraint
Data Type Integrity Constraint
A data type constraint restricts the range of values and the type of operations that can be
applied to the field with the specified data type.
For example, let us consider that a table "HOSTEL" has three fields - the hostel number,
hostel name and capacity. The hostel number should start with capital letter "H" and cannot
be NULL, and the capacity should not be more than 150. The following SQL command can
be used for data definition −

CREATE TABLE HOSTEL (

H_NO VARCHAR2(5) NOT NULL, H_NAME VARCHAR2(15), CAPACITY INTEGER,

CHECK ( H_NO LIKE 'H%'), CHECK ( CAPACITY <= 150)

);

Entity Integrity Control

Entity integrity control enforces the rules so that each tuple can be uniquely identified from
other tuples. For this a primary key is defined. A primary key is a set of minimal fields that
can uniquely identify a tuple. Entity integrity constraint states that no two tuples in a table
can have identical values for primary keys and that no field which is a part of the primary key
can have NULL value.
For example, in the above hostel table, the hostel number can be assigned as the primary key
through the following SQL statement (ignoring the checks) −

CREATE TABLE HOSTEL (

H_NO VARCHAR2(5) PRIMARY KEY, H_NAME VARCHAR2(15),

CAPACITY INTEGER);

Referential Integrity Constraint

Referential integrity constraint lays down the rules of foreign keys. A foreign key is a field in
a data table that is the primary key of a related table. The referential integrity constraint lays
down the rule that the value of the foreign key field should either be among the values of the
primary key of the referenced table or be entirely NULL.
For example, let us consider a student table where a student may opt to live in a hostel. To
include this, the primary key of hostel table should be included as a foreign key in the
student table. The following SQL statement incorporates this −

CREATE TABLE STUDENT (

S_ROLL INTEGER PRIMARY KEY, S_NAME VARCHAR2(25) NOT NULL, S_COURSE

VARCHAR2(10),

S_HOSTEL VARCHAR2(5) REFERENCES HOSTEL);

SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT – II – DISTRIBUTED DATABASE – SCS1613

UNIT - II
QUERIES AND OPTIMAZATION
Global Queries to Fragment Queries-Equivalence Transformations for
Queries-Distributed Grouping and Aggregate Function Evaluation-
Parametric Queries-Optimization of Access Strategies-Framework for Query
Optimization-Join Queries- General Queries-Introduction to Distributed
Transactions.

Global Queries to Fragment Queries

When a query is placed, it is at first scanned, parsed and validated. An internal
representation of the query is then created such as a query tree or a query graph. Then
alternative execution strategies are devised for retrieving results from the database tables.
The process of choosing the most appropriate execution strategy for query processing is
called query optimization.

Query Optimization Issues in DDBMS

In DDBMS, query optimization is a crucial task. The complexity is high since number of
alternative strategies may increase exponentially due to the following factors −

• The presence of a number of fragments.

• Distribution of the fragments or tables across various sites.

• The speed of communication links.

• Disparity in local processing capabilities.

Hence, in a distributed system, the target is often to find a good execution strategy for query
processing rather than the best one. The time to execute a query is the sum of the following

• Time to communicate queries to databases.

• Time to execute local query fragments.

• Time to assemble data from different sites.

• Time to display results to the application.

Query Processing
Query processing is a set of all activities starting from query placement to displaying the
results of the query. The steps are as shown in the following diagram −
Figure 2.1 step in query processing

Global Query Optimization

Input: Fragment query

• Find the best (not necessarily optimal) global schedule

➡ Minimize a cost function

➡ Distributed join processing

✦ Bushy vs. linear trees

✦ Which relation to ship where?

✦ Ship-whole vs ship-as-needed

➡ Decide on the use of semijoins

✦ Semijoin saves on communication at the expense of more local
processing.

➡ Join methods

✦ nested loop vs ordered joins (merge join or hash join)

Cost-Based Optimization

• Solution space

➡ The set of equivalent algebra expressions (query trees).

• Cost function (in terms of time)

➡ I/O cost + CPU cost + communication cost

➡ These might have different weights in different distributed environments

(LAN vs WAN).

➡ Can also maximize throughput

• Search algorithm

➡ How do we move inside the solution space?

➡ Exhaustive search, heuristic algorithms (iterative improvement, simulated

annealing, genetic,…)

Query Optimization Process

Figure 2.2 Query Optimization Process

Search Space

• Search space characterized by alternative execution

• Focus on join trees

• For N relations, there are O(N!) equivalent join trees that can be obtained by applying
commutativity and associativity rules

SELECT ENAME,RESP

FROM EMP, ASG,PROJ

WHERE [Link]=[Link]

AND [Link]=[Link]

Cost Functions

• Total Time (or Total Cost)

➡ Reduce each cost (in terms of time) component individually

➡ Do as little of each cost component as possible

➡ Optimizes the utilization of the resources

Increases system throughput

• Response Time

➡ Do as many things as possible in parallel

➡ May increase total time because of increased total activity

• Summation of all cost factors

• Total cost = CPU cost + I/O cost + communication cost

• CPU cost = unit instruction cost * [Link] instructions

• I/O cost = unit disk I/O cost * no. of disk I/Os

• communication cost = message initiation + transmission

2-Step – Problem Definition

• Given

➡ A set of sites S = {s1, s2, …,sn} with the load of each site

➡ A query Q ={q1, q2, q3, q4} such that each subqueryqiis the maximum
processing unit that accesses one relation and communicates with its
neighboring queries

➡ For each qi in Q, a feasible allocation set of sites Sq={s1, s2, …,sk} where each
site stores a copy of the relation in qi

• The objective is to find an optimal allocation of Q to S such that

➡ the load unbalance of S is minimized

➡ The total communication cost is minimized

• For each q in Q compute load (Sq)

• While Q not empty do

➡ Select subquerya with least allocation flexibility

➡ Select best site b fora (with least load and best benefit)
➡ Remove a from Q and recompute loads if needed

2-Step Algorithm Example

• Let Q = {q1, q2, q3, q4} where q1 is associated with R1, q2 is associated with R2 joined
with the result of q1, etc.

• Iteration 1: select q4, allocate to s1, set load(s1)=2

• Iteration 2: select q2, allocate to s2, set load(s2)=3

• Iteration 3: select q3, allocate to s1, set load(s1) =3

• Iteration 4: select q1, allocate to s3 or s4

Relational Algebra :
• The Relational Algebra is used to define the ways in which relations (tables) can be
operated to manipulate their data.
• This Algebra is composed of Unary operations (involving a single table) and Binary
operations (involving multiple tables).
• Join, Semi-join these are Binary operations in Relational Algebra.
Join
• Join is a binary operation in Relational Algebra.
• It combines records from two or more tables in a database.
• A join is a means for combining fields from two tables by using values common to
each.
Semi-Join
•A Join where the result only contains the columns from one of the joined tables.
•Useful in distributed databases, so we don't have to send as much data over the network.
•Can dramatically speed up certain classes of queries.
What is “Semi-Join” ?
Semi-join strategies are technique for query processing in distributed database systems. Used
for reducing communication cost.
A semi-join between two tables returns rows from the first table where one or more matches
are found in the second table.
The difference between a semi-join and a conventional join is that rows in the first table will
be returned at most once. Even if the second table contains two matches for a row in the first
table, only one copy of the row will be returned.
Semi-joins are written using EXISTS or IN.

A Simple Semi-Join Example “Give a list of departments with at least one employee.” Query
written with a conventional join:
SELECT [Link], [Link] FROM dept D, emp E WHERE [Link] = [Link]
ORDER BY [Link];
◦ A department with N employees will appear in the list N times.
◦ We could use a DISTINCT keyword to get each department to appear only once.

A Simple Semi-Join Example “Give a list of departments with at least one employee.” Query
written with a semi-join:
SELECT [Link], [Link] FROM dept D WHERE EXISTS (SELECT 1 FROM
emp E WHERE [Link] = [Link]) ORDER BY [Link];
◦ No department appears more than once.
◦ Oracle stops processing each department as soon as the first employee in that
department is found.
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT – III – DISTRIBUTED DATABASE – SCS1613

Unit III
MANAGEMENT OF DISTRIBUTED TRANSACTIONS

Management of Distributed Transactions- Framework for Transaction Management-

Supporting Atomicity of Distributed Transactions- Concurrency Control for Distributed
Transactions- Architectural Aspects of Distributed Transactions-Concurrency Control-
Foundation of Distributed Concurrency Control- Distributed Deadlocks-Concurrency
Control based on Timestamps- Optimistic Methods for Distributed Concurrency Control

A transaction is a program including a collection of database operations, executed as a

logical unit of data processing. The operations performed in a transaction include one or more
of database operations like insert, delete, update or retrieve data.
• read_item() − reads data item from storage to main memory.

• modify_item() − change value of item in the main memory.

• write_item() − write the modified value from main memory to storage.

Transaction Operations
The low level operations performed in a transaction are −

• begin_transaction − A marker that specifies start of transaction execution.

• read_item or write_item − Database operations that may be interleaved with main

memory operations as a part of transaction.

• end_transaction − A marker that specifies end of transaction.

• commit − A signal to specify that the transaction has been successfully completed in
its entirety and will not be undone.
• rollback − A signal to specify that the transaction has been unsuccessful and so all
temporary changes in the database are undone. A committed transaction cannot be
rolled back.

Desirable Properties of Transactions

Any transaction must maintain the ACID properties, viz. Atomicity, Consistency, Isolation,
and Durability.

• Atomicity − This property states that a transaction is an atomic unit of processing,

that is, either it is performed in its entirety or not performed at all. No partial update
should exist.

• Consistency − A transaction should take the database from one consistent state to
another consistent state. It should not adversely affect any data item in the database.

• Isolation − A transaction should be executed as if it is the only one in the system.

There should not be any interference from the other concurrent transactions that are
simultaneously running.

• Durability − If a committed transaction brings about a change, that change should be

durable in the database and not lost in case of any failure.
States of a transaction
Active: Initial state and during the execution
Partially committed: After the final statement has been executed
Committed: After successful completion
Failed: After the discovery that normal execution can no longer proceed
Aborted: After the transaction has been rolled back and the DB restored to its state
prior to the start of the transaction. Restart it again or kill it.

Goal:
The goal of transaction management in a distributed database is to control the execution of
transactions so that: 1. Transactions have atomicity, durability, serializability and isolation
properties.
• CPU and main memory utilization
• Control messages
• Response time
• Availability

Distributed Transactions
A distributed transaction is a database transaction in which two or more network hosts are
involved. Usually, hosts provide transactional resources, while the transaction manager is
responsible for creating and managing a global transaction that encompasses all operations
against such resources.

Supporting Atomicity of Distributed Transactions

Logs:
A log contains information for undoing or redoing all actions which are performed by
transactions. The log record contains
• Identifier of the transaction
• Identifier of the record
Type of action(insert,delete, modify)
• Old record value
• New record value
• Auxiliary information for the recovery procedure
Recovery procedures:
When a failure occurs a recovery procedure reads the log file and performs the following
operations,
• Determine all noncommitted transactions that have to be undone
• Determine all transactions which need to be redone.
• Undo the transactions determined at step 1 and redo the transactions determined at
step 2.

Recovery of distributed transactions

Each site have alocal transaction manager(LTM) which is capable of implementing
local transactions.

Figure 3.1 Reference Model of istributed transaction recovery

The relationship between distributed transaction management and local transaction

management is represented in the reference model. At the bottom level we have the local
transaction managers which do not need communication between them. The LTM’s
implement interface Local_begin. Local_commit, and local_abort. At the next higher level
we have the distributed transaction manager. DTM is by its nature a distributed a distributed
level;DTM will be implemented by a set of local DTM agents which exchanges messages
between them. DTM implements interface begin_transasction , commit, abort, and create.
At the higher level we have the distributed transaction , constituted by the root agent and the
other agents.

The 2-phase commit protocol

Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The
steps performed in the two phases are as follows −

Phase 1: Prepare Phase

• After each slave has locally completed its transaction, it sends a “DONE” message to
the controlling site. When the controlling site has received “DONE” message from
all slaves, it sends a “Prepare” message to the slaves.

• The slaves vote on whether they still want to commit or not. If a slave wants to
commit, it sends a “Ready” message.

• A slave that does not want to commit sends a “Not Ready” message. This may
happen when the slave has conflicting concurrent transactions or there is a timeout.

Phase 2: Commit/Abort Phase

• After the controlling site has received “Ready” message from all the slaves −

o The controlling site sends a “Global Commit” message to the slaves.

o The slaves apply the transaction and send a “Commit ACK” message to the
controlling site.

o When the controlling site receives “Commit ACK” message from all the
slaves, it considers the transaction as committed.

• After the controlling site has received the first “Not Ready” message from any slave
−

o The controlling site sends a “Global Abort” message to the slaves.

o The slaves abort the transaction and send a “Abort ACK” message to the
controlling site.

o When the controlling site receives “Abort ACK” message from all the slaves,
it considers the transaction as aborted.
Concurrency control for distributed Transactions

Locking Based Concurrency Control Protocols

Locking-based concurrency control protocols use the concept of locking data items.
A lock is a variable associated with a data item that determines whether read/write
operations can be performed on that data item. Generally, a lock compatibility matrix is used
which states whether a data item can be locked by two transactions at the same time.

Locking-based concurrency control systems can use either one-phase or two-phase locking
protocols.

One-phase Locking Protocol

In this method, each transaction locks an item before use and releases the lock as soon as it
has finished using it. This locking method provides for maximum concurrency but does not
always enforce serializability.

Two-phase Locking Protocol

In this method, all locking operations precede the first lock-release or unlock operation. The
transaction comprise of two phases. In the first phase, a transaction only acquires all the
locks it needs and do not release any lock. This is called the expanding or the growing
phase. In the second phase, the transaction releases the locks and cannot request any new
locks. This is called the shrinking phase.

Every transaction that follows two-phase locking protocol is guaranteed to be serializable.

However, this approach provides low parallelism between two conflicting transactions.

Architectural Aspects of Distributed Transactions

• Structure of the computation

• Communication of a distributed transactions

• Sessions and datagrams:
The communications between processes or servers can be performed through sessions and
datagrams. Sessions have a basic advantage: the authentication and identification functions
need to be oerformed only once and then messages can be exchanged without repeating
these operations.

Communication structure for commit protocols

• Centralized

2 2

3 3
1 1 1

4 4

5 5
Prepare Ready or Abort Commit or Abort ACK

Figure3. 2 Centralized

• Hierarchial 3
3
2
2
4 1
4 1
1

5
5
Prepare Ready or Abort Commit or Abort ACK

Figure 3.3 Hierarchial

3
3
2
2
4 1
4 1
1

5
5

Figure 3.4 Hierarchial

• Linear
(Commit or Abort)

1 2 3 4
(Prepare or Ready)

Figure 3.5 Linear

Ordering is defined

• Distributed
2 2
1
3 3

4 4

5 5

Prepare Ready or Abort

(No messages are required for the decision)

Figure 3.6 Distributed

Concurrency Control
Concurrency controlling techniques ensure that multiple transactions are executed
simultaneously while maintaining the ACID properties of the transactions and serializability
in the schedules.

Serializability in distributed database

In a system with a number of simultaneous transactions, a schedule is the total order of
execution of operations. Given a schedule S comprising of n transactions, say T1, T2,
T3………..Tn; for any transaction Ti, the operations in Ti must execute as laid down in the
schedule S.
Types of Schedules
There are two types of schedules −

• Serial Schedules − In a serial schedule, at any point of time, only one transaction is
active, i.e. there is no overlapping of transactions. This is depicted in the following
graph −

Figure 3.7 Serial Schedules

• Parallel Schedules − In parallel schedules, more than one transactions are active
simultaneously, i.e. the transactions contain operations that overlap at time. This is
depicted in the following graph −

Figure 3.8 Parallel Schedules

Conflicts in Schedules
In a schedule comprising of multiple transactions, a conflict occurs when two active
transactions perform non-compatible operations. Two operations are said to be in conflict,
when all of the following three conditions exists simultaneously −

• The two operations are parts of different transactions.

• Both the operations access the same data item.

• At least one of the operations is a write_item() operation, i.e. it tries to modify the
data item.

Serializability
A serializable schedule of ‘n’ transactions is a parallel schedule which is equivalent to a
serial schedule comprising of the same ‘n’ transactions. A serializable schedule contains the
correctness of serial schedule while ascertaining better CPU utilization of parallel schedule.

Equivalence of Schedules
Equivalence of two schedules can be of the following types −

• Result equivalence − Two schedules producing identical results are said to be result
equivalent.

• View equivalence − Two schedules that perform similar action in a similar manner
are said to be view equivalent.

• Conflict equivalence − Two schedules are said to be conflict equivalent if both

contain the same set of transactions and has the same order of conflicting pairs of
operations.
Serial schedules have less resource utilization and low throughput. To improve it, two are
more transactions are run concurrently. But concurrency of transactions may lead to
inconsistency in database. To avoid this, we need to check whether these concurrent
schedules are serializable or not.
Conflict Serializable: A schedule is called conflict serializable if it can be transformed into a
serial schedule by swapping non-conflicting operations.
Conflicting operations: Two operations are said to be conflicting if all conditions satisfy:
• They belong to different transaction
• They operation on same data item
• At Least one of them is a write operation
Example: –
• Conflicting operations pair (R1(A), W2(A)) because they belong to two different
transactions on same data item A and one of them is write operation.
• Similarly, (W1(A), W2(A)) and (W1(A), R2(A)) pairs are also conflicting.
• On the other hand, (R1(A), W2(B)) pair is non-conflictingbecause they operate on
different data item.
• Similarly, ((W1(A), W2(B)) pair is non-conflicting.
Consider the following schedule:
S1: R1(A), W1(A), R2(A), W2(A), R1(B), W1(B), R2(B), W2(B)
If Oi and Oj are two operations in a transaction and Oi< Oj (Oi is executed before Oj),
same order will follow in schedule as well. Using this property, we can get two
transactions of schedule S1 as:
T1: R1(A), W1(A), R1(B), W1(B)
T2: R2(A), W2(A), R2(B), W2(B)
Possible Serial Schedules are: T1->T2 or T2->T1
-> Swapping non-conflicting operations R2(A) and R1(B) in S1, the schedule becomes,
S11: R1(A), W1(A), R1(B), W2(A), R2(A), W1(B), R2(B), W2(B)
-> Similarly, swapping non-conflicting operations W2(A) and W1(B) in S11, the
schedule becomes,
S12: R1(A), W1(A), R1(B), W1(B), R2(A), W2(A), R2(B), W2(B)
S12 is a serial schedule in which all operations of T1 are performed before starting any
operation of T2. Since S has been transformed into a serial schedule S12 by swapping
non-conflicting operations of S1, S1 is conflict serializable.
Let us take another Schedule:
S2: R2(A), W2(A), R1(A), W1(A), R1(B), W1(B), R2(B), W2(B)
Two transactions will be:
T1: R1(A), W1(A), R1(B), W1(B)
T2: R2(A), W2(A), R2(B), W2(B)
Possible Serial Schedules are: T1->T2 or T2->T1
Original Schedule is:
S2: R2(A), W2(A), R1(A), W1(A), R1(B), W1(B), R2(B), W2(B)
Swapping non-conflicting operations R1(A) and R2(B) in S2, the schedule becomes,
S21: R2(A), W2(A), R2(B), W1(A), R1(B), W1(B), R1(A), W2(B)
Similarly, swapping non-conflicting operations W1(A) and W2(B) in S21, the schedule
becomes,
S22: R2(A), W2(A), R2(B), W2(B), R1(B), W1(B), R1(A), W1(A)
In schedule S22, all operations of T2 are performed first, but operations of T1 are not in
order (order should be R1(A), W1(A), R1(B), W1(B)). So S2 is not conflict serializable.
Conflict Equivalent: Two schedules are said to be conflict equivalent when one can be
transformed to another by swapping non-conflicting operations. In the example discussed
above, S11 is conflict equivalent to S1 (S1 can be converted to S11 by swapping non-
conflicting operations). Similarly, S11 is conflict equivalent to S12 and so on.
Note 1: Although S2 is not conflict serializable, but still it is conflict equivalent to S21
and S21 because S2 can be converted to S21 and S22 by swapping non-conflicting
operations.
Note 2: The schedule which is conflict serializable is always conflict equivalent to one of
the serial schedule. S1 schedule discussed above (which is conflict serializable) is
equivalent to serial schedule (T1->T2).

Distributed deadlocks
Distributed deadlocks can occur in distributed systems whendistributed transactions or
concurrency control is being [Link] deadlocks can be detected either by
constructing a global wait-for graph from local wait-for graphs at a deadlockdetector or by
a distributed algorithm like edge chasing.
Transaction processing in a distributed database system is also distributed, i.e. the same
transaction may be processing at more than one site. The two main deadlock handling
concerns in a distributed database system that are not present in a centralized system
are transaction location and transaction control. Once these concerns are addressed,
deadlocks are handled through any of deadlock prevention, deadlock avoidance or deadlock
detection and removal.

Transaction Location
Transactions in a distributed database system are processed in multiple sites and use data
items in multiple sites. The amount of data processing is not uniformly distributed among
these sites. The time period of processing also varies. Thus the same transaction may be
active at some sites and inactive at others. When two conflicting transactions are located in a
site, it may happen that one of them is in inactive state. This condition does not arise in a
centralized system. This concern is called transaction location issue.

This concern may be addressed by Daisy Chain model. In this model, a transaction carries
certain details when it moves from one site to another. Some of the details are the list of
tables required, the list of sites required, the list of visited tables and sites, the list of tables
and sites that are yet to be visited and the list of acquired locks with types. After a
transaction terminates by either commit or abort, the information should be sent to all the
concerned sites.

Transaction Control
Transaction control is concerned with designating and controlling the sites required for
processing a transaction in a distributed database system. There are many options regarding
the choice of where to process the transaction and how to designate the center of control,
like −

• One server may be selected as the center of control.

• The center of control may travel from one server to another.
• The responsibility of controlling may be shared by a number of servers.
Distributed Deadlock Prevention
Just like in centralized deadlock prevention, in distributed deadlock prevention approach, a
transaction should acquire all the locks before starting to execute. This prevents deadlocks.

The site where the transaction enters is designated as the controlling site. The controlling
site sends messages to the sites where the data items are located to lock the items. Then it
waits for confirmation. When all the sites have confirmed that they have locked the data
items, transaction starts. If any site or communication link fails, the transaction has to wait
until they have been repaired.

Though the implementation is simple, this approach has some drawbacks −

• Pre-acquisition of locks requires a long time for communication delays. This

increases the time required for transaction.

• In case of site or link failure, a transaction has to wait for a long time so that the sites
recover. Meanwhile, in the running sites, the items are locked. This may prevent
other transactions from executing.

• If the controlling site fails, it cannot communicate with the other sites. These sites
continue to keep the locked data items in their locked state, thus resulting in
blocking.

Distributed Deadlock Avoidance

As in centralized system, distributed deadlock avoidance handles deadlock prior to
occurrence. Additionally, in distributed systems, transaction location and transaction control
issues needs to be addressed. Due to the distributed nature of the transaction, the following
conflicts may occur −

• Conflict between two transactions in the same site.

• Conflict between two transactions in different sites.
In case of conflict, one of the transactions may be aborted or allowed to wait as per
distributed wait-die or distributed wound-wait algorithms.
Let us assume that there are two transactions, T1 and T2. T1 arrives at Site P and tries to
lock a data item which is already locked by T2 at that site. Hence, there is a conflict at Site
P. The algorithms are as follows −

• Distributed Wound-Die

o If T1 is older than T2, T1 is allowed to wait. T1 can resume execution after

Site P receives a message that T2 has either committed or aborted
successfully at all sites.

o If T1 is younger than T2, T1 is aborted. The concurrency control at Site P

sends a message to all sites where T1 has visited to abort T1. The controlling
site notifies the user when T1 has been successfully aborted in all the sites.

• Distributed Wait-Wait

o If T1 is older than T2, T2 needs to be aborted. If T2 is active at Site P, Site P

aborts and rolls back T2 and then broadcasts this message to other relevant
sites. If T2 has left Site P but is active at Site Q, Site P broadcasts that T2 has
been aborted; Site L then aborts and rolls back T2 and sends this message to
all sites.

o If T1 is younger than T1, T1 is allowed to wait. T1 can resume execution after

Site P receives a message that T2 has completed processing.

Distributed Deadlock Detection

Just like centralized deadlock detection approach, deadlocks are allowed to occur and are
removed if detected. The system does not perform any checks when a transaction places a
lock request. For implementation, global wait-for-graphs are created. Existence of a cycle in
the global wait-for-graph indicates deadlocks. However, it is difficult to spot deadlocks
since transaction waits for resources across the network.

Alternatively, deadlock detection algorithms can use timers. Each transaction is associated
with a timer which is set to a time period in which a transaction is expected to finish. If a
transaction does not finish within this time period, the timer goes off, indicating a possible
deadlock.

Another tool used for deadlock handling is a deadlock detector. In a centralized system,
there is one deadlock detector. In a distributed system, there can be more than one deadlock
detectors. A deadlock detector can find deadlocks for the sites under its control. There are
three alternatives for deadlock detection in a distributed system, namely.

• Centralized Deadlock Detector − One site is designated as the central deadlock

detector.
• Hierarchical Deadlock Detector − A number of deadlock detectors are arranged in
hierarchy.

• Distributed Deadlock Detector − All the sites participate in detecting deadlocks and
removing them.

Time and time stamps in a distributed database

Timestamp is a unique identifier created by the DBMS to identify a transaction. They are
usually assigned in the order in which they are submitted to the system. Refer to the
timestamp of a transaction T as TS(T). For basics of Timestamp you may refer here.
Timestamp Ordering Protocol –
The main idea for this protocol is to order the transactions based on their Timestamps. A
schedule in which the transactions participate is then serializable and the only equivalent
serial schedule permitted has the transactions in the order of their Timestamp Values. Stating
simply, the schedule is equivalent to the particular Serial Order corresponding to the order of
the Transaction timestamps. Algorithm must ensure that, for each items accessed
by Conflicting Operations in the schedule, the order in which the item is accessed does not
violate the ordering. To ensure this, use two Timestamp Values relating to each database
item X.

• W_TS(X) is the largest timestamp of any transaction that

executed write(X) successfully.
• R_TS(X) is the largest timestamp of any transaction that
executed read(X) successfully.

Timestamp Concurrency Control Algorithms

Timestamp-based concurrency control algorithms use a transaction’s timestamp to
coordinate concurrent access to a data item to ensure serializability. A timestamp is a unique
identifier given by DBMS to a transaction that represents the transaction’s start time.

These algorithms ensure that transactions commit in the order dictated by their timestamps.
An older transaction should commit before a younger transaction, since the older transaction
enters the system before the younger one.

Timestamp-based concurrency control techniques generate serializable schedules such that

the equivalent serial schedule is arranged in order of the age of the participating transactions.

Some of timestamp based concurrency control algorithms are −

• Basic timestamp ordering algorithm.

• Conservative timestamp ordering algorithm.
• Multiversion algorithm based upon timestamp ordering.
Timestamp based ordering follow three rules to enforce serializability −
• Access Rule − When two transactions try to access the same data item
simultaneously, for conflicting operations, priority is given to the older transaction.
This causes the younger transaction to wait for the older transaction to commit first.

• Late Transaction Rule − If a younger transaction has written a data item, then an
older transaction is not allowed to read or write that data item. This rule prevents the
older transaction from committing after the younger transaction has already
committed.

• Younger Transaction Rule − A younger transaction can read or write a data item
that has already been written by an older transaction.
Basic Timestamp Ordering –
Every transaction is issued a timestamp based on when it enters the system. Suppose, if an
old transaction Ti has timestamp TS(Ti), a new transaction Tj is assigned timestamp TS(Tj)
such that TS(Ti) < TS(Tj).The protocol manages concurrent execution such that the
timestamps determine the serializability order. The timestamp ordering protocol ensures that
any conflicting read and write operations are executed in timestamp order. Whenever some
Transaction T tries to issue a R_item(X) or a W_item(X), the Basic TO algorithm compares
the timestamp of T with R_TS(X) & W_TS(X) to ensure that the Timestamp order is not
violated. This describe the Basic TO protocol in following two cases.
1. Whenever a Transaction T issues a W_item(X) operation, check the following
conditions:
1.
• If R_TS(X) > TS(T) or if W_TS(X) > TS(T), then abort and rollback T and reject
the operation. else,
• Execute W_item(X) operation of T and set W_TS(X) to TS(T).
2. Whenever a Transaction T issues a R_item(X) operation, check the following
conditions:

• If W_TS(X) > TS(T), then abort and reject T and reject the operation, else
• If W_TS(X) <= TS(T), then execute the R_item(X) operation of T and set
R_TS(X) to the larger of TS(T) and current R_TS(X).
Whenever the Basic TO algorithm detects twp conflicting operation that occur in incorrect
order, it rejects the later of the two operation by aborting the Transaction that issued it.
Schedules produced by Basic TO are guaranteed to be conflict serializable. Already
discussed that using Timestamp, can ensure that our schedule will be deadlock free.
One drawback of Basic TO protocol is that it Cascading Rollbackis still possible. Suppose
we have a Transaction T1 and T2 has used a value written by T1. If T1 is aborted and
resubmitted to the system then, T must also be aborted and rolled back. So the problem of
Cascading aborts still prevails.
Let’s gist the Advantages and Disadvantages of Basic TO protocol:
• Timestamp Ordering protocol ensures serializablity
• Timestamp protocol ensures freedom from deadlock as no transaction ever waits.
• But the schedule may not be cascade free, and may not even be recoverable.
Optimistic Concurrency Control Algorithm
In systems with low conflict rates, the task of validating every transaction for serializability
may lower performance. In these cases, the test for serializability is postponed to just before
commit. Since the conflict rate is low, the probability of aborting transactions which are not
serializable is also low. This approach is called optimistic concurrency control technique.

In this approach, a transaction’s life cycle is divided into the following three phases −

• Execution Phase − A transaction fetches data items to memory and performs

operations upon them.

• Validation Phase − A transaction performs checks to ensure that committing its

changes to the database passes serializability test.

• Commit Phase − A transaction writes back modified data item in memory to the
disk.

This algorithm uses three rules to enforce serializability in validation phase −

Rule 1 − Given two transactions Ti and Tj, if Ti is reading the data item which Tj is writing,
then Ti’s execution phase cannot overlap with Tj’s commit phase. Tj can commit only after
Ti has finished execution.

Rule 2 − Given two transactions Ti and Tj, if Ti is writing the data item that Tj is reading,
then Ti’s commit phase cannot overlap with Tj’s execution phase. Tj can start executing only
after Ti has already committed.

Rule 3 − Given two transactions Ti and Tj, if Ti is writing the data item which Tj is also
writing, then Ti’s commit phase cannot overlap with Tj’s commit phase. Tj can start to
commit only after Ti has already committed.
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT – IV – DISTRIBUTED DATABASE – SCS1613

UNIT IV
RELIABILITY AND PROTECTION
Reliability- Basic Concepts- Reliability and concurrency Control- Determining a
Consistent View of the NetworkDetection and Resolution of Inconsistency- Checkpoints
and Cold Restart- Distributed Database AdministrationCatalog Management in Distributed
Databases-Authorization and Protection
RELIABILITY
Reliability is defined as a measure of the success with which the system conforms to
some authoritative specification of its behavior. When the behavior deviates from that which
is specified for it, this is called as Failure. The reliability of the system is inversely related to
the frequency of failures.
The reliability of a system can be measured in several ways, which are based on the
incidence of failures. Measures include Mean Time Between Failure(MTBF), Mean Time To
Repair(MTTR), and availability, defined as the fraction of the time that the system meets its
specification. MTBF is the amount of time between system failures in a network. MTTR is
the amount of time system takes to repair the failed systems.
BASIC CONCEPTS
In a database system application, the highest level specification is application-
dependent. It is convenient to split the reliability problem into two separate parts, an
application-dependent part and an application-independent part.
We emphasize two aspects of reliability: correctness and availability. It is important
not only that a system behave correctly, i.e., in accordance with the specification, but also
that it be available when necessary.
In some applications, like banking applications, correctness is an absolute requriment,
and errors which may corrupt the consistency of the database cannot be tolerated. Other
applications may tolerate the risk of inconsistencies in order to achieve a greater availability.
When a communication network fails the following problems may arise,
1. Commitment of transactions
2. Multiple copies of data and robusteness of concurrency control
3. Determining the state of the network
4. Detection and resolution of inconsistencies.
5. Checkpoints and cold restart.

NONBLOCKING COMMITMENT PROTOCOLS

A Commitment protocol is called blockingif the occurrence of some kinds of failure
forces some of the participating sites to wait until the failure is repaired before terminating
the transaction. A transaction that cannot be terminated at a site is called pending at this site.
State diagrams are used for describing the evolution of the coordinator and participants
during the execution of a protocol.

Figure 4.1 two state diagrams for the 2-phase-commitment protocol

The above figure shows the two state diagrams for the 2-phase-commitment protocol without
the ACK messages. For each transition, an input message and an output message are
indicated. A transition occurs when an input message arrives and causes the output message
to be sent.
State information must be recorded into stable storage for recovery purposes. This
helps in writing appropriate records in the logs.
Consider a transition from state X to state Y with input I and output O. The following
behavior is assumed:
1. The input message I is received.
2. The new state Y is recorded on stable storage.
3. The output message O is sent.

If the site fails between the first and the second event, the state remains X, and the input
message is lost. If the site fails between the second and third event, then the site reaches state
Y, but the output message is not sent.
1. NONBLOCKING COMMITMENT PROTOCLS WITH SITE FAILURE

The termination protocol for the 2-phase-commitment protocol must allow the
transactions to be terminated at all operational sites when a failure of the coordinator site
occurs. This is possible in the following two cases:
1. At least one of the participants has received the command. In this case, the other
participants can be told by this participant of the outcome of the transactions and can
terminate it.
2. None of the participants has received the command, and only the coordinator site has
crashed, so that all participants are operational. In this case, the participants can elect
a new coordinator and resume the protocol.

In above cases, the transactions can be correctly terminated at all operational sites.
Termination is impossible when no operational participants has received the command and at
least one participant failed, because the operational participants cannot know the failed
participant has done and cannot take an independent decision. So, if a coordinator fails
termination is impossible.
This problem can be eliminated by modifying the 2-phase-commitment protocol in the 3-
phase-commitment protocol.
The 3-phase-commitment protocol
In this protocol, the participants do not directly commit the transactions during the second
phase of commitment, instead they reach in this phase a new prepared-to-commit(PC) state.
So an additional third phase is required for actually committing the transactions.

Figure 4.2 two state diagrams for the 3-phase-commitment protocol

This protocol eliminates the blocking problem of the 2-phase-commitment protocol
because,
1. If one of the operational participants has received the command and the command
was ABORT, then the operational participant can abort the transactions. The failed
participant will abort the transaction at restart if it has not done it already.
2. If one of the operational participants has received the command and the command
was ENTER-PREPARED-STATE, then all the operational participants can commit
the transactions, terminating the second phase if necessary m performing the third
phase.
3. If none of the operational participants has received the ENTER-PREPARED-STATE
command, 2-phase-commitment protocol cannot be terminated. But with our new
protocol, the operational participants can abort the transactions, because the failed
participants has not committed. The failed transactions therefore abort the transactions
at restart.

This new protocol requires three phases for committing a transaction and two phases for
aborting it.
Termination protocol for 3-phase-commitment
“If at least one operational participant has not entered the prepared-to-commit state, then
the transactions can be aborted. If at least one operational participant has entered the
prepared-to-commit state, then the transactions can be safely committed.”
Since the above two condition are not mutually exclusive, in several cases the
termination protocol can decide whether to commit or abort the transactions. The protocol
which always commits the transactions when both cases are possible is called
progressive.
The simplest termination protocol is the centralized, nonprogressive protocol. First a
new coordinator is elected by the operational participants. Then the new coordinator
behaves as follows:
1. If the new coordinator is in the prepared-to-commit state, it issues to all
operational participants the command to enter also in this state. When it has
received all the OK messages, it issues the COMMIT command.
2. If the new coordinator is in commit state, i.e. it has committed the transactions, it
issues the COMMIT command to all participants.
3. If the new coordinator is in the abort state, it issues the ABORT command to all
participants.
4. Otherwise, the new coordinator orders all participants to go back to a state
previous to the prepared-to-commit and after it has received all the
acknowledgements, it issues the ABORT command.

2. COMMITMENT PROTOCOLS AND NETWORK PARTITIONS

Existence of nonblocking protocols for partitions

The main problem of the existence of nonblocking protocols is, some protocol allows
independent recovery in case of site failures.
The protocol we design must work as the following example. Suppose that we can build a
protocol such that if one site, say site2, fails, then
1. The other site, site1, terminates the transactions.
2. Site2 at restart terminates the transactions correctly without requiring any
additional information from site1.

So we make 4 propositions for the nonblocking commitment protocol, they are,

1. Independent recovery protocols exist only for single-site failures; however there exists
no independent recovery protocol which is resilient to multiple-site failures.
2. There exists no nonblocking protocol that is resilient to a network partition if
messages are lost when the partition occurs.
3. There exist nonblocking protocols which are resilient to a single network partition if
all undeliverable messages are returned to the sender.
4. There exists no nonblocking protocol which is resilient to a multiple partition.

Protocols which can deal with partitions

It is convenient to allow the termination of the transactions by at least one group of sites,
possible the largest group so that blocking is minimized. But it is not possible to determine
the largest group, because it does not know the size of the other groups.
There are two approaches to this problem, the primary site approach and the majority
approach.
In primary site approach, a site is designated as the primary site and the group that
contains it is allowed to terminate the transactions.
In majority approach, only the group which contains a majority of ites can terminate the
transactions. Here it is possible that no single group reaches a majority, in this case, all
groups are blocked.
A. Primary Site Approach

If the 2-phase-commitment protocol is used together with a primary site approach, then it is
possible to terminate all the transactions of the group of the primary site(the primary group),if
and only if the coordinators of all pending transactions belong to this group. This can be
achieved by assigning to the primary site the coordinator function for all transactions.
This approach is inefficient in most types of networks and it is very vulnerable to primary site
failure. To avoid this condition we can use 3-phase-commitment protocol can be used in
primary group.
B. Majority Approach and Quorum-Based Protocols

The majority approach avoids the disadvantages of the primary site approach. The basic idea
is that a majority of sites must agree on the abort or commit of a transaction before the
transaction is aborted or committed. A majority approach requires a specialized commitment
protocol. It cannot be applied with the standard 2-phase-commitment.
A straightforward generalization of the basic majority approach consists of assigning
different weights to the sites. The protocol which use a weighted majority are called
quorum-based protocols. The weights which are assigned to the sites are usually called
votes, since they are used when a site “votes” on the commit or abort of a transaction.
The basic rules of a quorum-based protocol are:
1. Each site I has associated with it a number of votes Vi, Vibeing a positive integer.
2. Let V indicate the sum of the votes of all sites of the network.
3. A transaction must collect a commit quorum Vc before committing.
4. A transaction must collect an abortquorum Va before aborting.
5. Va + Vc> V.

Rule 5 ensures that a transaction is either committed or aborted by implementing the basic
majority idea. In practice, the choice Va + Vc = V + 1 is the most convenient one.
A commitment protocol which implements this rule must guarantee that at one time a number
of sites such that the sum of their votes is greater than Vc agree to commit. It means these
sites have entered a prepared-to-commit state. Therefore a quorum based commitment
protocol can be obtained from the 3-phase-commitment protocol implementing the quorum
requirement.

Figure 4.3 Quorum based 3 phase commitment protocol

Termination and restart are more complex in this protocol. So once a site has participated in
building a commit (abort) quorum, it cannot participate in an abort (commit) quorum. Since a
site cab fail after participating in building a quorum, its participation must be recorded in
stable storage.
A centralized termination protocol for the quorum-based 3-phase-commitment has the
following structure.
1. A new coordinator is elected.
2. The coordinator collects state information and acts according to the following rules:
a. If at least one site has committed (aborted), send a COMMIT (ABORT)
command to the other sites.
b. If the number of votes of sites that reached the prepare-to-commit state is
greater than or equal to Vc , send a COMMIT command.
c. If the number of votes of sites in prepare to abort state reaches the abort
quorum, send an ABORT command.
d. If the number of votes of sites that reached the prepare-to-commit state plus
the number of votes of uncertain sites is greater than or equal to Vc, send a
PREPARE-TO-COMMIT command to uncertain sites, and wait for condition 2b to
occur.
e. If the number of votes of sites that reached the prepare-to-abort plus the
number of votes of uncertain sites is greater than or equal to Va, send a PREPARE-
TO-ABORT command to uncertain sites, and wait for condition 2c to occur.
f. Otherwise, wait for the repair of some failure.

RELIABILITY AND CONCURRENCY CONTROL

The problem arises when a failure happens is addressed here. We have to maximize the
number of transactions which are executed during this failure by the operational part of the
system.
Consider a transaction T having read-set RS(T) and write-set WS(T) and suppose that we
want to run Talone, so that no concurrency control is needed. In order to run T it is necessary
that at least one copy of each data item x belonging to RS(T) be available. If this elementary
necessary condition is not satisfied, T cannot be executed, because it lacks input data. The
availability of the data items of the write-set of T is not strictly required if we run T alone
during a failure, because a list of deferred updates can be produced which will be applied to
the database when the failure is repaired. Deferred updates can be implemented using
“spooler” method.
The availability of a system which allows only one transaction to be run during failure is
not satisfactory; therefore, concurrency control must be taken in account. The strongest
limitations on the execution of transactions in the presence of failures are due to the need for
concurrency control.
A. NONREDUNDANT DATABASES

If the database is nonredundant, then it is very simple to determine which transactions can
be executed. Let us consider 2-phase-locking is used for concurrency control. A transaction
tries to lock all data items of its readand write-sets before commitment. As there is only one
copy of some data item, this copy is either available or not. If the unique copy of some data
item of the read or write-set is not available, the transaction cannot commit and must
therefore be aborted.
If we assume that only site crashes occur but no partitions, then the availability of the
items which belong only one to the write-set is not required, and it is possible to spool the
update messages for these items. All transactions which have their read-set available
executed completely, including commitment; but the updates affecting sites which are down
are stored at spooler sites. When recovery happens, the restart procedures of the failed sites
will receive this list of deferred updates and execute them. We consider a crashed site as
exclusively locked for the [Link] other transaction can read the values of data items
which are stored here. In the case of partitions the differed updated will cause inconsistent
results to be produced- the failure is catastrophic.
In conclusion, if the database is nonredundant, there is not very much to do in order to
increase its availability in the presence of failures. Therefore, most reliability techniques
consider the case of redundant databases.
B. REDUNDANT DATABASE

The reasons why redundancy is introduced in a distributed database are twofold:

1. To increase the locality of reads, especially in those applications where the
number of reads is much larger than the number of writes
2. To increase the availability and reliability of the system.

We deal here essentially with the second aspect; however, in designing reliable
concurrency control methods for replicated data the first goal also should be kept in mind.
There are three main approaches to concurrency control based on 2-phase-locking in a
redundant database: write-locks-all, majority locking, and primary copy locking.
I. WRITE-LOCK-ALL

For transaction with a small write-set and especially for read-only transactions, the
system is much more available than for transaction with a large write-set. For read-only
transactions sometimes can run in more than one group ,because if a data item has two copies
in two copies in two different groups, then no update transaction can write on it and read-only
transaction can use each copy consistently.
If we make the assumption that no partitions occur, but only site crashes, then the
same approach can be used as with a nonredundant database i.e., the updates of unavailable
copies of data items can be spooled. In this case, the availability of the database for update
transactions increases very much. In fact, since only the read-set matters in the case,
transaction 1,4and 7 have the same availability as transaction 10; transactions 2, 5 and 8 as
transaction 11; and transaction 3,6 and 9 as transaction 12. So the example must be carefully
interpreted. The fact that a transaction can run in a given group means now that it can be run
if all other sites are down, instead of building separate groups. The high increase in
availability is obtained at the risk of catastrophic partitions.
Requests to lock or unlock a data item and the messages of the 2-phase-commitment
protocol are required for the control of transactions. Control messages carry information and
are short. Data messages contain database information and can be long. With the write
locks-all approach, we have:
1. Benefit - For each transaction executed at site i having x in its read-set, one lock message
and one data message are saved.
2. Cost - for each transaction which is not executed at site i and has x in its write-set, one
lock message and one data message are required, plus the messages required by the
commitment protocol.

Figure 4.4 Availability of Transaction

II. WEIGHTED MAJORITY LOCKING

The pure majority locking approach is not very suitable for our example, because two copies
of each data item exist; hence to lock a majority we must lock both. So consider a weighted
majority approach, or quorum approach, which adopts the same rules which have been used
for quorum-based commitment and termination protocols.
These rules, applied to the locking problem, consist of assigning to each data item x a total
number of votes V(x), and assigning votes V(xi) to each copy xi in such a way that V(x) is
the sum of all V(xi). A read quorum Vr(x) and a write quorum Vw(x) are then determined,
such that:

Vr ( x ) + V w ( x )  V ( x )
Vw ( x )  V ( x ) / 2

A transaction can read(write)x if it obtains read(write) locks on so many copies of x

that the sum of their votes is greater than or equal to Vr(x)(Vw(x)). Due to the first condition,
all conflicts between read and write operations are determined, because two transactions
which perform a read and a write operation on x cannot reach the read and write quorum
using two disjoint subsets of copies. Likewise, because of the second condition, all conflicts
between write operations are determined. Notice that the second condition can be omitted if
transactions read all data items which are written.
Let us assign votes for the copies of data items of Figure in the following way:
V (x ) = V ( y ) = V (z ) = 3
V (x1) = V ( y1) = V (z 2) = 2
V (x 2) = V ( y3) = V (z 3) = 1

With this assignment we can now consider the availability of the system in the case of
partitions. We choose the read and write quorums to be 2 for all data items. The availability
for the 12 transaction is shown in the figure. The following can be observed:
1. Transaction 1,2,3,4,7 and 10 have all the same availability. They are characterized by
the fact that they access all three data items either for reading or for writing or for both. Since
the read quorum is equal to the write quorum, it makes no difference whether the data item is
read or written from the viewpoint of availability. For the same reason, transaction 5,6,8and
11, which access only data items x and y, have the same availability. Also, transactions 9 and
12 have the same availability of the latter group, because the copy with highest weight for y.
2. The availability for update transactions is grater with the weighted majority approach
than with write-locks-all, while the availability for read-only transactions is smaller.
3. With this method , read-only transaction increases their availability if they can read
an inconsistent database, i.e., if they do not need to lock items, in fact, columns 10’,11,and
12, are the same for the majority approach as for the write-locks-all approach .

With the majority approach it is not reasonable to consider the assumption that
partitions do not occur. Notice that if we assume the absence of partitions, then the majority
approach is dominated by the write-locks-all approach(an approach is dominated by another
one if it is worse under all circumstances). In fact, we have seen that the majority and quorum
ideas have been developed essentially for dealing with partitions.
Consider now the locality aspect. A transaction reads a data item x at its site of origin, if
a copy is locally available. Hence, also in this case a data message is saved if a local copy is
available .However, read locks must be obtained at a number of copies corresponding to the
read quorum. Therefore, the addition of a copy of x can also force transactions which read x
to request more read locks at remote sites. This additional cost is incurred by transactions
which have x in their write-set, which must obtain write locks at a number of sites
corresponding to the write quorum. Moreover, they have to send a data message to all the
sites where there are copies of x.
It is clear that, considering only data messages, the same advantages and disadvantage
exist for the majority and the write-locks-all method. When control message are also
considered, then the situation is more complex; however, some of the locality motivations for
read-only transaction are lost.
[Link] COPY LOCKING

In the primary copy locking approach, all locks for a data item x are requested at the
site of the primary copy. We will assume first that also all he read and write operations are
performed on this copy; however, write are then propagated to all other copies.
Several enhancements of the primary copy approach exist which it more attractive.
The principal ones are:
1. Allowing consistent reads at different copies than the primary, even if real locks are
requested only at the primary; this enhances the locality of reads.
2. Allowing the migration of the primary copy if a site crash makes it unavailable; this
enhances availability.
3. Allowing the migration of the primary copy depending on its usage pattern. This also
enhances the locality aspect.

The first point deserves a comment. In order to obtain consistent reads at different
copies from the primary one, we should use the primary copy method for synchronization,
but perform the write and read operations according to the “write all/read one” method. In
this approach, the locks are all requested at the primary copy. So, at commitment all copies
are updated before releasing the write lock. A read can be performed in this way at any copy,
obtaining consistent data.
DETERMINING A CONSISTANT VIEW OF THE NETWORK
There are two aspects of this problem: Monitoring the state of the network, so that
state transitions of a site are discovered as soon as possible, and propagating new state
information to all sites consistently. Normally we use timeouts in the algorithms in order to
discover if a site was down. The use of timeouts can lead to an inconsistent view of the
network. Consider the following example in a 3-site network: site 1 sends a message to site2
requesting an answer. If no answer arrives before a given timeout, site 1 sends assumes that
sites 2 is down. If site 2 was just slow, then site 1 has a wrong view of the state of site2,
which is inconsistent with the view of site 2 about itself. Moreover, a third site 3 could try the
same operation at the same time as site 1, obtain an answer within the timeout, and assume
that site 2 is up. So it has different view that site1.
A generalized network wide mechanism is built such that all higher-level programs
are provided with the following facilities:
1. There is at each site a state tablecontaining an entry for eachsite. The entry can be up
or down. A program can send an inquiry to the state table for state information.
2. Any program can set a “watch” on any site, so that it receives an interrupt when the
site changes state.

The meaning of the state table and of a consistent view in the presence of partitions
failures is defined as follow: A site considers up only those sites with which it can
[Link] all crashed sites and all sites which belong to a different group in case of
partitions are considered down.A consistent view can be achieved only between sites of the
same [Link] of a partition there are as many consistent views as there are isolated
groups of sites. The consistency requirement is therefore that a site has the same state table as
all other sites which are up in its state table.
[Link] the state of the network
The basic mechanism for deciding whether a site is up or down is to request a message from
it and to wait for a timeout. The requesting site is called controller and the other site is called
controlled site. In a generalized monitoring algorithm, instead of having the controller
request message from the controlled site, it is more convenient to have the controlled site
send I-AM-UP messages periodical to the controller and the controlled site.
Note that if only site crashes are considered, the monitoring function essentially has to detect
transitions from up to down states, because the opposite transaction is detected by the site
which performs recovery and restart; this site will inform all the others. If, however,
partitions also are considered, then the monitoring function has also to determine transitions
from down to up states. When a partition is repaired, sites of one group must detect that sites
of the other group must detect that sites of the group become available.
Using this mechanism for detecting whether a site is up or down the problem consists of
assigning controllers to each site so that the overall message overhead is minimized and the
algorithm survives correctly the failure of a controller. The latter requirement is of extreme
importance, since in a distributed approach each site is controlled and at the same time
performs the function of controller of some other site.
A possible solution is to assign circular ordering to the sites and to assign to each site the
function of controller of its predecessor. In the absence of failures, each site periodically
sends an I-AM-UP message to its successor and controls that the I-AM-UP message from its
predecessor arrives in time. If the I-AM-UP message from the predecessor does not arrive in
time, then the controller assumes that the controlled site has failed, updates the state table and
broadcasts the updated state table to all other sites.
If the predecessor of a site is down,then the site also has to control its predecessor, and if this
one is also down, the predecessor of the predecessor, and so on backward in the ordering until
an up site is found is isolated or all other sites have crashed; this does not invalidate the
algorithm). In this way, each operational site always has a controller. For example, in site k
controls site k-3; i.e., it responsible for discovering that sites k-1 and k-2 recover from down
to up. Symmetrically, if the successor of a site is down, then this site has as a controller the
first operational site following it in the ordering. For example, site k-3 has site k as controller.
Note that in the FIG sites k-1 and k-2 is not necessarily crashed; they could belong to a
different group after a [Link], the view of the network of sites k and k-3 is not
necessarily the “real” state.

Broadcasting a New State

Each time that the monitor function detects a stage change, this function is activated. The
purpose of this function is to broadcast the new state table so that all sites of the same group
have the same state table so that all sites of the same group have the same state table. Since
this function could be activated by several sites in parallel, some mechanism is needed to
control interference. A possible mechanism is to attach a globally unique timestamp to each
new version of a state table. By including the version number of the current state table in the
I-am-up message all sites in the same group can check that they have a consistent view.
The site which starts the propagation of a new state table first performs a synchronization
step in order to obtain a timestamp and then sends the state table to all sites which have
answered.
DETECTION AND RESOLUTION OF INCONSISTENCY
When a partition of the network occurs, transaction should be run at most in one
group of sites if we want to preserve strictly the consistency of the database. In some
application it is acceptable to lose consistency in order to achieve more availability. In this
case, transaction is allowed to run in all partitions where there is at least one copy of the
necessary data. When the failure is repaired, one can try to eliminate the inconsistencies
which have been introduced into the database. For this purpose it is necessary first to discover
which portion of the data has become inconsistent, and then to assign to these portions a
value which is the most reasonable in consideration of what has happened. The first problem
is called the detection [Link] second is called the resolution of the
inconsistencies. While exact solutions can be found for the detection problem, the resolution
problem has no general solution, because transaction has serializable way. Therefore the
word “reasonable” and not the word”correct” is used for the value which is assigned by the
resolution procedure.
DETECTION OF INCONSISTENCIES
Let us assume that, during a partition, transaction have been executed in two or more
groups of sites, and that independent updates may have been performed on different copies of
the same fragment. Let us first observe that the most naïve solution,consisting of comparing
the contents of the copies to check that they are identical, is not only inefficient, but also not
correct in general. For example consider an airline reservation system. If, during the partition,
reservation for the same flight independently on different copies until the maximum number
is reached, then all copies might have the same value for the number of reservation; however,
the flight would be overbooked in this case.
A correct approach to the detection of inconsistencies can be based on version number
.Assume that one of the approaches is used for determining for each data item, the one group
of sites which is allowed to operate on it. The copies of the data item which are stored at the
sites of this group are called master copies; the others are called isolated copies.
During normal operation all copies are master copies and are mutually consistent. For
each copy an original version number and a current version number are maintained
.Initially the original version number is set to 0, and the current version number is set to 1;
only the current version number is incremented each time that an update is performed on the
copy. When a partition occurs, the original version number of each isolated copy is set to the
value of its current version number. In this way, the original version number is not altered
until the partition is repaired. At this time, the comparison of the current and original version
numbers of all copies reveals inconsistencies.
Let us consider an example of this method. Assume that copies x1, x2 and x3 of data
item x are stored at three different sites. Let V1,V2 and V3 be the version numbers of x1, x2
and x3. Each Vi is in fact a pair with the original and current version number. Initially all
three copies are consistently updated .Suppose that one update has been performed, so that
the situation is
V1=(0,2), V2=(0,2), V3=(0,2)
Now a partition occurs separating x3 from the other two copies. A majority algorithm
is used which chooses x1 and x2 as major copies. The version numbers become now
V1=(0,2), V2=(0,2) , V3=(2,2)
Suppose now that only the master copies are updated during the partitions. The
version numbers become
V1=(0,3), V2=(0,3), V3=(2,2)
And after the repair it is possible to see that x3 has not been modified, since the
current and original version numbers are equal. In this case, no inconsistency has occurred
and it is sufficient to perform the updated during the partition. We have
V1=(0,2), V2=(0,2), V3=(2,3)
Since the original version number of x3 is equal to the current version number of x1
and x2 , the master copies have not been updated. If there are no other copies, then we can
simply apply to the master copies the updates of x3, since the situation is exactly symmetrical
to the previous one. If there are other isolated copies, for example x4 with V4=(2,3), we
cannot tell whether x4 was updated consistency with x3 even if version numbers are the
same, hence we have to assume inconsistency.
Finally , if both the master and the isolated copies have been updated , which also
reveals an inconsistency, then the original and the current version number of the isolated
copy are different, and the original version number of the isolated copy is also different from
the current version number of the master copies; for example
V1= (0,3), V2 =(0,3),V3 =(2,3)
RESOLUTION OF INCONSISTENCIES
After a partition has been repaired and an inconsistency has been detected a common
value must be assigned to all copies of a same data item. The problem of resolution of
inconsistency is the determination of this value.
Since in the different group transaction have been executed without mutual
synchronization, it seems correct to assign as a common value the one which would be
produced by some serializable execution of these same transactions. However besides the
difficulty of obtaining this new value, this is not a satisfactory solution, because the
transactions which have been executed have produced effects outside of the system which
cannot be undone and cannot be simply ignored.
Note that the transaction requiring the high degree of availability which motivates the
acceptances of inconsistencies is exactly those which perform effects outside of the system.
For example, take the airline reservation example considered before. The reason for running
transaction while the system is partitioned is to tell the customers that flight are available;
otherwise , it would be simpler to collect the customer request and to apply them to the
database after the failure has been repaired.
However, if overbooking has occurred during the partition, then forcing the system to
serializable execution would force the system to perform arbitrary cancellation. From the
view point of the application, it might be better to keep the over bookings and let normal user
cancellations reduce the number of reservations. A possible way of reducing or eliminating
overbooking due to partitions is to assign to each site a number of reservations which is
smaller than the total number. This number could be proportional to the size of each group or
to some other application dependent value.
The above example shows that the resolution of inconsistencies is in general,
application-dependent, and hence within the scope of this book, which deals with generalized
mechanisms.
CHECKPONTS AND RESTART
There are two types for errors: Omission errors and Commission errors. Omission
errors occur when a action (commit/abort) is left out of the transactions being executed.
Commission errors occur when a action (commit/abort) is incorrectly included in the
transactionexecuted. An error of omission in one transaction will be counted as an error in
commission in another transaction.
Cold restart is required after some catastrophic failure which has caused the low of
log information on stable storage, so that the current consistent state of the database cannot be
reconstructed and a previous consistent state must be restored. A previous consistent state is
marked by a checkpoint.
In a distributed database, the problem of cold restart is worse than in a centralized
one; this is because if one site has to establish an earlier state then all other sites also have to
establish earlier states which are consistent with the one of the site, so that the global state of
the distributed database as a whole is consistent. This means that the recovery process is
intrinsically global, affecting all sites of the database, although the failure which caused the
cold restart is typically local.
A consistent global state C is characterized by the following two properties:
1. For each transaction T, C contains the updates performed by all subtransactions of T
at any site, of it does not contain any of them; in the former case we say that T is
contained in C.
2. If a transaction T is contained in C, then all conflicting transactions which have
preceded T in the serialization order are also contained in C.

Property 1 is related to the atomicity to the transactions: either all effects of T or none
of them can appear in a consistent state. Property 2 is related to the serializability of
transactions: if a conflicting transaction T’ has preceded T, then the updates performed by T’
have affected the execution of T; Hence, if we keep the effects of T , we must keep also all
the effects of T’ . Note that durability of transaction cannot be ensured if we are forced to a
cold restart; the effect of some transactions is lost.
The simplest way to reconstruct a global consistent state in a distributed database is to
use local dumps, local logs, and global checkpoints. A global checkpoint is a set of local
checkpoints which are performed all sites of the network and are synchronized by the
following condition: if a subtransaction of a transaction T is contained in the local checkpoint
at some site, then all other subtransactions of T must be contained in the corresponding local
checkpoint at other sites.
If global checkpoints are available, the reconstruction problem is relatively easy. First,
at the failed site the latest local checkpoint which can be considered safe is determined; this
determines which earlier global state has to be reconstructed. Then all other sites are required
to reestablish the local states of the corresponding local checkpoints.
The main problem with the above approach consists in recording global checkpoints.
It is not sufficient for one site to broadcast a “write checkpoints” message to all other sites,
because it is possible that the situation of Fig arises; in this situation, T2 and T3 are
subtransactions of the same transaction T, and the local checkpoint C2 does not contain
subtransaction T2, while the local checkpoint C3 contains sub transaction T3, thus violating
the basic requirement for global checkpoints. FIGURE shows also that the fact that T
performs a 2- phase-commitment does not eliminate this problem, because the
synchronization of subtransactions during 2-phase-commitment and of sites during recording
of the global checkpoint is independent.
The simplest way to avoid the above problem is to require that all sites become
inactive before each other records its local checkpoint. Note that all sites must remain
inactive simultaneously,and therefore coordination is required. A protocol which is very
similar to 2-phase-commitment can be used for this purpose; a coordinator broadcasts “
prepare for checkpoint” to all sites, each site terminates the execution of subtransactions and
then answers READY, and then the coordinator broadcasts “ perform checkpoint”. This type
of method is unacceptable in practice because of the inactivity which is required all the sites.
A site has to remain inactive not only for the time required to record its checkpoints, but until
all other sites have finished their active transactions. Three more efficient solutions are
possible:
1. To find less expensive ways to record global checkpoints, so called loosely
synchronized checkpoints. All sites are asked by a coordinator to record a global
checkpoint; however, they are free to perform it within a large time interval. The
responsibility of guaranteeing that all subtransaction of the same transaction are
contained in the local checkpoints corresponding to the same global checkpoint is left
to transaction management. If the root agent of transaction T starts after checkpoint
Ci and before checkpoint Ci+1 , then each other subtransaction at a different site can be
started only after Ci has been recorded at its sites and before Ci+1 has been recorded .
Observing the first condition may force a subtransaction to wait; observing the second
condition can cause transaction aborts and restarts.
2. To avoid building global checkpoints at all, let the recovery procedure take the
responsibility of reconstructing a consistent global state at cold restart. With this
approach, the notion of global checkpoint is abandoned. Each site records its local
checkpoints independently from other sites, and the whole effort of building a
consistent global state is therefore performed by the cold restart procedure.
3. To use the 2-phase-commitment protocol for guaranteeing that the local checkpoints
created by each sites are ordered in a globally uniform way. The basic ideas is to
modify the 2-phase-commitment protocol so that the check points idea is to modify
the 2-phase-commitment protocol so that the checkpoints of all subtransactions which
belong to two distributed transaction T and T1 are recorded in the same order at all
sites where both transaction T and T’ are recorded in the same order at all sites where
both transactions are executed. Let Ti and Tj be subtransactions T’i and Tj’be
subtransactions of T’. If at site i the checkpoint of subtransaction Tiproceeds the
checkpoint of TJ should precede the checkpoint of subtransaction T’j.

DISTRIBUTED DATABASE ADMIBISTRATION

Database administration refers to a variety of activities for the development, control,
maintenance, and testing of the software of the database application. Database administration
is not only a technical problem, since it involves the statement of policies under which users
can access the database, which is clearly also an organization problem.
The technical aspects of database administration in a distributed environment focus on the
following problems:
1. The content and management of the catalogs with this name, we designate the
information which is required by the system for accessing the database. In distributed
systems, catalogs include the description of fragmentation and allocation of data and the
mapping to local names.
2. The extension of protection and authorization mechanisms to distributed systems.

CATALOG MANAGEMENT IN DISTRIBUTED DATABASES

Catalogs of distributed databases store all the information which is useful to the
system for accessing data correctly and efficiently and for verifying that users have the
appropriate access rights to them.
Catalogs are used for:
1. Translating applications - Data referenced by applications at different levels of
transparency are mapped to physical data (physical images in our reference architecture).
2. Optimizing applications - Data allocation, access methods available at each site, and
statistical information (recorded in the catalogs) are required for producing an access plan.
3. Executing applications - Catalog information is used to verify that access plans are
valid and that the users have the appropriate access rights.

Catalogs are usually updated when the users modify the data [Link] happens
when global relations, fragments, or images are created or moved, local access structures are
modified, or authorization rules are changed.
I. CONTENT OF CATALOGS

Several classifications of the information which is typically stored in distributed database

catalogs are possible.
1. Global schema description -It includes the name of global relations and of attributes.
2. Fragmentation description -In horizontal fragmentation, it includes the qualification
of fragments. In vertical fragmentation, it includes the attributes which belong to each
[Link] mixed fragmentation, it includes both the fragmentation tree and the description
of the fragmentation corresponding to each nonleaf node of the tree.
3. Allocation description - It gives the mapping between fragments and physical
images.
4. Mappings to local names -It is used for binding the names of physical images to the
names of local data stored at each site.
5. Access method description -It describes the access methods which are locally
available at each site. For instance, in the case of a relational system, it includes the
number and types of indexes available.
6. Statistics on the database - They include the profiles of the database
7. Consistency information (protection and integrity constraints) - It includes
information about the users' authorization to access the database, or integrity
constraints on the allowed values of data.
Examples of authorization rules are:
a. Assessing the rights of users to perform specific actions on data. The typical
actions considered are: read, insert, delete, update, move.
b. Giving to users the possibility of granting to other users the above rights.
Some references in the literature also include in the catalog content state
information (such as locking or recovery information); it seems more
appropriate to consider this information as part of a system's data structure and
not of the catalog's content.

II. THE DISTRIBUTION OFCATALOGS

When catalogs are used for the translation, optimization, and execution of
applications, their information is only [Link] they are used in conjunction with a
change in data definitions, they are updated. In a few systems, statistics are updated after each
execution, but typically updates to statistics are batched. In general, retrieval usage is
quantitatively the most important, and therefore the ratio between updates and queries is
small.
Solutions given to catalog management with and without site autonomy are very
different. Catalogs can be allocated in the distributed database in many different ways. The
three basic alternatives are:
Centralized catalogs
The complete catalog is stored at one [Link] solution has obvious
limitations, such as the loss of locality of applications which are not at the central site
and the loss of availability of the system, which depends on this single central site.
Fully replicated catalogs
Catalogs are replicated at each [Link] solution makes the read-only use of
the catalog local to each site, but increases the complexity of modifying catalogs,
since this requires updating catalogs at all sites.
Local catalogs
Catalogs are fragmented and allocated in such a way that they are stored at the
same site as the data to which they refer.
A practical solution which is used in several systems consists of periodically caching
catalog information which is not locally stored. This solution differs from having totally
replicated catalogs, because cached information is not kept up-to-date.
If an application is translated and optimized with a different catalog version than the
up-to-date one, this is revealed by the difference in the version numbers. This difference can
be observed either at the end of compilation, when the access plan is transmitted to remote
sites, or at execution time.
In the design of catalogs for Distributed-INGRES, five alternatives were considered,
1. The centralized approach
2. The full replication of items 1, 2, and 3 of catalog content and the local
allocation of remaining items
3. The full replication of items 1, 2, 3, 4, and 5 of catalog contentand the local
allocation of remaining items
4. The full replication of all items 5 of catalog content.
5. The local allocation of all items with remote "caching"

SDD-1 considers catalog information as ordinary user data; therefore an arbitrary

level of redundancy is supported. Security, concurrency, and recovery mechanisms of the
system are also used for catalog management.

III. OBJECT NAMING AND CATALOG MANAGEMENT WITH SITE AUTONOMY

We now turn our attention to the different problems which arise when site autonomy
is required. The major requirement is to allow each local user to create and name his or her
local data independently from any global control, at the same time allowing several users to
share data.
1. Data definition should be performed locally.
2. Different users should be able, independently, to give the same name to
different data.
3. Different users at different sites should be able to reference the same data.

In the solution given to these problems in R*prototype, two types of names is used:
1. System wide names are unique names given to each object in the system.
They have four components:
a. The identifier of the user who creates the object
b. The site of that user
c. The object name
d. The birth site of the object, i.e., the site at which the object was
created.

An example of a systemwide name is

User_ 1 @San_ [Link]@Zurich
where the symbol @ is a separator which precedes site names.
Here, User_ 1 from San Jose has created a global relation EMP at Zurich. The same
user name at different sites corresponds to different users (i.e., JohnOSF is not the
same as JohnOLA). This allows creating user names independently.
2. Print names are shorthand names for systemwide names. Sine in systemwide
names a, b, and d part can be omitted, name resolution is made by context, where
a context is defined as the current user at the local site.

a. A missing user identifier is replaced by the identifier of the current

user.
b. A missing user site or object site is replaced by the current site.
It is also possible for each user to define synonyms, which map simple names to
systemwide names. Synonyms are created for a specific user at a specific [Link]
mapping of a simple name to a systemwide name is attempted before name resolution.
Catalog management in R* satisfies the following requirements:
1. Global replication of a catalog is unacceptable, since this would violate the
possibility of autonomous data definition.
2. No site should be required to maintain catalog information of objects which
are not stored or created there.
3. The name resolution should not require a random search of catalog entries in
the network.
4. Migration of objects should be supported without requiring any change in
programs.

The above requirements are met by storing catalog entries of each object as follows:
1. One entry is stored at the birth site of the object, until the object is destroyed.
If the object is still stored at its birth site, the catalog contains all the
information; otherwise, it indicates the sites at which there are copies of the
object.
2. One entry is stored at every site where there is a copy of the object.

The catalog content in R* includes relation names, column names and types,
authorization rules, low-level objects' names, available access paths, and profiles. R*
supports the "caching" of catalogs, using version numbers to verify the validity of cached
information.
AUTHORIZATION AND PROTECTION

I. Site-to-Site Protection

The first security problem which arises in a distributed database is initiating and
protecting intersite communication. When two database sites communicate, it is important to
make sure that:
1. At the other side of the communication line is the intended site (and not an
intruder).
2. No intruder can either read or manipulate the messages which are exchanged
between the sites.

The first requirement can he accomplished by establishing an identification protocol

between remote sites. When two remote databases communicate with each other, on the first
request they also send each other a password. When two sites decide to share some data they
follow R* mechanism.
The second requirement is to protect the content of transmitted messages once the two
identified sites start to communicate. Messages in a computer network are typically routed
along paths which involve several intermediate nodes and trans-missions, with intermediate
buffering.
The best solution to this problem consists of using cryptography, a standard technique
commonly used in distributed information systems, for instance for protecting
communications between terminals and processing units. Messages ("plaintext") are initially
encoded into cipher messages ("ciphertext") at the sender site, then transmitted in the
network, and finally decoded at the receiver site.

II. User Identification

When a user connects to the database system, they must be identified by the
[Link] identification is a crucial aspect of preserving security, because if an intruder
could pretend to be a valid user, then security would be violated.
In a distributed database, users could identify themselves at any site of the distributed
database. However, this feature can be implemented in two ways which both show negative
aspects.
1. Passwords could be replicated at all the sites of the distributed database. This
would allow user identification to be performed locally at each site, but would
also compromise the security of passwords, since it would they easier for an
intruder to access them.
2. Users could each have a "home" site where their identification is performed; in
this scenario, a user connecting to a different site would be identified by
sending a request to the home site and letting this site perform the
identification.

A reasonable solution is to restrict each user to identifying themselves at the home

site. This solution is consistent with the idea that users seem to be more "static" than, for
instance, data or programs. A "pass-through" facility could be used to allow users at remote
sites to connect their terminals to their "home" sites in order to identify themselves.

III. Enforcing Authorization Rules

Once users are properly identified, database systems can use authorization rules to
regulate the actions performed upon database objects by them. In a distributed environment,
additional problems include the allocation of these rules, which are part of the catalog, and
the distribution of the mechanisms used for enforcing them. Two alternative, possible
solutions are:
1. Full replication of authorization rules. This solution is consistent with having fully
replicated catalogs, and requires mechanisms for distributing online updates to
them. But, this solution allows authorization to be checked either at the beginning
of compilation or at the beginning of execution.
2. Allocation of authorization rules at the same sites as the objects to which they
refer. This solution is consistent with local catalogs and does not incur the update
overhead as in the first case.
The second solution is consistent with site autonomy, while the first is consistent with
considering a distributed database as a single system.
The authorizations that can be given to users of a centralized database include the
abilities of reading, inserting, creating, and deleting object instances (tuples) and of creating
and deleting objects (relations or fragments).

IV. Classes of Users

For simplifying the mechanisms which deal with authorization and the amount of
stored information, individual users are grouped into classes, which are all granted the same
privileges.
In distributed databases, the following considerations apply to classes of users:
1. A "natural" classification of users is the one which is induced by the distribution
of the database to different [Link] is likely that "all users at site x" have some
common properties from the viewpoint of authorization. An explicit naming
mechanism for this class should be provided.
2. Several interesting problems arise when groups of users include users from
multiple sites. Problems are particularly complex when multiple-site user groups
are considered in the context of site autonomy. So, mechanisms involve the
consensus of the majority or of the totality of involved sites, or a decision made by
a higher-level administrator. So, multiple-site user groups contrast with pure site
autonomy.
SCHOOL OF COMPUTING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

UNIT – V – DISTRIBUTED DATABASE – SCS1613

UNIT 5
DATABASE INTEGRATION AND MANAGEMENT
Database Integration- Scheme Translation- Scheme Integration- Query Processing Query
Processing Layers in Distributed Multi-DBMSs- Query Optimization Issues- Transaction
Management Transaction and Computation ModelMultidatabase Concurrency Control-
Multidatabase Recovery- Object Orientation And Interoperability- Object Management
Architecture - Distributed Component Model.
DATABASE INTEGRATION
Database integration means that multiple different applications have their data stored in a
specific database – the integration database – so that data is available across all of these
different applications. In other words, the data is available between two different parties and
therefore, can be easily accessed and implemented into a different application without having
to transfer to a different database.
For the database integration to work successfully, it needs to have a plan that allows for all of
the client applications to be taken into account. Whether the scheme is more complex, general
or both is irrelevant because a separate group controls the database to negotiate between the
numerous different applications and the database group. In other words, this plan makes it
possible for all the applications to be grouped together into that one database group.

Figure 5.1 Database Integration

NECESSARY OF DATABASE INTEGRATION

The fundamental reason that database integration is necessary is because it allows for data to
be shared throughout an organization without there needing to be another set of integration
services on each application. It would be a tremendous waste of resources if each application
needed something to convert the data into data it can read. By using database integration, it
allows for the information to automatically be integrated so if, at any time the data is needed,
it can be pulled up and accessed.
On top of that, it helps when two companies that are merging have their data integrated
because when their databases come together, the data can mesh easily. If the data wasn’t
integrated, an server manager would have to go in and manually integrate everything which
can become a hassle and, as previous mentioned, result in a waste of resources. Therefore,
integrating before a merger is definitely ideal.
Another application that database integration can be used for is in the scientific community.
When collecting data, a scientist might use one application for one bit of data. Then, he’ll go
to a different application for a different bit of data. By having database integration, the data
becomes readily available across the spectrum without thereneeding to be any wasted time.
This results in more successful experiments.
All in all, database integration is becoming a technology that more companies are investing
in, especially as the quantity and connectivity of data increases. As people need to access
more data and share data between departments, companies have realized that have all the data
integrated on a database is an incredible time saver.

DATABASE INTEGRATION = TRANSLATION AND SCHEMA INTEGRATION

• Database integration is done in most cases in two-steps : schema translation (or

simply translation) and schema integration.
• Scheme translation means the translation of the participating local schemes into
a common intermediate canonical representation.

e.g. if a network and a relational model is used, then an

intermediate data model should be chosen, if it is the relational
one, the database scheme formulated in the network model is
translated into a scheme based on the relational model.

• Scheme transformation is of course only necessary if different data models are

involved.
• The scheme integration integrates each intermediate schemes into a global
conceptual scheme.
• All intermediate schema base on the same data model, the so called target model,
which is of course the data model for the global conceptual scheme.

THE EXAMPLE FOR THE TRANSLATION AND THE SCHEMA INTEGRATION

• We consider the following three local schema. The first one is based on the relational
model, the second one on the network model (the CODEASYL network) and the third
one on the entity-relationship data model.
• First scheme, the Relational Engineering Database Representation :

E(ENO, ENAME, TITLE) Each Engineer Description

J(JNO, CNAME, JNAME, BUDGET, LOC) Job Description
G(ENO, JNO, RESP ,DUR) Engineer to Job relation description
S(TITLE, SALARY) Salary description

• Second scheme : the CODEASYL Network Definition of the Employee Database :

Two records : DEPARTMENT and EMPLOYEE and their attributes DEP-NAME

and so on. One link between the records with →, named employs which links the
two corresponding records. It can model only one-to-many relationships. The schema
representation looks like :
DEPARTMENT : DEP-NAME BUDGET MANAGER →(employs)
EMPLOYEE : E# NAME ADDRESS TITLE SALARY

SCHEMA TRANSLATION :

• Schema translation is the task of mapping one schema to another.

• Requires the specification of the target data model for the global conceptual schema
definition.
• Some rare approaches did merge the translation and integration phase, but increases
the complexity of the whole process.
• In the example, the Entity-Relationship model is chosen as the target model.
• The first scheme translation is the CODEASYL network schema to an E-R-scheme
one.

SCHEME TRANSLATION 1 : CODEASYL SCHEMA TO E-R SCHEMA

• One entity is created for each record. Thus, an EMPLOYEE and one DEPARTMENT
entity is created.
• The attributes of the records are taken directly into the E-R scheme.
• Finally, the links employs becomes a many-to-one relationship from the EMPLOYEE
entity to the DEPARTMENT entity. The final model looks like :

Figure 5.2 Schema Translation 1

Remark : Many-to-many relationships modelled in the network by some intermediate
records can be represented directly by one many-to-many relationship (→ translation
should be optimized).

SCHEME TRANSLATION 2 : RELATIONAL SCHEMA TO E-R SCHEMA

• The example relational model of the engineering database consists of four relations :

E(ENO, ENAME, TITLE) Each Engineer Description

J(JNO, CNAME, JNAME, BUDGET, LOC) Job Description
G(ENO, JNO, RESP ,DUR) Engineer to Job relation description
S(TITLE, SALARY) Salary description

• Identify the base relations : E and J clearly corresponds to an entity.

• Identify the relationships : G corresponds to a relationship, ENO and JNO are foreign
keys, thus a relationship between J and E can be identified.
• Handling of S is difficult.
• First it can be a entity. In such a case a relationship between S and E must be
established (this would be a many-to-one relationship, e.g. pay between S and E). No
relation is specified for this relationship.

An employee could have only one salary, but a salary can belong to many employees.

• Second salary could be an attribute of E, cleaner, but the relationship between the title
and salary is lost.
• See below the result E-R scheme, with SAL as attribute of E.

Figure 5.3 Schema Translation 2

SCHEME INTEGRATION :

• All local scheme are now translated to an intermediate scheme based on the target
model. The task of the schema integration is now to generate the global conceptual
schema (CGS), which can be queried by the user of the MDBMS.
• Ozsu defines the schema integration, as the process of identifying the components of
a database which are related to one another, selecting the best representation for the
global conceptual schema and finally integrating the components of each
intermediate schema.
• Integration methodologies are either binary or unary

Binary integration methodologiesinvolves the manipulation of two schema at a time.

These occurs either ladder (linear tree !) or purely binary (bushy tree !).

• Binary are either one-shot (integration of all schema) or interactive (integration of

2,3,4 .. at a time). Binary approaches are a special case of the latter.

In general, the one-shot approach is very complex and rarely used, mostly the binary
approach is used (Determine the best ordering!).

• Very good graphical tools exists now which help the identification and integration
approach.

OVERVIEW OVER THE SCHEMA INTEGRATION :

• Preintegration : identify the keys and defines the ordering of the binary processing
approach.
• Comparison : Identification of naming and structural conflicts.
• Conformation : Resolution of the naming and structural conflicts.
• Restructeration and Merging of the different intermediate schema to the global
conceptual scheme (GCS).
• Interaction with an integrator is absolutely necessary.

Preintegration

• Preintegration establishes the rules of the integration process, i.e. the integration
method is selected (e.g. binary iterative n-ary) and then the order of the schema
integration (i.e. which intermediate schema is integrated with which one first).
• Candidate keys are determined. Here for each of the entities in all intermediate
schemes, the keys are determined.
• Potentially equivalent domains of attributes are detected and transformation rules
between the domains should be determined (e.g. one scheme defines the attribute
temperature in Grad Celsius, the other one in Fahrenheit, transformation rules
between the different domains should be prepared for further integration).

Comparison

• The comparison phase detects naming conflicts, relationships between

schemes and structural conflicts.
• Naming conflicts are either the synonym or the homonyms problem.
• Two identical entities which have different names are synonyms, and two different
entities that have identical names, but are not identical entities, are homonyms.
• Example 1 : ENGINEER and E are synonyms and they both refer to an engineer
entity.
• Title in the network model refers to an employee and is different from the title related
to an engineer, thus these are homonyms.

Relationship between the schemes

• The determination of the relationship bases on the recognition of the synonyms as

described before.
• There are four possibilities of relationships between schemes

1) Equivalent
2) One is subset of the other
3) Some components from any may occur in the other
4) Completely no overlap.

Structural conflicts

• Type Conflicts : Type conflict happens if the same object is represented by an

attribute in the one intermediate scheme and by an entity in another scheme.
• Dependency Conflicts : This conflict occurs, when the different relationship modes
(e.g. one-to-one versus many-to-many) are used to represent the same thing in
different schemas.
• Key Conflicts : This conflict happens, if different candidate keys are available and
different primary keys are selected in different scheme.
• Behavorial Conflicts are implied by the modeling process. For example deleting the
last employee out of the employee record can result in an empty department, as for the
engineers this may not be allowed).

Conformation

• Conformation is the resolution of the conflicts that are determined at the comparison
phase.
• Naming conflicts are resolved by simply renaming conflicting ones. In the case
of homonyms, the identical entities or attributes are extended with the name of the
entity and the name of the scheme it belongs to.
• Structural conflicts are resolved by transforming entities/attributes or relationships
between them.

Resolving structural conflicts

• Resolving attribute to represent it.

• Remark : Key attributed to Entities require supplemental steps.
• Dependency Conflicts will be resolved by choosing the most general relationship.
• The restructuring is virtually an art rather than a science. Semantic knowledge about
the all intermediate schemas is repaired, which makes an automatic resolution
difficult. There exists many supporting tools.
• structural conflicts means the restructuring of some schemes to eliminate the
conflicts.
• Attribute &Entity : A non-key attribute can be transformed into an entity by creating
an intermediate relationship connecting the new entity and a new Merging and
Restructuration

• All modified and non-conflicting schemes must be first merged into a single database
schema and secondly restructured to create the 'best' (see later) one.
• The merging follows the integration order ones fixed in the Preintegration. The
merging should be complete, i.e. all components of all the intermediate schema
should be find their place in the merged one.
• Now a Restructuration would take place which searches for the minimal one, thus
the redundant relationships are removed.
• Finally, the scheme could be re-transformed to be more understandable. This
process is in its great parts autonomous and this mechanism ignores all kind of
understandability, it is often necessary by the integrator to rebuild or extend some
relationships (here the minimalist can be lost) in a way that the user can understand
the scheme and thus formulate correct queries.

ACID PROPERTIES : atomicity, consistency, isolation, and durability.

Atomicity
A transaction's changes to the state are atomic: either all happen or none happen. These
changes include database changes, messages, and actions on transducers.
Consistency
A transaction is a correct transformation of the state. The actions taken as a group do not
violate any of the integrity constraints associated with the state.
Isolation
Even though transactions execute concurrently, it appears to each transaction T, that others
executed either before T or after T, but not both.
Durability
Once a transaction completes successfully (commits), its changes to the database survive
failures and retain its changes.
QUERY PROCESSING :
• Query Processing Overview
• Query Optimization
• Distributed Query Processing Steps

QUERY PROCESSING :
Query processing is a set of all activities starting from query placement to displaying the
results of the query. The steps are as shown in the following diagram
Figure 5.4 Query Processing

RELATIONAL ALGEBRA
Relational algebra defines the basic set of operations of relational database model. A
sequence of relational algebra operations forms a relational algebra expression. The result of
this expression represents the result of a database query.

The basic operations are −

• Projection
• Selection
• Union
• Intersection
• Minus
• Join
Projection
Projection operation displays a subset of fields of a table. This gives a vertical partition of
the table.

Syntax in Relational Algebra

π<AttributeList>(<TableName>)π<AttributeList>(<TableName>)

For example, let us consider the following Student database −

STUDENT

Roll_No Name Course Semester Gender

2 Amit Prasad BCA 1 Male

4 Varsha Tiwari BCA 1 Female

5 Asif Ali MCA 2 Male

6 Joe Wallace MCA 1 Male

8 Shivani Iyengar BCA 1 Female

Table 5.1 Student Data

If we want to display the names and courses of all students, we will use the following
relational algebra expression −
πName,Course(STUDENT)πName,Course(STUDENT)

Selection
Selection operation displays a subset of tuples of a table that satisfies certain conditions.
This gives a horizontal partition of the table.

SYNTAX IN RELATIONAL ALGEBRA

σ<Conditions>(<TableName>)σ<Conditions>(<TableName>)

For example, in the Student table, if we want to display the details of all students who have
opted for MCA course, we will use the following relational algebra expression −
σCourse="BCA"(STUDENT)σCourse="BCA"(STUDENT)

Combination of Projection and Selection Operations

For most queries, we need a combination of projection and selection operations. There are
two ways to write these expressions −

• Using sequence of projection and selection operations.

• Using rename operation to generate intermediate results.
For example, to display names of all female students of the BCA course −
Relational algebra expression using sequence of projection and selection operations
•
πName(σGender="Female"ANDCourse="BCA"(STUDENT))πName(σGender="Female"AN
DCourse="BCA"(STUDENT))

Relational algebra expression using rename operation to generate intermediate results

•
FemaleBCAStudent←σGender="Female"ANDCourse="BCA"(STUDENT)FemaleBCAStud
ent←σGender="Female"ANDCourse="BCA"(STUDENT)
Result←πName(FemaleBCAStudent)Result←πName(FemaleBCAStudent)

Union
If P is a result of an operation and Q is a result of another operation, the union of P and Q
(p∪Qp∪Q) is the set of all tuples that is either in P or in Q or in both without duplicates.
For example, to display all students who are either in Semester 1 or are in BCA course −
Sem1Student←σSemester=1(STUDENT)Sem1Student←σSemester=1(STUDENT)
BCAStudent←σCourse="BCA"(STUDENT)BCAStudent←σCourse="BCA"(STUDENT)
Result←Sem1Student∪BCAStudentResult←Sem1Student∪BCAStudent

Intersection
If P is a result of an operation and Q is a result of another operation, the intersection of P and
Q ( p∩Qp∩Q ) is the set of all tuples that are in P and Q both.
For example, given the following two schemas −

EMPLOYEE

EmpID Name City Department Salary

PROJECT

PId City Department Status

To display the names of all cities where a project is located and also an employee resides −
CityEmp←πCity(EMPLOYEE)CityEmp←πCity(EMPLOYEE)
CityProject←πCity(PROJECT)CityProject←πCity(PROJECT)
Result←CityEmp∩CityProjectResult←CityEmp∩CityProject

Minus
If P is a result of an operation and Q is a result of another operation, P - Q is the set of all
tuples that are in P and not in Q.

For example, to list all the departments which do not have an ongoing project (projects with
status = ongoing) −
AllDept←πDepartment(EMPLOYEE)AllDept←πDepartment(EMPLOYEE)
ProjectDept←πDepartment(σStatus="ongoing"(PROJECT))ProjectDept←πDepartment(σSta
tus="ongoing"(PROJECT))
Result←AllDept−ProjectDeptResult←AllDept−ProjectDept

Join
Join operation combines related tuples of two different tables (results of queries) into a
single table.

For example, consider two schemas, Customer and Branch in a Bank database as follows −

CUSTOMER

CustID AccNo TypeOfAc BranchID DateOfOpening

BRANCH

BranchID BranchName IFSCcode Address

To list the employee details along with branch details −

Result←CUSTOMER⋈[Link]=[Link]←CUSTOM
ER⋈[Link]=[Link]
LAYERS OF QUERY PROCESSING :
.

Figure 5.5 layers of query processing

The problem of query processing can itself be decomposed into several subproblems, corresponding
tovarious layers. A generic layering scheme for query processing is shown where each layer solves a
well-defined subproblem. To simplify the discussion, let us assume a static and semicentralized
query processor that does not exploit replicated fragments. The input is a query on global data
expressed in relational calculus. This query is posed on global (distributed) relations, meaning that
data distribution is hidden. Four main layers are involved in distributed query processing. The first
three layers map the input query into an optimized distributed query execution plan. They perform
the functions of query decomposition, data localization, and global query optimization. Query
decomposition and data localization correspond to query rewriting. The first three layers are
performed by a central control site and use schema information stored in the global directory. The
fourth layer performs distributed queryexecution by executing the plan and returns the answer to
the query. It is done by the local sites and the control site

GENERIC LAYERING SCHEME FOR DISTRIBUTED QUERY PROCESSING

QUERY DECOMPOSITION
The first layer decomposes the calculus query into an algebraic query on global relations. The
information needed for this transformation is found in the global conceptual schema
describing the global relations. However, the information about data distribution is not used
here but in the next layer. Thus the techniques used by this layer are those of a centralized
DBMS.
Query decomposition can be viewed as four successive steps. First, the calculus query is
rewritten in a normalized form that is suitable for subsequent manipulation. Normalization of
a query generally involves the manipulation of the query quantifiers and of the query
qualification by applying logical operator priority.
Second, the normalized query is analyzed semantically so that incorrect queries are detected
and rejected as early as possible. Techniques to detect incorrect queries exist only for a subset
of relational calculus. Typically, they use some sort of graph that captures the semantics of
the query.
Third, the correct query (still expressed in relational calculus) is simplified. One way to
simplify a query is to eliminate redundant predicates. Note that redundant queries are likely
to arise when a query is the result of system transformations applied to the user query. Such
transformations are used for performing semantic data control (views, protection, and
semantic integrity control).
Fourth, the calculus query is restructured as an algebraic query. That several algebraic queries
can be derived from the same calculus query, and that some algebraic queries are “better”
than others. The quality of an algebraic query is defined in terms of expected performance.
The traditional way to do this transformation toward a “better” algebraic specification is to
start with an initial algebraic query and transform it in order to find a “good” one. The initial
algebraic query is derived immediately from the calculus query by translating the predicates
and the target statement into relational operators as they appear in the query. This directly
translated algebra query is then restructured through transformation rules. The algebraic
query generated by this layer is good in the sense that the worse executions are typically
avoided. For instance, a relation will be accessed only once, even if there are several select
predicates. However, this query is generally far from providing an optimal execution, since
information about data distribution and fragment allocation is not used at this layer.
DATA LOCALIZATION :
The input to the second layer is an algebraic query on global relations. The main role of the
second layer is to localize the query’s data using data distribution information in the fragment
schema. We saw that relations are fragmented and stored in disjoint subsets, called fragments,
each being stored at a different site. This layer determines which fragments are involved in
the query and transforms the distributed query into a query on fragments. Fragmentation is
defined by fragmentation predicates that can be expressed through relational operators. A
global relation can be reconstructed by applying the fragmentation rules, and then deriving a
program, called a localization program, of relational algebra operators, which then act on
fragments. Generating a query on fragments is done in two steps. First, the query is mapped
into a fragment query by substituting each relation by its reconstruction program (also
called materialization program). Second, the fragment query is simplified and restructured to
produce another “good” query. Simplification and restructuring may be done according to the
same rules used in the decomposition layer. As in the decomposition layer, the final fragment
query is generally far from optimal because information regarding fragments is not utilized.
GLOBAL QUERY OPTIMIZATION :
The input to the third layer is an algebraic query on fragments. The goal of query
optimization is to find an execution strategy for the query which is close to optimal.
Remember that finding the optimal solution is computationally intractable. An execution
strategy for a distributed query can be described with relational algebra operators
and communication primitives (send/receive operators) for transferring data between sites.
The previous layers have already optimized the query, for example, by eliminating redundant
expressions. However, this optimization is independent of fragment characteristics such as
fragment allocation and cardinalities. In addition, communication operators are not yet
specified. By permuting the ordering of operators within one query on fragments, many
equivalent queries may be found.
Query optimization consists of finding the “best” ordering of operators in the query,
including communication operators that minimize a cost function. The cost function, often
defined in terms of time units, refers to computing resources such as disk space, disk I/Os,
buffer space, CPU cost, communication cost, and so on. Generally, it is a weighted
combination of I/O, CPU, and communication costs. Nevertheless, a typical simplification
made by the early distributed DBMSs, as we mentioned before, was to consider
communication cost as the most significant factor. This used to be valid for wide area
networks, where the limited bandwidth made communication much more costly than local
processing. This is not true anymore today and communication cost can be lower than I/O
cost. To select the ordering of operators it is necessary to predict execution costs of
alternative candidate orderings. Determining execution costs before query execution (i.e.,
static optimization) is based on fragment statistics and the formulas for estimating the
cardinalities of results of relational operators. Thus the optimization decisions depend on the
allocation of fragments and available statistics on fragments which are recorder in the
allocation schema.
An important aspect of query optimization is join ordering, since permutations of the joins
within the query may lead to improvements of orders of magnitude. One basic technique for
optimizing a sequence of distributed join operators is through the semijoin operator. The
main value of the semijoin in a distributed system is to reduce the size of the join operands
and then the communication cost. However, techniques which consider local processing costs
as well as communication costs may not use semijoins because they might increase local
processing costs. The output of the query optimization layer is a optimized algebraic query
with communication operators included on fragments. It is typically represented and saved
(for future executions) as a distributed query execution plan.
DISTRIBUTED QUERY EXECUTION :
The last layer is performed by all the sites having fragments involved in the query. Each
subquery executing at one site, called a local query, is then optimized using the local schema
of the site and executed. At this time, the algorithms to perform the relational operators may
be chosen. Local optimization uses the algorithms of centralized systems.
The goal of distributed query processing may be summarized as follows: given a calculus
query on a distributed database, find a corresponding execution strategy that minimizes a
system cost function that includes I/O, CPU, and communication costs. An execution strategy
is specified in terms of relational algebra operators and communication primitives
(send/receive) applied to the local databases (i.e., the relation fragments). Therefore, the
complexity of relational operators that affect the performance of query execution is of major
importance in the design of a query processor.
TRANSACTION AND COMPUTATION MODEL
• Page Model
• Object Model

Page Model
Syntax
Atransactiont is a partial order of steps (actions) of the formr(x) or w(x), where x∈D and
reads and writes as well as multiple writes applied to the same object are ordered.
We write t = (op, <),
for transaction t with step set op and partial order <.
Example:r(s) w(s) r(t) w(t)
Semantics
Interpretation of jth step, pj , of t:
If pj =r(x), then interpretation is assignment vj:= x to local variable vj.
If pj=w(x), then interpretation is assignment x := fj(vj1, ..., vjk).
with unknown function fjand j1, ..., jk denoting t‘s prior read steps.
Object Model
A transaction t is a (finite) tree of labeled nodes with
• the transaction identifier as the label of the root node,
• the names and parameters of invoked operations as labels of inner nodes, and
• page-model read/write operations as labels of leaf nodes, along with a partial order <
on the leaf nodes such that for all leaf-node operations p and q with p of the form w(x)
and q of the form r(x) or w(x) or vice versa, we have
p<q ∨ q<p
Special case: layered transactions(all leaves have same distance from root)
Derived inner-node ordering: a < b ifall leaf-node descendants of a precede all leaf-node
descendants of b
Example: DBS Internal Layers

Figure 5.6 DBS Internal Layers

Figure 5.7 Business Objects

MULTIDATABASE CONCURRENCY CONTROL :
Concurrency controlin hierarchical MDBSs .In this section, we present a framework for the
design of concurrency control mechanisms for hierarchical MDBSs. In a hierarchical MDBS,
for the global schedule S to be serializable, the projection of S onto data items in each domain
D ∈ ∆ (that is, S D) must be serializable. However, as illustrated in the following example,
ensuring serializability of S D, for each D ∈ ∆, is not sufficient to ensure global
serializability.
For example, in a schedule generated by a serialization-graph-testing (SGT) scheduler, it may
not be possible to associate a serialization function with transactions. However, in such
schedules, serialization functions can be introduced by forcing direct conflicts between
transactions.
Let τ ′ ⊆ τ be a set of transactions in a schedule S. If each transaction in τ ′ executed a
conflicting operation (say a write operation on data item ticket) in S, then the functions that
maps a transaction Ti ∈ τ ′ to its write operation on ticket is the serialization function for the
transactions in S with respect to the set of transactions τ ′ . Associating serialization functions
with global transactions makes the task of ensuring serializability of S D relatively simple.
Since at each local DBMS the order in which transactions that are global with respect to the
local DBMSs are serialized is consistent with the order in which their serSk operations
execute, serializability of S D can be ensured by simply controlling the execution order of the
serSk operations belonging to the transactions global with respect to the local DBMSs. To see
how this can be achieved, for a global transaction Ti , let us denote its projection to its
serialization function values over the local DBMSs as a transaction T˜ D i .
Formally, T˜ D i is defined as follows.
Definition 1. Let Ti be a transaction and D be a simple domain such that global(Ti , DBk),
for some DBk, where child(DBk, D), T˜ D i is a restriction of Ti consisting of all the
operations in the set {serSk (Ti) | Ti executes in DBk, and child(DBk, D)} Further, for the
global schedule S, we define a schedule S˜D to be the restriction of S consisting of the set of
operations belonging to transactions T˜ D i . Thus, S˜D = (τS˜D , ≺S˜D ), where τS˜D = {T˜
D i | global(Ti , DBk) for some DBk, where child(DBk, D)}, and for all operations oq, or in
S˜D, oq ≺S˜D or, iff oq ≺S or. In the schedule S˜D the conflict between operations is defined
as follows:
Definition 2. Let S be a global schedule. Operations Sk (Ti) and Sl (Tj ) in schedule S˜D, Ti /
Tj , are said to conflict if and only if k = l. It is not too difficult to show that the serializability
of the schedule S D can be ensured by ensuring the serializability of the schedule S˜D.
Essentially, ensuring serializability of S˜D enforces a total order over global transactions
(with respect to the local DBMSs), such that if Ti occurs before Tj in the total order, then
serSk operation of Ti occurs before serSk operation of Tj for all sites sk at which they
execute in common, thereby ensuring serializability of S D
Notice that operations in the schedule S˜D consist of only global transactions. Thus, since
global transactions execute under the control of the MDBS software, the MDBS software can
control the execution of the operations in S˜D to ensure its serializability, thereby ensuring
serializability of S D. How this can be achieved – that is, how the MDBS software can ensure
serializability of S˜D is a topic of the next section. Recall that the above-described
mechanism for ensuring serializability of S D has been developed under the assumption that
D is a simple domain. In the remainder of this section, we extend the mechanism suitably to
ensure serializability of the schedule S D for an arbitrary domain D. One way we can extend
the mechanism to arbitrary domains in hierarchical MDBSs is by suitably extending the
notion of the serialization function to the set of domains.
Definition 3. Let D be any arbitrary domain in ∆. An extended serialization function is a
function sf(Ti , D) that maps a given transaction Ti , and a domain D, to some operation of Ti
that executes in D such that the following holds. For all Ti , Tj , if global(Ti , D), global(Tj ,
D), and Ti ∗ SD Tj , then sf(Ti , D) ≺SD sf(Tj , D). We refer to sf(Ti , D) as a serialization
function of transaction Ti with respect to the domain D. To see how such a serialization
function will aid us in ensuring serializability within a domain, consider a domain D /= DBk,
k = 1, 2, . . . , m. To develop the intuition, let us assume that the above-defined serialization
function exists for transactions in every child domain of D, that is, for every Dk, where
child(Dk, D). If such a serialization function can be associated with the child domains, we
can simply use the mechanism developed for simple domains to ensure serializability of S D.
We will, however, have to appropriately extend our definitions of the transaction T˜ D i , and
the schedule S˜D with respect to the newly defined serialization function. This is done below.
Definition 4. Let Ti be a transaction and D be a domain such that global(Ti , Dk) for some
Dk, where child(Dk, D). T˜ D i is a restriction of Ti consisting of all the operations in the set
{sf(Ti , Dk) | Ti executes in Dk, and child(Dk, D) }. As before, schedule S˜D is simply the
schedule consisting of the operations in the transactions T˜ D i . That is, S˜D = (τS˜D , ≺S˜D
), where τS˜D = {T˜ D i | global(Ti , Dk) for some Dk, where child(Dk, D)}, 158 and for all
operations oq, or in S˜D, oq ≺S˜D or, iff oq ≺S or. Similar to the case of simple domain, two
operations in S˜D, where D is an arbitrary domain, conflict if they are both serialization
function values of different transactions over the same child domain.
Definition [Link] S be a global schedule. Operations sf(Ti , Dk) and sf(Tj , Dl) in schedule
S˜D, Ti / Tj , are said to conflict if and only if k = l. It it not difficult to see that similar to the
case of simple domains, serializability of S D can be ensured, where D is an arbitrary domain,
by ensuring the serializability of the schedule S˜D, under the assumption that, for all child
domains Dk of D, the schedule S Dk is serializable and further a serialization function sf can
be associated with transactions that are global with respect to Dk (see Lemma 1 in the
appendix for a formal proof). In fact, this result can be applied recursively over the domain
hierarchy to ensure serializability of the schedules S D for arbitrary domains D in hierarchical
MDBSs. To see this, consider a hierarchical MDBS shown in Fig. 4. To ensure serializability
of S D3 , it suffices to ensure serializability of the schedule S˜D3 , under the assumption that
S D1 and S D2 are serializable and further that an appropriate serialization function sf can be
associated with transactions that are global with respect to D1 and D2. In turn, serializability
of S D1 (S D2 ) can be ensured by ensuring that the schedule S˜D1 (S˜D2 ) is serializable,
under the assumption that S DB1 and S DB2 (S DB3 and S DB4 ) are serializable and further
that an appropriate serialization function sf can be associated with transactions that are global
with respect to DB1 and DB2 (DB3 and DB4). The recursion ends when D is a simple
domain, since the child domains are local DBMSs and by assumption the schedule at each
local DBMS is serializable. Thus, if we can associate an appropriate serialization function sf
with transactions in each domain D ∈ ∆, we can ensure serializability of S D, by ensuring
serializability of S˜D for all domains D ∈ ∆. Note that, for a domain D = DBk, the function sf
is simply the function serSk introduced earlier. We now define the function sf for an arbitrary
domain D ∈ ∆, which is done recursively over the domain ordering relation.
Definition 6. Let D be a domain and Ti be a transaction such that global(Ti , D). The
serialization function for transaction Ti in domain D is defined as follows: sf(Ti , D) = serSk
(Ti), if for some DBk, D = DBk. serS˜D (T˜ D i ), if for all DBk, D /= DBk Let us illustrate
the above definition of the serialization function using the following example. Example 3.
Consider an MDBS environment consisting of local databases: DBMS1 with data item a,
DBMS2 with data item b, DBMS3 with data item c, and DBMS4 with data item d. Let the
domain ordering relation be as illustrated in Fig. 4. The set of domains: ∆ = {DB1, DB2,
DB3, DB4, D1, D2, D3}
MULTIDATABASE RECOVERY :
ReMT - A Recovery Strategy for MDBSs As already mentioned, reliability in MDBSs
requires the design of two different types of protocols: commit and recovery protocols. A
commit protocol which enforces commit atomicity of global transactions. In this section, we
will present a strategy, called ReMT, for recovering multidatabase consistency after failures,
without human intervention. In MDBSs, recovering multidatabase consistency has a twofold
meaning. First, for global transaction aborts, recovering multidatabase consistency means to
undo the effects of locally committed subsequences belonging to the aborted global
transactions from a semantic point of view. In addition, the effects of transactions which have
accessed objects updated by aborted global transactions should be preserved (recall that, after
the last operation of a subsequence, all locks held by the subsequence are released). For the
other types of failures, recovering multi database consistency means to restore the most
recent global transaction-consistent state. We say that a multi database is in a global
transaction-consistent state, if all local DBMSs the effects of locally-committed
subsequences. The ReMT strategy consists of a collection of recovery protocols which are
distributed among the components of an MDBS. Hence, some of them are performed by the
GRM, some by the servers and some are provided by the LDBMSs. We assume that every
participating LDBMS provides its own recovery mechanism. Local recovery mechanisms
should be able to restore the most recent transaction-consistent state of local databases after
local failures. For each type of failure, we propose a specic recovery scheme. 6.1 Transaction
Failures As seen before, we identify different kinds of transaction failures which may occur
in a multidatabase environment. Each of them can be dealt with in a different manner. First, a
particular global transaction may fail. This can be caused by a decision of the GTM or can be
requested by the transaction itself. Second, a given subsequence of a global transaction may
fail. In the following, we will propose recovery procedures to cope with failures of global
transactions and subsequences.
A global transaction failure may occur for two reasons. The abort can be requested by the
transaction or it occurs on behalf of the MDBS. The GTM can identify the reason which has
caused the abort. This is because the GTM receives an abort operation from the transaction,
whenever the abort is required by the transaction. We have observed that the recovery
protocol for global transaction failures can be optimized if the following design decision is
used: specific recovery actions should be dined for each situation in which a global
transaction abort occurs. Therefore, we have designed recovery actions which should be
triggered when the global transaction requires the abort, and recovery actions for coping with
aborts which occur on behalf of the MDBS. Aborts Required by Transactions Since we
assume that updates of a global transaction Gi may be viewed by other transactions, we
cannot restore the database state which existed before the execution Gi, if Gi aborts. This
implies that the standard transaction undo action cannot be used in such a situation. However,
the effects of a global transaction must be somehow removed from the database, if it aborts.
For that reason, we need a more adequate recovery paradigm for such an abort scenario. This
new recovery paradigm should primarily focus on the fact that the effects of transactions
which have accessed the objects updated by an aborted global transaction Gi and database
consistency should be preserved, when removing the effects of Gi from the database. The key
to this new recovery paradigm is the notion of compensating transactions. A compensating
transaction CT \undoes" the effect of a particular transaction T from a semantic point of view.
That means, CT does not restore the physical database state which existed before the
execution of the transaction T . The compensation guarantees that a consistent (in the sense
that all integrity constraints are preserved) database state is established based on semantic
information, which is application-specific.
By definition, a compensating transaction CTi should be associated with a transaction Ti and
may only be executed within the context of Ti. That means that the existence of CTi depends
on Ti. In other words, CTi may only be executed, if Ti has been executed before. Hence, CTi
must be serialized after Ti. We will assume that persistence of compensation is guaranteed,
that is, once the compensating action has been started, it is completed successfully. For our
purpose the concept of compensation is realized as follows. For a given transaction Gi
consisting of subsequences SUBi;1, SUBi;2, :::, SUBi;n, a global compensating transaction
CTi is dened, which in turn consists of a collection of local compensating transactions CTi;k,
0 < k n. Each local compensating transaction CTi;k is associated to the corresponding
subsequence SUBi;k of transaction Gi. Of course, CTi;k must be performed at the same local
site as does SUBi;k and must be serialized after SUBi;k. Now, we are in a position to
describe the recovery strategy for aborts required by transactions. When the GS receives an
abort request from a global transaction Gi, the GS forwards this operation to the GRM. The
GRM reads the global log in order to identify which subsequences of Gi are still active. For
each active subsequence, the GRM sends a local abort operation to the servers responsible for
the execution of the subsequence. The GRM then waits for an acknowledgment from these
servers conrming that the subsequences were aborted. After that, the GRM triggers the
corresponding local compensating transactions for every subsequence which has already been
locally committed. This information can be retrieved from the global log le. Operations of the
compensating transactions are scheduled by the GS. Therefore, the execution of local
compensating transactions will undo the effect of committed subsequences from a semantic
point of view. Since we have assumed that the LDBMSs implement 2PL to enforce local
serializability, the compensation mechanism described above satisfies the following
requirement. A particular transaction T (subsequence or local transaction) running at a local
system either views a database state reacting the effects of an updating subsequence SUBi;k
or it accesses a state produced by the compensating transaction of SUBi;k, namely CTi;k. In
other words, T cannot access objects updated by SUBi;k and by CTi;k. Such a constraint is
required for preserving local database consistency. Thus far, we have assumed that the effect
of any transaction can be removed from the database by means of a compensating
transaction. However, not all transactions are compensatable. There are some actions,
classified by Gray as real actions ,which present the following property: once they are done,
they cannot be undone anymore. For some of these actions, the user does not know how they
can be compensated, that is, the semantic of such compensating transactions is unknow. For
instance, the action ring a missile cannot be undone. Moreover, the semantic of a
compensating transaction for this action cannot be defined. For that reason, we say that
transactions involving such real actions are not compensatable. In order to overcome this
problem, we propose the following mechanism. The execution of local commit operations for
non-compensatable subsequences should be delayed until the GTM receives a commit for the
global transaction containing the non-compensatable subsequences. This mechanism requires
that the following two conditions are satisfied. First, the user should specify which
subsequences of a global transaction are non-compensatable3 . 3When it is not specified that
a subsequence is non-compensatable, it is assumed that the subsequence is compensatable.
This is a reasonable requirement, since our recovery strategy relies on a compensation
mechanism. This latter mechanism presumes that the user defines compensation transactions,
when he or she is designing transactions. Hence, the user can identify at this point, which
subsequences of a global transaction may not be compensatable. Second, the information
identifying which subsequences are non-compensatable should be made available to the
GTM. For instance, the GTM can be designed to receive this information as an input
parameter of subsequences. The procedure of delaying the execution of local commit
operations for non-compensatable subsequences can be realized according to the following
protocol: 1. When the GTM receives the rst operation of a particular subsequence, it must
identify whether the subsequence is compensatable. If the subsequence is non-compensatable,
the GTM saves this information in the log record of the subsequence. The log record should
be stored in the global log le. 2. If the GTM receives a local commit operation for a non-
compensatable subsequence, it marks the log record of the subsequence stored in the global
log with a ag. This ag captures the information that the local commit operation for the
subsequence can be processed when the global transaction is to be committed. 3. Whenever
the GTM receives a commit operation for a given global transaction Gi, it verifies in the
global log if there are local commit operations to be processed for subsequences of Gi. This
can be realized by reading the log records of all subsequences belonging to Gi. Following this
protocol, we ensure that the
effects of non-compensatable subsequences are reacted in the local databases only when the
global transaction is to be committed. This eliminates the possibility of undoing the effect of
such subsequences. Unfortunately, this mechanism has the following disadvantage. Locks
held by non-compensatable subsequences can only be released when the global transaction
completes its execution. Another drawback of the compensation approach is the specification
of compensating transactions for interactive transactions as, for instance, design activities. As
a solution for overcoming such a problem, we propose the following strategy. When an
interactive global transaction G has to be aborted and G has some locally committed
subsequences, the GTM reads the global log le in order to identify which subsequences of G
were already locally committed. After that, the GTM notifies the user that the effects of some
subsequences of G must be \manually" undone. The GRM informs which subsequences
should be undone and what operations these subsequences have executed. Moreover, the
GRM informs the user on which objects these operations have been performed. The user then
starts another transaction in order to undo the effect of such subsequences. Objects updated
by these subsequences may have been viewed by other global transactions. For that reason,
the user must know which global transactions have read these objects. With this knowledge
the user can notify other designers that the values of the objects x,y,z they have read (the
GRM has provided this information) are invalid. Aborts on Behalf of the MDBS Usually,
such aborts occur when global transactions are involved in deadlocks. Deadlocks are
provoked by transactions trying to access the same objects with connecting locks. Committed
subsequences have already released their locks. Besides this, they are not competing for locks
anymore. Hence, operations of such subsequences can neither provoke nor be involved in
deadlocks. This observation has an important impact in designing recovery actions to cope
with transaction aborts required by the MDBS. It is not necessary to abort entire global
transactions to resolve deadlock situations. Aborting active subsequences is sufficient.
However, we need to replay the execution of the aborted subsequences in order to ensure
commit atomicity. This implies that new results may be produced by the resubmission of the
subsequences. In such a situation, the user must be noticed that the subsequences were
aborted and, for that reason, they must be replayed, which may produce different results from
those he/she has already received. With this knowledge, the user can decide to accept the new
results or to abort the entire global transaction. Observe that, if the original values read by the
failed subsequences were not communicated to other subsequences (those reads may be
invalid), the resubmission of the aborted subsequences will produce no inconsistency in the
execution of entire global transaction. Such a requirement is reasonable in a multidatabase
environment. Based on these observations, we propose the following strategy for dealing with
global transaction aborts which occur on behalf of the MDBS. When the GTM (or another
component of the MDBS) decides to abort a transaction Gi, the GRM must be informed that
Gi has to be aborted. When the GRM receives this signal, it verifies in the global log which
subsequences of Gi are still active. For each active subsequence, the GRM sends a local abort
operation to the servers (through the GS, of course) responsible for the submission of these
subsequences to the local systems. In the meantime, the GRM waits for an acknowledgment
from the servers confrming the local aborts of the subsequences. Furthermore, the GRM
sends to the user responsible for the execution of Gi the notification informing that some
subsequences of Gi have to be aborted and they will be replayed. The GRM is able to inform
the user which operations have to be reexecuted. The user can then decide to wait for the
resubmission of the aborted subsequences or to abort the entire global transaction. If the user
decides to abort the entire global transaction, the process of replaying the subsequences is
cancelled and the recovery protocol for global transaction failure requested by the transaction
is triggered. Otherwise, the recovery protocol for global transaction aborts which occur on
behalf of the MDBS goes on as described below. When the GRM has received the
acknowledgments that the subsequences were aborted in the local DBMSs, the GRM starts to
replay the execution of each aborted subsequence SUBi;k. For that purpose, the GRM must
read from the global log le the log record which contains information about the installation
point of each subsequence to be replayed. This record can be identified by the elds SUBID
and LRT. Observe that LRT must have the value `IP'.
As mentioned before, a subsequence of a particular global transaction may abort for many
reasons. However, there are two situations of subsequence aborts which should be handled in
a different manner. The rest situation is when the subsequence is aborted on behalf of the
local DBMS. The second situation is when the subsequence decides to abort its execution. In
this section, we describe a recovery method to deal with these two subsequence abort
situations. Aborts on Behalf of the Local DBMS Typically, DBMSs decide to abort
subsequences, when such subsequences are involved in local deadlocks. After such aborts,
the effect of failed subsequences are undone by the LDBMSs. Locks held by the aborted
subsequences are released. As soon as the server recognizes that a particular subsequence has
been aborted by the local DBMS, the server reads the server log le and retrieves the log
records of the aborted subsequence. The server stores a new log record for the subsequence
with LRT=`ST' in the server log le. Moreover, the server sends a message to the GTM
reporting that the subsequence has been aborted by the LDBMS. The GRM forces a record
log of the failed subsequence to the global log le. By doing this, the new state of the
subsequence is stored in the global log le as well. After that, the server forces a log record
with the new state of the subsequence to the server log and starts the resubmission of the
aborted subsequence. As already seen, new results may be produced by such a resubmission.
However, we propose a notification mechanism which gives the user the necessary support to
decide for accepting the new results or for aborting the entire global transaction. It is
important to notice here that a given subsequence SUBi;k belonging to a global transaction
Gi may have more than one log record with LRT=`ST' (in each log le) during the execution
of Gi. For such a subsequence, only the last record with LRT=`ST' should be considered.
Aborts Required by Subsequences When the subsequence identifies some internal error
condition (e.g., violation of some integrity constraints or bad input), it aborts its execution.
Sometimes the resubmission of the subsequence is sufficient to overcome the error situation.
However, we cannot guarantee that the subsequence will be committed after being
resubmitted a certain number of times.
This is because the abort is caused when some internal error condition occurs (e.g. division
by 0). Hence, it is impossible to predict whether or not the same problem will occur in a
repeated execution of the subsequence. In this case, the solution is to abort the complete
global transaction. The user or the GTM should be able to make such a decision. Observe
that, when an internal error occurs, it is necessary that the subsequence reads new values
(new input) and produces new results in order to overcome the internal error condition. Based
on this observation, we propose the following actions for dealing with aborts required by
subsequences. When the subsequence decides to abort its execution, an explicit abort
operation is submitted to the GS, which in turn sends this operation to the GRM. The GRM
then writes a new log record with LRT=`ST' for the subsequence in order to re ect its new
state. Thereafter, the GRM forwards the abort operation to the server. In turn, the server
forces a log record with the new state of the subsequence to the server log and submits the
abort operation to the LDBMS. After the subsequence is aborted by the LDBMS, the GTM
resubmits the aborted subsequence to the LDBMS. 6.2 Local System Failures Local DBSs
reside in heterogeneous and autonomous computer systems (sites). When a system failure
occurs at a particular site, we assume that the LDBMS is able to perform recovery actions in
order to restore the most recent transaction-consistent state. These actions are executed
outside the control of the MDBS. After an LDBMS completes the recovery actions, the
interface server assumes the control of the recovery processing. While the server is executing
its recovery actions, no local transaction can be submitted to the restarted DBMS. Before
describing the strategy for recovering from local system failures, we need to defined states of
a subsequence in a given server. A subsequence may present four different states in a server.
A subsequence is said to be active, when no termination operation for the subsequence has
been submitted to the local DBMS by the server. When the server submits a commit
operation, the subsequence enters the to-be-committed state. If the commit operation
submitted by the server has been successfully executed by the local DBMS, the subsequences
enters the locally-committed state. When the subsequence aborts, it enters the locally-aborted
state.
We assume that the GTM can identify when a given server has failed. The protocol for
handling server failures is the following:
[Link] the GTM recognizes that a server has failed, it aborts the execution of all active
subsequences which were being executed in the failed server. Log records (with LRT=`ST')
for the aborted subsequences are forced to the global log le in order to store the information
that these subsequences have passed from the active to the aborted state. Moreover, the GTM
stops submitting operations to that server. In order to decide what kind of recovery actions
should be performed for to-be-committed subsequences, the GTM must wait until the server
has been restarted, since the GTM must know whether the subsequence was successfully
committed by the local DBMS.
2. After the server is restarted, it should trigger the following recovery procedures:
(a) The server log is sequentially read. For each subsequence which was active immediately
before the occurrence of the failure, the server sends an abort operation to the local DBMS. If
the subsequence was to-be-committed, the server may query the external interface of the local
DBMS in order to know whether or not the subsequence was successfully committed by the
local DBMS. The server then forwards this information to the GTM.
(b) The server log must be updated. For instance, if a particular to-be-committed subsequence
was aborted by the local DBMS before the occurrence of the failure, the server writes a
record in the server log le in order to capture this information.
(c) After the server log is read and updated, the server sends a message to the GTM
informing that it is in operation. 3. When the GRM receives a message from the server
reporting that it is operational, the GRM replays the aborted subsequences. After that, the
recovery procedure for server failure is completed.
Communication Failures The components of an MDBS are interconnected via
communication links. Typically, communication failures break the communication among
some of the components of an MDBS. According to Figure 2, there may be two types of
communication links in MDBSs. One type of link, which we call Server-LDBS link, connects
servers to local systems. If the interface servers are not integrated with the GTM, that is, each
server resides at a different site from the GTM site, the other type of link connects the GTM
to servers. Such a communication link is denoted GTM-Server link. We propose different
recovery strategies for handling failures in each type of communication link. In order to
enable MDBSs to cope with communication failures, the following requirement must be
satisfied. Each server in an MDBS must know the timeout period of the local DBMS with
which the server is associated. We also assume that each server has its own timeout period
and this timeout period is larger than the timeout period of the respective local DBMS.
Failures in Server-LDBS links In such failures, the link between a particular local system and
a server is broken. The local system and the server will continue to work correctly. Such a
situation can lead the local system to abort the execution of some subsequences (which are
being executed at the local system) by timeout. For coping with communication failures
between a server and a local system, we propose the following strategy. If the communication
link is reestablished before the timeout period of the local DBMS is reached, no recovery
action is necessary. This is because no subsequence was aborted by timeout. In the case that
the communication link is reestablished after the timeout period of the local DBMS is
reached, but before the timeout period of the server, the following recovery actions should be
performed by the server: 1. The server scans the server log le. During the scan process, the
following recovery actions should be performed.
(a) For each subsequence which was active before the occurrence of the failure, the server
executes recovery actions, since such subsequences were aborted by the LDBMS by timeout.
These recovery actions are the same as those which should be performed for recovering from
subsequence failures required by LDBMSs
(b) If the subsequence was to-be-committed, the server may query the external interface of
the LDBMS in order to know whether the subsequence was successfully committed. In this
case, the server performs actions to confirm the fact that the subsequence was committed (for
instance, log records with LRT=`ST' must be forced to the server log and global log les).
Otherwise, it considers the subsequence as locally aborted and performs actions for
recovering from subsequence failures required by LDBMSs. If the timeout period of the
server is reached before the communication link is reestablished, the server sends a message
to the GTM reporting that it cannot process subsequences anymore. After that, the GTM
aborts the execution of all subsequences which were being executed in the failed server. Log
records for the aborted subsequences are stored in the global log le with their new state
(aborted). The GTM stops submitting operations to that server. If the communication link is
reestablished before the timeout period of the GTM is reached, recovery actions for
recovering from server failures are executed. If the timeout period of the GTM is reached
before the Server-LDBS link is reestablished, the global log le is sequentially read. For each
global transaction which has submitted a subsequence to the server whose Server-DBMS link
is broken, the subsequence's log record with LRT=`ST' is read. If the subsequence is active or
to-be-committed, the GRM aborts the global transaction. In this case, recovery actions for
global transaction recovery should be triggered. Observe that a subsequence which was
submitted to the server with a broken Server-LDBS link and has a to-be-committed state in
the global log may have been committed by the local DBMS. In this case, after the link is
reestablished, the server must be able to query the external interface of the LDBMS to know
whether or not the subsequence was successfully committed. If the subsequence was
committed, a compensating transaction for such a subsequence should be executed. Failures
in GTM-Server links Of course, such a failure has only to be considered, if it is assumed that
the interface servers reside at different sites from the GTM's site. When a failure in the GTM-
Server link occurs, the link between the GTM and a server is broken. In order to enable
MDBSs to cope with failures in GTM-Server links, we propose the strategy described below.
Without loss of generality, consider that the link between the GTM and the server Serverk is
broken. Serverk is associated with local system LDBSk. If the communication link is
reestablished after the timeout period of LDBSk is reached, but before the timeout period of
the server, the following actions are performed: The server log is sequentially read.
1. For each subsequence which was active before the failure, the server executes recovery
actions for subsequence failures required by local DBMSs, since such transactions were
aborted by the local DBMS (timeout).
2. If the subsequence was to-be-committed, the server may query the external interface of the
local DBMS in order to know whether the subsequence was committed. In this case, the
server performs the actions to react the fact that the subsequence was locally committed.
Otherwise, it performs actions for recovering from subsequence failures required by local
DBMSs. If the link is reestablished after the timeout period of Serverk, but before the timeout
period of the GTM is reached, actions for recovering from server failures are started. If the
timeout period of the GTM is reached before the link is reestablished, the GRM reads the
global log in order to identify active global transactions which have submitted a subsequence
to Serverk whose link with the GTM is broken. For each global transaction satisfying this
condition, the GRM verifies the state of the subsequence submitted to Serverk. If the
subsequence was active or to-be-committed, the GRM aborts the global transaction.
Recovery actions for global transaction recovery should be triggered. A subsequence which
has a to-be-committed state in the global log may have been committed by the local DBMS.
In this case, after the communication link is reestablished, the server must be able to query
the external interface of the LDBMS in order to know whether or not the subsequence was
successfully committed. If the subsequence was committed, a compensating transaction for
such a subsequence should be executed.
OBJECT ORIENTATION AND INTEROPERABILITY:
Interoperating applications are often developed independently of each other in environments
that may differ in the following dimensions:
• Locations
• Machine architectures
• Operating systems
• Programming languages
• Models of information. Applications can interoperate along the following dimensions:
• “Horizontal” peer-to-peer sharing of services and information, such as an editor invoking a
spreadsheet processor to embed a spreadsheet in a document.
• “Vertical” cascading through levels of implementation. A student registration service may
use a database service which in turn uses a file manager which uses a device driver.
• “Time-line” through the life cycle of an application. Enterprise modeling may be done in
terms of one set of constructs which are translated into constructs of the application
programming language which are compiled into constructs of the run-time environment. Or, a
graphical language used to capture a user’s conceptual model of a business domain is
translated into a computer-executable simulation language, with the results of the simulation
then being input either to an analysis tool to allow refinement of the simulation, or to a report
generator to produce the final result. Internal Accession Date Only 2
• Others, e.g., the “viewpoints” of the ISO/CCITT Reference Model for Open Distributed
Processing (RM-ODP) – Enterprise viewpoint – Information viewpoint – Computational
viewpoint – Engineering viewpoint – Technology viewpoint Interoperation is concerned with
such things as:
• Application interconnection: – Finding services and information in a distributed
environment. – Coping with operational differences between requesters and providers of
services, such as interface/communication protocols, synchronization, exception handling,
work coordination, resource management, etc.
• Information compatibility.
OMA (OBJECT MANAGEMENT ARCHIRECTURE):

OMA is an architecture developed by the OMG (Object Management Group) that provides an
industry standard for developing object-oriented applications to run on distributed networks.
The goal of the OMG is to provide a common architectural framework for object-oriented
applications based on widely available interface specifications.

The OMA reference model identifies and characterizes components, interfaces, and protocols
that comprise the OMA. It consists of components that are grouped into application-oriented
interfaces, industry-specific vertical applications, object services, and ORBs (object request
brokers). The ORB defined by the OMG is known more commonly as CORBA (Common
Object Request Broker Architecture).

The Common Object Request Broker Architecture (CORBA) is a specification developed by

the Object Management Group (OMG). CORBA describes a messaging mechanism by which
objects distributed over a network can communicate with each other irrespective of the
platform and language used to develop those objects.
There are two basic types of objects in CORBA. The object that includes some functionality
and may be used by other objects is called a service provider. The object that requires the
services of other objects is called the client. The service provider object and client object
communicate with each other independent of the programming language used to design them
and independent of the operating system in which they run. Each service provider defines an
interface, which provides a description of the services provided by the client.
CORBA enables separate pieces of software written in different languages and running on
different computers to work with each other like a single application or set of services. More
specifically, CORBA is a mechanism in software for normalizing the method-call semantics
between application objects residing either in the same address space (application) or remote
address space (same host, or remote host on a network).
CORBA applications are composed of objects that combine data and functions that represent
something in the real world. Each object has multiple instances, and each instance is
associated with a particular client request. For example, a bank teller object has multiple
instances, each of which is specific to an individual customer. Each object indicates all the
services it provides, the input essential for each service and the output of a service, if any, in
the form of a file in a language known as the Interface Definition Language (IDL). The client
object that is seeking to access a specific operation on the object uses the IDL file to see the
available services and marshal the arguments appropriately.

Figure 5.8 Object Management Archirecture

The CORBA specification dictates that there will be an object request broker (ORB) through which an
application interacts with other objects. In practice, the application simply initializes the ORB, and
accesses an internal object adapter, which maintains things like reference counting, object (and
reference) instantiation policies, and object lifetime policies. The object adapter is used to register
instances of the generated code classes. Generated code classes are the result of compiling the user
IDL code, which translates the high-level interface definition into an OS- and language-specific class
base to be applied by the user application. This step is necessary in order to enforce CORBA
semantics and provide a clean user process for interfacing with the CORBA infrastructure.

DISTRIBUTED COMPONENT MODEL:

DCOM is a programming construct that allows a computer to run programs over the network
on a different computer as if the program was running locally. DCOM is an acronym that
stands for Distributed Component Object Model. DCOM is a proprietary Microsoft software
component that allows COM objects to communicate with each other over the network.
An extension of COM, DCOM solves a few inherent problems with the COM model to better
use over a network:
Marshalling: Marshalling solves a need to pass data from one COM object instance to
another on a different computer – in programming terms, this is called “passing arguments.”
For example, if I wanted Zaphod’s last name, I would call the COM Object LastName with
the argument of Zaphod. The LastName function would use a Remote Procedure Call (RPC)
to ask the other COM object on the target server for the return value for LastName(Zaphod),
and then it would send the answer – Beeblebrox – back to the first COM object.
Distributed Garbage Collection: Designed to scale DCOM in order to support high volume
internet traffic, Distributed Garbage Collection also addresses away to destroy and reclaim
completed or abandoned DCOM objects to avoid blowing up the memory on webservers. In
turn, it communicates with the other servers in the transaction chain to let them know they
can get rid of the objects related to a [Link] DCE/RPC as the underlying RPC
mechanism: To achieve the previous items and to attempt to scale to support high volume
web traffic, Microsoft implemented DCE/RPC as the underlying technology for DCOM –
which is where the D in DCOM came from.
How Does DCOM Work?
In order for DCOM to work, the COM object needs to be configured correctly on both
computers – in our experience they rarely were, and you had to uninstall and reinstall the
objects several times to get them to work.
The Windows Registry contains the DCOM configuration data in 3 identifiers:
CLSID – The Class Identifier (CLSID) is a Global Unique Identifier (GUID). Windows
stores a CLSID for each installed class in a program. When you need to run a class, you need
the correct CLSID, so Windows knows where to go and find the program.
PROGID – The Programmatic Identifier (PROGID) is an optional identifier a programmer
can substitute for the more complicated and strict CLSID. PROGIDs are usually easier to
read and understand. A basic PROGID for our previous example could be
[Link]. There are no restrictions on how many PROGIDs can have the same
name, which causes issues on occasion.
APPID – The Application Identifier (APPID) identifies all of the classes that are part of the
same executable and the permissions required to access it. DCOM cannot work if the APPID
isn’t correct. You will probably get permissions errors trying to create the remote object, in
my experience.
A basic DCOM transaction looks like this:
The client computer requests the remote computer to create an object by its CLSID or
PROGID. If the client passes the APPID, the remote computer looks up the CLSID using the
PROGID.
The remote machine checks the APPID and verifies the client has permissions to create the
object.
[Link] (if an exe) or [Link] (if a dll) will create an instance of the class
the client computer requested.
Communication is successful!
The Client can now access all functions in the class on the remote computer.
If the APPID isn’t configured correctly, or the client doesn’t have the correct permissions, or
the CLSID is pointing to an old version of the exe or any other number of issues, you will
likely get the dreaded “Can’t Create Object” message.
DCOM vs. CORBA
Common Object Request Broker Architecture (CORBA) is a JAVA based application and
functions basically the same as DCOM. Unlike DCOM, CORBA isn’t tied to any particular
Operating System (OS), and works on UNIX, Linux, SUN, OS X, and other UNIX-based
platforms.
Neither proved secure or scalable enough to become a standard for high volume web traffic.
DCOM and CORBA didn’t play well with firewalls, so HTTP became the default standard
protocol for the internet.
Search... Sign In

Accountancy Business Studies Economics Organisational Behaviour Human Resource Management Entrepreneurship Marketing Income Tax Finance Management

Mobile Marketing : Meaning, Types, Importance and Examples

Last Updated : 06 Aug, 2025

What is Mobile Marketing?

Mobile Marketing is a dynamic marketing strategy that capitalizes on mobile channels,
including SMS, MMS messaging, smartphones, tablets, and mobile apps, to effectively
promote products or services to a targeted consumer audience. This approach's primary
objective is to consistently engage with consumers on their handheld devices, offering a
personalized and precisely targeted marketing experience. Key components integral to the
success of mobile marketing initiatives encompass diverse elements such as mobile ads,
SMS and MMS messaging, mobile apps, and location-based marketing.

Geeky Takeaways:

Mobile Marketing is a dynamic strategy that utilizes channels like MMS, smartphones,
tablets, SMS, and apps to promote products or services.
The primary goal is to engage users consistently on their mobile devices, providing
personalized and targeted marketing experiences.
Responsive mobile websites, interactive mobile apps, social commerce, and direct SMS
marketing are key components.
Examples include IKEA's interactive initiatives, Burger King's mobile outreach, and
Swiggy's mobile engagement.

Table of Content
How does Mobile Marketing Work?
Types of Mobile Marketing
Why is Mobile Marketing important?
Advantages of Mobile Marketing
Disadvantages of Mobile Marketing
How to start a Mobile Marketing Business?
Examples of Mobile Marketing
Differentiate Mobile Marketing from Traditional Marketing
Mobile Marketing Strategy
Free & Paid Mobile Marketing Tools
How much does Mobile Marketing Cost?
Mobile Marketing - FAQs

How does Mobile Marketing Work?

Mobile Marketing campaigns encompass a range of sophisticated elements aimed at
engaging consumers through diverse channels. The intricacy lies in the variety of methods
employed to capture user attention and encourage interaction. Crucially, understanding the
nuances of mobile marketing requires acknowledging a shift from demographic-centric
approaches to a focus on consumer behavior. While demographic factors still play a role,
the central emphasis is on understanding and responding to users' actions. This
underscores the dynamic and behavior-driven nature of successful mobile marketing
strategies, where tailored and responsive content takes precedence over-generalized
demographic targeting.

Types of Mobile Marketing

1. Responsive Mobile Websites: Mobile websites are meticulously designed to be
responsive, adapting seamlessly to smaller screens of smartphones and tablets. This
optimization ensures a user-friendly experience by adjusting layouts and content, making
navigation and information access convenient for users on various devices.

2. Interactive Mobile Apps: Mobile applications are purpose-built software programs

tailored for mobile devices. They elevate user experiences through interactive features such
as in-app purchases, push notifications, and loyalty programs. Businesses leverage mobile
apps to engage with their audience in personalized and dynamic ways.

3. Social Commerce: Social media platforms seamlessly integrate e-commerce features,

allowing businesses to showcase and sell products directly to users. This mobile marketing
approach leverages the popularity of social media channels to enhance sales and create
interactive customer engagements.

4. Direct SMS Marketing: SMS marketing delivers promotional messages directly to

customers' mobile phones, offering a direct and effective communication channel for timely
offers, updates, and reminders. This method ensures immediate visibility and engagement.

5. AI-Driven Chatbots: Chatbots simulate conversations with users through messaging

platforms. They enhance the user experience on mobile devices by answering queries,
providing customer support, and facilitating purchases, contributing to streamlined
interactions.

6. Augmented Reality: AR technology enriches mobile experiences by overlaying digital

content onto the real world through a device's camera. Marketers leverage AR to create
interactive engagements, such as virtual product trials or immersive environments,
captivating users in unique ways.

7. Location-Based Marketing: Targeting consumers based on their geographic location,

location-based marketing utilizes GPS data or beacon technology. Businesses deliver
personalized offers, promotions, or notifications to users when they are in proximity to
specific locations, enhancing the relevance of marketing efforts.

8. Optimized Social Media Marketing: Social media platforms offer potent channels for
businesses to connect, share content, run ads, and build relationships. Mobile social media
marketing specifically tailors content for mobile viewing and engagement, optimizing for
features like live videos or stories to effectively interact with followers.

9. In-Game Advertising: Placing ads within mobile games engages a captive audience of
gamers. These ads, whether banners, videos, or sponsored content, integrate seamlessly
into the gaming experience, providing marketers with unique opportunities to connect with
users.

10. Engaging Mobile Video Ads: Leveraging the high engagement of video content on
mobile devices, mobile video ads are strategically placed on social media platforms,
websites, or apps. This approach visually compellingly promotes products or services.

11. Mobile Wallet Marketing: Mobile wallet marketing harnesses digital wallets like
Google Pay or Apple Pay to deliver loyalty cards, coupons, promotions, or payment options
directly to users' smartphones. This method offers businesses a convenient way to engage
with customers during transactions.

12. Bluetooth Proximity Marketing: Utilizing Bluetooth technology, proximity marketing

sends targeted messages or promotions to users when they are near Bluetooth-enabled
devices or beacons. This approach delivers relevant content based on physical location.

13. Voice Search Optimization: With the surge in voice assistants, optimizing content for
voice search is imperative. Voice search optimization involves tailoring keywords and
content to match natural language queries made through voice commands, enhancing
visibility in voice-enabled searches.

14. In-App Advertising: In-app advertising involves strategically placing ads within mobile
applications, ranging from banners to videos or native ads. This method capitalizes on
users' active engagement with the app, providing marketers with valuable opportunities to
connect with their target audience.

Why is Mobile Marketing important?

1. Expanded Reach: Mobile Marketing serves as a potent tool for businesses to establish
connections with consumers through their devices, ensuring a broad reach. This approach
facilitates engagement with customers regardless of their location, enabling businesses to
establish a pervasive presence in the lives of their target audience.

2. Precision in Targeting: The specificity inherent in mobile marketing empowers

businesses to direct their campaigns with precision toward the intended consumer base. By
honing in on mobile devices, marketers can fine-tune their strategies to resonate with the
right audience at opportune moments, thereby maximizing the impact and effectiveness of
their marketing endeavors.

3. Enhanced User Experience: Crafting well-designed mobile marketing campaigns goes

beyond mere advertising as it transforms user experience by incorporating interactive
features, personalized content, and seamless navigation. This captivates the audience and
supports heightened brand engagement and customer satisfaction, creating a positive and
memorable interaction between the business and its clientele.

4. Geographically Targeted Outreach: Mobile Marketing brings forth the capability to tailor
campaigns based on users' specific locations, embracing the power of location-based
targeting. Businesses can curate customized strategies that align with users' geographic
locations and behaviors, thereby adding a layer of relevance and context to their marketing
initiatives.

5. Agile Adaptability: Mobile Marketing, driven by real-time data and user feedback, offers
businesses the flexibility to swiftly adapt and optimize their strategies. This responsiveness
ensures that businesses remain agile and able to navigate and respond effectively to the
ever-evolving landscape of the digital sphere.

Advantages of Mobile Marketing

1. Expanding Audience Reach: Mobile Marketing serves as a dynamic channel for
businesses to establish connections with a broader audience, leveraging the widespread
use of mobile devices in people's daily lives. This expansive reach ensures that businesses
can engage with a diverse consumer base, fostering increased visibility and brand
presence.

2. Lead Generation and Customer Expansion: Strategically employing mobile marketing

opens avenues for businesses to generate a higher volume of leads, facilitating the
expansion of their customer base. Through targeted mobile campaigns, firms can attract
potential customers, nurturing them into valuable leads for sustained growth.

3. Sales Amplification: Effective mobile marketing campaigns contribute to heightened

sales by providing customers with a seamless, personalized shopping experience. This
convenience, coupled with strategic marketing efforts, translates into increased sales
figures, making mobile marketing a pivotal driver of business revenue.

4. Cost-Effective Campaigns: Compared to traditional marketing methods, mobile

marketing offers a cost-effective approach for businesses to reach a larger and more
targeted audience. This efficiency enables businesses to optimize their marketing budget,
allocating resources strategically for maximum impact.

5. Personalization for Enhanced Engagement: Mobile Marketing empowers businesses to

infuse personalization into their campaigns, tailoring content based on user preferences,
behaviors, and locations. This personalized approach enhances user engagement, making
interactions more meaningful and relevant to individual consumers.

Disadvantages of Mobile Marketing

1. Privacy and User Data Handling: The utilization of user data in mobile marketing
initiatives often sparks privacy concerns, prompting a need for transparent practices and
robust data protection measures. Businesses must navigate this landscape carefully,
ensuring responsible data usage, to build and maintain consumer trust.

2. Addressing Spam Challenges: The prevalence of spam messages and notifications in

mobile marketing necessitates a delicate balance to avoid being perceived as intrusive.
Mitigating spam-related issues is crucial for maintaining positive user experiences and
fostering a receptive audience for mobile campaigns.

3. Navigating Creative Constraints: Crafting compelling and visually appealing content for
mobile devices presents a unique set of challenges due to constraints like screen size and
design limitations. Successfully navigating these restrictions requires innovative approaches
to captivate users effectively within the mobile interface.

4. Budgeting for App Development: For businesses, especially smaller ones with limited
budgets, the high cost associated with designing and maintaining mobile applications can
pose financial challenges. Strategic budgeting and consideration of cost-effective
alternatives become essential to overcome this obstacle.

How to start a Mobile Marketing Business?

1. Understanding your Audience: Begin your mobile marketing journey by clearly defining
your target audience. Delve into their demographics, behaviors, and preferences to
establish a comprehensive understanding of your ideal customers. This foundational step
lays the groundwork for tailoring your mobile marketing strategies to effectively meet their
specific needs and expectations.

2. Strategically Choosing Mobile Approaches: Select a mobile marketing strategy that

aligns seamlessly with your business objectives and resonates with your identified target
audience. Choices range from SMS marketing and mobile applications to in-app advertising
and location-based marketing.

3. Building an Opt-In Database for SMS Campaigns: For effective SMS marketing
campaigns, prioritize the creation of an opt-in database. Encourage users to willingly
subscribe to your text messages, fostering a relationship built on consent. This approach
complies with regulatory requirements and ensures that your messages reach an audience
genuinely interested in your offerings.

4. Ensuring Mobile-Friendly Website: Optimize your website for mobile users by

embracing responsive design, streamlined navigation, and rapid loading times. A mobile-
friendly website enhances the user experience, facilitating easy access to information and
services.

5. Exploring Native Ads for Seamless Integration: Consider investing in native ads, a form
of advertising that seamlessly integrates with the platform's form and function. Native ads
provide a less disruptive and more engaging user experience. This approach fosters a
natural flow within the platform, increasing the likelihood of capturing the audience's
attention and interest.

6. Utilizing QR Codes for Engagement: Implement QR codes strategically within your

marketing campaigns to encourage user interaction. QR codes serve as gateways for users
to engage with your content and gather more information about your business. This
interactive element adds depth to your campaigns, promoting user involvement and
curiosity.

7. Monitoring and Adapting with Analytics: Track the performance of your mobile
marketing endeavors using robust analytics tools. Regularly monitor and analyze the
results to gain insights into campaign effectiveness. Use the acquired data to make
informed adjustments and refinements to your strategies, ensuring continuous improvement
and optimal outcomes.

Examples of Mobile Marketing

1. IKEA's Interactive Mobile Initiatives: IKEA has successfully utilized mobile marketing to
pioneer interactive experiences for customers. Introducing augmented reality apps, IKEA
enables users to visualize furniture in their homes before making a purchase. This
innovative approach enhances the customer journey and exemplifies how mobile marketing
can redefine and elevate the retail experience.

2. Burger King's Mobile Outreach: Burger King employs mobile marketing strategies to
effectively reach consumers on their mobile devices, driving both awareness and sales for
their diverse range of products. Through innovative approaches, Burger King leverages the
ubiquity of mobile devices to engage with their audience, ensuring their presence in the
dynamic digital landscape.

3. Swiggy's Mobile Engagement: Swiggy, a prominent food delivery platform, has adeptly
executed impactful mobile marketing campaigns aimed at engaging customers and
amplifying awareness of their services. Leveraging mobile channels, Swiggy strategically
connects with users, employing tailored strategies to promote their platform and foster
customer loyalty.

Differentiate Mobile Marketing from Traditional Marketing

Basis Mobile Marketing Traditional Marketing

Allows for two-way communication,

Typically, one-way
enabling direct interaction between
Interactivity communication and limited
customers and businesses through their
interaction.
mobile devices.

It involves creating campaigns

specifically targeting consumers on
May have a broader reach but
Targeting mobile devices, leading to a more
may lack specific targeting.
targeted audience and potentially
higher conversion rates.

More affordable, as it enables

Can be expensive, notably for
Cost businesses to reach a targeted audience
firms with limited budgets.
with lower costs.

Allows for personalization based on

user preferences, behaviors, and Relies on mass communication
Personalization
locations, providing more engaging and with limited personalization.
relevant experiences.

Provides instant access to real-time

May not offer the same level
data, empowering businesses to
Real-time Data of real-time data and
optimize campaigns based on
optimization.
immediate user feedback and behavior.

More flexible and adaptable due to the

wide range of mobile devices and May face technical challenges
Technical
platforms available, reducing potential in designing and implementing
Challenges
technical challenges in campaign complex campaigns.
execution.

Mobile Marketing Strategy

A mobile marketing strategy refers to a tailored plan or methodology delineating how a
business intends to leverage diverse mobile marketing channels and tactics to attain its
marketing objectives. This comprehensive strategy encompasses pivotal decisions on the
selection of mobile channels deemed most effective, the optimization of content to suit
mobile platforms, and the formulation of engaging approaches to connect with the target
audience through mobile devices. It serves as a roadmap for businesses, guiding them in
navigating the dynamic landscape of mobile marketing to maximize reach, engagement,
and overall effectiveness in achieving their marketing aspirations.

Free & Paid Mobile Marketing Tools

I. Free Mobile Marketing Tools

1. Google Analytics 4: Google Analytics 4 emerges as a robust tool offering profound

insights into app performance, user behavior, and acquisition. By tracking crucial metrics like
user engagement, retention rates, app installs, and conversion events, businesses can
scrutinize user interactions within mobile apps. This analysis empowers them to optimize
marketing strategies and enhance overall user experience.

2. Flurry Analytics: Tailored for mobile apps, Flurry Analytics serves as a free analytics
tool providing valuable insights into user engagement, retention, and in-app behavior. Key
metrics such as session lengths, active users, and conversion rates are tracked, enabling
businesses to comprehend user-app interactions. With data-driven decisions, businesses
can improve app performance and foster heightened user engagement.

3. App Annie: A comprehensive platform, App Annie, delves into app market data,
competitor insights, and industry trends. Businesses gain access to critical information,
including app downloads, revenue estimates, user demographics, and market share. Armed
with this knowledge, businesses can strategically position themselves, identify growth
opportunities, and optimize their app marketing strategies.

4. Branch: Branch, a deep linking platform, facilitates seamless user experiences across
diverse devices and platforms. Through deep linking, users navigate directly to specific
content within an app or website, elevating user engagement and retention. Additionally,
Branch offers attribution analytics, allowing businesses to track campaign effectiveness and
refine user acquisition strategies.

5. Usability Hub: Usability Hub, a user testing platform, empowers businesses to gather
feedback on app usability and design. Employing remote usability tests, surveys, and
preference tests, businesses gain insights into user interactions with their mobile apps. This
feedback proves invaluable in identifying usability issues, enhancing the user experience,
and making informed design decisions.

II. Paid Mobile Marketing Tools

1. AppsFlyer: As a prominent mobile attribution and marketing analytics platform,

AppsFlyer aids businesses in gauging the effectiveness of mobile marketing campaigns.
Real-time data on app installs, in-app events, revenue attribution, and ROI analysis
enables businesses to optimize user acquisition strategies, monitor campaign performance
across channels, and maximize marketing ROI.

2. CleverTap: CleverTap, a customer engagement platform, offers personalized messaging,

behavioral analytics, and campaign optimization tools for mobile marketing. By tailoring
campaigns based on user behavior, preferences, and lifecycle stage, businesses enhance
customer engagement, boost retention rates, and increase conversions through
personalized communication.

3. Attentive: Specializing in personalized SMS marketing for e-commerce businesses,

Attentive is a mobile messaging platform that enables targeted messaging based on
customer preferences, purchase history, and browsing behavior. By leveraging this
platform, businesses can elevate customer engagement, drive sales through mobile
messaging campaigns, and cultivate enduring relationships with their audience.

How much does Mobile Marketing Cost?

Mobile Marketing expenses can vary based on the chosen strategies and tools. However,
these costs can be customized to your budget. Effectively allocate resources and monitor
campaign performance for a positive return on investment. Here are the average costs for
various mobile marketing components:

Mobile Marketing Application Fees: Licensing a mobile marketing application can

range from a few hundred to several thousand dollars, depending on the required
functionalities.
App Market Research: Researching the app market typically costs between $5,000 and
$15,000 on average.
Social Media Advertising: Advertising an app on social media may range from $5,000
to over $50,000 monthly, determined by your budget and campaign resources.
Paid Campaigns: For simplicity, let's assume a $1,000 per month budget split between
four campaigns, totaling $250 per month per paid campaign over three months.
Influencer Marketing: Influencer Marketing costs vary widely, from a few hundred
dollars to tens of thousands, contingent on the influencer's reach and campaign scope.

Comment S sriyali… Follow

Article Tags : Commerce Marketing

Explore

DSA Tutorial - Learn Data Structures and Algorithms 6 min read

System Design Tutorial 3 min read

Aptitude Questions and Answers 3 min read

Web Development Technologies 6 min read

AI, ML and Data Science Tutorial 3 min read

DevOps Tutorial 5 min read

Company Explore Skip to content

Tutorials Courses Videos Preparation Corner
Search EN Upload Sign in Download free for 30 days

0 ratings · 3K views · 49 pages

Distributed DBMS Architecture

1. Distributed databases allow data to be stored across multiple networked computers in a unified manner so that transactions can be processed in a distributed way. 2. Key advantages includ…
Full description
Uploaded by Mahboob AI-enhanced title and description

Download Save Share 0% 0% Print Embed Ask AI Report

Distributed DBMS Architecture

Reference book
Database Systems: A Practical Approach to Design, Implementation and Management,
by Thomas M. Connolly and Carolyn E. Begg. Ch 22

Principles of Distributed Database Systems

by M. Tamer Özsu • Patrick Valduriez – (ch 1-page 21-35)
&
internet resources.

From Scribd · 34 pages · 677 views No ratings yet

Unit - 1 DDB

Distributed Database Concepts

It is a system to process Unit of execution (a transaction) in

a distributed manner. That is, a transaction can be
executed by multiple networked computers in a unified
manner.
It can be defined as
A distributed database (DDB) is a collection of multiple
logically related database distributed over a computer
network, and a distributed database management
system as a software system that manages a distributed
database while making the distribution transparent to
the user.

From Scribd · 85 pages · 1.2K views 100% (1)

DBMS PPT Unit-5

Shared nothing architecture

From Scribd · 58 pages · 2.6K views No ratings yet

Distibuted Database Management System Notes

Centralized database

From Scribd · 8 pages · 6.1K views 33% (3)

Distributed Database Questions

Distributed database

From Scribd · 4 pages · 1.9K views No ratings yet

DDBMS Questions Answers

Distributed Database System

Advantages
1. Management of distributed data with different levels
of transparency: This refers to the physical placement of
data (files, relations, etc.) which is not known to the user
(distribution transparency).
Site 5
Site 1

Site 4 Communications neteork

Site 3 Site 2

From Scribd · 39 pages · 403 views No ratings yet

Distributed Databases

Distributed Database System

Advantages
The EMPLOYEE, PROJECT, and WORKS_ON tables may be fragmented
horizontally and stored with possible replication as shown below.
EMPLOYEES - All
PROJECTS - All
WORKS_ON - All
EMPLOYEES - New York
Chicago PROJECTS - All
(headquarters) WORKS_ON - New York Employees

EMPLOYEES - San Francisco and LA New York

PROJECTS - San Francisco
WORKS_ON - San Francisco Employees

San Francisco Communications neteork

Los Angeles Atlanta

EMPLOYEES - LA EMPLOYEES - Atlanta
PROJECTS - LA and San Francisco PROJECTS - Atlanta
WORKS_ON - LA Employees WORKS_ON - Atlanta Employees

From Scribd · 14 pages · 476 views No ratings yet

Unit-1 DDBMS Architecture

Distributed Database System

Advantages
• Distribution and Network transparency: Users do not have to
worry about operational details of the network. There is
Location transparency, which refers to freedom of issuing
command from any location without affecting its working. Then
there is Naming transparency, which allows access to any names
object (files, relations, etc.) from any location.
• Replication transparency: It allows to store copies of a data at
multiple sites as shown in the above diagram. This is done to
minimize access time to the required data.
• Fragmentation transparency: Allows to fragment a relation
horizontally (create a subset of tuples of a relation) or vertically
(create a subset of columns of a relation).

From Scribd · 11 pages · 770 views No ratings yet

Unit 3 (Distributed DBMS Architecture) : Architecture: The Architecture o

Distributed Database System

Advantages
2. Increased reliability and availability: Reliability refers to system live
time, that is, system is running efficiently most of the time.
Availability is the probability that the system is continuously
available (usable or accessible) during a time interval. A distributed
database system has multiple nodes (computers) and if one fails
then others are available to do the job.
3. Improved performance: A distributed DBMS fragments the
database to keep data closer to where it is needed most. This
reduces data management (access and modification) time
significantly.
4. Easier expansion (scalability): Allows new nodes (computers) to be
added anytime without chaining the entire configuration.

From Scribd · 2 pages · 1.6K views No ratings yet

6 Design Issues of DDBMS

From Scribd · 27 pages · 1.2K views 100% (1)

Distributed DBMS Reliability Unit IV

From Scribd · 5 pages · 2.2K views No ratings yet

Distributed Database Design Concept

Types of Distributed Database Systems

Federated Database Management Systems Issues

• Differences in data models: Relational, Objected

oriented, hierarchical, network, etc.
• Differences in constraints: Each site may have their
own data accessing and processing constraints.
• Differences in query language: Some site may use SQL,
some may use SQL-89, some may use SQL-92, and so on.

From Scribd · 15 pages · 1.6K views No ratings yet

Transparencies in Distributed DBMS

Types of Distributed Database System

 Homogeneous
DDBMS
 Heterogeneous

Homogenous Heterogeneous

From Scribd · 27 pages · 3.7K views 100% (3)

Unit - I Distributed Data Processing

Homogenous Distributed Database Systems

•In a homogeneous distributed database, all the sites

use identical DBMS and operating systems. Its properties are

• The sites use very similar software.

• The sites use identical DBMS or DBMS from the same vendor.

•Each site is aware of all other sites and cooperates with other
sites to process user requests.

•The database is accessed through a single interface as if it is a

single database.

From Scribd · 28 pages · 1.2K views 100% (3)

Distributed Transactions Management

Example
• homogeneous database system is an
enterprise’s nation-wide ERP system which
comprises of distributed databases, all of
which are Oracle.

From Scribd · 8 pages · 3.3K views No ratings yet

DBMS Previous Questions

Homogeneous Database

Same software

From Scribd · 35 pages · 2.5K views 50% (2)

Distributed DBMS Reliability - 3 of 3 (Good)

Types of Homogeneous Distributed

Database
• There are two types of homogeneous distributed
database −
• Autonomous − Autonomous distributed database are independent
databases (separate data residing in each database) that function
independently, but, are integrated by the controlling application
software.
• Non-autonomous −
• Non-autonomous distributed database are homogeneous
databases where data is distributed
across homogeneous nodes and is controlled by DBMS at each
node

From Scribd · 130 pages · 1.2K views No ratings yet

DBMS Complete Note PDF

Example
• Example for a autonomous distributed
database system is Oracle based data marts
which manages data pertaining to sales,
distribution and inventory. Example for a non-
autonomous distributed database system is
Oracle based global sales database which is
partitioned across multiple databases.

From Scribd · 8 pages · 3.0K views 100% (3)

Distributed DBMS Challenges

Advantages of Homogeneous Distributed Database

 Easy to use
 Easy to mange
 Easy to Design

Disadvantages of Homogeneous Distributed Database

 Difficult for most organizations to force a

homogeneous environment

From Scribd · 2 pages · 2.1K views No ratings yet

Persistent Programming Language

Heterogeneous Distributed Database Systems

In this type of database , Different data center may run different DBMS products, with
possibly different underlying data models.

Occurs when sites have implemented their own databases and integration is considered
later.

Translations required to allow for:

o Different hardware.
o Different DBMS products.
o Different hardware and different DBMS products.

From Scribd · 2 pages · 695 views 100% (1)

The Database System Environment

Heterogeneous Distributed database

Sql oracle

From Scribd · 62 pages · 2.3K views No ratings yet

Bca 3 Sem File

Heterogeneous DDBMS

• In a heterogeneous distributed database different

sites may use different schema and software.
• In heterogeneous systems, different nodes may
have different hardware & software and data
structures at various nodes or locations are also
incompatible.
• Different computers and operating systems,
database applications or data models may be
used at each of the locations.

From Scribd · 25 pages · 2.3K views 100% (1)

Chapter 4: Semantic Data Control: View Management Security Control Inte

Heterogeneous DDBMS (contd..)

• On heterogeneous system, translations are
required to allow communication between
different sites (or DBMS).
• The heterogeneous system is often not
technically or economically feasible. In this
system, a user at one location may be able to
read but not update the data at another
location.

From Scribd · 33 pages · 2.5K views 100% (1)

Semantic Integrity Control in Distributed DBMSS: References

Types of Heterogeneous Distributed

Databases
• Federated − The heterogeneous database
systems are independent in nature and
integrated together so that they function as a
single database system.
• Un-federated − Unfederated database systems
are collection of homogeneous database
systems which are generally non-autonomous
by nature and employs centralized control.

From Scribd · 9 pages · 2.5K views No ratings yet

Types of Data Independence

Advantages of Heterogeneous Distributed Database

Huge data can be stored in one Global center from different data
center

Remote access is done using the global schema.

Different DBMSs may be used at each node

Disadvantages of Heterogeneous Distributed Database

Difficult to mange

Difficult to design.

From Scribd · 18 pages · 820 views No ratings yet

DBMS

Distributed DBMS Architecture

• The architecture of a system defines its structure.
• This means that the components of the system
are identified, the function of each component is
specified, and the interrelationships and
interactions among these components are
defined.
• DDBMS architectures are generally developed
depending on three parameters

From Scribd · 30 pages · 2.4K views No ratings yet

Distributed Database Systems (DDBS)

Architectural Models for Distributed

DBMSs
• Distribution − It states the physical
distribution of data across the different sites.
• Autonomy − It indicates the distribution of
control of the database system and the degree
to which each constituent DBMS can operate
independently.
• Heterogeneity − It refers to the uniformity or
dissimilarity of the data models, system
components and databases.

From Scribd · 19 pages · 1.7K views No ratings yet

CS3492 DBMS Univ - QP Answer AM 2024

ANSI/SPARC Architecture
• In late 1972, the Computer and Information
Processing Committee (X3) of the American
National Standards Institute (ANSI) established a
Study Group on Database Management Systems
under the auspices of its Standards Planning and
Requirements Committee (SPARC).
• The mission of the study group was to study the
feasibility of setting up standards in this area, as
well as determining which aspects should be
standardized if it was feasible.

From Scribd · 67 pages · 2.9K views No ratings yet

DBMS BCA 4th Sem. Mohd Kaif

ANSI -SPARK

From Scribd · 5 pages · 1.5K views 100% (1)

DBMS Unit-1 PPT 1.2 (Advantages & Disadvantages of DBMS, Components,

From Scribd · 2 pages · 2.6K views No ratings yet

Model Question Paper Database Management Systems

Architectural Models

• Some of the common architectural models are –

• Client - Server Architecture for DDBMS

• Peer - to - Peer Architecture for DDBMS
• Multi - DBMS Architecture

From Scribd · 30 pages · 667 views No ratings yet

DBMS-Unit 3

Client - Server Architecture for DDBMS

• This is a two-level architecture where the

functionality is divided into servers and clients.
• The server functions primarily encompass data
management, query processing, optimization and
transaction management
• Client functions include mainly user interface.
However, they have some functions like
consistency checking and transaction
management.

From Scribd · 157 pages · 2.4K views No ratings yet

Advanced Database Note

From Scribd · 6 pages · 1.3K views No ratings yet

DBMS Question Bank for Students

types of client/server architecture.

• 1) Multiple client single server

There is only one server which is accessed by multiple
clients.
•

2) Multiple client multiple server

Two alternative management strategies are possible: wither
each client manages its own
connection to the appropriate server or each client knows of
only its home server which then
communicates with other servers as required.

From Scribd · 2 pages · 1.9K views No ratings yet

Reference Architecture of Distributed Dbmss

Peer- to-Peer Architecture for DDBMS

• In these systems, each peer acts both as a

client and a server for imparting database
services.
• The peers share their resource with other
peers and co-ordinate their activities.

From Scribd · 18 pages · 1.5K views No ratings yet

Unit-3 (Database Design and Normalization)

This architecture generally has four

levels of schemas −
• Global Conceptual Schema − Depicts the
global logical view of data.
• Local Conceptual Schema − Depicts logical
data organization at each site.
• Local Internal Schema − Depicts physical data
organization at each site.
• External Schema − Depicts user view of data

From Scribd · 10 pages · 639 views No ratings yet

DDBMS MCQ - 1

From Scribd · 10 pages · 1.1K views No ratings yet

Parallel Database Architecture Guide

Multi - DBMS
• Multi database refers to multiple databases
where each database has full autonomy - can be
seen as a collection of autonomous databases
(similar to federated databases).
• In relational DB context there is a separate
schema for each database.
• We talk about database integration that relates
data from multiple databases.
• For example there could be a manufacturing
database that records products and a separate
sales database that records sales. The two
database can make up multi database system.

From Scribd · 23 pages · 1.7K views No ratings yet

DBMS Unit 1

• Multi-database systems usually reflect a

situation where, for historical reasons, the
data that an organization needs to operate is
held in multiple different databases in
different locations, and possibly from different
vendors.

From Scribd · 2 pages · 956 views 100% (1)

Dbms Question Bank Unit I

From Scribd · 53 pages · 653 views 100% (1)

DBMS - Unit 2

From Scribd · 2 pages · 1.8K views 67% (3)

A Brief History of Database Applications

From Scribd · 30 pages · 1.4K views 100% (1)

Unit 1: Database Management System (DBMS) Historical Perspective

From Scribd · 10 pages · 1.6K views No ratings yet

Dbms Practical Slips

Design Alternatives

• The distribution design alternatives for the

tables in a DDBMS are as follows −
• Non-replicated and non-fragmented
• Fully replicated
• Partially replicated
• Fragmented
• Mixed

From Scribd · 22 pages · 2.3K views No ratings yet

Types of Distributed Databases.: Homogeneous Distributed Databases Sys

Non-replicated & Non-fragmented

• In this design alternative, different tables are

placed at different sites.
• Data is placed so that it is at a close proximity to
the site where it is used most.
• It is most suitable for database systems where
the percentage of queries needed to join
information in tables placed at different sites is
low.
• If an appropriate distribution strategy is adopted,
then this design alternative helps to reduce the
communication cost during data processing.

From Scribd · 11 pages · 34K views 83% (12)

Unit Wise Important Questions

Fully Replicated

• In this design alternative, at each site, one copy

of all the database tables is stored.
• Since, each site has its own copy of the entire
database, queries are very fast requiring
negligible communication cost.
• On the contrary, the massive redundancy in data
requires huge cost during update operations.
• Hence, this is suitable for systems where a large
number of queries is required to be handled
whereas the number of database updates is low.

From Scribd · 32 pages · 710 views No ratings yet

DBMS (R23) Unit - 2-1

Partially Replicated

• Copies of tables or portions of tables are stored

at different sites.
• The distribution of the tables is done in
accordance to the frequency of access.
• This takes into consideration the fact that the
frequency of accessing the tables vary
considerably from site to site.
• The number of copies of the tables (or portions)
depends on how frequently the access queries
execute and the site which generate the access
queries.

From Scribd · 3 pages · 858 views No ratings yet

Transaction With Replicated Data PDF

Fragmented

• In this design, a table is divided into two or more

pieces referred to as fragments or partitions, and
each fragment can be stored at different sites.
• This considers the fact that it seldom happens
that all data stored in a table is required at a
given site.
• Moreover, fragmentation increases parallelism
and provides better disaster recovery

From Scribd · 2 pages · 1.2K views No ratings yet

Actors On The Scene

• The three fragmentation techniques are −

• Vertical fragmentation
• Horizontal fragmentation
• Hybrid fragmentation

Share this document

PDF No ratings yet
Unit - 1 DDB

34 pages

PDF 100% (1)

DBMS PPT Unit-5

85 pages

PDF No ratings yet

Distibuted Database Management System Notes

58 pages

PDF 33% (3)

Distributed Database Questions

8 pages

PDF No ratings yet

DDBMS Questions Answers

4 pages

PDF No ratings yet

Distributed Databases

39 pages

PDF No ratings yet

Unit-1 DDBMS Architecture

14 pages

PDF No ratings yet

Unit 3 (Distributed DBMS Architecture) : Architecture: The Architecture of A System Defines Its Structure

11 pages

PDF No ratings yet

6 Design Issues of DDBMS

2 pages

PDF 100% (1)

Distributed DBMS Reliability Unit IV

27 pages

PDF No ratings yet

Distributed Database Design Concept

5 pages

PDF No ratings yet

Transparencies in Distributed DBMS

15 pages

PDF 100% (3)

Unit - I Distributed Data Processing

27 pages

PDF 100% (3)

Distributed Transactions Management

28 pages

PDF No ratings yet

DBMS Previous Questions

8 pages

PDF 50% (2)

Distributed DBMS Reliability - 3 of 3 (Good)

35 pages

PDF No ratings yet

DBMS Complete Note PDF

130 pages

PDF 100% (3)

Distributed DBMS Challenges

8 pages

PDF No ratings yet

Persistent Programming Language

2 pages

PDF 100% (1)

The Database System Environment

2 pages

PDF No ratings yet

Bca 3 Sem File

62 pages

PDF 100% (1)

Chapter 4: Semantic Data Control: View Management Security Control Integrity Control

25 pages

PDF 100% (1)

Semantic Integrity Control in Distributed DBMSS: References

33 pages

PDF No ratings yet

Types of Data Independence

9 pages

PDF No ratings yet

DBMS

18 pages

PDF No ratings yet

Distributed Database Systems (DDBS)

30 pages

PDF No ratings yet

CS3492 DBMS Univ - QP Answer AM 2024

19 pages

PDF No ratings yet

DBMS BCA 4th Sem. Mohd Kaif

67 pages

PDF 100% (1)

DBMS Unit-1 PPT 1.2 (Advantages & Disadvantages of DBMS, Components, Overall System Tructure)

5 pages

PDF No ratings yet

Model Question Paper Database Management Systems

2 pages

PDF No ratings yet

DBMS-Unit 3

30 pages

PDF No ratings yet

Advanced Database Note

157 pages

PDF No ratings yet

DBMS Question Bank for Students

6 pages

PDF No ratings yet

Reference Architecture of Distributed Dbmss

2 pages

PDF No ratings yet

Unit-3 (Database Design and Normalization)

18 pages

PDF No ratings yet

DDBMS MCQ - 1

10 pages

PDF No ratings yet

Parallel Database Architecture Guide

10 pages

PDF No ratings yet

DBMS Unit 1

23 pages

PDF 100% (1)

Dbms Question Bank Unit I

2 pages

PDF 100% (1)

DBMS - Unit 2

53 pages

PDF 67% (3)

A Brief History of Database Applications

2 pages

PDF 100% (1)

Unit 1: Database Management System (DBMS) Historical Perspective

30 pages

PDF No ratings yet

Dbms Practical Slips

10 pages

PDF No ratings yet

Types of Distributed Databases.: Homogeneous Distributed Databases System Heterogeneous Distributed
Database System

22 pages

PDF 83% (12)

Unit Wise Important Questions

11 pages

PDF No ratings yet

DBMS (R23) Unit - 2-1

32 pages

PDF No ratings yet

Transaction With Replicated Data PDF

3 pages

PDF No ratings yet

Actors On The Scene

2 pages

PDF No ratings yet

DBMS Languages and Interfaces

18 pages

PDF No ratings yet

Data Base Management Systems - Lab 2ND SEM BCA - Y2K8 SCHEME

8 pages

PDF No ratings yet

Dbms

42 pages

PDF No ratings yet

OS Unit-2 Notes

29 pages

PDF No ratings yet

Distributed Databases Overview

19 pages

PDF 0% (1)
Dbms Unit II

68 pages

PDF No ratings yet

Taxonomy of Architectural Styles

4 pages

PDF No ratings yet

Chapter - 6 Distributed Database System

50 pages

PDF No ratings yet

Advanced Data Base Management Systems

35 pages

PDF 0% (1)
Distributed Database Systems Guide

54 pages

PDF No ratings yet

Team:DBMS: by Navdeep Kaur Assistant Professor Computer Science Department

19 pages

PDF No ratings yet

Lecture3-Distributed Introduction

38 pages

PDF No ratings yet

Software Requirements Specification - Knights vs. Zombies

1 page

PDF No ratings yet

Software Requirements Specification: by Car'eless

28 pages

PDF No ratings yet

Food Delivery Mobile Application

55 pages

PDF No ratings yet

One-Page Design Document: Game Identity / Mantra

1 page

PDF No ratings yet

CNC Machine Drawing Thesis

65 pages

PDF No ratings yet

Sedocumentationfinal 160825001132

27 pages

PDF No ratings yet

User Manual ECS-9200/9100 GTX1050

112 pages

PDF No ratings yet

DOCUMENTATION

38 pages

PDF 100% (1)

50 C# Coding Interview Questions Every Developer Should Know.

25 pages

PDF No ratings yet

Symbian OS Seminar Report

24 pages

PDF No ratings yet

C++ Programming Basics Guide

23 pages

PDF No ratings yet

ESP8266 NodeMCU IoT Sensors Guide

3 pages

PDF No ratings yet

QA Tester Resume

3 pages

PDF No ratings yet

Asdi Workshop #1 Apprentice's Full Name: - Date: - Ficha

6 pages

PDF No ratings yet

Manual Tankmaster Winsetup Inventory Management Software For Tank Gauging Systems en 80868

122 pages

PDF No ratings yet

Datasheet - Link2500 Integration - Jan21

2 pages

PDF No ratings yet

Microprocessor 8086 Assigenment

14 pages

PDF No ratings yet

Grade 7 - UNIT 2 REVISION - PYTHON PROGRAMMING

14 pages

PDF No ratings yet

FIOT Practical 12

4 pages

PDF No ratings yet

HC-05 Bluetooth Serial Module Guide

7 pages

PDF No ratings yet

6 pages

PDF 75% (4)

CS401 Collection of Old Papers

25 pages

PDF No ratings yet

DC 4

4 pages

PDF No ratings yet

DS NV Quadro K420 US NV HR

1 page

PDF No ratings yet

WPC CCR Manual Table of Contents-1

5 pages

PDF No ratings yet

Intel PC Emulator Setup Guide

5 pages

PDF No ratings yet

Rockwell Automation Library of Process Objects: Hand-Operated Motor (P - Motorho)

44 pages

PDF No ratings yet

Computer Fanda Mentals

148 pages

PDF No ratings yet

Backup Exec

1,311 pages

PDF No ratings yet

KY 020 Joy IT

3 pages

PDF No ratings yet

VP02 4 BuildingApp-Lab4

4 pages

PDF No ratings yet

Grade 11 Math App Impact Study

8 pages

PDF No ratings yet

Intro to Malware Analysis

38 pages

PDF No ratings yet

ALL100 Manual

95 pages

PDF No ratings yet

3 Stm32f4discovery User Manual 2

40 pages

PDF No ratings yet

Presentation VM FT ECDIS - Radar Display 2024

We and our 31 IAB 72

TCFpages
partners store and access information on your device for the following purposes: store and/or access information on a device, advertising and content measurement, audience
research, and services development, personalised advertising, and personalised content.
Personal data may be processed to do the following: use precise geolocation data and actively scan device characteristics for identification.
Our third party IAB TCF partners may store and access information on your device such as IP address and device characteristics. Our IAB TCF Partners may process this personal data on the basis of
legitimate interest, or with your consent. You may change or withdraw your preferences at any time by clicking on the cookie icon or link; however, as a consequence, you may not see relevant ads or Customize Your Choices
personalized content.
Our website may use these cookies to:
Accept All
Measure the audience of the advertising on our website, without profiling
Display personalized ads based on your navigation
About Support and your profile Legal Social Get our free apps
Personalize our editorial content based on your navigation Reject All
About Scribd,
Allow you [Link] on social networks
to share Help / FAQ present on our website
or platforms Terms Instagram
Send you advertising based on your location
Everand: Ebooks & Audiobooks Accessibility Privacy Facebook
Privacy Policy
Slideshare
Third Parties Purchase help Copyright Pinterest
Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

The CAP Theorem in DBMS

Last Updated : 15 Jul, 2025

The inherent trade-offs in networked shared-data system design make it very difficult to
create a dependable and effective system. The CAP theorem, or CAP principle, is a central
foundation for comprehending these trade-offs in distributed systems. The CAP theorem
emphasizes the limitations that system designers have while addressing distributed data
replication. It states that only two of the three properties—consistency, availability, and
partition tolerance—can be concurrently attained by a distributed system.

Developers must carefully balance these attributes according to their particular application
demands because of this underlying restriction. Designers may decide which qualities to
prioritize to obtain the best performance and reliability for their systems by knowing the
CAP theorem. This article will provide a thorough analysis of all the properties given in the
CAP theorem, investigate the associated trade-offs, and talk about how these ideas relate
to distributed systems in the real world.

What is the CAP Theorem?

The CAP theorem is a fundamental concept in distributed systems theory that was first
proposed by Eric Brewer in 2000 and subsequently shown by Seth Gilbert and Nancy
Lynch in 2002. It asserts that all three of the following qualities cannot be concurrently
guaranteed in any distributed data system:

1. Consistency

Consistency means that all the nodes (databases) inside a network will have the same
copies of a replicated data item visible for various transactions. It guarantees that every
node in a distributed cluster returns the same, most recent, and successful write. It refers to
every client having the same view of the data. There are various types of consistency
models. Consistency in CAP refers to sequential consistency, a very strong form of
consistency.

Note that the concept of Consistency in ACID and CAP are slightly different since in CAP, it
refers to the consistency of the values in different copies of the same data item in a
replicated distributed system. In ACID, it refers to the fact that a transaction will not violate
the integrity constraints specified on the database schema.

For example, a user checks his account balance and knows that he has 500 rupees. He
spends 200 rupees on some products. Hence the amount of 200 must be deducted
changing his account balance to 300 rupees. This change must be committed and
communicated with all other databases that hold this user's details. Otherwise, there will
be inconsistency, and the other database might show his account balance as 500 rupees
which is not true.

Consistency problem

2. Availability

Availability means that each read or write request for a data item will either be processed
successfully or will receive a message that the operation cannot be completed. Every non-
failing node returns a response for all the read and write requests in a reasonable amount
of time. The key word here is "every". In simple terms, every node (on either side of a
network partition) must be able to respond in a reasonable amount of time.

For example, user A is a content creator having 1000 other users subscribed to his channel.
Another user B who is far away from user A tries to subscribe to user A's channel. Since the
distance between both users are huge, they are connected to different database node of
the social media network. If the distributed system follows the principle of availability, user
B must be able to subscribe to user A's channel.

Availability problem

3. Partition Tolerance

Partition tolerance means that the system can continue operating even if the network
connecting the nodes has a fault that results in two or more partitions, where the nodes in
each partition can only communicate among each other. That means, the system continues
to function and upholds its consistency guarantees in spite of network partitions. Network
partitions are a fact of life. Distributed systems guaranteeing partition tolerance can
gracefully recover from partitions once the partition heals.

For example, take the example of the same social media network where two users are
trying to find the subscriber count of a particular channel. Due to some technical fault, there
occurs a network outage, the second database connected by user B losses its connection
with first database. Hence the subscriber count is shown to the user B with the help of
replica of data which was previously stored in database 1 backed up prior to network
outage. Hence the distributed system is partition tolerant.

Partition Tolerance

The CAP theorem states that distributed databases can have at most two of the three
properties: consistency, availability, and partition tolerance. As a result, database
systems prioritize only two properties at a time.

Venn diagram of CAP theorem

The Trade-Offs in the CAP Theorem

The CAP theorem implies that a distributed system can only provide two out of three
properties:

1. CA (Consistency and Availability)

These types of system always accept the request to view or modify the data sent
by the user and they are always responded with data which is consistent among
all the database nodes of a big, distributed network.

However, such type of distributed systems is not realizable in real world because when
network failure occurs, there are two options: Either send old data which was replicated
moments ago before network failure or do not allow user to access the already moments
old data. If we choose first option, our system will become Available and if we choose
second option our system will become Consistent.

The combination of consistency and availability is not possible in distributed systems and
for achieving CA, the system has to be monolithic such that when a user updates the state
of the system, all other users accessing it are also notified about the new changes which
means that the consistency is maintained. And since it follows monolithic architecture, all
users are connected to single system which means it is also available. These types of
systems are generally not preferred due to a requirement of distributed computing which
can be only done when consistency or availability is sacrificed for partition tolerance.

Example databases: MySQL, PostgreSQL

CAP diagram

2. AP (Availability and Partition Tolerance)

These types of system are distributed in nature, ensuring that the request sent by
the user to view or modify the data present in the database nodes are not
dropped and are processed in presence of a network partition.

The system prioritizes availability over consistency and can respond with possibly stale
data which was replicated from other nodes before the partition was created due to some
technical failure. Such design choices are generally used while building social media
websites such as Facebook, Instagram, Reddit, etc. and online content websites like
YouTube, blog, news, etc. where consistency is usually not required, and a bigger problem
arises if the service is unavailable causing corporations to lose money since the users may
shift to new platform. The system can be distributed across multiple nodes and is designed
to operate reliably even in the face of network partitions.

Example databases: Amazon DynamoDB, Google Cloud Spanner.

3. CP (Consistency and Partition Tolerance)

These types of system are distributed in nature, ensuring that the request sent by
the user to view or modify the data present in the database nodes are dropped
instead of responding with inconsistent data in presence of a network partition.

The system prioritizes consistency over availability and does not allow users to read crucial
data from the stored replica which was backed up prior to the occurrence of network
partition. Consistency is chosen over availability for critical applications where latest data
plays an important role such as stock market application, ticket booking application,
banking, etc. where problem will arise due to old data present to users of application.

For example, in a train ticket booking application, there is one seat which can be booked. A
replica of the database is created, and it is sent to other nodes of the distributed system. A
network outage occurs which causes the user connected to the partitioned node to fetch
details from this replica. Some user connected to the unpartitioned part of distributed
network and already booked the last remaining seat. However, the user connected to
partitioned node will still one seat which makes the available data inconsistent. It would
have been better if the user was shown error and make the system unavailable for the user
and maintain consistency. Hence consistency is chosen in such scenarios.

Example databases: Apache HBase, MongoDB, Redis.

Conclusion
The CAP theorem provides a framework for understanding the trade-offs in designing
distributed systems. It highlights that only two out of three properties—Consistency,
Availability, and Partition Tolerance—can be achieved simultaneously. Depending on the
application requirements, developers must prioritize the properties that best meet their
needs. It's also important to note that many modern databases offer configurations to
balance these properties dynamically based on specific use cases.

Comment N nitin_s… Follow 67

Article Tags : DBMS

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

About Us POTD Programming Languages IBM Certification DSA Interview Corner
Corporate & Communications Address: Legal Job-A-Thon DSA DSA and Placements Python Aptitude
A-143, 7th Floor, Sovereign Corporate Privacy Policy Blogs Web Technology Web Development Java Puzzles
Tower, Sector- 136, Noida, Uttar Pradesh
Contact Us Nation Skill Up AI, ML & Data Science Programming Languages C++ GfG 160
(201305)
Advertise with us DevOps DevOps & Cloud Web Development System Design
GFG Corporate Solution CS Core Subjects GATE Data Science
Registered Address:
Campus Training Program Interview Preparation Trending Technologies CS Subjects
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam Soware and Tools
Buddh Nagar, Uttar Pradesh, 201305
Choosing an AWS NoSQL Database AWS Whitepaper

Types of NoSQL databases

To support specic needs and use cases, NoSQL databases use a variety of data models for
managing and accessing the data. The following section describes some of the common NoSQL
database categories:

• Key-value pair

• Document-oriented

• Column-oriented

• Graph-based

• Time series

These types of databases are optimized specically for applications that need large data volumes,
exible data models, and low latency. To achieve these objectives, NoSQL databases employ
various techniques, and it's important to note that not all database options prioritize the same set
of factors that are mentioned here:

• Consistency

• Atomicity, Consistency, Isolation, and Durability (ACID) transactions

• Query language and data access richness (simplied Create, Read, Update, and Delete (CRUD)-
style operations with known predictable cost)

• Sharding/partitioning of data sets on primary identier keys

• Shifting the burden of data and schema validation to the application (removing referential
integrity enforcement by the database and so on)

Here’s a brief overview of most popular NoSQL data models.

1. Key-value — A key-value data store is a type of database that stores data as a collection of key-
value pairs. In this type of data store, each data item is identied by a unique key, and the value
associated with that key can be anything, such as a string, number, object, or even another data
structure.

3
Choosing an AWS NoSQL Database AWS Whitepaper

An example of data stored as key-value pairs.

AWS oers Amazon DynamoDB as a key-value managed database service.

2. Document — In a document database, the data is stored in documents. Each document is

typically a nested structure of keys and values. The values can be atomic data types, or complex
elements such as lists, arrays, nested objects, or child collections (for example, a collection in
the document database is analogous to a table in a relational database, except there is no single
schema enforced upon all documents).

Documents are retrieved by unique keys. It may also be possible to retrieve only parts of a
document--for example, the cost of an item--to run queries such as aggregation, querying using
examples based on a text string, or even full-text search. Most document databases also allow
you to dene secondary indexes.

You can transfer the application code object model directly into a document using several
dierent formats. The most commonly used are JavaScript Object Notation (JSON), Binary
JavaScript Object Notation (BSON), and Extensible Markup Language (XML).

4
Choosing an AWS NoSQL Database AWS Whitepaper

An example of a document data model

AWS oers a specialized document database service called Amazon DocumentDB (with
MongoDB compatibility).

3. Wide-column — A wide column data store is a type of NoSQL database that stores data in
columns rather than rows, making it highly scalable and exible. In a wide column data store,
data is organized into column families, which are groups of columns that share the same
attributes. Each row in a wide column data store is identied by a unique row key, and the
columns in that row are further divided into column names and values.

Unlike traditional relational databases, which have a xed number of columns and data types,
wide column data stores allow for a variable number of columns and support multiple data
types. The most signicant benet of having column-oriented databases is that you can store

5
Choosing an AWS NoSQL Database AWS Whitepaper

large amounts of data within a single column. This feature allows you to reduce disk resources
and the time it takes to retrieve information from it.

An example of the kind of data you might store in a wide-column data store

AWS oers Amazon Keyspaces (for Apache Cassandra) as a wide-column managed database
service.

4. Graph — Graph databases are used to store and query highly connected data. Data can be
modeled in the form of entities (also referred to as nodes, or vertices) and the relationships
between those entities (also referred to as edges). The strength or nature of the relationships
also carry signicant meaning in graph databases.

Users can then traverse the graph structure by starting at a dened set of nodes or edges and
travel across the graph, along dened relationship types or strengths, until they reach some
dened condition. Results can be returned in the form of literals, lists, maps, or graph traversal
paths. Graph databases provide a set of query languages that contain syntax designed for
traversing a graph structure, or matching a certain structural pattern.

6
Choosing an AWS NoSQL Database AWS Whitepaper

An example of a social network graph. Given the people (nodes) and their relationships (edges), you
can ﬁnd out who the "friends of friends" of a particular person are—for example, the friends of
Howard's friends.

AWS oers Amazon Neptune as a managed graph database service.

5. Time series — A time series database is designed to store and retrieve data records that are
sequenced by time, which are sets of data points that are associated with timestamps and stored
in time sequence order. Time series databases make it easy to measure how measurements or

7
Choosing an AWS NoSQL Database AWS Whitepaper

events change over time; for example, temperature readings from weather sensors or intraday
stock prices.

An example of a series data model

AWS oers Amazon Timestream as a managed time series database service.

Understanding Amazon NoSQL data stores

AWS provides the broadest selection of managed NoSQL databases, allowing you to save, grow,
and innovate faster. With Amazon NoSQL databases, you get high performance, enterprise-grade
security, automatic, and instant scalability. The following table lists some of the AWS managed
NoSQL database services oered, and their key characteristics:

Table 1 — AWS database service comparison

AWS database service Use cases Strengths Security

Amazon DocumentD User prole/personaliz Flexible schema and Capability to ena

B (with MongoDB ation, catalogs, mobile, indexing, ad hoc queries encryption at res
compatibility)* and content managemen on any attributes, transit

Understanding Amazon NoSQL data stores 8

Choosing an AWS NoSQL Database AWS Whitepaper

AWS database service Use cases Strengths Security

• Mid TB range t, retail and marketing including nested attribute

• Data format: JSCO, (for example, tracking s
BSON, XML customers who purchase
similar items)
• NoSQL type: document
• Consistency: strong/ev
entual

Amazon DynamoDB* • User preferences • Performance at scale Encrypts all data

• Session management • Serverless default, row/colu
• High TB range
security
• Shopping cart • Simple data model
• Data format: JSON,
BSON, or XML • Product catalog

• NoSQL type: key-value • High-trafc web apps

, document • Near real-time bidding
• Consistency: strong/ev
entual

Amazon Keyspaces (for High scalable apps for: • Extreme write speeds • Tables are encr
Apache Cassandra) with relatively less default
• Equipment maintenance
velocity reads • Capability to e
• Data format: JSON • Fleet management
• Being serverless, encryption at r
• NoSQL type: wide • Route optimization allocates storage and transit
column
read/write throughput
• Consistency: one, directly to tables
local_one, local-quo
rum

Understanding Amazon NoSQL data stores 9

Choosing an AWS NoSQL Database AWS Whitepaper

AWS database service Use cases Strengths Security

Amazon Neptune* • Recommendations • Highly connected Capability to ena

• Social patterns data is locally indexed encryption at res
• Mid TB range
and purpose-built to transit
• relationship traversal
• Data format: Germalin, answer questions about
RDF, open Cypher • Fraud detection
relationships
• NoSQL type: graph • Risk assessment
• Optimized for ecient
• Consistency: storage and retrieval
immediate consistency

Amazon Timestream* • Server metrics Analytics over time series Encrypts all data
• Application performance data default
• NoSQL type: TimeSerie
monitoring
s
• Network data
• Consistency: eventual
• IoT apps
• Sensor data
• Events
• Clicks
• Financial forecasting
• Many other types of
analytics data

Amazon ElastiCache • Caching repeat requests • Simple caching model Capability to ena
(Memcached) • Sticky sessions (to store • Multi-threaded encryption at res
session state) performance transit
• Low TB range
• NoSQL type: in-
memory, key-value

Understanding Amazon NoSQL data stores 10

Choosing an AWS NoSQL Database AWS Whitepaper

AWS database service Use cases Strengths Security

Amazon ElastiCache • Gaming leaderboards • Complex data structure Capability to ena

(Redis OSS) • Geospatial applications s encryption at res
• Sorting and ranking transit
• Low TB range
• Pub/sub messaging
• NoSQL type: in-
memory, key-value • Geospatial capabilities

Amazon MemoryDB • High concurrency • Durable database Capability to ena

• Streaming media • Complex data structure encryption at res
• NoSQL type: in-
s transit
memory, database • Data feeds

• Consistency: strong/ev
entual

* ACID compliant

NoSQL databases

• Amazon DynamoDB

• Amazon Keyspaces (for Apache Cassandra)

• Amazon Neptune

• Amazon Timestream

• Amazon ElastiCache

• Amazon DocumentDB (with MongoDB compatibility)

• Amazon MemoryDB

Amazon DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service. Some key capabilities of
DynamoDB include:

• High performance — Designed to provide single-digit millisecond latency for read and write
operations at any scale.

Amazon DynamoDB 11
Choosing an AWS NoSQL Database AWS Whitepaper

• Scalability — The ability to automatically scale throughput capacity in response to demand, so

you can start small and scale as needed.
• Flexibility — Supports both document and key-value data models, and provides rich data types
such as lists and maps, making it easy to store any type of data.
• Durability — Provides consistent, single-digit millisecond performance for read and write
operations, even in the face of hardware failures.
• Global tables — DynamoDB global tables provides multi-Region data replication which makes it
easy to build globally distributed applications.
• Integrations — Integration with other AWS services such as AWS Lambda, Amazon Simple
Storage Service (Amazon S3), Amazon DynamoDB Streams, and Amazon Kinesis, making it
easy to build serverless and data-driven applications. Amazon DynamoDB integrates with
Amazon CloudWatch Contributor Insights to provide information about the most accessed and
throttled items in a table or index. DynamoDB delivers this information to you using CloudWatch
Contributor Insights rules, reports, and graphs of report data.

Amazon Keyspaces (for Apache Cassandra)

Amazon Keyspaces (for Apache Cassandra) (Amazon Keyspaces) is a fully managed, Apache
Cassandra-compatible database service. Some key features of Amazon Keyspaces include:

• Apache Cassandra compatibility ‑— Full compatibility with Cassandra, allowing you to use your
existing Cassandra applications and tools with minimal changes.
• Scalability — Designed to handle millions of requests per second and terabytes of data, making
it suitable for high-scale applications.
• Serverless — Instead of deploying, managing, and maintaining storage and compute resources
for your workload through nodes in a cluster, Amazon Keyspaces allocates storage and read/
write throughput resources directly to tables.
• Global distribution — Supports global distribution of data, allowing you to store and access
data from multiple Regions, reducing latency and improving application performance.
• Monitoring and management — Provides an easy-to-use, web-based console for monitoring
and managing your database, as well as integration with Amazon CloudWatch for metrics and
alerts.
• Integration with other AWS services — Integration with other AWS services such as Amazon S3,
Amazon Redshift, and Amazon EMR, making it easy to build data-driven applications.

Amazon Keyspaces (for Apache Cassandra) 12

Choosing an AWS NoSQL Database AWS Whitepaper

• Highly available and secure — Data is replicated automatically across multiple AWS Availability
Zones using a replication factor of three. Amazon Keyspaces encrypts all customer data at rest by
default, and is integrated with AWS IAM to help you manage access to your tables and data.

Amazon Neptune

Amazon Neptune is a fully managed graph database service. Neptune makes it easy to build and
run applications that work with highly connected datasets, including for ID, graph/C360, security,
fraud, and knowledge graph applications. Some key features of Amazon Neptune include:

• High performance — Provides low-latency and high-throughput performance for both read and
write operations, making it suitable for real-time applications.

• Scalability — Neptune can handle billions of vertices and edges, and is designed to
automatically scale to meet the demands of your application.

• Compatibility — Supports the popular graph query languages, including Apache TinkerPop
Gremlin and SPARQL, making it easy to use with existing applications and tools.

• Durability — Automatically replicates data across multiple Availability Zones (AZs) for high
availability (HA) and data durability.

• Integration with other AWS services — Integration with other AWS services such as Amazon
S3, Amazon OpenSearch Service, and Amazon SageMaker AI, making it easy to build data-driven
applications.

• Management and monitoring — Provides an easy-to-use, web-based console for monitoring

and managing your database, as well as integration with Amazon CloudWatch for metrics and
alerts.

Amazon Timestream

Amazon Timestream is a managed time series database. It is specically designed to handle time-
stamped data, such as IoT device data and operational logs, and provides a fast, scalable and cost-
eective way to store and analyze large amounts of time series data.

Some of the key features of Amazon Timestream include:

• Scalable storage — Automatically scales storage as your data grows, so you don’t have to worry
about running out of space.

Amazon Neptune 13
Choosing an AWS NoSQL Database AWS Whitepaper

• Fast querying — Provides fast and ecient querying of your time series data, allowing you to
quickly and easily analyze your data.
• Integrations — Integration with other AWS services, such as Amazon Kinesis, Amazon S3, and
QuickSight, making it easy to collect, store and analyze your data.
• Cost-eective — Provides cost-eective storage and analysis of your time series data, with the
ability to choose between standard and memory-optimized performance tiers.

Amazon ElastiCache

Amazon ElastiCache is a web service that makes it easy to deploy, operate, and scale an in-memory
cache in the cloud. It provides a simple way to cache frequently-used data in memory, reducing
the need to repeatedly fetch this data from a slower disk-based data store such as a relational
database.

ElastiCache supports two popular in-memory cache engines: Memcached and Redis. These engines
can be used to signicantly improve the performance of web applications, mobile apps, gaming
platforms

Some of the key features of Amazon ElastiCache include:

• Easy setup —Providing a simple, one-click setup process, making it easy to get started with in-
memory caching.
• Scalable performance —Automatically scales cache nodes as your application's needs change, so
you can easily adjust cache performance to meet the demands of your application.
• High availability —Provides built-in replication and failover capabilities, ensuring high
availability and durability of your cache data.
• Integrations —integration with other AWS services, such as Amazon Elastic Compute Cloud
(Amazon EC2), Amazon Relational Database Service (Amazon RDS), and Amazon S3, making it
easy to use caching in your overall application architecture.

Amazon DocumentDB (with MongoDB compatibility)

Amazon DocumentDB (with MongoDB compatibility) (Amazon DocumentDB) is a fully managed,

scalable, and highly available document database service. It is designed to be compatible with
the MongoDB API, allowing you to use existing MongoDB applications and tools with Amazon
DocumentDB.

Amazon ElastiCache 14
Choosing an AWS NoSQL Database AWS Whitepaper

DocumentDB supports a range of use cases, such as content management systems, e-commerce
applications, and mobile backends. It allows you to store and retrieve JSON-like documents, with
support for complex queries, indexing, and aggregation. It also supports transactions, allowing you
to group multiple write operations into a single atomic unit of work.

Some of the key features of Amazon DocumentDB include:

• Compatibility — DocumentDB is compatible with existing MongoDB drivers and tools, and
applications can be used with DocumentDB with little or no change.
• Scalability — DocumentDB Elastic Clusters scale within minutes to handle millions of reads and
writes with petabytes of storage capacity, helping you cost-eectively meet the needs of your
most demanding document workloads.
• Flexibility — Supports a exible data model that allows you to store and retrieve JSON-like
documents with varying structure and complexity. This makes it well-suited for a wide range of
use cases, such as content management, user proles, product catalogs, and more.
• Durability — Designed to provide high durability and availability for your data. It provides
automatic backup and recovery, point-in-time recovery, and data replication across multiple
Availability Zones for high availability and disaster recovery (DR). DocumentDB automatically
backs up your data and transaction logs to Amazon S3, which is designed for 99.999999999%
durability. This helps ensure that your data is protected against data loss or corruption, even in
the event of a disaster or outage.
• Global clusters — DocumentDB global clusters provides disaster recovery from Region-wide
outages and enables low-latency global reads.
• Integration with other AWS services — You can integrate with AWS Glue to import and export
data from and to DocumentDB to other AWS services such as Amazon S3, Amazon Redshift,
and Amazon OpenSearch Service.

Amazon MemoryDB
Amazon MemoryDB (MemoryDB) is a fully managed, in-memory database service. It is designed to
provide high performance and low latency for applications that require fast and frequent access to
data.

Some of the key features of MemoryDB include:

• Compatibility — Compatibility with Redis, an open-source, in-memory data store that is widely
used for caching, near real-time analytics, and other high-performance applications. MemoryDB

Amazon MemoryDB 15
Choosing an AWS NoSQL Database AWS Whitepaper

supports the same set of Redis data types and parameters, and requires no code change to
migrate from Redis.
• Scalability — Designed to be highly available and durable, with automatic failover and data
replication across multiple Availability Zones for high availability and disaster recovery.
• Data durability — Data is stored across multiple Availability Zones, while ensuring single-digit
millisecond response using AWS proprietary architecture design.
• Support for security features such as Amazon Virtual Private Cloud (Amazon VPC), encryption
with AWS Key Management Service (AWS KMS), and authentication and authorization with Redis
ACLs.
• Flexibility — Provides a number of features and capabilities to help you optimize your
application's performance, including read and write replicas, Multi-AZ deployments, automatic
scaling, and exible pricing options based on usage. MemoryDB is well-suited for a wide range of
use cases, including caching, near real-time analytics, and session stores. It is particularly useful
for applications that require fast and frequent access to data, such as gaming, e-commerce, and
advertising.

Amazon NoSQL databases integrate with AWS Identity and Access Management (AWS IAM) for
access control and security. IAM allows you to manage access to your NoSQL databases by creating
policies that dene permissions for specic users, groups, or roles. You can use IAM to control
access to specic tables or resources within your NoSQL databases, as well as to enforce ne-
grained permissions for read and write operations.

Amazon MemoryDB 16
Choosing an AWS NoSQL Database AWS Whitepaper

Decision making
This section outlines a decision framework you can use to help you determine which AWS-managed
NoSQL database service, or combination of database services would t your workload needs best.

While there is no simple formula you can follow that is comprehensive enough to apply generally,
there are a few important questions related to your application that should be answered during the
selection process:

What type of data is your application planning to persist (such as JSON structures, telemetry
data, image les, geospatial data)?

Dierent databases allow you to access stored data dierently. If you plan to store unstructured
data such as images or encoded payloads, you need a data store that can store and retrieve
the entire unstructured binary payload fast, but may need a rich set of data access features to
introspect the unstructured data.

Conversely, a catalog system needs a richer feature set to access data based on patterns, but also
allow for exibility to expand the set of attributes collected for each item in the catalog. These
capabilities may be more important than the absolute fastest way to retrieve data access.

What performance requirements and service-level commitments have you made to your end
users (for example, a service level agreement that guarantees microsecond or millisecond-level
response latency for queries)?

If your workload requires extremely high read performance with a response time measured in
microseconds rather than single-digit milliseconds, then you may consider using in-memory
caching solutions alongside your database, or a database that supports in-memory data access.

Also consider how predictable your performance needs to be. A database such as Amazon
DynamoDB can deliver consistent, predictable response latencies to reads and writes, but it does
so because it supports a small number of query patterns that have a known cost. It’s a great t for
point queries accessing one or a very small number of records at high frequency.

If your data access patterns require accessing a variable or unpredictable number of records or
volume of data, your performance will also have more variability. Consider also that modern
architectures are implemented using decoupled microservices, each with dierent date access
requirements, compounding the end-to-end latency or performance of the end user request.

17
Choosing an AWS NoSQL Database AWS Whitepaper

What are your resiliency requirements?

Workloads with high availability requirements (such as mission-critical applications that can’t
tolerate any downtime) can span multiple Regions to provide further resiliency in case a specic
AWS region becomes unreachable. For example, you can use DynamoDB global tables to deploy
globally across supported Regions and read or write to the local copies of the tables concurrently.
Amazon DocumentDB also supports multiple Regions through Amazon DocumentDB global
clusters, but you can only write in the primary Region cluster.

After you address these questions, you can use the following decision tree for further direction on
how to narrow down your choices. The decision tree covers two scenarios:

1. If you’re already using a NoSQL database on premises and would like to consider migrating to a
fully managed scalable, highly available AWS NoSQL database service, start your review of our
decision tree at Step 1.

2. If you want to modernize your application and are considering a NoSQL database, you can use
the decision tree to choose the most appropriate AWS-managed NoSQL database service for
your use case based on your requirements by starting at Step 2, You can start eith the data
model that is appropriate for your use case.

18
Choosing an AWS NoSQL Database AWS Whitepaper

19
Choosing an AWS NoSQL Database AWS Whitepaper

Considerations

Best practices and limitations

• For implementation best practices and limitations, refer to the AWS documentation for the
respective database service.

• If your use cases use relatively static schemas, perform complex table lookups, require accessing
data across multiple keys and might experience high service throughputs it might be a better t
for Amazon RDS oerings.

Some customer journeys and lessons learned

• Amazon DynamoDB on Production: FinBox’s Compilation of Lessons Learned in a Year
• Refer to this case study to learn how McAfee’s use case of migrating from Microsoft SQL server
to DynamoDB to power their next generation messaging platform to drive ecommerce business
• Watch this video on why Amazon Fulllment choose Amazon DocumentDB to power their
inventory authority platform (IAP), considerations for performance and scale, and some learnings
from their experience
• Watch this video to hear about FINRA’s story on how they modernized their data collection
platform used by FINRA customers from a relational database using XML to Amazon
DocumentDB.
• Idea to product: PricewaterhouseCoopers launches Check-In within three months on Amazon
Keyspaces
• Watch this video to learn about the key features of ElastiCache (Redis OSS) and dive deep into
how Groupon uses ElastiCache for deal curation.
• Blog post on how Near was able to reduce latency by four times and achieve 99.9% uptime of its
critical RTB platform applications by moving to ElastiCache.
• LexisNexis presentation on using graph to store relationships between legal documents using
Amazon Neptune.
• Cox Automotive presentation on using graph to store relationships between user identities on
their web platforms to power marketing and advertising.

Best practices and limitations 20

Choosing an AWS NoSQL Database AWS Whitepaper

References
• Scale and performance characteristics of Timestream – Deriving near real-time insights over
petabytes of time series data with Amazon Timestream.

• This blog post provides you with a quick summary and set of resources for common topics so you
can quickly ramp up on Amazon DocumentDB.

• This blog post provides improved performance characteristics of Amazon Keyspaces, lightweight
transactions API, advanced design patterns, and operational best practices.

• AWS Online Tech Talks: ElasticCache best practices

• Getting started with Amazon Neptune by creating a graph of all of your AWS resources.

• How to migrate an application from using GridFS to using Amazon S3 and Amazon DocumentDB.

• Graph data model lets you traverse through relationships without requiring joins and indexes.
For more information, refer to the "How Do I Know I Need an Amazon Neptune Graph
Database?” video.

• Graph data model lets you traverse through relationships without requiring joins and indexes.
For more information, refer to "How Do I Know I Need an Amazon Neptune Graph Database?”.

• Complex data models (such as arrays, nested elds, and deep relationships) let you consider a
wider range of application needs. For more information, refer to the “When to use DocumentDB
vs DynamoDB” video.

• DynamoDB provides extreme scale for certain data access patterns. For more information, refer
to “How to determine if Amazon DynamoDB is appropriate for your needs”.

• Refer to this tech talk to learn about DocumentDB use cases, and how Amazon DocumentDB
cluster architecture provides better performance, scalability, and availability.

• Amazon MemoryDB is a durable, in-memory database for workloads that require an ultra-fast
Redis-compatible primary database. If you require sub-millisecond performance and need to add
persistence and durability, consider using MemoryDB rather than in-memory cache for Redis.
Refer to this tech talk to learn about Amazon MemoryDB.

Developer references

• Why purpose-built database? This hands-on tutorial will help you get an idea of how AWS
NoSQL databases can help run your specic workloads.

References 21
Choosing an AWS NoSQL Database AWS Whitepaper

Training and guidance

• To ensure that development teams were comfortable with transitioning to Amazon, it essential
to train the teams on AWS NoSQL databases and cloud-based design patterns (tech talks,
workshops, and Immersion Days.)

Training and guidance 22

Choosing an AWS NoSQL Database AWS Whitepaper

Conclusion
NoSQL databases have become increasingly popular over the years due to their scalability,
exibility, and ability to handle large volumes of complex data. This whitepaper has provided
an overview of the dierent types of NoSQL databases, including document-based, key-value,
column-family, and graph databases, as well as their unique strengths and weaknesses.

It has also explored the various NoSQL database services oered by Amazon Web Services,
including Amazon DynamoDB, Amazon Keyspaces, Amazon Neptune, Amazon DocumentDB, and
Amazon MemoryDB. We hope it helps you make an informed decision on which database to choose
based on your specic needs.

23
Choosing an AWS NoSQL Database AWS Whitepaper

Contributors
Contributors to this document include:

• Ashish Bhatia, Senior Solution Architect, Amazon Web Services

• Malathi Pinnamaneni, Senior Solution Architect, Amazon Web Services

24
Choosing an AWS NoSQL Database AWS Whitepaper

Document revisions
To be notied about updates to this whitepaper, subscribe to the RSS feed.

Change Description Date

Minor update Updated Amazon Keyspaces July 28, 2023

information. Minor editorial
corrections throughout.

Initial publication Whitepaper published. April 25, 2023

25
Choosing an AWS NoSQL Database AWS Whitepaper

Notices
Customers are responsible for making their own independent assessment of the information in
this document. This document: (a) is for informational purposes only, (b) represents current AWS
product oerings and practices, which are subject to change without notice, and (c) does not create
any commitments or assurances from AWS and its aliates, suppliers or licensors. AWS products or
services are provided “as is” without warranties, representations, or conditions of any kind, whether
express or implied. The responsibilities and liabilities of AWS to its customers are controlled by
AWS agreements, and this document is not part of, nor does it modify, any agreement between
AWS and its customers.

26
Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Neo4j : The Graph Database

Last Updated : 12 Jul, 2025

Neo4j is the most famous database management system and it is also a NoSQL database
system. Neo4j is different from Mysql or MongoDB it has its own features and it is
designed to efficiently store and query highly interconnected data that's makes it special
compared to other Database Management System.

Neo4j is a cutting-edge database designed to handle and analyze connected data more
efficiently than traditional databases. Instead of using tables, it uses graph structures to
store and query data, making it ideal for applications with complex relationships. Neo4j is
known for its high performance, scalability, and flexibility, and is used in various fields like
social network analysis, fraud detection, recommendation systems, and knowledge graphs.
This article will explore Neo4j's key features, benefits, and uses, showing how it transforms
data management and helps uncover deeper insights

Table of Content
What is Neo4j?
Neo4j structure
What is a Graph Database?
Features of Neo4J
Neo4j Usage

What is Neo4j?
Neo4j is a powerful, high-performance, open-source graph database that enables the
efficient management and querying of highly connected data. Unlike traditional relational
databases, Neo4j uses graph structures to represent and store data, making it uniquely
suited for applications involving complex relationships and dynamic, interconnected data.
As the world's leading graph database, Neo4j has become essential for organizations
looking to leverage the power of graph technology for a variety of use cases.

Neo4j is a powerful and flexible graph database management system, designed to

efficiently store and query highly interconnected data. Unlike traditional relational
databases, which store data in tables, Neo4j uses a graph structure to represent and
navigate relationships between data entities.

Neo4j structure
Neo4j stores and present the data in the form of graph not in tabular format or not in a Json
format. Here the whole data is represented by nodes and there you can create a
relationship between nodes. That means the whole database collection will look like a
graph, that's why it is making it unique from other database management system.

A graph Database

MS Access, SQL server all the relational database management system use tables to store
or present the data with the help of column and row but Neo4j doesn't use tables, row or
columns like old school style to store or present the data.

What is a Graph Database?

A graph database uses graph theory to store, map, and query relationships. It consists of
nodes, edges, and properties, where:

Nodes represent entities such as people, businesses, or any data item.

Edges (or relationships) connect nodes and illustrate how entities are related.
Properties provide additional information about nodes and relationships.

This structure allows graph databases to model real-world scenarios more naturally and
intuitively than traditional relational databases.

Features of Neo4J

High Performance and Scalability

Neo4j is designed to handle massive amounts of data and complex queries quickly and
efficiently. Its native graph storage and processing engine ensure high performance and
scalability, even with billions of nodes and relationships.

Cypher Query Language

Neo4j uses Cypher, a powerful and expressive query language tailored for graph
databases. Cypher makes it easy to create, read, update, and delete data, allowing users
to perform complex queries with concise and readable syntax.

ACID Compliance

Neo4j ensures data integrity and reliability through ACID (Atomicity, Consistency, Isolation,
Durability) compliance. This guarantees that all database transactions are processed
reliably and ensures the consistency of the database even in the event of failures.

Flexible Schema

Unlike traditional databases, Neo4j offers a flexible schema, allowing users to add or
modify data models without downtime. This adaptability makes it ideal for evolving data
structures and rapidly changing business requirements.

Neo4j Usage
If your Database Management System has so many interconnecting relationships then you
can use Neo4j that will be the best choice. Neo4j is highly preferable to store data that
contains multiple connections between nodes. This is where the Neo4j comes in it's more
comfortable to use with relational data than the relational database. Because Neo4j
doesn't require a predefined schema, you just need to load the data here the data is the
main structure. It is schema optional Database Management System.

There are some unique features that will make you choose Neo4j over any other Database
Management System. Neo4j is surrounded by relationships but there is no need to set up
primary key or foreign key constraints to any data. Here you can add any relation between
any nodes you want. That makes the Neo4j extremely suited for Networking data, below is
the list of data areas where you can use this Database Management System.

Social network analysis like in Facebook, Twitter or in Instagram

Network Diagram
Fraud Detection
Graph based searched of digital assets
Data Management
Real-time product recommendation

Advantages of Neo4j:

1. Representation of connected data is very easy.

2. Retrieval or traversal or navigation of connected data is very fast.
3. It uses simple and powerful data model.
4. It can represent semi-structured data is easy.

Disadvantages of Neo4j:

1. OLAP support for these types of databases is not well executed.

2. In this area, still there are lots of research happening around.

Conclusion
Neo4j stands out as a leading graph database solution, offering unparalleled capabilities
for managing and querying highly interconnected data. Its native graph architecture
provides flexibility and performance advantages over traditional relational databases,
especially in scenarios involving complex relationships and real-time queries. As
organizations increasingly recognize the value of understanding relationships within their
data, Neo4j continues to play a crucial role in enabling advanced analytics, recommendation
engines, and knowledge graphs.

Comment S skyrid… Follow 3

Article Tags : DBMS Neo4j

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

Enhanced Data Models:

Introduction to Active,
Temporal, Spatial, Multimedia,
and Deductive Databases
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe
26.1 Active Database Concepts and
Triggers
n Database systems implement rules that specify
actions automatically triggered by certain events
n Triggers
n Technique for specifying certain types of active
rules
n Commercial relational DBMSs have various
versions of triggers available
n Oracle syntax used to illustrate concepts

Generalized Model for Active
Databases and Oracle Triggers
n Event-condition-action (ECA) model
n Event triggers a rule
n Usually database update operations
n Condition determines whether rule action should
be completed
n Optional
n Action will complete only if condition evaluates to
true
n Action to be taken
n Sequence of SQL statements, transaction, or
external program
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 26- 4
Example
n Events that may cause a change in value of
Total_sal attribute
n Inserting new employee
n Changing salary
n Reassigning or deleting employees

Figure 26.1 A simplified COMPANY database used for active rule examples

Example (cont’d.)
n Condition to be evaluated
n Check that value of Dno attribute is not NULL
n Action to be taken
n Automatically update the value of Total_sal

Figure 26.2 Specifying active rules as triggers in Oracle notation (a) Triggers for
automatically maintaining the consistency of Total_sal of DEPARTMENT

Figure 26.2 (cont’d.) Specifying active rules as triggers in Oracle notation (b)
Trigger for comparing an employee’s salary with that of his or her supervisor

Design and Implementation Issues
for Active Databases
n Deactivated rule
n Will not be triggered by the triggering event
n Activate command
n Makes the rule active again
n Drop command
n Deletes the rule from the system
n Approach: group rules into rule sets
n Entire rule set can be activated, deactivated, or
dropped

Design and Implementation Issues
for Active Databases (cont’d.)
n Timing of action
n Before trigger executes trigger before executing
event that caused the trigger
n After trigger executes trigger after executing the
event
n Instead of trigger executes trigger instead of
executing the event
n Action can be considered separate transaction
n Or part of same transaction that triggered the rule

Design and Implementation Issues
for Active Databases (cont’d.)
n Rule consideration
n Immediate consideration
n Condition evaluated as part of same transaction
n Evaluate condition either before, after, or instead of
executing the triggering event
n Deferred consideration
n Condition evaluated at the end of the transaction
n Detached consideration
n Condition evaluated as a separate transaction

Design and Implementation Issues
for Active Databases (cont’d.)
n Row-level rule
n Rule considered separately for each row
n Statement-level rule
n Rule considered once for entire statement
n Difficult to guarantee consistency and termination
of rules

Examples of Statement-Level Active
Rules in STARBURST

Figure 26.5 (continues) Active rules using statement-level

Figure 26.5 (cont’d.) Active rules using statement-level

semantics in STARBURST notation

Potential Applications for Active
Databases
n Allow notification of certain conditions that occur
n Enforce integrity constraints
n Automatically maintain derived data
n Maintain consistency of materialized views
n Enable consistency of replicated tables

Triggers in SQL-99

Figure 26.6 Trigger T1 illustrating the syntax for defining triggers in SQL-99

26.2 Temporal Database Concepts
n Temporal databases require some aspect of time
when organizing information
n Healthcare
n Insurance
n Reservation systems
n Scientific databases
n Time considered as ordered sequence of points
n Granularity determined by the application

Temporal Database Concepts
(cont’d.)
n Chronon
n Term used to describe minimal granularity of a
particular application
n Reference point for measuring specific time
events
n Various calendars
n SQL2 temporal data types
n DATE, TIME, TIMESTAMP, INTERVAL, PERIOD

Temporal Database Concepts
(cont’d.)
n Point events or facts
n Typically associated with a single time point
n Time series data
n Duration events or facts
n Associated with specific time period
n Time period represented by start and end points
n Valid time
n True in the real world

Temporal Database Concepts
(cont’d.)
n Transaction time
n Value of the system clock when information is
valid in the system
n User-defined time
n Bitemporal database
n Uses valid time and transaction time
n Valid time relations
n Used to represent history of changes

Temporal Database Concepts
(cont’d.)

Figure 26.7 Different types of temporal relational databases (a) Valid time database
schema (b) Transaction time database schema (c) Bitemporal database schema

Temporal Database Concepts
(cont’d.)

Figure 26.8 Some tuple versions in the valid time relations EMP_VT and DEPT_VT

Temporal Database Concepts
(cont’d.)
n Types of updates
n Proactive
n Retroactive
n Simultaneous
n Timestamp recorded whenever change is applied
to database
n Bitemporal relations
n Application requires both valid time and
transaction time

Temporal Database Concepts
(cont’d.)
n Implementation considerations
n Store all tuples in the same table
n Create two tables: one for currently valid
information and one for the rest
n Vertically partition temporal relation attributes into
separate relations
n New tuple created whenever any attribute updated
n Append-only database
n Keeps complete record of changes and
corrections

Temporal Database Concepts
(cont’d.)
n Attribute versioning
n Simple complex object used to store all temporal
changes of the object
n Time-varying attribute
n Values versioned over time by adding temporal
periods to the attribute
n Non-time-varying attribute
n Values do not change over time

Figure 26.10 Possible ODL schema for a temporal valid time EMPLOYEE_VT
object class using attribute versioning
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 26- 26
Temporal Database Concepts
(cont’d.)
n TSQL2 language
n Extends SQL for querying valid time and
transaction time tables
n Used to specify whether a relation is temporal or
nontemporal
n Temporal database query conditions may involve
time and attributes
n Pure time condition involves only time
n Attribute and time conditions

Temporal Database Concepts
(cont’d.)
n CREATE TABLE statement
n Extended with optional AS clause
n Allows users to declare different temporal options
n Examples:
n AS VALID STATE<GRANULARITY> (valid time
relation with valid time period)
n AS TRANSACTION (transaction time relation with
transaction time period)
n Keywords STATE and EVENT
n Specify whether a time period or point is
associated with valid time dimension
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 26- 28
Temporal Database Concepts
(cont’d.)
n Time series data
n Often used in financial, sales, and economics
applications
n Special type of valid event data
n Event’s time points predetermined according to
fixed calendar
n Managed using specialized time series
management systems
n Supported by some commercial DBMS packages

26.3 Spatial Database Concepts
n Spatial databases support information about
objects in multidimensional space
n Examples: cartographic databases, geographic
information systems, weather information
databases
n Spatial relationships among the objects are
important
n Optimized to query data such as points, lines,
and polygons
n Spatial queries

Spatial Database Concepts (cont’d.)
n Measurement operations
n Used to measure global properties of single
objects
n Spatial analysis operations
n Uncover spatial relationships within and among
mapped data layers
n Flow analysis operations
n Help determine shortest path between two points

Spatial Database Concepts (cont’d.)
n Location analysis
n Determine whether given set of points and lines lie
within a given polygon
n Digital terrain analysis
n Used to build three-dimensional models

Spatial Database Concepts (cont’d.)

Table 26.1 Common types of analysis for spatial data

Spatial Database Concepts (cont’d.)
n Spatial data types
n Map data
n Geographic or spatial features of objects in a map
n Attribute data
n Descriptive data associated with map features
n Image data
n Satellite images
n Models of spatial information
n Field models
n Object models
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 26- 34
Spatial Database Concepts (cont’d.)
n Spatial operator categories
n Topological operators
n Properties do not change when topological
transformations applied
n Projective operators
n Express concavity/convexity of objects
n Metric operators
n Specifically describe object’s geometry
n Dynamic spatial operators
n Create, destroy, and update

Spatial Database Concepts (cont’d.)
n Spatial queries
n Range queries
n Example: find all hospitals with the Metropolitan
Atlanta city area
n Nearest neighbor queries
n Example: find police car nearest location of a crime
n Spatial joins or overlays
n Example: find all homes within two miles of a lake

Spatial Database Concepts (cont’d.)
n Spatial data indexing
n Grid files
n R-trees
n Spatial join index
n Spatial data mining techniques
n Spatial classification
n Spatial association
n Spatial clustering

26.4 Multimedia Database Concepts
n Multimedia databases allow users to store and
query images, video, audio, and documents
n Content-based retrieval
n Automatic analysis
n Manual identification
n Color often used in content-based image retrieval
n Texture and shape
n Object recognition
n Scale-invariant feature transform (SIFT) approach

Multimedia Database Concepts
(cont’d.)
n Semantic tagging of images
n User-supplied tags
n Automated generation of image tags
n Web Ontology Language (OWL) provides concept
hierarchy
n Analysis of audio data sources
n Text-based indexing
n Content-based indexing

26.5 Introduction to Deductive
Databases
n Deductive database uses facts and rules
n Inference engine can deduce new facts using rules
n Prolog/Datalog notation
n Based on providing predicates with unique names
n Predicate has an implicit meaning and a fixed
number of arguments
n If arguments are all constant values, predicate
states that a certain fact is true
n If arguments are variables, considered as a query or
part of a rule or constraint

Prolog Notation and The Supervisory
Tree

Figure 26.11 (a) Prolog notation (b) The supervisory tree

Introduction to Deductive Databases
(cont’d.)
n Datalog notation
n Program built from basic objects called atomic
formulas
n Literals of the form p(a1,a2,…an)
n p is the predicate name
n n is the number of arguments for predicate p
n Interpretations of rules
n Proof-theoretic versus model-theoretic
n Deductive axioms
n Ground axioms
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 26- 42
Introduction to Deductive Databases
(cont’d.)

Figure 26.12 Proving a new fact

Introduction to Deductive Databases
(cont’d.)
n Safe program or rule
n Generates a finite set of facts
n Nonrecursive query
n Includes only nonrecursive predicates

Use of Relational Operations

Figure 26.16
Predicates for
illustrating
relational operations

26.6 Summary
n Active databases
n Specify active rules
n Temporal databases
n Involve time concepts
n Spatial databases
n Involve spatial characteristics
n Multimedia databases
n Store images, audio, video, documents, and more
n Deductive databases
n Prolog and Datalog notation
Copyright © 2016 Ramez Elmasri and Shamkant B. Navathe Slide 26- 46
Search EN Upload Download free for 30 days Sign in

Uploaded by BalaMuruganSamuthira PDF, PPTX 1,612 views

Triggers and active database

AI-enhanced description

This document discusses active databases and database triggers. It defines a trigger as a procedure that is automatically invoked by the
database management system in response to specified changes made to the database. An active database is one that has associated …
triggers. Triggers have three parts - an event that activates the trigger, an optional condition, and an action that is executed if the
condition evaluates to true. Triggers allow maintaining database integrity and performing additional actions in response to insert, update,
Science
or delete statements. They can also be used for auditing and logging changes made to the database.

Download as PDF, PPTX

1 Save Share Embed Download now

1 / 12

2 / 12

3 / 12

4 / 12

5 / 12

6 / 12

7 / 12

8 / 12

9 / 12

10 / 12

11 / 12

12 / 12

Recommended
Query optimization

by Pooja Dixit
PPTX
14 slides ◦ 2.9K views

Architecture of-dbms-
and-data-independence

by Anuj Modi
PPS
12 slides ◦ 23.6K views

Allocation of Frames &

Thrashing

by arifmollick8578
PPTX
10 slides ◦ 3.1K views

Integrity Constraints

by madhav bansal
PPTX
21 slides ◦ 21.3K views

joins in database

by Sultan Arshad
PPTX
32 slides ◦ 20.9K views

Transaction states and

properties

by Chetan Mahawar
PPTX
9 slides ◦ 11.3K views

Trigger

by VForce Infotech
PPTX
29 slides ◦ 12.4K views

Concurrency Control in
Distributed Database.

by Meghaj Mallick
PPTX
20 slides ◦ 4.5K views

SQL Views

by [Link] - No. 1 s…
PPT
28 slides ◦ 16.4K views

11. Storage and File

Structure in DBMS

by koolkampus
PPT
76 slides ◦ 31.4K views

Nested Queries Lecture

by Felipe Costa
PDF
29 slides ◦ 6.2K views

Transaction Processing
in [Link]

by Lovely Professional U…
PPTX
33 slides ◦ 2.4K views

basic structure of SQL

[Link]

by Anusha sivakumar
PPTX
17 slides ◦ 2.3K views

Introduction to method
overloading & …

by Harshal Misalkar
PPT
10 slides ◦ 3.8K views

3 Level Architecture

by Adeel Rasheed
PPTX
8 slides ◦ 6.3K views

2 phase locking protocol

DBMS

by Dhananjaysinh Jhala
PPTX
12 slides ◦ 13.1K views

Sql queries presentation

by NITISH KUMAR
PPTX
57 slides ◦ 15K views

serializability in dbms

by Saranya Natarajan
PPTX
24 slides ◦ 27K views

Applications of
DBMS(Database …

by chhinder kaur
PPTX
9 slides ◦ 2.9K views

Relational algebra ppt

by GirdharRatne
PPTX
17 slides ◦ 31K views

Programming in c Arrays

by janani thirupathi
PPTX
31 slides ◦ 29.6K views

Unit I - Evaluation of
expression

by DrkhanchanaR
PPTX
22 slides ◦ 2.8K views

Packages in java

by Elizabeth alexander
PPTX
14 slides ◦ 12K views

Deadlock dbms

by Vardhil Patel
PPTX
9 slides ◦ 12.7K views

Transaction
management DBMS

by Megha Patel
PPTX
20 slides ◦ 20.7K views

Structure of dbms

by Megha yadav
PPTX
10 slides ◦ 6.2K views

Exception Handling in
[Link]

by rishisingh190
PPTX
20 slides ◦ 1.7K views

Concurrent transactions

by Sajan Sahu
PPT
15 slides ◦ 8.8K views

Multimedia Databases
Concepts: Managing …

by COSMOS58
PPTX
75 slides ◦ 18 views

basicofunit-4-
250728105436-…

by meetpathak040
PPTX
17 slides ◦ 10 views

Similar to Triggers and active database

Multimedia Databases basicofunit-4- basicofunit-4- Basic information of unit-4 Trigger in mysql Triggers
Concepts: Managing … 250728105436-… 250728105436-… form of ppt. yo datab
by COSMOS58 by meetpathak040 by meetpathak040 by meetpathak040 by [Link] Magar by MrSush
75 slides ◦ 18 views 17 slides ◦ 10 views 17 slides ◦ 6 views 17 slides ◦ 9 views 23 slides ◦ 957 views 9 slides ◦ 9

Recently uploaded

Metamorphism - How Rocks Rotkotoe_ A Unified Science 5 Quarter 2 Living 1. GP1- Kinematics and its Metallic Crystals The Sola
Change Due to Heat and … Framework of Reality … Things Discussion on Sele… [Link] presentation (chemistry) collectio
by EricsonBueza by Rotko Rotko Toe by PeteraBotea by rolandrogerdelatorre by jasminecookiejasytt by Shreya
48 slides ◦ 0 views 16 slides ◦ 0 views 71 slides ◦ 0 views 13 slides ◦ 0 views 10 slides ◦ 0 views 12 slides ◦

Triggers and active database

1. ACTIVE DATABASE TRIGGER PRESENTEDBY [Link] [Link] DEPT OF COMPUTER SCIENCE SBK COLLEGE ,ARUPPUKOTTAI
2. TRIGGERS AND ACTIVE DATABASE A triggers is a procedure that is automatically invoked by the DBMS in responds to specified change to the database, and is typically specified by the
DBA.  A data base that has a set of associated triggers is called an active database.  ATRIGGERS DESCRIPTION CONTAINSTHREE PARTS  Event: A change to the database that activates the
triggers.  Condition: A query or test is run when the triggers is activate.
3.  Action: A procedurethat is executed when the trigger is activated and its condition is true.  The event specification an insert , delete , or update statement.  A condition in a trigger can be a
true/false statement. e.g., ( All employee salaries are less than($100,000) or query is interrupted as true if the answer set is nonempty and false if the query has no answer. If the condition part
evaluates to true, the action associated with trigger is executed
4.  Activating thetrigger, executed new queries and make changes to the database.  An action can even execute a series of data-definition commands e.g., (create new tables, change
authorization) and transaction-oriented commands. e.g.,(commit) or call host- language procedures. EXAMPLE OFTRIGGERS  A trigger action can answer to the query in the condition part of the
trigger, refer to old and new values of tuple modified by the statement For example, we have to examine the age field of the inserted student record to decide whether to increment the count,
trigger. CREATE TRIGGER init_count BEFORE INSERT ON STUDENT /*Event*/ DECLARE count INTEGER; BEGIN /*Action*/ count :=0; END
5. CREATETIGGER incr_count AFTER INSERT ON student /*Event/ WHEN (new. age < 18) /* condition ; ‘new’ is just-inserted tuple*/ FOR EACH ROW BEGIN /* Action; a procedure in oracle’s
PL/SQL syntax*/ count :=count+1; END  Ing event should be defined to occur for each modified record; the for EACH ROW clause is used to do this.  Such a trigger is called a row-level trigger. 
the init-count trigger is executed just once per (INSERT statement , regardless of the number of records inserted , because we have omitted the FOR EACH ROW phrase.
6.  such atrigger is called a statement-level trigger.  The keyword clause NEWTABLE enables us give a table name (InsertedTuples) to the set of newly inserted tuples.  The FOR EACH
STATEMENT clause specifies a statement –level trigger and can be omitted because it is the default.  The definition does not have aWHEN clause ;if such clause is included it follows the FOR EACH
STATEMENT clause , just before the Action specification.  The trigger is evaluated once for each SQL statement the insert tuples into students and inserts a single tuples into a table that contains
statistics on modification.
7. For example, CREATETRIGGER set_countAFTER INSERT on student /*Event/ REFERENCING NEWTABLE AS Inserted Tuples FOR EACH STATEMENT INSERT /*Action/ INTO statistics table
(modified table , modification type , count) SELECT students ‘insert', count * FROM inserted tuples I WHERE I . age<18 DESINGNING ACTIVE DATABASES  Triggers offers a powerful mechanism for
dealing with changes to database.
8. WHY TRIGGERS CAN BE HARD TO UNDERSTAND  The DBMS processes the trigger by evaluating its condition part , and then(if the condition evaluates to true) executing its action part.  An
important point is that the execution of the action part of a trigger could in turn activate another trigger.  In particular , the execution of the action part of a trigger could again activate the same
trigger , such triggers are called recursive triggers.
9. CONSTRAINTS VERSUS TRIGGERS A constraints also prevents the data from being made inconsistent by any kind of statement , whereas a trigger is activated by a specific kind of
We and our statement(INSERT,DELETE,UPDATE).  Again
28 IAB TCF partners store and access information , thisdevice
on your restriction makes apurposes:
for the following constraints
store easier to understand.
and/or access information on Once theadvertising
a device, other hand triggermeasurement,
and content allow as to audience
maintain database integrity in more flexible ways ,as
the follows examples illustrate.  Suppose that we have a table called ordered with fields iternid , quantity , custom rid and unit prize.  When a customer place on order , the first the three fields
research, and services development, personalised advertising, and personalised content.
value are filled in by the user(in this example, a sales clerk).
10.  The fourth’sfields values can be obtained table to called to items.  But, it important to include it in the orders table to have a complete record of the order ,in case the price of the items is
subsequently
Personal data changed.
may be processed  We
to do the can define
following: a trigger
use precise to look
geolocation data up
andthis value
actively and
scan include
device it in thefor
characteristics fourth fields of a newly inserted record.  Continuing with this example , we may want to perform some
identification.
additional action when an orders received. For example, if the purchase is being charged to a credit line issued bye the company , we may want to check whether the total cost of theYour
Customize purchase is
Choices
within the current credit limit.  We can use trigger to do the check; indeed, we can even use a CHECK constraints .
[Link]
Our third OTHERSIAB TCF USES OFTRIGGERS
partners may  Triggers
store and access canongenerate
information your deviceasuch
logasofIPevent toand
address support
device auditing and Our
characteristics. security checks.
IAB TCF formay
Partners example, each
process this time data
personal a customer places
on the basis of an order , we can create a record with the
legitimate customer’s IDyour
interest, or with andconsent.
currentYoucredit limit and
may change insert this
or withdraw yourrecord in a atcustomer
preferences any time byhistory
clickingtable.
on the cookie icon or link; however, as a consequence, you may not see relevant ads or Accept All
12. THANK U
personalized content.
Our website may use these cookies to:
Measure the audience of the advertising on our website, without profiling Reject All
Display personalized ads based on your navigation and your profile
Personalize our editorial content based on your navigation
Allow you to share content on social networks or platforms present on our website
Send you advertising based on your location
About Support Terms Privacy Copyright Cookie Preferences Do not sell or share my personal information
Cookie Policy English
Third Parties
© 2025 Slideshare from Scribd
Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Multimedia Database
Last Updated : 02 Jan, 2025

A Multimedia database is a collection of interrelated multimedia data that includes text,

graphics (sketches, drawings), images, animations, video, audio etc and have vast amounts
of multisource multimedia data. The framework that manages different types of multimedia
data which can be stored, delivered and utilized in different ways is known as multimedia
database management system. There are three classes of the multimedia database which
includes static media, dynamic media and dimensional media.

Content of Multimedia Database management system :

1. Media data - The actual data representing an object.

2. Media format data - Information such as sampling rate, resolution, encoding scheme
etc. about the format of the media data after it goes through the acquisition, processing
and encoding phase.
3. Media keyword data - Keywords description relating to the generation of data. It is also
known as content descriptive data. Example: date, time and place of recording.
4. Media feature data - Content dependent data such as the distribution of colors, kinds of
texture and different shapes present in data.

Types of multimedia applications based on data management characteristic are :

1. Repository applications - A Large amount of multimedia data as well as meta-

data(Media format date, Media keyword data, Media feature data) that is stored for
retrieval purpose, e.g., Repository of satellite images, engineering drawings, radiology
scanned pictures.
2. Presentation applications - They involve delivery of multimedia data subject to
temporal constraint. Optimal viewing or listening requires DBMS to deliver data at
certain rate offering the quality of service above a certain threshold. Here data is
processed as it is delivered. Example: Annotating of video and audio data, real-time
editing analysis.
3. Collaborative work using multimedia information - It involves executing a complex
task by merging drawings, changing notifications. Example: Intelligent healthcare
network.

There are still many challenges to multimedia databases, some of which are :

1. Modelling - Working in this area can improve database versus information retrieval
techniques thus, documents constitute a specialized area and deserve special
consideration.
2. Design - The conceptual, logical and physical design of multimedia databases has not
yet been addressed fully as performance and tuning issues at each level are far more
complex as they consist of a variety of formats like JPEG, GIF, PNG, MPEG which is not
easy to convert from one form to another.
3. Storage - Storage of multimedia database on any standard disk presents the problem
of representation, compression, mapping to device hierarchies, archiving and buffering
during input-output operation. In DBMS, a ”BLOB”(Binary Large Object) facility allows
untyped bitmaps to be stored and retrieved.
4. Performance - For an application involving video playback or audio-video
synchronization, physical limitations dominate. The use of parallel processing may
alleviate some problems but such techniques are not yet fully developed. Apart from
this multimedia database consume a lot of processing time as well as bandwidth.
5. Queries and retrieval -For multimedia data like images, video, audio accessing data
through query opens up many issues like efficient query formulation, query execution
and optimization which need to be worked upon.

Areas where multimedia database is applied are :

Documents and record management : Industries and businesses that keep detailed
records and variety of documents. Example: Insurance claim record.
Knowledge dissemination : Multimedia database is a very effective tool for knowledge
dissemination in terms of providing several resources. Example: Electronic books.
Education and training : Computer-aided learning materials can be designed using
multimedia sources which are nowadays very popular sources of learning. Example:
Digital libraries.
Marketing, advertising, retailing, entertainment and travel. Example: a virtual tour of
cities.
Real-time control and monitoring : Coupled with active database technology,
multimedia presentation of information can be very effective means for monitoring and
controlling complex tasks Example: Manufacturing operation control.

Several issues must be addressed if multimedia data are to be stored in a database:

The database must support large objects, since multimedia data such as videos can
occupy up to a few gigabytes of storage. Many database systems do not support
objects larger than a few gigabytes. Larger objects could be split into smaller pieces
and stored in the database. Alternatively, the multimedia object may be stored in a file
system, but the database may contain a pointer to the object; the pointer would typically
be a file name. The SQL/MED standard (MED stands for Management of External Data)
allows external data, such as files, to be treated as if they are part of the database.
With SQL/MED, the object would appear to be part of the database, but can be stored
externally.
The retrieval of some types of data, such as audio and video, has the requirement that
data delivery must proceed at a guaranteed steady rate. Such data are sometimes
called isochronous data, or continuous-media data. For example, if audio data are not
supplied in time, there will be gaps in the sound. If the data are supplied too fast,
system buffers may overflow, resulting in loss of data.
Similarity-based retrieval is needed in many multimedia database applications. For
example, in a database that stores fingerprint images, a query fingerprint image is
provided, and fingerprints in the database that are similar to the query fingerprint must
be retrieved. Index structures such as B+- trees and R-trees cannot be used for this
purpose; special index structures need to be created.

Reference -

Multimedia database - Wikipedia

Comment H Himanshi_Singh 34

Article Tags : Misc DBMS

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

About Us POTD Programming Languages IBM Certification DSA Interview Corner
Corporate & Communications Address: Legal Job-A-Thon DSA DSA and Placements Python Aptitude
A-143, 7th Floor, Sovereign Corporate Privacy Policy Blogs Web Technology Web Development Java Puzzles
Tower, Sector- 136, Noida, Uttar Pradesh
Contact Us Nation Skill Up AI, ML & Data Science Programming Languages C++ GfG 160
(201305)
Advertise with us DevOps DevOps & Cloud Web Development System Design
GFG Corporate Solution CS Core Subjects GATE Data Science
Registered Address:
Campus Training Program Interview Preparation Trending Technologies CS Subjects
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam Soware and Tools
Buddh Nagar, Uttar Pradesh, 201305 Skip to content
Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Introduction To Temporal Database

Last Updated : 23 Jul, 2025

Pre-requisites: Database

A temporal database is a database that needs some aspect of time for the organization of
information. In the temporal database, each tuple in relation is associated with time. It
stores information about the states of the real world and time. The temporal database
does store information about past states it only stores information about current states.
Whenever the state of the database changes, the information in the database gets
updated. In many fields, it is very necessary to store information about past states. For
example, a stock database must store information about past stock prizes for analysis.
Historical information can be stored manually in the schema.

There are various terminologies in the temporal database:

Valid Time: The valid time is a time in which the facts are true with respect to the real
world.
Transaction Time: The transaction time of the database is the time at which the fact is
currently present in the database.
Decision Time: Decision time in the temporal database is the time at which the decision
is made about the fact.

Temporal databases use a relational database for support. But relational databases have
some problems in temporal database, i.e. it does not provide support for complex
operations. Query operations also provide poor support for performing temporal queries.

Applications of Temporal Databases

Finance: It is used to maintain the stock price histories.

1. It can be used in Factory Monitoring System for storing information about current and
past readings of sensors in the factory.
2. Healthcare: The histories of the patient need to be maintained for giving the right
treatment.
3. Banking: For maintaining the credit histories of the user.

Examples of Temporal Databases

1. An EMPLOYEE table consists of a Department table that the employee is assigned to.
If an employee is transferred to another department at some point in time, this can be
tracked if the EMPLOYEE table is an application time-period table that assigns the
appropriate time periods to each department he/she works for.

Temporal Relation

A temporal relation is defined as a relation in which each tuple in a table of the database is
associated with time, the time can be either transaction time or valid time.

Types of Temporal Relation

There are mainly three types of temporal relations:

1. Uni-Temporal Relation: The relation which is associated with valid or transaction time is
called Uni-Temporal relation. It is related to only one time.

2. Bi-Temporal Relation: The relation which is associated with both valid time and
transaction time is called a Bi-Temporal relation. Valid time has two parts namely start time
and end time, similar in the case of transaction time.

3. Tri-Temporal Relation: The relation which is associated with three aspects of time
namely Valid time, Transaction time, and Decision time called as Tri-Temporal relation.

Features of Temporal Databases

The temporal database provides built-in support for the time dimension.
Temporal database stores data related to the time aspects.
A temporal database contains Historical data instead of current data.
It provides a uniform way to deal with historical data.

Challenges of Temporal Databases

1. Data Storage: In temporal databases, each version of the data needs to be stored
separately. As a result, storing the data in temporal databases requires more storage as
compared to storing data in non-temporal databases.

2. Schema Design: The temporal database schema must accommodate the time
dimension. Creating such a schema is more difficult than creating a schema for non-
temporal databases.

3. Query Processing: Processing the query in temporal databases is slower than

processing the query in non-temporal databases due to the additional complexity of
managing temporal data.

Comment M meher… Follow 6

Article Tags : DBMS DBMS Basics

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

1. Introduction to information retrieval

Prof. Dr. Goran Glavaš
(Partially based on slides from Laura Dietz and Jan Šnajder)

Data and Web Science Group

Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik
Universität Mannheim

CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International

11.2.2019.
After this lecture, you’ll...
2

 Understand the basic concepts in information retrieval

 Know how to represent and preprocess text for IR
 Understand the general formalization of IR models

 Know what this course is about and be glad you’ve enrolled it

 Know which topics we will cover
 Hopefully be intrigued by some of the topics
 Know what’s your part of the job to earn credits

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Outline
3

 Part one
 What is information retrieval?
 Text representations and preprocessing
 General information retrieval model

 Part two
 About the course (IE 663 + IE 681)
 Topics
 Organization

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Outline
4

 Part one
 What is information retrieval?
 Text representations and preprocessing
 General information retrieval model

 Part two
 About the course
 Topics
 Organization

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

„Retrieval” and „search”
5

 What is your first association to „information retrieval”?

 What is your first association to „search” (or „search engine”)?

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Retrieval and search
6

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

What is information retrieval?
7

 Information retrieval is the activity of obtaining information resources relevant

for an user’s information need from a collection of information resources.

 Elements of an information retrieval process:

1. Information needs (users express them in the form of queries)
2. Information (re)sources, most often unstructured (text, images, video, audio, etc.)
3. A system/method/model for identifying (re)sources relevant for a given
information need (usually from a large collection of information resources)

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Information needs
8

 Information need is an individual of group’s desire to locate and obtain

information to satisfy a conscious or unconscious need
 I.e., needs and interests that call for information

 Information needs (conscious or unconscious) are expressed as queries

 When retrieving texts, queries are words or phrases (e.g., „Olympics in London”)
 In image retrieval queries can also be images

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Why text information retrieval?
9

 Because of large repositories of unstructured information sources

 Companies – tehnical documentation, business documents, contracts, ...
 Governments – documentation, regulation, laws, ...
 Science – publications (e.g., Google Scholar)
 Personal collections – books, emails, files

 World Wide Web – the largest document collection of all

 Additional challenges due to sheer scale

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Why text information retrieval?
10

 Unstructured sources (text) vs. structured sources (databases)

1996 2009

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Text information retrieval
11

 This course is about retrieval of text, where models differ in:

 Representations of documents and queries
 Methods for determining (degree of) relevance of a document for a given query

 In most IR models relevance is expressed as a score and not a binary decision

 Documents are ranked in decreasing order according to assigned relevance scores
 Relevance scores usually incorporate an element of uncertainty

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Outline
12

 Part one
 What is information retrieval?
 Text representations and preprocessing
 General information retrieval model

 Part two
 About the course
 Topics
 Organization

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Text representations in IR
13

1. Unstructured representation
 Text represented as an unordered set of terms (the so-called bag of words)
 Considerable oversimplification
 We are ignoring the syntax, semantics, and pragmatics of text
 Is this problematic?

Q: „Revenue of Apple”
D: „Apple Pencil 2 'to launch in March 2017‘... Microsoft faces drop in revenue in the 3rd quarter...”

 Despite oversimplifying, BoW representations yield good IR performance

 BoW is de facto standard IR representation

 Due to simplicity and speed

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Text representations in IR
14

2. Weakly-structured representations
 Certain groups of terms given more importance – e.g., nouns or named entities
 Other terms’ contribution is either downscaled or completely ignored
 Some natural language processing (NLP) tools required
 Part-of-speech (POS) tagger to identify nouns or named entity recognizer (NER) to
identify named entities
 Additional preprocessing can be costly
3. Structured representations
 For example, graphs in which nodes represent some terms/concepts and edges
semantic relations between them
 Sophisticated information extraction (IE) and NLP tools needed to induce structure
 IE models typically not accurate enough and time-costly
 Structured representations are virtually not used in IR

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Text representations in IR
15

 Document snippet
„One evening Frodo and Sam were walking together in the cool twilight. Both of
them felt restless again. On Frodo suddenly the shadow of parting had falling: the
time to leave Lothlorien was near. ”

 Unstructured (bag-of-words) representation

{(One, 1), (evening, 1), (Frodo, 2), (and, 2), (Sam, 1) (were, 1), (walking, 1),
(together, 1), (in, 1), (the, 3), (cool, 1), (twilight, 1), (Both, 1), (of, 2), (them, 1), (felt,
1), (restless, 1), (again, 1), (On, 1), (suddenly, 1), (shadow, 1), (parting, 1), (had, 1),
(falling, 1), (time, 1), (to, 1), (leave, 1), (Lothlorien, 1), (was, 1), (near, 1)}

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Text representations in IR
16

 Weakly-structured representations
 Bag of nouns
{(evening, 1), (Frodo, 2), (Sam, 1), (twilight, 1), (shadow, 1), (parting, 1), (time, 1),
(Lothlorien, 1)}

 Bag of named entities

{(Frodo, 2), (Sam, 1), (Lothlorien, 1)}

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Text representations in IR
17

„One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless
again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near. ”
 Structured representation
 For example, event-based structure

walking
Same time
shadow
Frodo felt restless
Before

had falling Lothlorien

Sam Before

leave

 Building such structure requires sophisticated natural language processing tools

 Structured document representations have not been shown beneficial for IR

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Text preprocessing
18

 So, in IR, we most often use unstructured text representations

 Text is represented as unordered set of terms (i.e., bag of words)
 However, many details about the exact representation are still undefined
 How do we „split” text into terms? Can this be done in more than one way?
 Do we consider all terms, or do we want to eliminate some?
 E.g., functional words that have little meaning like articles and prepositions?
 How do we treat different forms of the same word?
 E.g., should „house” be treated the same as „houses”? What about „housing”?
 What about synonyms or same concepts in different languages?
 On a more technical side: what about different document formats?

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Text preprocessing
19

 The preprocessing (i.e., preparing text for the retrieval process) usually involves
the following steps:
1. Extracting pure textual content (e.g., from HTML, PDF, Word)
2. Language detection
 Optional – if you’re dealing with multilingual document collections
3. Tokenization (separating text into character sequences)
4. Morphological normalization (lemmatization or stemming)
5. Stopword removal

 After preprocessing, the text (i.e., the document) is ready to be indexed

 More on indexing in the upcoming lectures

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Tokens and terms
20

 Word is a delimited string of characters as it appears in the text

 Term is a normalized form of the word (accounting for morphology, spelling, etc.)
 Word and term are in the same equivalence class – in informal speech they are often
used interchangeably
 Token is an instance of a word or term ocurring in a document
 Tokens are „words” in the general sense
 But numbers, punctuation, and special characters are also tokens
 Tokenization is a process, typically automated, of breaking down the text (one
long string) into a sequence of tokens (shorter strings)

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Tokenization
21

 Two types of methods for tokenization

 Rule-based (i.e., heuristic)
 Based on supervised machine learning models
 Learn from manually tokenized texts
 Tokenization might seem simple, but it’s not always unambiguous
 E.g., a simple rule: split string on all whitespaces
 „Hewlett-Packard declared losses” -> „Hewlett-Packard”, „declared”, „losses”
 Would we want to split „Hewlett” from „Packard”? What about „lower-case”?
 What about „Denmark’s mountains”: „Denmark” and „’s”, or „Denmarks”, or „Denmark”?
 What about tokenizing numbers and punctuation?
 „19/1/2017”, „55 B.C.”, „+49 176 832 40 332”, „IP: [Link]”
 Sometimes spaces are not an indication of an end of a token

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Issues in tokenization
22

 What about different languages?

 German has numerous compounds
 „Lebensversicherungsgesellschaftsangestellter” (life insurance company employee)
 Is this a single token or 4 tokens?
 IR systems for German texts greatly benefit from a compund splitting module
 How about languages that don’t segment text using whitespaces at all?
 E.g., Chinese
 „莎拉波娃现在居住在美国东南部的佛罗里达”

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Normalization
23

 Normalization or standardization can involve various changes to the token

 Error/spelling correction – repairing the incorrect word
 Case-folding – converting all letters to lower case
 „Morgen will ich in MIT” – is this German preposition „mit”?
 Often best to lower case everything (queries and documents)

 How does Google do it?

 „C.A.T.” (information need: Caterpillar Inc.)
returns cat (animal) as the first result

 Morphological normalization
 Reducing different forms of the „same” word to a common representative form

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Morphological normalization
24

 Inflectional normalization (or lemmatization) reduces all lexico-syntactic forms

of the same word to one standard form, lemma (dictionary headword form)
 Nouns: singular form in „nominative” case
 Verbs: infinitive form
 E.g., „houses” -> „house”, „tried” -> „try”
 Derivational normalization reduces all words syntactically derived from some
word to the original word (even if the derived word has different meaning)
 Derivational operators often change the part-of-speech of the word
 E.g., „destruction” -> „destroy”

 Most IR systems perform inflectional but not derivational normalization

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Stemming
25

 Lemmatization reduces words to dictionary headword entries

 I.e., the resulting lemma is a string that is again a valid word in the language
 Stemming is the procedure of reducing the word to its grammatical (morpho-
syntactic) root
 The result of stemming is not necessarily a valid word of the language
 E.g., „recognized” -> „recogniz”, „incredibly” -> „incredibl”
 Stemming removes suffixes with heuristics
 E.g., „automates”, „automatic”, „automation” will all be reduced to „automat”
 Stemming is „more aggressive” than lemmatization and „less agressive” than
derivational normalization
 „More agressive” means more different words are normalized to the same form
 Stemming is more frequently used in IR systems than lemmatization

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Porter’s algorithm
26

 Most common algorithm for English stemming

 Rule-based algorithm
 Grammatical conventions and 5 phases of reduction
 Phases are executed sequentially, one at a time
 Each phase consists of a set of concurrent suffix-trimming rules
 If multiple rules apply, use the one that removes the longest suffix

 More on Porter’s stemmer:

 [Link]

 Similar algorithms have been developed for other languages as well

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Porter’s algorithm
27

 Examples of rules
 „-ing” -> „”
 „ly” -> „”
 „sses” -> „ss”
 „ational” -> „ate”
 „tional” -> „tion”
 Rules are sensitive to the measure of „how much of a word” a string is
 Rules consider sequences of consonants and vowels, e.g., [C][VC]m[V]
 Rules also often take into account the length of the remaining „root”
 E.g., „ement” -> „” is valid only if the remaining word has more than one syllable
 „replacement” -> „replac” but „cement” -> „cement”

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Expansion vs. normalization
28

 An alternative to normalization is the expansion of the query words

 I.e., We search for alternative forms of the query word as well
 Example
 Query: window Search: window, windows
 Query: windows Search: Windows, windows, window
 Query: Windows Search: Windows
 Theoretically more powerful (no need for imperfect normalization)
 In practice less efficient as we need to index all words we will be looking for
 Some languages are highly inflectional and one word can have many different forms
 E.g., Finnish can have up to 14 different case forms for nouns
 omena (apple) -> omenan, omenaa, omenaan, omenat, omenien, omenoiden,
omenojen, omenain, omenia, omenoita, omenoja, omeniin, omenoihin

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Stopword removal
29

 Stopwords are semantically poor terms such as articles, prepositions,

conjunctions, pronouns, etc.
 Removal of stopwords is one of the most common steps of IR text preprocessing
 Q: Why would we want to remove the stopwords?
 A: Because stopwords have very little meaning, they do not determine whether a
document is relevant or not
 A: Removing stopwords reduces the size of vocabulary (and index) and makes
retrieval process more efficient
 A: Including stopwords may lead to false positives because of stopword matches
between query and documents
 Stopword lists for a number of languages:
 [Link]

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Outline
30

 Part one
 What is information retrieval?
 Text representations and preprocessing
 General information retrieval model

 Part two
 About the course
 Topics
 Organization

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

General information retrieval model
31

 We’ve seen what information retrieval is and how to preprocess text

 Now, let’s formalize the general information retrieval model
 Consider this as a „placeholder” for all concrete IR models we will cover later

 Each functional retrieval system implements the following three components

1. Representation of a raw query text
 To be used for matching against documents in the collection
2. Representation of a raw document text
 To be used for matching against the query
 May or may not be the same representation as the one used for query
3. A function for determining the relevance of documents for the query
 Taking as input document and query representations – (1) and (2)

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

General information retrieval model
32

 Formally, a general retrieval model is a triple of functions (fd, fq, r):

1. fd is a function that maps documents (raw text) to their representation for
retrieval, i.e., fd(d) = pd, where pd is the retrieval representation of the document d;

2. fq is a function that maps queries (raw text) to their representation for retrieval,
i.e., fq(q) = sq, where sq is the retrieval representation of the document q;
 Depending on the IR model, fd and fq may or may not be the same function

3. r is a ranking function which computes a real number indicating the potential

relevance of document d for query q, using representations pd and sq:

rel(d,q) = r(fd(d), fq(q)) = r(pd, sq)

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Index terms and term weights
33

 Index terms are all terms in the collection (i.e., the vocabulary)
 Except those we ignore in preprocessing (like stopwords)
 The set of all index terms: K = {k1, k2, ..., kt}
 Each term ki is, for each document dj, assigned a weight wij
 The weight of the index terms not appearing in the document is 0
 Document dj is represented by term vector [w1j, w2j, ..., wtj] where t is the number
of index terms
 Let g be the function that computes the weights, i.e., wij = g(ki, dj)
 Different choices for the weight-computation function g and the ranking function
r define different IR models

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

IR paradigms
34

 Information retrieval models roughly fall into following paradigms:

1. Set theoretic models
 Boolean model
 Extended Boolean model
2. Algebraic models
 Vector space model
 Latent models
 Latent semantic indexing (LSI), Random indexing, Topic modelling for IR
3. Probabilistic retrieval
 Classic probabilistic retrieval: Binary independence model, BM11, BM25
 Language models for IR
4. Semantic ad-hoc retrieval
 Embedding models

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

IR paradigms
35

 Different models are used in the Web search

 Due to sheer size of the Web
 Because users have no control over the content of the collection
 Q: What is the problem if only content is considered for relevance?
 A: Easy to create spam documents that would be very relevant for certain queries
 Ranking algorithms also exploit the linked structure of the Web
 PageRank
 HITS

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Outline
36

 Part one
 What is information retrieval?
 Text representations and preprocessing
 General information retrieval model

 Part two
 About the course
 Topics
 Organization

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Course description
37

 Q: Why this course?

 A: Because large collections of unstructured documents from which we retrieve
information are all around
 A: Because there are many efficient models to retrieve information, some more
suitable than others in different settings
 A: Because as information workers and data scientists you are likely to sooner or
later have to design/implement a system that retrieves some information from
unstructured data collections
 Course purpose
 Provide a systematic overview of both traditional and advanced methods for text
retrieval and web search

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Course description
38

 Target audience are students who want to

 Gain theoretical understanding of basic and advanced information retrieval models
 Obtain practical (hands-on) experience implementing IR & WS techniques
 Prerequisites
 Fundamental knowledge of
 Linear algebra
 Probability theory
 Algorithms and data structures
 For IE 681: Programming skills in a higher-level programming language
 E.g., Java, Python, C#, C++
 Necessary for homeworks and project
 Helpful, but not necessary:
 Knowledge of natural language processing
 Knowledge of machine learning

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Course description
39

 What this course covers

 Basic theoretical concepts in information retrieval
 Several traditional information retrieval models
 Some advanced/recent IR models and techniques
 IR evaluation
 Web search and web ranking algorithms

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Course description
40

 What this course doesn’t cover

 Natural language processing / Computational linguistics
 We’ll cover only as much as needed for IR, but won’t go into much depth
 Machine learning
 We’ll cover basics needed for IR, but won’t explain the inner workings of ML algorithms
 Multimedia retrieval (search for images, video, audio)
 Out of focus, we are interested primarily in text

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Textbooks
41

 C. D. Manning, P. Raghavan and H. Schütze, Introduction to

Information Retrieval, Cambridge University Press, 2008 (available
at [Link]

 B. Croft, D. Metzler, T. Strohman, Search Engines: Information

Retrieval in Practice, Addison-Wesley, 2009 (available
at [Link] ).

 R. Baeza-Yates, B. Ribeiro-Neto, Modern Information

Retrieval, Addison-Wesley, 2011 (2nd Edition).

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Course content and schedule
42

 Lecture 01: Introduction to Information Retrieval (Feb 11)

 Lecture 02: Boolean Retrieval and Term Indexing (Feb 18)
 Lecture 03: Data Structures in IR and Tolerant Retrieval (Feb 25)
 Lecture 04: Term Weighting and Vector Space Model (Mar 4)
 Lecture 05: Probabilistic Information Retrieval (Mar 18)
 Lecture 06: Language Modelling for Information Retrieval (Mar 25)
 Lecture 07: Relevance Feedback and Query Expansion (Apr 1)
 Project coaching: Apr 8
 Easter break: Apr 15 and Apr 22
 Lecture 08: Latent and Semantic Information Retrieval Models (Apr 29)
 Lecture 09: Classification, Clustering, Learning to Rank, Evaluation (May 6)
 Project coaching: May 13
 Lecture 10: Web Search and Link Analysis (May 20)
 Project presentations: May 27

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Examination and grading
43

 IE 663: Final exam

 Exam will asses both theoretical and practical knowledge
 Preparation for the exam:
 Exercises
 50% of points necessary to pass to course

 IE 681: IR Team Project

 Practical IR problems to be solved
 Done in groups of 3 students
 Expected output:
1. Program code (i.e., software)
2. Written project report
3. Oral presentation of the project

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Team project
44

 Examples of project topics

 Implement a prominent IR models, (build a test collection), and evaluate performance
 Implement indexing techniques and evaluate efficiency on several collections
 Implement link analysis algorithms over a baseline IR model and evaluate performance
 Evaluation
1. Quality of the implementation (i.e., does it work, how stable it is, code quality)
2. Written project report (5-10 pages)
3. Presentation (clarity, style, ...)
 Points (max. 50) assigned to the group
 Group members propose the distribution the points among themselves
 Example: we assign 72 points to a group of 3 students, students then propose how to
distribute 3*72 = 216 points among themselves
 A single student cannot be assigned more than maximal 100 points
 All students should contribute – we will check!
 Our final decision on project points can differ from the distribution proposed by the group

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Team project
45

 Tentative schedule
 Topics published: approx. March 1
 Topics selected and confirmed: approx. March 15

 Project coaching:
 Two sessions, on April 8 and May 13
 We check the progress of your projects
 Help you resolve dilemas and problems you might be facing

 Project presentations: May 27

 Present what you did: methods/models, implementation, evaluation
 10-15 minutes per team
 All team members should present and clearly state what their contribution was
 We will ask questions to all team members

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Communication
47

 This course is powered by the Data and Web Science (DWS) group
 Your IR & WS teachers
 Prof. Dr. Goran Glavaš (lecturer)
 Robert Litschko (teaching assistant)
 Office hours (Goran)
 Fridays at 15:00 (in lecture weeks only)
 B6 29, building C, Room C1.02
 Visits should be previously announced via email
 E-mail communication
 Only for really urgent matters, otherwise come in office hours
 If you’re wondering whether your matter is urgent or not, it probably isn’t 
 All relevant information will be posted timely in ILIAS

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Is this course hard?
48

 To an extent, this depends

 On your previous knowledge (linear algebra, probability theory, NLP, ML, ...)
 On your programming skills

 But primarily this depends on

 Your interest in the IR & WS topics
 Your enthusiasm and willingness to learn new stuff
 The amount of time and effort you invest into this course

 This course is 6 (3+3) ECTS credits

 One credit should amount to 25-30 hours of your time
 Our job is to make sure that this is the amount of effort you put in the course

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Now you...
49

 Understand the basic concepts in information retrieval

 Know how to represent and preprocess text for IR
 Understand the general formalization of IR models

 Know what this course is about and be glad you’ve enrolled it

 Know which topics we will cover
 Are hopefully intrigued by some of the topics
 Know what’s your part of the job to earn credits

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Can I pass this course?
50

IR & WS, Lecture 1: Introduction to Information Retrieval 11.2.2019.

Search... Sign In

Data Science Data Science Projects Data Analysis Data Visualization Machine Learning ML Projects Deep Learning NLP Computer Vision Artificial Intelligence

What is Information Retrieval?

Last Updated : 15 Jul, 2025

Information Retrieval (IR) helps to find relevant information from large collections of
documents. It can be defined as a software program that deals with the organization,
storage, retrieval and evaluation of information from documents. It is like a smart
librarian who doesn’t give you direct answers but tells you where to find the right book like
this IR system scans them and pulls out the ones that match your query.

When you search for something Information Retrieval (IR) model helps find the most
relevant documents and ranks them based on your query. It works by comparing your query
with documents in the system using a matching function. This function gives each
document a retrieval status value (RSV) which helps rank the most relevant results first.
To do this IR systems represent documents using descriptors i.e most important keywords
from vocabulary (V).

Estimation of the probability of user’s relevance rel for each document d and query q
with respect to a set R q of training documents: Prob ( rel ∣ d, q, R q )

Components of Information Retrieval/ IR Model

The Information Retrieval (IR) model can be broken down into key components that
involve both the system and the user. Here’s how it works in a simple flow:

1. User Side (Search Process)

Problem Identification: A student wants to learn about machine learning and types a
query into a search engine.
Representation: The user converts their need into a search query using keywords or
phrases like instead of asking "How do machines learn?" the student types "machine
learning basics" into Google and the problem is converted into a query (keywords or
phrases).
Query: The user submits the search query into IR system.
Feedback: User can refine or modify the search based on the retrieved results.

2. System Side (Retrieval Process)

Acquisition: The system collects and stores a large number of documents or data
sources. It can includes web pages, books, research papers or any text-based
information.
Representation: Each document in the system is analyzed and represented in a
structured way using keywords (terms). Example: If the document talks about "machine
learning" it is tagged with relevant terms like "AI, deep learning, algorithms, models" to
help retrieval.
File Organization: The documents are indexed and stored efficiently so the system can
quickly find relevant ones. Like organizing a library so books can be found easily based
on topics.
Matching: The system compares the user's search query with stored documents to
find the best matches. It uses matching functions that rank documents based on
relevance.
Retrieved Object: The system returns the most relevant documents to the user. These
documents are ranked so the most useful ones appear at the top.

3. Interaction Between User & System

The user reviews the retrieved results and may provide feedback to refine the search.
The system then processes the updated query and retrieves better results.

Acquisition: In this step the selection of documents and other objects from various web
resources that consist of text-based documents takes place. The required data is
collected by web crawlers and stored in the database.
Representation: It consists of indexing that contains free-text terms, controlled
vocabulary, manual and automatic techniques as well. Example: Abstracting contains
summarizing and Bibliographic description that contains author, title, sources, data and
metadata.
File Organization: There are two types of file organization methods. i.e. Sequential that
contains documents by document data and Inverted: that contains list of records under
each term.
Query: An IR process starts when a user enters a query into the system. Queries are
formal statements of information needs. For example, search strings in web search
engines. In IR a query does not uniquely identify a single object in the collection. Instead
several objects may match the query, perhaps with different degrees of relevancy.

Difference Between Information Retrieval and Data Retrieval

Information Retrieval Data Retrieval

The software program that deals with Data retrieval deals with obtaining data from a
the organization, storage, retrieval database management system such as ODBMS. It is
and evaluation of information from A process of identifying and retrieving the data
document repositories particularly from the database based on the query provided by
textual information. user or application.

Determines the keywords in the user query and

Retrieves information about a subject.
retrieves the data.

Small errors are likely to go unnoticed. A single error object means total failure.

Not always well structured and is

Has a well-defined structure and semantics.
semantically ambiguous.

Does not provide a solution to the user Provides solutions to the user of the database
of the database system. system.

The results obtained are approximate

The results obtained are exact matches.
matches.

Results are ordered by relevance. Results are unordered by relevance.

It is a probabilistic model. It is a deterministic model.

Advantages of Information Retrieval

Efficient Access: Information retrieval techniques make it possible for users to easily
locate and retrieve vast amounts of data or information.
Personalization of Results: User profiling and personalization techniques are used to
tailor search results to individual preferences and behaviors.
Scalability: They are capable of handling increasing data volumes.
Precision: These systems can provide highly accurate and relevant search results and
reducing the likelihood of irrelevant information appearing in search results.

Disadvantages of Information Retrieval

Information Overload: When a lot of information is available users often face

information overload making it difficult to find most useful and relevant material.
Lack of Context: They may fail to understand the context of a user's query leading to
inaccurate results.
Privacy and Security Concerns: They often access sensitive user data that can raise
privacy and security concerns.
Maintenance Challenges: Keeping these systems up-to-date and effective requires a
lot of efforts including regular updates, data cleaning and algorithm adjustments.
Bias and fairness: Ensure that systems do not exhibit biases and provide fair and
unbiased results.

Comment S siddhi… Follow 44

Article Tags : NLP AI-ML-DS

Explore

Natural Language Processing (NLP) Tutorial 5 min read

Introduction to NLP

Libraries for NLP

Text Normalization in NLP

Text Representation and Embedding Techniques

NLP Deep Learning Techniques

NLP Projects and Practice

Company Explore Tutorials Courses Videos Preparation Corner

About Us POTD Programming Languages IBM Certification DSA Interview Corner
Corporate & Communications Address: Legal Job-A-Thon DSA DSA and Placements Python Aptitude
A-143, 7th Floor, Sovereign Corporate Privacy Policy Blogs Web Technology Web Development Java Puzzles
Tower, Sector- 136, Noida, Uttar Pradesh
Contact Us Nation Skill Up AI, ML & Data Science Programming Languages C++ GfG 160
(201305)
Advertise with us DevOps DevOps & Cloud Web Development System Design
GFG Corporate Solution CS Core Subjects GATE Data Science
Registered Address:
Campus Training Program Interview Preparation Trending Technologies CS Subjects
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam Soware and Tools
Buddh Nagar, Uttar Pradesh, 201305
Search... Sign In

Software Engineering Tutorial Software Development Life Cycle Waterfall Model Software Requirements Software Measurement and Metrics Software Design Process System configuration management Soft

Search Engine
Last Updated : 23 Jul, 2025

Imagine you are in a library and are looking for a particular book. Now if you have to go
through every book in each category, it will be a tedious and difficult task. Moreover, if the
library has more than a million books then this task seems next to impossible. You are
definitely going to need a librarian who can bring the relevant books for you without any
delay. Well, that’s where a search engine comes in.

Search engine spamming refers to the practice of creating Web pages, or sets of Web
pages, designed to get a high relevance rank for some queries, even though the sites are
not popular sites. Popularity ranking schemes such as PageRank make the job of search
engine spamming more difficult, since just repeating words to get a high TF– IDF score was
no longer sufficient. However, even these techniques can be spammed, by creating a
collection of Web pages that point to each other, increasing their popularity rank.
Techniques such as using sites instead of pages as the unit of ranking (with appropriately
normalized jump probabilities) have been proposed to avoid some spamming techniques,
but are not fully effective against other spamming techniques. The war between search
engine spammers and search engines continues even today.

The hubs and authorities approach of the HITS algorithm is more susceptible to spamming.
A spammer can create a Web page containing links to good authorities on a topic, and
gains a high hub score as a result. In addition, the spammer’s Web page includes links to
pages that they wish to popularize, which may not have any relevance to the topic. Because
these linked pages are pointed to by a page with high hub score, they get a high but
undeserved authority score.
Table of Content
What is a Search Engine?
History of search engines
Working of a search engine
Architecture Of Search Engine
How queries are processed in search engine?
Search Engine Advantages:
Examples Of Popularly Used Search Engines

What is a Search Engine?

A search engine is a software that brings to user relevant information(which they search)
from the vast library of data available on World Wide Web. Users can search for multiple
things including queries, documents, images, videos, webpages, and other content on a
search engine. Search engines are build in such a way that they effectively generate the
required information by crawling across the web and searching from the available
databases on internet.

We all use search engine in our day to day life or should I say daily in our lives! I guess we
all use Google a number of times in a day even to search basic things. Well Google is one
of the most widely used search engine all around the world due to its variety of services
like web search, image and video search, etc.

History of search engines

Archie was the first well developed search engine in the year 1990. It used to search
files by matching their names and indexing on FTP server.
In 1992 Veronica search engine was developed for Gopher based websites.
Later in 1993 W3Catalog and Aliweb were formed which were web search engines.
WebCrawler search engine was the first to allow users search keywords.
Finally in 1994, the search engine Yahoo! was developed which gained immense
popularity. Earlier it was just a directory but in 1995 search feature was also added.
In 1998 when Google was founded as till now is the most used and preferred browser
all across the globe. However, there have been other frequently used search engines
formed after google like, Baidu, Bing, Yandex, etc.

Working of a search engine

Lets understand how a search engine works through this chart:

If we look back to earlier example the search engine acts as a librarian that gathers
relevant books which is required information from the library of data available on the
internet.

To summarize, when user searches for a particular data the web crawlers scan or crawl
through the data available on web and gather all the relevant information (Crawling). After
this, the gathered information is organized in the form of catalog or database so that the
relevant web pages can be selected quickly. The search engine then picks up the most
relevant results according to the ranking and finally displays it in the results page or SERP.
It is quite a technical process, but all this happens so quickly that user gets the results as
soon as they search something on the search engine.

Architecture Of Search Engine

If we talk about the architecture or the framework of a search engine, it can be described in
three main components –

• Web crawlers – As the name suggests these acts as spiders which crawl all over the
web to collect required information. These are special bots that search throughout the
internet and accumulates data using various links.

• Database – It is a collection of data which is gathered by the web crawlers after

searching throughout the world wide web.

• Search Interface – It provides a medium or interface for users so that they can access
and search on the database for required information.

How queries are processed in search engine?

Whenever we search anything on the search engine, it only takes a second or two for the
output generation. However, a lot goes on in the backend. Indexing and Querying are two
essential components behind the processing of a search engine. They are like the building
blocks of search engines. Let’s take a look at these processes -

Indexing

• The indexing process begins with web crawling where the so-called spiders crawl across
the world wide web and collect data.

• The data collected is stored in the form of a database for the process of indexing. This is
also termed as text acquisition.

• Then the collected data is broken down into tokens or keywords. These tokens are used
by the search engine in creating indexes. Each keyword is associated to a particular
document and through indexing the data becomes organized and it helps the search engine
to quickly retrieve a particular information.

Querying

• When a user searches something on the search engine a query input is generated.

• Then the search engine parses the generated query and searches at the indexes for the
matching documents.

• Using a ranking algorithm, the search engine ranks the documents based on their
relevance. Finally, the generated list is presented to the user with most relevant results on
the top.

Search Engine Advantages:

• Search engines have made it possible to navigate through the internet. Even a person
without any technical knowledge is able to use a search engine for solving a query.

• The quick and efficient responses of search engine make it easier for a user to
immediately get result for their search.

• A search engine not only supports text results but also results like images, videos, maps,
documents and various other formats, hence offering users a one stop solution.

• In today’s time people are using search engines not only for technical purposes but also
for researches, educational purposes and even in day-to-day life because of its diversified
result generation.

• The user-friendly interface, organized results, customization features and diversity of

search results makes search engines one the most essential tool for surfing on internet.

Examples Of Popularly Used Search Engines

In today's time Google is the most widely used search engine, however there are several
other popular search engines being used. Some of the popular search engines are listed
below:

• Google - Founded in 1998 by Larry Page and Sergey Brin, Google is the most popularly
and widely used search engine. It has an attractive an user-friendly interface with a
versatile library of features which makes it first choice for maximum users when it comes to
search engines.

• Bing - Founded in 2009 by Microsoft, Bing is quite similar to other search engines. It also
allows users to search through images.

• Yahoo - Founded in 1994 by Jerry Yang and David Filo, Yahoo was among the earliest
used search engines. However, its popularity has declined over time. Earlier Yahoo offered
a platform called "Yahoo answers" where users could ask or answer various questions.

• Baidu, DuckDuckGo and Yandex are some other popular emerging browsers in today's
time.

Comment P princy… Follow 3

Article Tags : Software Engineering Geeks Premier League 2023

Explore

Software Engineering Basics

Software Measurement & Metrices

Software Development Models & Agile Methods

SRS & SPM

Testing & Debugging

Verification & Validation

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

About Us POTD Programming Languages IBM Certification DSA Interview Corner
Corporate & Communications Address: Legal Job-A-Thon DSA DSA and Placements Python Aptitude
A-143, 7th Floor, Sovereign Corporate Privacy Policy Blogs Web Technology Web Development Java Puzzles
Tower, Sector- 136, Noida, Uttar Pradesh
Contact Us Nation Skill Up AI, ML & Data Science Programming Languages C++ GfG 160
(201305)
Advertise with us DevOps DevOps & Cloud Web Development System Design
GFG Corporate Solution CS Core Subjects GATE Data Science
Registered Address:
Campus Training Program Interview Preparation Trending Technologies CS Subjects
K 061, Tower K, Gulshan Vivante
Apartment, Sector 137, Noida, Gautam Skip to content
Software and Tools
Buddh Nagar, Uttar Pradesh, 201305

Search... Sign In

Aptitude Engineering Mathematics Discrete Mathematics Operating System DBMS Computer Networks Digital Logic and Design C Programming Data Structures Algorithms

Deductive Database Semantics and Query Evaluation

Last Updated : 23 Jul, 2025

Pre-requisites: What is a Database?

We classify the relation in Datalog Program or deductive database as either output relation
or input relation. output relations are defined by rules and input relations have a set of
tuples explicitly listed (e.g. assembly) given the instance of the input relation we must
compute instances for the output relations.

The major advantage of a deductive database is the ability to write queries. we can
understand deductive databases more easily using the following diagram.

deductive database

The meaning Datalog programming usually defines deductive database in two different
ways both of which essentially describe the relation instance for output relation. technically
a query is a section over one of the output relations. however, the meaning of the query is
clear once we understand how relation instances are associated with output relation in
Datalog Program

Safe Datalog Programmer

There are many approaches to defining the semantics of the Datalog Program:

Least model semantics

This model gives users a way to understand the Program without thinking about how
that Program will be executed.
This semantics is declarative works like relational calculus and not practical like
relational algebra semantics.
It is comparatively simpler due to recursive rules making it difficult to understand the
Program in terms of evaluation strategy.

Least fix point semantics

least fix point semantics will give a conceptual evaluation strategy to compute the
relation.
It works as the basis for recursive query evaluation.
The efficient query evaluation strategy is used in actually for better implementation.
The correctness of the model is demonstrated by equivalence to the least fixed point
approach.

Altogether, the main objective of this thesis is to improve existing transformation-based

methods and to develop new ones for evaluating ratifiable as well as unsatisfiable
recursion. The results ought to provide a realistic framework of
efficient evaluation techniques for extending existing relational database systems.

Query Evaluation for Deductive Database

The query evaluation for the deductive database is as follows:

phase 1: storage and access

The deductive database stores rules and facts on datalog formulas in clausal form
It contains quantifiers like existential and universals
Clausal forms of the formula are made up of a number of clauses each clause is
composed of a number of literals connected by OR logical connection or AND logical
connection

phase2: interpretation of rules

The deductive database then interprets all rules using various methods.
Interpretation of rules the fact is considered as axioms. Rules are also called deductive
axioms and are used to construct a proof that derives new facts from existing facts.
Another method of interpretation we have given is an infinite domain of constant values
with an assigned predicate for each combination of values for an argument.

Deductive Database Prototype

There are many deductive prototypes are available many such systems are memory based.
it assumes all required permanent relations are stored in the main memory and during the
computation process, temporary relations generated can be stored in memory.

for example RDL/c and megalog

Comment A alshifa… Follow 1

Article Tags : DBMS Articles

Explore

Basics of DBMS

ER & Relational Model

Relational Algebra

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Advanced DBMS

Practice Questions

Company Explore Tutorials Courses Videos Preparation Corner

Introduction to ER Model in DBMS
No ratings yet
Introduction to ER Model in DBMS
20 pages
Introduction to ER Model Basics
No ratings yet
Introduction to ER Model Basics
36 pages
ER Model Fundamentals and Design
No ratings yet
ER Model Fundamentals and Design
107 pages
DBMS Unit-3
No ratings yet
DBMS Unit-3
45 pages
Introduction to ER Model Basics
No ratings yet
Introduction to ER Model Basics
7 pages
Understanding ER Diagrams in DBMS
No ratings yet
Understanding ER Diagrams in DBMS
3 pages
EER Diagram Design in SQL
No ratings yet
EER Diagram Design in SQL
130 pages
In Today's Digital World, A Huge Amount of Data Is Generated Every Day. To Store, Manage, and Retrieve This Data Efficiently
No ratings yet
In Today's Digital World, A Huge Amount of Data Is Generated Every Day. To Store, Manage, and Retrieve This Data Efficiently
5 pages
ER Model and Database Design Overview
No ratings yet
ER Model and Database Design Overview
51 pages
Understanding ER Diagrams in Software Engineering
No ratings yet
Understanding ER Diagrams in Software Engineering
6 pages
Uni2 See DBM
No ratings yet
Uni2 See DBM
17 pages
ER Diagram Tutorial in DBMS
100% (1)
ER Diagram Tutorial in DBMS
32 pages
MCA Unit-1 Notes Updated
No ratings yet
MCA Unit-1 Notes Updated
31 pages
Database Models and ER Diagrams Explained
No ratings yet
Database Models and ER Diagrams Explained
36 pages
Entity Relationships in Database Design
No ratings yet
Entity Relationships in Database Design
18 pages
Understanding Entity-Relationship Diagrams
No ratings yet
Understanding Entity-Relationship Diagrams
14 pages
Entity Relationship Diagram Basics
No ratings yet
Entity Relationship Diagram Basics
79 pages
Understanding Entity Representation in ERDs
No ratings yet
Understanding Entity Representation in ERDs
37 pages
SQL Lab Manual for CSE Students
No ratings yet
SQL Lab Manual for CSE Students
134 pages
ER Diagram Mapping and SQL DDL Guide
No ratings yet
ER Diagram Mapping and SQL DDL Guide
81 pages
Understanding Entity-Relationship Diagrams
No ratings yet
Understanding Entity-Relationship Diagrams
5 pages
Entity Relationship Model
No ratings yet
Entity Relationship Model
25 pages
Overview of Entity Relationship Model
No ratings yet
Overview of Entity Relationship Model
21 pages
Data Modeling with ER Diagrams Guide
No ratings yet
Data Modeling with ER Diagrams Guide
43 pages
Module 1 Notes - Part 2
No ratings yet
Module 1 Notes - Part 2
25 pages
Understanding Entity Relationship Models
No ratings yet
Understanding Entity Relationship Models
19 pages
Codd's Rules and ER Model in DBMS
No ratings yet
Codd's Rules and ER Model in DBMS
30 pages
ER Diagram Model and Examples Guide
No ratings yet
ER Diagram Model and Examples Guide
14 pages
ER Diagram Mapping in DBMS Lab
No ratings yet
ER Diagram Mapping in DBMS Lab
77 pages
Understanding Entity-Relationship Models
No ratings yet
Understanding Entity-Relationship Models
69 pages
Data Modeling with E-R Diagrams
No ratings yet
Data Modeling with E-R Diagrams
10 pages
Understanding RDBMS Concepts and Terminology
No ratings yet
Understanding RDBMS Concepts and Terminology
42 pages
Understanding ER Models and Diagrams
No ratings yet
Understanding ER Models and Diagrams
8 pages
Understanding the Entity-Relationship Model
No ratings yet
Understanding the Entity-Relationship Model
91 pages
Understanding Entity-Relationship Diagrams
No ratings yet
Understanding Entity-Relationship Diagrams
25 pages
Understanding ER Model Basics
100% (1)
Understanding ER Model Basics
24 pages
Understanding Entity Relationship Diagrams
No ratings yet
Understanding Entity Relationship Diagrams
33 pages
Understanding Entity Relationship Models
No ratings yet
Understanding Entity Relationship Models
21 pages
High-Level Conceptual Data Model Overview
No ratings yet
High-Level Conceptual Data Model Overview
31 pages
Understanding Data Models in DBMS
No ratings yet
Understanding Data Models in DBMS
22 pages
Understanding ER Diagrams and Models
No ratings yet
Understanding ER Diagrams and Models
14 pages
Unit 2
No ratings yet
Unit 2
25 pages
Understanding E-R Diagrams in Databases
No ratings yet
Understanding E-R Diagrams in Databases
12 pages
Understanding the Entity-Relationship Model
No ratings yet
Understanding the Entity-Relationship Model
12 pages
ER Diagram Components for Library System
No ratings yet
ER Diagram Components for Library System
36 pages
Understanding the ER Model in Databases
No ratings yet
Understanding the ER Model in Databases
16 pages
Dbms Unit2 Er Model
No ratings yet
Dbms Unit2 Er Model
22 pages
Understanding Entities and Attributes
No ratings yet
Understanding Entities and Attributes
14 pages
E-R Diagram Fundamentals for Data Modeling
No ratings yet
E-R Diagram Fundamentals for Data Modeling
44 pages
Understanding ER Diagrams and Models
No ratings yet
Understanding ER Diagrams and Models
43 pages
Understanding ER Diagrams and Models
No ratings yet
Understanding ER Diagrams and Models
18 pages
Understanding ER Diagrams in DBMS
No ratings yet
Understanding ER Diagrams in DBMS
14 pages
ER Diagram and SQL Basics by Vaishnavi Raut
No ratings yet
ER Diagram and SQL Basics by Vaishnavi Raut
42 pages
Module 2.1
No ratings yet
Module 2.1
38 pages
Introduction to ER Model Overview
No ratings yet
Introduction to ER Model Overview
17 pages
Advanced Data Structures Lab Record
No ratings yet
Advanced Data Structures Lab Record
85 pages
Anna University B.Tech AI Results 2024
No ratings yet
Anna University B.Tech AI Results 2024
5 pages
Meeting Notes: Hostel Management & Services
No ratings yet
Meeting Notes: Hostel Management & Services
1 page
B.Sc. Medical Lab Tech Exam Details 2025
No ratings yet
B.Sc. Medical Lab Tech Exam Details 2025
1 page
Metrology Lab Exam Questions 2021-22
100% (1)
Metrology Lab Exam Questions 2021-22
3 pages
ME3581 Metrology & Dynamics Lab Exam
No ratings yet
ME3581 Metrology & Dynamics Lab Exam
3 pages
M.E. CSE Curriculum 2025 Overview
No ratings yet
M.E. CSE Curriculum 2025 Overview
17 pages
CE3481 Strength of Materials Lab QP
No ratings yet
CE3481 Strength of Materials Lab QP
4 pages
Mobile Computing Lecture Notes for CSE/IT
No ratings yet
Mobile Computing Lecture Notes for CSE/IT
124 pages
Fujitsu ESPRIMO P900 E90+ Desktop PC: Data Sheet
No ratings yet
Fujitsu ESPRIMO P900 E90+ Desktop PC: Data Sheet
9 pages
Block Diagram of Mobile Phone Hardware
No ratings yet
Block Diagram of Mobile Phone Hardware
4 pages
Structured Programming Exam Question Bank
No ratings yet
Structured Programming Exam Question Bank
6 pages
Data Storage Conversion Guide
No ratings yet
Data Storage Conversion Guide
31 pages
Graphic Designer with 5 Years Experience
No ratings yet
Graphic Designer with 5 Years Experience
1 page
Photo and Signature Upload Guidelines
No ratings yet
Photo and Signature Upload Guidelines
2 pages
Solar-Powered IoT Waste Monitoring System
No ratings yet
Solar-Powered IoT Waste Monitoring System
34 pages
DBMS Overview and Applications
No ratings yet
DBMS Overview and Applications
64 pages
Leapfrog Geo 4.2 File Types Overview
100% (3)
Leapfrog Geo 4.2 File Types Overview
5 pages
SQL Schema for Flight Booking System
No ratings yet
SQL Schema for Flight Booking System
3 pages
Calibrating the Pols S-210 Scale
No ratings yet
Calibrating the Pols S-210 Scale
5 pages
Parsing Database Queries with ILP
No ratings yet
Parsing Database Queries with ILP
64 pages
DDoS Attacks in Cloud Computing
No ratings yet
DDoS Attacks in Cloud Computing
6 pages
VLAN Configuration and Benefits Guide
No ratings yet
VLAN Configuration and Benefits Guide
2 pages
B Viewer EN PL 0
No ratings yet
B Viewer EN PL 0
128 pages
Web Application API Development Guide
No ratings yet
Web Application API Development Guide
7 pages
Cyber City Oedo 808 1080p BD Release
No ratings yet
Cyber City Oedo 808 1080p BD Release
2 pages
(Ebook) Algorithms in A Nutshell, 2nd Edition: A Desktop Quick Reference by George T. Heineman, Gary Pollice, Stanley Selkow ISBN 9781491948927, 1491948922 Complete Edition
No ratings yet
(Ebook) Algorithms in A Nutshell, 2nd Edition: A Desktop Quick Reference by George T. Heineman, Gary Pollice, Stanley Selkow ISBN 9781491948927, 1491948922 Complete Edition
75 pages
Cybersecurity Management Report
No ratings yet
Cybersecurity Management Report
12 pages
Design Patterns for Self-Adaptive Systems
No ratings yet
Design Patterns for Self-Adaptive Systems
18 pages
Smart Life App User Manual PDF
0% (1)
Smart Life App User Manual PDF
51 pages
Memory Organization in Computer Systems
No ratings yet
Memory Organization in Computer Systems
12 pages
Excel Function Keys and Shortcuts Guide
No ratings yet
Excel Function Keys and Shortcuts Guide
11 pages
JavaFX Login Screen Implementation
No ratings yet
JavaFX Login Screen Implementation
22 pages
Class Design in Object-Oriented Analysis
No ratings yet
Class Design in Object-Oriented Analysis
110 pages
C Programming Assignment for FE 2025-26
No ratings yet
C Programming Assignment for FE 2025-26
2 pages
1.2+L 02 IntroducingHEC ResSim - 2025
No ratings yet
1.2+L 02 IntroducingHEC ResSim - 2025
16 pages
HTML5 Images and Links Guide
No ratings yet
HTML5 Images and Links Guide
12 pages
SAP BPC 70 SP03 M SMGde
No ratings yet
SAP BPC 70 SP03 M SMGde
16 pages

Understanding the ER Model in Databases

Uploaded by

Understanding the ER Model in Databases

Uploaded by

Search...

The Entity-Relationship Model (ER Model) is a conceptual model for designing a

Entity: An objects that is stored as data such as Student, Course or Company.

The graphical representation of this model is called an Entity-Relation Diagram (ERD).

ER Model in Database Design Process

Why Use ER Diagrams In DBMS?

Symbols Used in ER Model

Rectangles: Rectangles represent entities in the ER Model.

Symbols used in ER Diagram

Real-World Objects: Person, Car, Employee etc.

What is an Entity Set?

In the ER diagram below, the entity type is represented as:

A company may store the information of dependents (Parents, Children, Spouse) of an

Strong Entity and Weak Entity

An attribute composed of many other attributes is called a composite attribute. For

Entity and Attributes

Relationship Type and Relationship Set

Degree of a Relationship Set

1. Unary/Recursive Relationship: When there is only ONE entity set participating in a

Read more about - Recursive Relationship

Cardinality can be of different types:

One to One Cardinality

Using Sets, it can be represented as:

Set Representation of One-to-One

one to many cardinality

Using sets, one-to-many cardinality can be represented as:

Set Representation of One-to-Many

many to one cardinality

Using Sets, it can be represented as:

Set Representation of Many-to-One

many to many cardinality

Using Sets, it can be represented as:

Many-to-Many Set Representation

Total Participation and Partial Participation

Using Set, it can be represented as,

Set representation of Total Participation and Partial Participation

How to Draw an ER Diagram

5. Remove Redundancies: Review the diagram and eliminate unnecessary or repetitive

No compatible source was found for this media.

Entity Relationship Model in DBMS Visit Course

Comment K kartik Follow 716

Article Tags : DBMS GATE CS DBMS-ER model

ER & Relational Model

Functional Dependencies & Normalisation

Transactions & Concurrency Control

Company Explore Tutorials Courses Videos Preparation Corner

Difference between Generalization and Specialization in DBMS

Cuts Down on Redundancy: Cuts down on data duplication by combining related

Enhances Specificity: By forming specialized subgroups, it is possible to depict things

Difference Between Generalization and Specialization

Generalization works in Bottom-Up

In Generalization, size of schema gets In Specialization, size of schema gets

Generalization is normally applied to group We can apply Specialization to a single

Generalization can be defined as a process

In Generalization process, what actually Specialization is reverse of Generalization.

Generalization process starts with the

In Generalization, the difference and

Comment S snigdh… Follow 35

Article Tags : DBMS Difference Between GATE CS

ER & Relational Model

Functional Dependencies & Normalisation

The ER model is the abstract representation of a database structure that defines:

What is an Enhanced ER model?

Subclasses and Superclasses

1. Superclass and Subclass

2. Generalization and Specialization

Generalization and Specialization are common relationships added as enhancements to the

An entity belonging to a sub-class is related to some super-class entity. For instance

Enhanced ER Model of Above Example:

Constraints: There are two types of constraints on the “Sub-class” relationship.

1. Total or Partial Sub-classing:

2. Overlapped or Disjoint Sub-Classing:

Overlapped: An entity can belong to multiple subclasses.

3. Category or Union Type

Enhanced ER Model with Union

4. Attribute and Relationship Inheritance