MCA – DBS: UNIT-4: DDB (CH-25.1, 25.2, 25.3, 25.
4)
Distributed Databases: Distributed Database Concepts, Types of Distributed Database
Systems, Distributed Database Architectures, Data Fragmentation, Replication, and
Allocation Techniques for Distributed Database Design.
Distributed Database
Distributed databases bring the advantages of distributed computing to the database
management domain.
DDB technology resulted from a merger of two technologies: database technology, and
network and data communication technology.
DDB is a collection of multiple, logically interrelated databases distributed over a computer
network.
DDBMS: It is a software system that manages a distributed database while making
distribution transparent to the user.
For a database to be called distributed, these minimum conditions should be satisfied:
Network Connection: All database sites (computers) must be connected via a
communication network to share data and commands, as shown later in Figure(c).
Logical Relationship:
The data stored in different sites must be logically related.
No Uniformity Required:
The sites can differ in their data, hardware, or software — they don’t have to be the same.
Distributed Database Concepts:
i) Fragmentation:
The process of dividing the database into a smaller multiple parts is called fragmentation.
These fragments are stored at different locations.
The data fragmentation process should be carried out in such a way that the reconstruction
of original database from the fragments is possible.
The system partitions/divides the relation into several fragments, and stores each fragment
at different sites.
Horizontal data Fragmentation
It breaks relation R by assigning each tuple of R to one or more fragments.
Each fragment is a subset of the tuples in original relation R.
Horizontal (using union operation) → R ⟨R₁, R₂⟩ → R₁ ∪ R₂ = R
Vertical data Fragmentation
It breaks relation R by decomposing schema.
Each fragment is a subset of the attributes of the original relation R.
Vertical (using join operation) → R ⟨R₁, R₂⟩ → R₁ ⨝ R₂ = R
Mixed Fragmentation:
It is a combination of both horizontal and vertical fragmentation.
Original relation is obtained by the combination of join and union operations.
ii) Replication:
It means storing a copy (or replica) of a relation or relation fragments in two or more sites.
Full Replication:
Distribution of entire relation at all the sites.
Partial Replication:
Only some fragments of a relation are replicated.
Why Replication is Desirable:
i) Increased availability of data
ii) Better performance
iii) Transparency:
In distributed system, the user should be able to access the database exactly as if the system
were local.
Hiding details such as data storage, how data can be accessed is called as data transparency.
Types of Transparency:
i) Location transparency: refers to the fact that the command used to perform a task is
independent of the location of the data and the location of the node where the command
was issued.
ii) Fragmentation transparency: Fragmentation transparency makes the user unaware of the
existence of fragments.
iii) Replication transparency: Replication transparency makes the user unaware of the
existence of these copies. Copies of the same data objects may be stored at multiple sites for
better availability, performance, and reliability.
iv) Naming transparency: implies that once a name is associated with an object, the named
objects can be accessed unambiguously without additional specification as to where the
data is located.
iv) Autonomy:
Autonomy determines the extent to which individual nodes or DBs in a connected DDB can
operate independently.
Autonomy refers to the degree of independence each site (or node) in a distributed
database has over its own operations — such as managing data, running queries, or
handling users.
v) Reliability and Availability:
Reliability is broadly defined as the probability that a system is running (not down) at a
certain time point.
availability is the probability that the system is continuously available during a time interval.
Features of Distributed Database:
i) Data is stored at a number of sites.
ii) Sites are interconnected by a network
iii) DDB is logically a single db.
iv) DDBMS has full functionality of DBMS.
Advantages of Distributed Database:
i) Sharing of Data
ii) Improved Availability and Reliability
iii) Autonomy
iv) Easier expansion
v) Reduced operating cost
Disadvantages of Distributed Database:
i) Complexity of management and control.
ii) Deadlock handling
iii) security
iv) lack of standards
Types of Distributed Databases:
Homogeneous
i) Share a common global schema.
ii) Run identical DBMS s/w.
iii) Each site provides part of its autonomy in terms of right to change schema or s/w.
iv) Same s/w – No problem in transaction processing.
v) Same schema – No problem in query processing.
Heterogeneous
i) Different sites can have different schema.
ii) Run different DBMS s/w.
iii) Each site maintains its own right to change the schema or s/w.
iv) Different s/w – Major problem in transaction processing.
v) Different schema – Problem in query processing.
Classification of Distributed Database Systems:
A: Centralized Database System
No distribution, no heterogeneity, high autonomy
One site handle everything.
Example: A clinic management system in a small private hospital where patient records,
billing, and appointments are stored on a single local server using MySQL. All operations are
performed on one machine; there is no need for distribution.
B: Pure Distributed Database System
Fully distributed, homogeneous, zero local autonomy
Looks like a single centralized DB to the user.
All data access is through a common interface.
Single global schema
Sites do not act independently.
[Example: Google Spanner used by Google for managing distributed data across its global
data centers. Appears as a single unified database to users, despite being distributed; it is fully
homogeneous with a single global schema. Spanner is used for mission-critical applications
that require high availability, global scale, and strong consistency, such as financial services,
gaming, and e-commerce platforms.]
C: Federated Database System (FDBS)
Some distribution, some heterogeneity, moderate autonomy
Sites have local users and local DBAs.
There is a global schema shared across sites.
Sites can run independently, but participate in a shared federation.
Example: Healthcare Information Systems connecting various hospitals that use different
local databases (Oracle, SQL Server, etc.) but share patient data under a unified health
program. Each hospital retains control over its local database but can participate in a shared
health data ecosystem.
D: Peer-to-Peer System
High distribution, high heterogeneity, full local autonomy
No global schema exists.
Each site constructs necessary schemas only when needed.
Sites can run on different DBMS models (relational, object, hierarchical, etc.)
Example: University Collaboration System
Different universities maintain their own local databases (student info, courses,
results), each built using different DBMSs.
They occasionally share data for student exchange programs or research
collaboration, but there’s no unified global schema.
Each university decides what to share, when, and how, often using custom-built APIs
or schema mappings.
Concepts/Techniques in Distributed Database Design
1. Fragmentation
o The process of breaking up the database into logical units called fragments.
o Each fragment can represent a portion of a table (horizontal or vertical).
o Purpose: to improve locality of access and efficiency.
o Types of fragmentation:
Horizontal fragmentation: rows are divided across sites.
Vertical fragmentation: columns (attributes) are divided.
Mixed/hybrid fragmentation: combination of both horizontal and
vertical.
2. Replication
o The technique of storing copies of data (or fragments) at multiple sites.
o Increases data availability and fault tolerance.
o Comes at the cost of maintaining data consistency during updates.
3. Allocation
o The process of assigning fragments or replicas to various sites in the
distributed system.
o Allocation strategies:
Centralized: all data is stored at one site.
Partitioned: fragments are stored at different sites.
Replicated: multiple copies of fragments are stored at several sites.
4. Global Directory
o Stores metadata about the fragmentation, replication, and allocation of
data.
o Acts as a catalog used by the Distributed Database System (DDBS) to locate
and access data.
o Must be efficiently maintained and accessible to all DDBS applications.
5. Purpose of These Techniques
o Improve performance, reliability, scalability, and availability of the
distributed database system.
o These decisions are made during the design phase of a DDBS.
Q. Describe any two distributed database architectures with diagrams.
1. A three-tier Client-Server Architecture [TB page - 921]
Clients (users or applications) request services from servers.
Database servers manage the data and respond to queries.
Key Components:
Client Tier: User interfaces or front-end applications.
Application Server: Handles business logic.
Database Server: Manages storage, query processing, and transaction management.
Advantages:
Clear separation of concerns.
Centralized control over data.
Easy to scale and maintain.
Diagram:
Figure: The three-tier client-server architecture.
2. Peer-to-Peer (P2P) or Fully Distributed Architecture
All sites (or nodes) in the network function as peers.
Each site has equal responsibility and autonomy.
No central server or controller.
Key Features:
High local autonomy.
Sites may run different DBMSs (heterogeneous).
Data can be fragmented and replicated across sites.
No global schema needed; sites interact only when necessary.
Advantages:
Highly scalable and fault-tolerant.
Flexible and decentralized.
Supports dynamic and evolving environments.
Diagram:
These two architectures — Client-Server and Peer-to-Peer (P2P) — are most commonly
described.
Extra:
Advances in Database Management Systems
Coursework syllabus:
Distributed Database Concepts: Distributed Database Concepts, Data Fragmentation,
Replication, and Allocation Techniques for Distributed Database Design
Overview of Concurrency Control and Recovery in Distributed Databases
Overview of Transaction Management in Distributed Databases
Query Processing and Optimization in Distributed Databases
Types of Distributed Database Systems, Distributed Database Architectures, Distributed
Catalogue Management.
Parallel computing utilizes multiple processors within a single machine, while distributed
computing uses multiple, independent computers connected over a network.
Parallel Computing:
Focus: Executes multiple parts of a single task simultaneously on different processors
within the same machine (e.g., multi-core CPUs, GPUs).
Communication: Processors share memory and communicate through shared
resources, typically with low latency.
Goal: To speed up the execution of a single task by breaking it down into smaller,
parallelizable parts.
Example: Using a multi-core processor to render a complex 3D scene in a video
game, where each core handles a portion of the image.
Distributed Computing:
Focus: Uses multiple independent computers (nodes) connected over a network to
work together on a task.
Communication: Nodes communicate by sending messages over the network, which
can have higher latency than shared memory communication.
Goal: To handle large workloads or solve complex problems that are too large or
resource-intensive for a single machine.
Example: A search engine distributing a query across many servers to find results
from a massive database.
A centralized database stores all data in a single location, while a distributed database stores
data across multiple locations.
Centralized Database:
Location: Data is stored on a single server or site.
Management: Easier to manage and maintain due to the single location.
Backup: Backups are simpler and more straightforward.
Performance: Can experience performance bottlenecks if many users access it
simultaneously.
Reliability: A single point of failure, meaning if the central server goes down, the
entire system is affected.
Scalability: Scaling is often limited by the capabilities of the single server.
Distributed Database:
Location: Data is spread across multiple servers or sites.
Management: More complex to manage and synchronize data across different
locations.
Backup: Backups require coordination across multiple sites.
Performance: Can offer better performance due to data distribution.
Reliability: More resilient to failures as data can be accessed from other locations.
Scalability: Horizontal scalability, meaning it can easily handle larger workloads by
adding more nodes.
In essence, a centralized database is like having all your books in one library, while a
distributed database is like having multiple library branches with some books at each
location. This difference in data storage location leads to varying implications for
management, reliability, and performance.