UNIT – I
Introduction : Distributed Data Processing, Distributed Database System, Promises of DDBSs, Problem areas.
Distributed DBMS Architecture : Architectural Models for Distributed DBMS, DDMBS Architecture.
Distributed Database Design : Alternative Design Strategies, Distribution Design issues, Fragmentation,
Allocation.
Introduction – Distributed Databases:
A distributed database is a database that runs and stores data across multiple computers, as
opposed to doing everything on a single machine.
Typically, distributed databases operate on two or more interconnected servers on a computer
network.
Each location where a version of the database is running is often called an instance or a node.
A distributed database is basically a database that is not limited to one system, it is spread over
different sites, i.e, on multiple computers or over a network of computers.
A distributed database system is located on various sites that don’t share physical components.
This may be required when a particular database needs to be accessed by various users globally.
It needs to be managed such that for the users it looks like one single database.
Fig: Distributed Database System
A distributed database system is a type of database management system that stores data across
multiple computers or sites that are connected by a network. In a distributed database system, each
site has its own database, and the databases are connected to each other to form a single, integrated
system.
The main advantage of a distributed database system is that it can provide higher availability and
reliability than a centralized database system. Because the data is stored across multiple sites, the
system can continue to function even if one or more sites fail. In addition, a distributed database
system can provide better performance by distributing the data and processing load across multiple
sites.
Distributed Database Features
Some general features of distributed databases are:
Location independency - Data is physically stored at multiple sites and managed by an
independent DDBMS.
Distributed query processing - Distributed databases answer queries in a distributed
environment that manages data at multiple sites. High-level queries are transformed into a
query execution plan for simpler management.
Distributed transaction management - Provides a consistent distributed database through
commit protocols, distributed concurrency control techniques, and distributed recovery
methods in case of many transactions and failures.
Seamless integration - Databases in a collection usually represent a single logical database,
and they are interconnected.
Network linking - All databases in a collection are linked by a network and communicate
with each other.
Transaction processing - Distributed databases incorporate transaction processing, which is
a program including a collection of one or more database operations. Transaction processing
is an atomic process that is either entirely executed or not at all.
Applications / Uses of Distributed Database
It is used in Corporate Management Information System.
It is used in multimedia applications.
Used in Military’s control system, Hotel chains etc.
It is also used in manufacturing control system.
Architectures for distributed database systems
There are several different architectures for distributed database systems, including:
Client-server architecture: In this architecture, clients connect to a central server,
which manages the distributed database system. The server is responsible for
coordinating transactions, managing data storage, and providing access control.
Peer-to-peer architecture: In this architecture, each site in the distributed database
system is connected to all other sites. Each site is responsible for managing its own data
and coordinating transactions with other sites.
Federated architecture: In this architecture, each site in the distributed database system
maintains its own independent database, but the databases are integrated through a
middleware layer that provides a common interface for accessing and querying the data.
Distributed database systems can be used in a variety of applications, including e-
commerce, financial services, and telecommunications. However, designing and
managing a distributed database system can be complex and requires careful
consideration of factors such as data distribution, replication, and consistency.
Distributed Database Types
There are two types of distributed databases:
Homogenous
Heterogeneous
Homogeneous
A homogenous distributed database is a network of identical databases stored on multiple
sites. The sites have the same operating system, DDBMS, and data structure, making them
easily manageable.
Homogenous databases allow users to access data from each of the databases seamlessly.
The following diagram shows an example of a homogeneous database:
Heterogeneous
A heterogeneous distributed database uses different schemas, operating systems, DDBMS,
and different data models.
In the case of a heterogeneous distributed database, a particular site can be completely
unaware of other sites causing limited cooperation in processing user requests. The limitation
is why translations are required to establish communication between sites.
The following diagram shows an example of a heterogeneous database:
Distributed Database Storage
Distributed database storage is managed in two ways:
Replication
Fragmentation
Replication
In database replication, the systems store copies of data on different sites. If an entire
database is available on multiple sites, it is a fully redundant database.
The advantage of database replication is that it increases data availability on different sites
and allows for parallel query requests to be processed.
However, database replication means that data requires constant updates and synchronization
with other sites to maintain an exact database copy. Any changes made on one site must be
recorded on other sites, or else inconsistencies occur.
Constant updates cause a lot of server overhead and complicate concurrency control, as a lot
of concurrent queries must be checked in all available sites.
Fragmentation
When it comes to fragmentation of distributed database storage, the relations are fragmented,
which means they are split into smaller parts. Each of the fragments are stored on a different
site, where it is required.
The prerequisite for fragmentation is to make sure that the fragments can later be
reconstructed into the original relation without losing data.
The advantage of fragmentation is that there are no data copies, which prevents data
inconsistency.
There are two types of fragmentation:
Horizontal fragmentation - The relation schema is fragmented into groups of rows, and each
group (tuple) is assigned to one fragment.
Vertical fragmentation - The relation schema is fragmented into smaller schemas, and each
fragment contains a common candidate key to guarantee a lossless join.
Note: In some cases, a mix of fragmentation and replication is possible.
Distributed Database Advantages and Disadvantages
Below are some key advantages and disadvantages of distributed databases:
Advantages Disadvantages
Modular development Costly software
Reliability Large overhead
Lower communication costs Data integrity
Better response Improper data distribution
The advantages and disadvantages are explained in detail in the following sections.
Advantages / Benefits of Distributed Databases:
Modular Development. Modular development of a distributed database implies that a system
can be expanded to new locations or units by adding new servers and data to the existing
setup and connecting them to the distributed system without interruption. This type of
expansion causes no interruptions in the functioning of distributed databases.
Reliability. Distributed databases offer greater reliability in contrast to centralized databases.
In case of a database failure in a centralized database, the system comes to a complete stop.
In a distributed database, the system functions even when failures occur, only delivering
reduced performance until the issue is resolved.
Lower Communication Cost. Locally storing data reduces communication costs for data
manipulation in distributed databases. Local data storage is not possible in centralized
databases.
Better Response. Efficient data distribution in a distributed database system provides a faster
response when user requests are met locally. In centralized databases, user requests pass
through the central machine, which processes all requests. The result is an increase in response
time, especially with a lot of queries.
Disadvantages / Issues of Distributed Databases:
Costly Software. Ensuring data transparency and coordination across multiple sites often
requires using expensive software in a distributed database system.
Large Overhead. Many operations on multiple sites requires numerous calculations and
constant synchronization when database replication is used, causing a lot of processing
overhead.
Data Integrity. A possible issue when using database replication is data integrity, which is
compromised by updating data at multiple sites.
Improper Data Distribution. Responsiveness to user requests largely depends on proper
data distribution. That means responsiveness can be reduced if data is not correctly distributed
across multiple sites.
Centralized Database Vs Distributed Database
Centralized DBMS Distributed DBMS
In Distributed DBMS the database are stored in
In Centralized DBMS the database are stored in
different site and help of network it can access
a only one site
it
Database and DBMS software distributed over
If the data is stored at a single computer site,
many sites, connected by a computer network
which can be used by multiple users
Database is maintained at a number of
Database is maintained at one site
different sites
If centralized system fails, entire system is If one system fails, system continues work
halted with other site
It is a less reliable It is a more reliable
Centralized database
Fig : Centralized database
Distributed database
Fig : Distributed database
Types of Distributed Databases
Fig : Types of Distributed Databases
Examples of distributed databases
Some common examples of distributed databases include:
Apache Ignite
Apache Cassandra
Apache HBase
Couchbase Server
Amazon SimpleDB
Clusterpoint
FoundationDB
Distributed data processing
Distributed data processing refers to the approach of handling and analyzing data across multiple
interconnected devices or nodes. (or)
Distributed data processing having different database files located at different sites in a network is
known as DDP (Distributed Data Processing).
In contrast to centralized data processing, where all data operations occur on a single, powerful
system, distributed processing decentralizes these tasks across a network of computers.
Distributed Processing is a computing approach that involves dividing tasks across multiple
machines or nodes in a network. Instead of relying on a single machine to process large amounts of
data, the workload is distributed among multiple machines, enabling parallel processing. The
distributed nature of processing allows for increased performance, scalability, and fault tolerance.
How Distributed Data Processing works?
In a distributed processing system, a central coordinator assigns tasks to different nodes in the
network. Each node processes its assigned task independently and communicates the results back to
the coordinator. The coordinator then combines the results to produce the final output.
Distributed processing can be achieved through various mechanisms, including message passing,
shared memory, or a combination of both. Communication between nodes can occur through direct
point-to-point connections or via a shared communication infrastructure such as a message queue
or distributed file system.
In a Distributed data processing system, a massive amount of data flows through several different
sources into the system. This process of data flow is known as data ingestion.
Once the data streams in, there are different layers in the system architecture that breakdown the
entire processing into several different parts.
Fig : Data Ingestion
Data Collection and Preparation:
This layer takes care of collecting data from different external sources and prepares it to be
processed by the system.
It may be Text, audio, video, image, tax returns forms, insurance forms, medical bills, etc.
The task of the data preparation layer is to convert the data into a consistent standard format,
also to classify it as per the business logic to be processed by the system. This is automated
fashion without any sort of human intervention.
Data Security Layer
The role of this layer is to ensure that the data transit is secure by watching over it through out with
applied security protocols, encryption like that.
Data Storage Layer
Here, Data storage layer is used to store the big amount of data.
Data Processing Layer
This is the layer that contains the business logic for data processing. Machine Learning, predictive,
descriptive and decision modeling are primarily used to extract meaningful information.
Data Visualization Layer
All the information extracted is sent to the data visualization layer which typically contains browser
based dashboards which display the information in the form of graphs, charts and infographics.
Why Distributed Processing is important
Distributed processing offers several benefits that make it important for data processing and
analytics:
Improved Performance: By distributing the workload across multiple machines, distributed
processing can significantly reduce the processing time compared to a single machine. This is
especially crucial when dealing with large datasets or complex computational tasks.
Scalability: Distributed processing allows organizations to scale their computing resources by
adding or removing nodes as needed. This flexibility enables businesses to handle increased
workloads and accommodate future growth without a significant impact on performance.
Fault Tolerance: In a distributed processing system, if one node fails or experiences issues, the
workload can be automatically rerouted to other available nodes. This fault tolerance ensures that
processing continues uninterrupted and reduces the risk of data loss.
Cost Efficiency: With distributed processing, organizations can utilize commodity hardware
instead of relying on expensive high-end servers. This reduces hardware costs and allows
businesses to achieve higher computing power at a lower price point.
Advantages of distributed data processing (DDP)
Inexpensive:
Data is also distributed so adding and removing nodes (computers) can be easy. To achieve
distributed networking, we can use Beowulf cluster technology. In Beowulf cluster, remote
computers are assigned processing through network switches and routers.
Easy to replace remote computers:
Microsoft Windows server has a feature called failover clustering that helps to remove faulty
computers. If any computer on the network fails or corrupted by some means, then that computer
is automatically replaced by other computers.
Optimized processing:
Managing data on online server solves slow processing. On the personal computer, we can do
extra tasks also. Doing extra tasks consumes processor power. But the online computer is
dedicated to one type of processing and it is more likely to increase processing powers. Database
server can only handle database queries and file server stores files. So data processing is
optimized.
Easy to expand:
Suppose your company needs more data processing than expected then you can easily attach
more computers to the distributed network.
Parallel processing:
Adding and removing computers from the network cannot disturb data flow. All data from
different computers are processed in parallel. Parallel processing means data is updated at the
same time from all nodes.
Better performance:
The overall performance of the company gets better and data is filtered and processed more
rapidly in the distributed environment.
Backup of data:
Data can be backup from any computer connected to the network. So the user can backup data at
a different time and work with that data locally and then upload the data to the server.
Local data synchronization:
All the computers on the network can have local storage of important data. Suppose there are
different office branches interconnected to each other. All branch computers are interlinked with
the main branch office. All office branch computers have a local copy of data. Office users edit
and update data and then upload to the main server. So the data is synced and available to all
computers. Working locally with data is easy and fast and when the user thinks that his work is
complete then at the end of the day he can sync that data with the main server.
Data recovery:
If some data like the database is a loss in any computer then it can be recovered by another
interconnected computer i.e. main database server.
Disadvantages of distributed data processing (DDP)
Complexity:
Computers attached in DDP are difficult to troubleshoot, design and administrate.
Planning data synchronization is difficult:
Doing the correct synchronization of data is difficult to develop. Sometimes data is u pdated in
wrong order. So administrators have to keep the focus on it before making a distributed network.
Data security:
If the unauthorized computer is connected to a distributed network then it can affect other
computer performance and data can be a loss also.
Examples of distributed data processing
Hosting a website on the online server
Online photo editing tools
Airline ticketing system
Processing user data by mobile companies
Dropbox, Google drive, MSN drive, Google photos
Report generation from satellite
Weather forecast system
Promises of DDBSs
There are four fundamentals which may also be viewed as promises of DDBS technology:
Transparent management of distributed and replicated data
Reliable access to data through distributed transactions
Improved performance
Easier system expansion
1. Transparent Management of Distributed and Replicated Data
A transparent system “hides” the implementation details from users.
The advantage of a fully transparent DBMS is the high level of support that it provides for the
development of complex applications.