0% found this document useful (0 votes)
9 views32 pages

Chapter 6-Distributed Database System

Chapter 6 discusses distributed databases and client-server architectures, covering key concepts such as data fragmentation, replication, and allocation. It outlines the advantages and disadvantages of distributed databases, types of distributed database systems, and the complexities involved in query processing and concurrency control. The chapter emphasizes the importance of optimizing data transfer and ensuring data integrity across multiple sites in a distributed environment.

Uploaded by

abduseidmehammed
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views32 pages

Chapter 6-Distributed Database System

Chapter 6 discusses distributed databases and client-server architectures, covering key concepts such as data fragmentation, replication, and allocation. It outlines the advantages and disadvantages of distributed databases, types of distributed database systems, and the complexities involved in query processing and concurrency control. The chapter emphasizes the importance of optimizing data transfer and ensuring data integrity across multiple sites in a distributed environment.

Uploaded by

abduseidmehammed
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 6

Distributed Databases
and
Client-Server Architectures

1
Outline

1. Distributed Database Concepts


2. Data Fragmentation, Replication and Allocation
3. Types of Distributed Database Systems
4. Query Processing
5. Concurrency Control and Recovery
6. 3-Tier Client-Server Architecture

2
1. Distributed Database Concepts
▪ A transaction can be executed by multiple networked
computers in a unified manner.
▪ A distributed database (DDB) processes Unit of execution (a
transaction) in a distributed manner.
▪ A distributed database (DDB) can be defined as :
– A collection of multiple logically related database
distributed over a computer network, and a distributed
database management system as a software system that
manages a distributed database while making the
distribution transparent to the user.
– The physical placement of data (files, relations, etc.) which
is not known to the user (distribution transparency).

3
• Example : EMPLOYEE, PROJECT, and WORKS_ON
tables may be fragmented horizontally and stored with
possible replication as shown below.

4
❑ Advantages of DDB :
i. Distribution and Network transparency:
▪ Users do not have to worry about operational details of
the network.
– There is Location transparency, which refers to
freedom of issuing command from any location
without affecting its working.
– Then there is Naming transparency, which allows
access to any names object (files, relations, etc.) from
any location.
ii. Replication transparency:
▪ It allows to store copies of a data at multiple sites as
shown in the above diagram.
▪ This is done to minimize access time to the required
data.
iii. Fragmentation transparency:
• Allows to fragment a relation horizontally (create a
subset of rows of a relation) or vertically (create a subset
5
of columns of a relation).
iv. Increased reliability and availability:
▪ Reliability refers to system live time, that is, system is
running efficiently most of the time. Availability is the
probability that the system is continuously available
(usable or accessible) during a time interval.
▪ A distributed database system has multiple nodes
(computers) and if one fails then others are available to
do the job.
v. Improved performance:
▪ A distributed DBMS fragments the database to keep data
closer to where it is needed most.
▪ This reduces data management (access and
modification) time significantly.
vi. Easier expansion (scalability):
▪ Allows new nodes (computers) to be added anytime
without changing the entire configuration. 6
❑ Disadvantages of Distributed Database

i. Complexity- The data replication , failure recovery , network


management …make the system more complex than the central
DBMSs
ii. Cost- Since DDBMS needs more people and more hardware ,
maintaining and running the system can be more expensive
than the centralized system .
[Link] of connecting Dissimilar Machine- Additional layers
of operation system software are needed to translate and
coordinate the flow of data between machines.
[Link] integrity and security problem - Because data maintained
by distributed systems can be accessed at locations in the
network, controlling the integrity of a database can be
difficult.
7
2. Data Replication and Fragmentation: Distributed data storage
▪ There are two approaches to store the relation in the distributed
database : Replication and Fragmentation
I. Data Replication
▪ The system maintain several identical copies of the relation & store
each copy at a different site
▪ In general it enhance the performance of read operation and
increase the availability of data to read only transaction.
Advantages :
▪ Increased availability (data can be accessed even if one site fails).
▪ Reduced latency (data is closer to the user).
Challenges :
▪ Ensuring consistency across replicas.
▪ Managing update propagation and synchronization.

8
I. Data Fragmentation
– To divide data into smaller, manageable pieces for efficient querying
and storage.
–Types :
▪ Horizontal Fragmentation : Divides a table into subsets of rows.
Each fragment contains a subset of the rows but all columns.
Example: Dividing an employee table into two fragments
one for employees in Department A and another for Department
B.
▪ Vertical Fragmentation : Divides a table into subsets of columns.
Each fragment contains a subset of columns but all rows.
Example: Splitting an employee table into one fragment with
personal details (e.g., name, address) and another with
employment details (e.g., salary, department).
▪ Hybrid Fragmentation : Combines horizontal and vertical
fragmentation for more granular control.
9
– Consider the Employee relation. A vertical fragment of can be
created by keeping the values of Name, Bdate, Sex, and
Address.
– Because there is no condition for creating a vertical fragment,
each fragment must include the primary key attribute of the
parent relation Employee. In this way all vertical fragments of
a relation are connected.
▪ Representation
▪ There three rules that must be followed during fragmentation
▪ Completeness – if a relation r is decomposed into
fragments r1, r2… rn , each data item that can be found in r
must appear in at least one fragment
▪ Reconstruction – it must be possible to define a relation
operation that will reconstruct the relation r from fragments
▪ Disjointness –if a data item di appears in fragment ri ,
then it shouldn’t appear in any other fragment
10
–The main reasons for fragmenting a relation are

• Efficiency- data that is not needed by the local applications is


not stored

• Parallelism – a transaction can be divided

• into several subqueries that operate on fragments which will


increase the degree of concurrency

–but reconstruction of the whole relation will require accessing


data from all sites containing part of the relation

11
3. Data Allocation
– Purpose : To decide where to place fragments or replicas in
the distributed system.

Strategies :

– Centralized Allocation : All data is stored at a single site.

– Partitioned Allocation : Data is divided and stored


across multiple sites without replication.

– Replicated Allocation : Data is replicated across multiple


sites for redundancy.

12
3. Types of Distributed Database Systems

• Homogeneous Window
– All sites of the database Site 5 Unix
Oracle Site 1
system have identical Oracle
setup, i.e., same database Window
system software. Site 4 Communications
– The system may have network
little or no local
autonomy Oracle
– The underlying operating Site 3 Site 2
systems can be a mixture Linux Oracle Linux Oracle
of Linux, Window, Unix,
etc.
– Example: Multiple
branches of a bank using
the same DBMS software. 13
• Heterogeneous
– Definition : A distributed database system where at least one
database is from a different vendor (e.g., Oracle, MySQL,
PostgreSQL).
– Key Characteristics :
– Each site may use a different DBMS with varying schemas,
query languages, and data models.
– Translation mechanisms are required to enable
communication between heterogeneous systems. Object Unix Relational
Oriented Site 5 Unix
– Data integration is achieved Site 1
Hierarchical
through middleware or gateways Window
Communications
Site 4
that translate queries and data formats. network

– Example : Object
Network
DBMS
– Site A uses Oracle for its relational database. Site 3
Linux
Oriented Site 2
Linux
Relational

– Site B uses MongoDB for its NoSQL database.


– Middleware translates SQL queries into
14
MongoDB-compatible queries when accessing data from Site B.
–Federated: A distributed database system where each site may run a different DBMS, but data
access is managed through a single conceptual schema (global schema).
–Key Characteristics :
– Global Schema : A unified view of all databases in the system,
allowing users to query data as if it were stored in a single database.
– Local Autonomy : Sites retain some degree of autonomy, but they
must adhere to a centralized access policy defined by the global schema.
– Integration : The global schema integrates local schemas, providing a
consistent interface for users.
Implications : -The degree of local autonomy is minimum because each
site must conform to the centralized access policy.
– Query processing involves mapping global queries to local queries
using the global schema.
Example : Site A uses PostgreSQL and Site B uses MySQL .
– A federated system defines a global schema that combines customer
data from both sites.
– Users can execute a query like SELECT * FROM Customers without
worrying about which database stores the data. 15
▪ Federated Database Management Systems Issues
– Differences in data models:
• Relational, Objected oriented, hierarchical, network, etc.
– Differences in constraints:
• Each site may have their own data accessing and
processing constraints.
– Differences in query language:
• Some site may use SQL, some may use SQL-89, some
may use SQL-92, and so on.

16
–Multidatabase: A distributed database system where there is no
single conceptual global schema . Instead, a schema is constructed
dynamically as needed by the application software.
• Key Characteristics : Dynamic Schema Construction : Applications
construct a schema on-the-fly based on the specific data access
requirements.
• No Centralized Policy : There is no global schema or centralized
access policy. Each site maintains full autonomy over its data.
• Loose Coupling : Sites are loosely coupled, and data integration
is handled at the application level.
• Implications : The degree of local autonomy is maximum because
each site operates independently.
• Query processing is more complex, as applications must handle
schema mismatches and data translation manually.
• Example : Site A uses Oracle , Site B uses MongoDB , and Site C uses
Cassandra . 17
4. Query Processing in Distributed Databases
▪ Query processing in distributed databases involves breaking down a user query
into subqueries, executing them on relevant nodes, and combining the results. Key
steps include:
Query Decomposition :
▪ Parsing, validating, and transforming the query into a relational algebra
expression.
Example: Breaking a SQL query into smaller operations like selection, projection,
and join.
Data Localization :
▪ Identifying which fragments or nodes contain the required data.
▪ Example: Mapping a query to specific horizontal or vertical fragments.
Global Query Optimization :
▪ Deciding the most efficient way to execute the query, considering factors like
network latency, data transfer costs, and local processing costs.
e.g: Choosing between retrieving data from a remote site or performing a local join.
▪ Execution and Result Integration :
▪ Executing subqueries on relevant nodes and combining the results to produce the
18
final output.
4. Query Processing in Distributed Databases
▪ Issues
– Cost of transferring data (files and results) over the network.
• This cost is usually high, so some optimization is necessary.
• Example: suppose there are three sites. Where the relation Employee
at site 1, Department at Site 2 and no relation at site 3
– Employee at site 1. 10,000 rows. Row size = 100 bytes. Table
size = 106 bytes.
– Department at Site 2. 100 rows. Row size = 35 bytes. Table size
= 3,500 bytes.
– And a query is initiated from S3 to retrieve employees [First Name (15
byte long), Last name (15 byte long) and Department name (10 byte long)
total of 40 bytes]
• Q: For each employee, retrieve employee Fname, Lname, and
department name
• Q: Fname,Lname,Dname (Employee Dno = Dnumber Department)
Fname Minit Lname SSN Bdate Address Sex Salary Superssn Dno

Dname Dnumber Mgrssn Mgrstartdate 19


▪ Assumption
– The result of this query will have 10,000 rows, assuming
that every employee is related to a department.
– Suppose each result row 40 bytes long. The query is
submitted at site 3 and the result is sent to this site.
– Problem: Employee and Department relations are not
present at site 3.

• what is your best strategy that can optimize data


transportation cost?

20
• Strategies : Minimizing data transfer.
1. Transfer Employee and Department to site 3.
• Total transfer bytes = 1,000,000 + 3500 = 1,003,500 bytes.
2. Transfer Employee to site 2, execute join at site 2 and send
the result to site 3.
• Transferring employees data from site 1 to site 2: 1,000,000 bytes
• Query result size = 40 * 10,000 = 400,000 bytes.
• Total transfer size = 1,000,000 + 400,000 = 1,400,000 bytes.
3. Transfer Department relation to site 1, execute the join at site
1, and send the result to site 3.
• Data Transfer from site 2 to site 1: 3500 bytes
• Query result size = 40 * 10,000 = 400,000 bytes
• Total bytes transferred = 3500+ 400,000 = 403,500 bytes.
– Preferred approach: strategy 3.

21
Example 2 : Consider the query
– Q’: For each department, retrieve the department name ,Fname
and LName of the department manager
• Relational Algebra expression:
– Fname,Lname,Dname (Employee Mgrssn = SSN Department)
• The result of this query will have 100 tuples, assuming that every
department has a manager, the execution strategies are:
1. Transfer Employee and Department to the result site and
perform the join at site 3.
• Total bytes transferred = 1,000,000 + 3500 = 1,003,500
bytes.
2. Transfer Employee to site 2, execute join at site 2 and send the
result to site 3.
• data ransfer from site 1 to site 2: 1,000,000
• Query result size = 40 * 100 = 4000 bytes.
• Total transfer size = 4000 + 1,000,000 = 1,004,000 bytes.
3. Transfer Department relation to site 1, execute join at site 1 and
send the result to site 3.
• Total transfer size = 4000 + 3500 = 7500 bytes. 22
Preferred strategy: Choose strategy 3.
Example 3: Now suppose the result is needed at site2. Possible strategies
:
1. Transfer Employee relation to site 2, execute the query and
present the result to the user at site 2.
• Total transfer size = 1,000,000 bytes for both queries Q
and Q’.
2. Transfer Department relation to site 1, execute join at site 1
and send the result back to site 2.
• Total transfer size for
– Q = 400,000 + 3500 = 403,500 bytes
– Q’ = 4000 + 3500 = 7500 bytes.

❑ Preferred strategy: Choose strategy 2.

23
5. Concurrency Control and Recovery
▪ Distributed Databases encounter a number of concurrency
control and recovery problems which are not present in
centralized databases. Some of them are listed below.

– Dealing with multiple copies of data items:


➢The concurrency control must maintain global
consistency. Likewise the recovery mechanism must
recover all copies and maintain consistency after
recovery.
– Failure of individual sites:
• Database availability must not be affected due to the
failure of one or two sites and the recovery scheme must
recover them before they are available for use.

24
– Communication link failure:

• This failure may create network partition which would affect


database availability even though all database sites may be
running.

– Distributed commit:

• A transaction may be fragmented and they may be executed by a


number of sites. This require a two or three-phase commit
approach for transaction commit.

– Distributed deadlock:

• Since transactions are processed at multiple sites, two or more


sites may get involved in deadlock. This must be resolved in a
distributed manner. 25
5. 1 Distributed Concurrency control
i. Primary site technique: A single site is designated as a
primary site which serves as a coordinator for transaction
management.

Primary site
Site 5
Site 1

Site 4 Communications neteork

Site 3 Site 2

26
• Transaction management:
– Concurrency control and commit are managed by this site.
– In two phase locking, this site manages locking and
releasing data items. If all transactions follow two-phase
policy at all sites, then serializability is guaranteed.

– Advantages:
• An extension to the centralized two phase locking so
implementation and management is simple.
• Data items are locked only at one site but they can be
accessed at any site.
– Disadvantages:
• All transaction management activities go to primary site
which is likely to overload the site.
• If the primary site fails, the entire system is inaccessible.
– To aid recovery a backup site is designated which behaves as
a shadow of primary site. In case of primary site failure,
backup site can act as primary site. 27
i. Primary Copy Technique:

– In this approach, instead of a site, a data item partition is


designated as primary copy. To lock a data item just the
primary copy of the data item is locked.

• Advantages:

– Since primary copies are distributed at various sites, a


single site is not overloaded with locking and unlocking
requests.

• Disadvantages:

– Identification of a primary copy is complex. A distributed


directory must be maintained, possibly at all sites. 28
Recovery from a coordinator failure

• In both approaches a coordinator site or copy may become unavailable. This


will require the selection of a new coordinator.

– Primary site approach with no backup site:

• Aborts and restarts all active transactions at all sites. Elects a new
coordinator and initiates transaction processing.

– Primary site approach with backup site:

• Suspends all active transactions, designates the backup site as the


primary site and identifies a new back up site.

• Primary site receives all transaction management information to


resume processing.

– Primary and backup sites fail or no backup site:

• Use election process to select a new coordinator site. 29


i. Concurrency control based on voting:

– There is no primary copy of coordinator.

– Send lock request to sites that have data item.

– If majority of sites grant lock then the requesting transaction


gets the data item.

– Locking information (grant or denied) is sent to all these


sites.

– To avoid unacceptably long wait, a time-out period is


defined. If the requesting transaction does not get any vote
information then the transaction is aborted.
30
Client-Server Database Architecture

• It consists of clients running client software, a set of servers


which provide all database functionalities and a reliable
communication infrastructure.

Server 1 Client 1

Client 2

Server 2 Client 3

Server n Client n

31
• Clients reach server for desired service, but server does reach
clients.
• The server software is responsible for local data management
at a site, much like centralized DBMS software.
• The client software is responsible for most of the distribution
function.
• The communication software manages communication among
clients and servers.
• The processing of a SQL queries goes as follows:
– Client parses a user query and decomposes it into a number
of independent sub-queries. Each subquery is sent to
appropriate site for execution.
– Each server processes its query and sends the result to the
client.
– The client combines the results of subqueries and produces
the final result. 32

You might also like