0% found this document useful (0 votes)
6 views19 pages

Distributed Database Concepts Explained

The document discusses distributed databases (DDB) and their management systems (DDBMS), emphasizing concepts such as data fragmentation, sharding, replication, and query processing. It explains horizontal and vertical fragmentation methods for distributing data across sites, as well as the complexities involved in query optimization and data allocation. Additionally, it categorizes types of distributed database systems based on homogeneity and local autonomy, highlighting federated and multidatabase systems.

Uploaded by

Abenezer Daniel
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views19 pages

Distributed Database Concepts Explained

The document discusses distributed databases (DDB) and their management systems (DDBMS), emphasizing concepts such as data fragmentation, sharding, replication, and query processing. It explains horizontal and vertical fragmentation methods for distributing data across sites, as well as the complexities involved in query optimization and data allocation. Additionally, it categorizes types of distributed database systems based on homogeneity and local autonomy, highlighting federated and multidatabase systems.

Uploaded by

Abenezer Daniel
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Chapter Two

Distributed Database
Concepts
Distributed Database Concepts
A distributed database (DDB) is a collection of
multiple logically interrelated databases distributed
over a computer network.
A distributed database management system
(DDBMS) is a software system that manages a
distributed database while making the distribution
transparent to the user.
Data Fragmentation and Sharding
 The simplest logical units are the relations
themselves; that is, each whole relation is to be
stored at a particular site.
 For example, we must decide on a site to store
each of the relations EMPLOYEE, DEPARTMENT,
PROJECT, WORKS_ON, and DEPENDENT in
COMPANY DATABASE.
 In many cases, however, a relation can be divided
into smaller logical units for distribution.
 For example, assume that there are three computer
sites—one for each department in the company.
 We may want to store the database information
relating to each department at the computer site
for that department.
Cont ..
A horizontal fragment or shard of a relation is a
subset of the tuples in that relation.
 The tuples that belong to the horizontal fragment
can be specified by a condition on one or more
attributes of the relation, or by some other
mechanism.
 Often, only a single attribute is involved in the
condition.
 For example, we may define three horizontal
fragments on the EMPLOYEE with the following
conditions: (Dno = 5), (Dno = 4), and (Dno = 1)—
each fragment contains the EMPLOYEE tuples
working for a particular department.
 Derived horizontal fragmentation applies the
partitioning of a primary relation (DEPARTMENT in
Cont ..
 Horizontal fragmentation divides a relation
horizontally by grouping rows to create subsets of
tuples, where each subset has a certain logical
meaning.
 These fragments can then be assigned to different
sites (nodes) in the distributed system.
 Each site may not need all the attributes of a
relation, which would indicate the need for a
different type of fragmentation.
 Vertical fragmentation divides a relation “vertically”
by columns.
 A vertical fragment of a relation keeps only certain
attributes of the relation.
 For example, we may want to fragment the
EMPLOYEE relation into two vertical fragments.
Cont ..
 This vertical fragmentation is not quite proper,
because if the two fragments are stored separately,
we cannot put the original employee tuples back
together since there is no common attribute
between the two fragments.
 It is necessary to include the primary key or some
unique key attribute in every vertical fragment so
that the full relation can be reconstructed from the
fragments.
 Hence, we must add the Ssn attribute to the
personal information fragment.
 We can intermix the two types of fragmentation,
yielding a mixed fragmentation.
 For example, we may combine the horizontal and
vertical fragmentations of the EMPLOYEE relation
Cont ..
The following figures show horizontal and vertical
fragments from primary relation-departement and
secondary relation-employee, for departement 5 and
4, which can be stored at different sites.
Data Replication and Allocation
 If a fragment is stored at more than one site, it is
said to be replicated.
 Each fragment—or each copy of a fragment—must
be assigned to a particular site in the distributed
system.
 This process is called data distribution (or data
allocation).
 The choice of sites and the degree of replication
depend on the performance and availability goals
of the system and on the types and frequencies of
transactions submitted at each site.
 For example, if high availability is required,
transactions can be submitted at any site, and
most transactions are retrieval only, a fully
replicated database is a good choice.
Cont ..
 Data that is accessed at multiple sites can be
replicated at those sites.
 If many updates are performed, it may be useful to
limit replication.
 Finding an optimal or even a good solution to
distributed data allocation is a complex
optimization problem.
 The next figure shows one example of data
distribution and replication among distributed
databases based on the company database.
Cont ..
Query Processing and Optimization
in Distributed Databases
 In a distributed system, several additional factors
further complicate query processing.
 The first is the cost of transferring data over the
network.
 This data includes intermediate files that are
transferred to other sites for further processing, as
well as the final result files that may have to be
transferred to the site where the query result is
needed.
 Although these costs may not be very high if the
sites are connected via a high-performance local
area network, they become significant in other
types of networks.
 Hence, DDBMS query optimization algorithms
Cont ..
 For example, suppose that the EMPLOYEE (10,000
records, 100 bytes long each) and DEPARTMENT
(100 records, 35 bytes long each) relations are
distributed at site one and two resepectively.
 We will assume in this example that neither relation
is fragmented.
 The size of the EMPLOYEE relation is 100 * 10,000 =
106 bytes, and the size of the DEPARTMENT relation
is 35 * 100 = 3,500 bytes.
 Consider the query Q: For each employee, retrieve
the employee name and the name of the
department for which the employee works.
 This can be stated as follows in the relational
algebra:
Cont ..
 The result of this query will include 10,000 records,
assuming that every employee is related to a
department.
 Suppose that each record in the query result is 40
bytes long.
 The query is submitted at a distinct site 3, which is
called the result site because the query result is
needed there.
 Neither the EMPLOYEE nor the DEPARTMENT
relations reside at site 3.
 There are three simple strategies for executing this
distributed query:
 1. Transfer both the EMPLOYEE and the
DEPARTMENT relations to the result site, and
perform the join at site 3.
Cont ..
 2. Transfer the EMPLOYEE relation to site 2, execute
the join at site 2, and send the result to site 3.
 The size of the query result is 40 * 10,000 =
400,000 bytes, so 400,000 + 1,000,000 =
1,400,000 bytes must be transferred.
 3. Transfer the DEPARTMENT relation to site 1,
execute the join at site 1, and send the result to
site 3. In this case, 400,000 + 3,500 = 403,500
bytes must be transferred.
 If minimizing the amount of data transfer is our
optimization criterion, we should choose strategy 3.
 With distributed query processing using the
semijoin operation, we can reduce the number of
tuples in a relation before transferring it to another
site.
Cont ..
 Intuitively,the idea is:
 1. Project the join attributes of one relation R
(DEPARTMENT in our example) at site 2 and send
the joining column of R (DEPARTMENT) to site 1
where the other relation S (EMPLOYEE in our
example) is located;
 2. This transferred column file is then joined with S
(EMPLOYEE) at site 1.
 Following that, the join attributes, along with the
attributes required in the result, are projected out
and transferred back to site 2.
 3. Join the transferred file with R (DEPARTMENT)
and present the result to the user at site 2.
Cont ..
 Hence, only the joining column of R (DEPARTMENT)
is transferred in one direction, and
 A subset of S (EMPLOYEE) with no extraneous
tuples or attributes is transferred in the other
direction.
 We limited the EMPLOYEE attributes and tuples
transmitted to site 2 in step 2 to only those that will
actually be joined with a DEPARTMENT tuple in step
3.
 For query Q, this turned out to include all
EMPLOYEE tuples, so little improvement was
achieved.
 If only a small fraction of the tuples in S
(EMPLOYEE) participate in the join, this can be an
efficient solution to minimizing data transfer.
Types of Distributed Database Systems
 There are criteria and factors that make DDBMSs
different.
 The first factor is the degree of homogeneity of the
DDBMS software.
 If all servers (or individual local DBMSs) use
identical software and all users (clients) use
identical software, the DDBMS is called
homogeneous; otherwise, it is called
heterogeneous.
 Another factor related to the degree of
homogeneity is the degree of local autonomy.
 If there is no provision for the local site to function
as a standalone DBMS, then the system has no
local autonomy.
 On the other hand, if direct access by local
Cont ..
 For a centralized database, there is complete
autonomy but a total lack of distribution and
heterogeneity.
 But, the degree of local autonomy provides further
ground for classification into federated and
multidatabase systems.
 In such systems, each server is an independent and
autonomous centralized DBMS that has its own
local users, local transactions, and DBA, and hence
has a very high degree of local autonomy.
 The term federated database system (FDBS) is
used when there is some global view or schema of
the federation of databases that is shared by the
applications.
 On the other hand, a multidatabase system has full
Cont ..
 Both systems are hybrids between distributed and
centralized systems, and the distinction we made
between them is not strictly followed.
 We will refer to them as FDBSs in a generic sense.
 There is also a system with full local autonomy and
full heterogeneity—this could be a peer-to-peer
database system.
 In a heterogeneous FDBS, one server may be a
relational DBMS, another a network DBMS (such as
Computer Associates’ IDMS or HP’S IMAGE/3000),
and a third an object DBMS (such as Object
Design’s ObjectStore) or hierarchical DBMS (such as
IBM’s IMS);
 In such a case, it is necessary to have a canonical
system language and to include language

You might also like