0% found this document useful (0 votes)
10 views35 pages

Parallel and Distributed Database Overview

The document discusses Parallel and Distributed Databases, covering concepts such as parallel database architecture, types of parallelism, and distributed database systems. It highlights the advantages of parallel databases, including performance improvement, high availability, and proper resource utilization, while also explaining partitioning techniques and the differences between homogeneous and heterogeneous distributed databases. Additionally, it details data storage strategies like fragmentation and replication to enhance database performance and availability.

Uploaded by

myhousehold510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views35 pages

Parallel and Distributed Database Overview

The document discusses Parallel and Distributed Databases, covering concepts such as parallel database architecture, types of parallelism, and distributed database systems. It highlights the advantages of parallel databases, including performance improvement, high availability, and proper resource utilization, while also explaining partitioning techniques and the differences between homogeneous and heterogeneous distributed databases. Additionally, it details data storage strategies like fragmentation and replication to enhance database performance and availability.

Uploaded by

myhousehold510
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Parallel and Distributed Databases

Syllabus Content
• Parallel Database:
• Architecture, I/O Parallelism, Interquery, Intraquery
• Intraoperation and Interoperation Parallelism
• Distributed Databases
• Types of Distributed Database Systems,
• Distributed Data Storage, Distributed Query Processing
Parallel DBMS
• Parallel DBMS is a Database Management System that runs through
multiple processors and disks.
• They combine two or more processors also disk storage that helps
make operations and executions easier and faster.
• A parallel server can allow access to a single database by users on
multiple machines. It also performs many parallelization operations
like data loading, query processing, building indexes, and evaluating
queries.
Working of parallel database

• Parallel processing divides a large task into many smaller tasks and
executes the smaller tasks concurrently on several CPU’s and
completes it more quickly.
• The driving force behind parallel database systems is the demand of
applications that have to query extremely large databases of the
order of terabytes or that have to process a large number of
transactions per second.
• In parallel processing, many operations are performed simultaneously
as opposed to serial processing, in which the computational steps are
performed sequentially.
Advantages of Parallel
Databases
• Performance Improvement –
By connecting multiple resources like CPU and disks in parallel we can significantly
increase the performance of the system.

• High availability –
In the parallel database, nodes have less contact with each other, so the failure of one
node doesn’t cause for failure of the entire system. This amounts to significantly higher
database availability.

• Proper resource utilization –


Due to parallel execution, the CPU will never be ideal. Thus, proper utilization of
resources is there.

• Increase Reliability –
When one site fails, the execution can continue with another available site which is having
a copy of data. Making the system more reliable.
Speed up and Scale up
• The ability to execute the tasks in less time by
increasing the number of resources is called
Speedup.
• Speedup =time original/time parallel
• Where ,
• time original = time required to execute the task
using 1 processor
• time parallel = time required to execute the task
using 'n' processors fig. Ideal Speedup curve
Speed up and Scale up
• Scaleup means the ability to maintain the
performance of the system when both
workload and resources increase
proportionally.
• Scaleup = Volume Parallel/Volume
Original
• Where ,
• Volume Parallel = volume executed in a
fig. Ideal Scaleup curve
given amount of time using 'n' processor
• Volume Original = volume executed in a
given amount of time using 1 processor
Parallel Database-Architectural Models

• Architectural Models
• There are several architectural models for parallel Database, which are
given below −
• Shared memory architecture.
• Shared disk architecture.
• Shared nothing architecture.
Parallel Database
• Shared Memory System
• Every computer processor is able to access and
process data from common memory modules or
unit through intercommunication channel.
• This architecture is also commonly known as SMP
or Symmetric Multi-processing
• Shared Disk System
• A Shared Disk System is an architecture of
Database Management System where every
computer processors can access multiple disk
through intercommunication network.
• It can also access and utilize every local memory.
Parallel Database
• Shared Nothing System
• A Shared Nothing System is an architecture of
Database Management System where every
processor has their own disk and memory for the
objective of efficient workflows.
• The processors can communicate with other
processors using intercommunication network.
• Each of the processors act like servers to store
data on the disk.
I/O parallelism in parallel database- Partitioning
• Parallel I/O refers to the process of writing to, or reading from, two or more I/O devices
simultaneously; it can result in significant improvements in throughput. This can be achieved
through partitioning.
• Horizontal Fragmentation –In horizontal partitioning, the tuples of a relation are divided (or
declustered) among many disks, so that each tuple resides on one disk.
• Partitioning the tables using the conditions specified through WHERE clause of SQL query to
distribute bunch of tuples (records). For example, consider the following schema;
• STUDENT (Regno, SName, Address, Branch, Phone)
• In this table, Branch column is used to store the academic branch in which the student
admitted in. suppose, there are various branches like ‘BTech CIVIL’, ‘BTech MECH’, ‘BTech CSE’,
and so on, then the following query would be used to partition the table using horizontal
partitioning.
• SELECT * FROM student WHERE Branch = branch name;
• In the result, there won’t be any changes in the schema of the table; i.e the structure of the
table is unaltered. Only, ‘BTech CIVIL’ students fragmented into one partition, ‘BTech CSE’
I/O parallelism in parallel
database -Partitioning
• Vertical Fragmentation – Partitioning tables using the decomposition
rules to break and distribute the tables into multiple partitions
vertically (different schemas) is called Vertical fragmentation.
• For example, if we would like to break STUDENT into different tables
like STUDENT(Regno, SName, Address, Branch) and
STU_PHONE(Regno, Phone), i.e into two different schema on columns
is Vertical partitioning.
Partitioning Techniques
•There are various partitioning strategies proposed to manage the data
distribution into multiple processors evenly.
•Round-robin.
•This strategy scans the relation in any order and sends the ith tuple to disk
number i mod n.
•The round-robin scheme ensures an even distribution of tuples across disks;
that is, each disk has approximately the same number of tuples as the others.
•For example:
•If the number of available disks n is 10, then first record goes to D1 (1 mod 10
= 1), second record goes to D2 (2 mod 10 =2), and so on and 10th record goes to
D0 (10 mod 10 = 0), 11th record goes to D1 (11 mod 10 = 1). This scheme
distributes data evenly in all the disks.
Partitioning Techniques
•Hash partitioning.
•This declustering strategy designates one or more attributes from the given relation’s schema as the
partitioning attributes.
•A hash function is chosen whose range is {0, 1, . . . , n − 1}. Each tuple of the original relation is hashed on
the partitioning attributes. If the hash function returns i, then the tuple is placed on disk Di.
• For example, consider the following table;
• EMPLOYEE(ENo, EName, DeptNo, Salary, Age)
• If we choose DeptNo attribute as the partitioning attribute and if we have 10 disks to distribute the data,
then the following would be a hash function;
• h(DeptNo) = DeptNo mod 10
• If we have 10 departments, then according to the hash function, all the employees of department 1 will go
into disk 1, department 2 to disk 2 and so on.
• As another example, if we choose the EName of the employees as partitioning attribute, then we could
have the following hash function;
• h(EName) = (Sum of ASCII value of every character in the name) mod n,
Range Partitioning
• For example, for the EMPLOYEE relation given above, if the partitioning attribute
is Salary, then the vector would be one as follows;
• [5000, 15000, 30000],
• where every value means the individual range of salaries. That is, 5000 represents
the first range (0 – 5000), 15000 represents the range (5001 – 15000), 30000
represents the third range (15001 – 30000), and it includes the final range which is
(30001 – rest). Hence, the vector with 3 values represents 4 disks/partitions.

The above discussed partition strategies must be chosen carefully according to the
workload of your parallel database system. The workload may involve many
components like, which attribute is frequently used in any queries as filtering
condition, the number of records in a table, the size of the database, approximate
number of incoming queries in a day, etc.
Partitioning example

• Perform following partitioning techniques.


• A. Round-Robin Partitioning
• B. Hash Partitioning
• C. Range Partitioning
Interquery and Intraquery parallelism
• Interquery Parallelism
• In interquery parallelism, different queries or transaction execute in parallel
with one another.
• This form of parallelism can increase transactions throughput. The response
times of individual transactions are not faster than they would be if the
transactions were run in isolation.
• Thus, the primary use of interquery parallelism is to scale up a transaction
processing system to support a more significant number of transactions per
second.
• Interquery parallelism is the easiest form of parallelism to support in a
database system—particularly in a shared-memory parallel system.
Interquery and Intraquery parallelism
• Intraquery Parallelism
• Intraquery parallelism defines the execution of a single query in parallel on
multiple processors and disks.
• Using intraquery parallelism is essential for speeding up long-running queries.
• This application of parallelism decomposes the serial SQL, query into lower-
level operations such as scan, join, sort, and aggregation.
• To illustrate the parallel evaluation of a query, consider a query that requires a
relation to be sorted. Suppose that the relation has been partitioned across
multiple disks by range partitioning on some attribute, and the sort is
requested on the partitioning attribute. The sort operation can be
implemented by sorting each partition in parallel, then concatenating the
sorted partitions to get the final sorted relation.
Intraoperation and Interoperation Parallelism
• we may be able to pipeline the output of one operation to another operation.
The two operations can be executed in parallel on separate processors,
• one generating output that is consumed by the other, even as it is generated.
• In summary, the execution of a single query can be parallelized in two ways:
• Intraoperation parallelism.
• We can speed up processing of a query by parallelizing the execution of each
individual operation, such as sort, select, project, and join.
• Interoperation parallelism.
• We can speed up processing of a query by executing in parallel the different
operations in a query expression.
Distributed Databases
• What is a distributed database?
• Distributed database system is one in which the
data belonging to a single logical database is
distributed to two or more physical databases
to ensure reliability and availability
• A distributed database is a database in which all
storage devices are not attached to a common
CPU. Data may be stored in multiple sites
separate from each other.
• In a distributed database, the data is spread or
replicated among several databases which are
physically separate from each other. These
databases are connected through a network so
that they appear as a single database to the
user.
Types of Distributed Databases
• Distributed databases can be broadly
classified into homogeneous and
heterogeneous distributed database
environments
• Homogeneous Distributed Databases
• In a homogeneous distributed database, all
the sites use identical DBMS and operating
systems. Its properties are
• The sites use very similar software.
• The sites use identical DBMS or DBMS from
the same vendor.
• Each site is aware of all other sites and
cooperates with other sites to process user
requests.
• The database is accessed through a single
interface as if it is a single database.
Types of Distributed Databases
• There are two types of homogeneous
distributed database are:
[Link] − Each database is
independent that functions on its
own. They are integrated by a
controlling application and use
message passing to share data
updates.
[Link]-autonomous − Data is distributed
across the homogeneous nodes and a
central or master DBMS co-ordinates
data updates across the sites.
Types of Distributed Databases
• Heterogeneous Distributed Databases
• In a heterogeneous distributed database,
different sites have different operating systems,
DBMS products and data models. Its properties
are −
• Different sites use dissimilar schemas and
software.
• The system may be composed of a variety of
DBMSs like relational, network, hierarchical or
object oriented.
• Query processing is complex due to dissimilar
schemas.
• Transaction processing is complex due to
dissimilar software.
• A site may not be aware of other sites and so
there is limited co-operation in processing user
requests.
Types of Distributed Databases
• Types of Heterogeneous Distributed
Databases
[Link] − The heterogeneous
database systems are independent in
nature and integrated together so
that they function as a single
database system.
[Link]-federated − The database systems
employ a central coordinating module
through which the databases are
accessed.
Distributed Data Storage

• Distributed Data storage is an intelligent distribution of your data pieces,


(called data fragments) to improve database performance and Data
Availability for end-users.
• It aims to reduce overall costs of transaction processing while also
providing accurate data rapidly in your DDBMS systems.
• Distributed Data storage is one of the key steps in building your
Distributed Database Systems.
• There are two common strategies used in optimal Data Allocation: Data
Fragmentation and Data Replication.
Distributed Data Storage
• Fragmentation –
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and
each of the fragments is stored in different sites where they’re required.
• Fragmentation is a process of disintegrating relations or tables into several partitions in
multiple sites. It divides a database into various subtables and sub relations so that data can
be distributed and stored efficiently. Fragmentation of relations can be done in two ways:
•Horizontal fragmentation– Splitting by rows – The relation is fragmented into groups of
tuples so that each tuple is assigned to at least one fragment.
• For example, in the student schema, if the details of all students of Computer Science
Course needs to be maintained at the School of Computer Science, then the designer will
horizontally fragment the database as follows −
• CREATE COMP_STD AS
• SELECT * FROM STUDENT
• WHERE COURSE = "Computer Science";
Distributed Data Storage
•Vertical fragmentation – Splitting by columns –
•The schema of the relation is divided into smaller schemas. Each fragment must contain a
common candidate key so as to ensure a lossless join.
•In certain cases, an approach that is hybrid of fragmentation and replication is used.
•For example, let us consider that a University database keeps records of all registered
students in a Student table having the following schema.
• STUDENT
Regd_No Name Course Address Semester Fees Marks

• Now, the fees details are maintained in the accounts section. In this case, the designer will
fragment the database as follows −
• CREATE TABLE STD_FEES AS
• SELECT Regd_No, Fees
• FROM STUDENT;
Distributed Data Storage
•Hybrid Fragmentation
•In hybrid fragmentation, a combination of horizontal and vertical
fragmentation techniques are used.
•Hybrid fragmentation can be done in two alternative ways −
•At first, generate a set of horizontal fragments; then generate vertical
fragments from one or more of the horizontal fragments.
•At first, generate a set of vertical fragments; then generate horizontal
fragments from one or more of the vertical fragments.
Distributed Data Storage
•Fragmentation Example
Distributed Data Storage
• Replication –
In this approach, the entire relationship is stored redundantly at 2 or more sites. If the entire database is
available at all sites, it is a fully redundant database. Hence, in replication, systems maintain copies of data.
• This is advantageous as it increases the availability of data at different sites.
• However, it has certain disadvantages as well. Data needs to be constantly updated. Any change made at one site
needs to be recorded at every site that relation is stored or else it may lead to inconsistency. This is a lot of
overhead. Also, concurrency control becomes way more complex as concurrent access now needs to be checked
over a number of sites.
• Advantages of Data Replication
• Reliability − In case of failure of any site, the database system continues to work since a copy is available at another
site(s).
• Reduction in Network Load − Since local copies of data are available, query processing can be done with reduced
network usage, particularly during prime hours. Data updating can be done at non-prime hours.
• Quicker Response − Availability of local copies of data ensures quick query processing and consequently quick
response time.
• Simpler Transactions − Transactions require less number of joins of tables located at different sites and minimal
coordination across the network. Thus, they become simpler in nature.
Distributed Data Storage
•Types of Data Replication In DBMS
•Transactional Replication
•Snapshot Replication
•Merge Replication
•Transactional Replication
•Transactional Replication makes a complete copy of your database, as well as copies of new data changes. In this type of
Data Replication, changes to your database are synced in real-time and in the same order as they occur. This guarantees
transactional consistency.
•Snapshot Replication
•Snapshot Replication is perhaps the simplest type of Data Replication that copies “snapshots” of your database. It
replicates the current state of your database as is, at a specific point in time, without including any changes/updates to
your data. This kind of replication is helpful when changes made to your databases are infrequent.
•Merge Replication
•Merge Replication combines data from several databases into a single database. This type of Data Replication tracks
subsequent data changes and schema modifications made at publishers and subscribers and synchronizes the same to your
database using merge agents. A great advantage of using Merge Replication is that it allows publishers and subscribers to
independently modify the database.
Query Processing in Distributed DBMS

• Query processing in a distributed database management


system requires the transmission of data between the computers in a
network.
• Query processing in DBMS is different from query processing in
centralized DBMS due to the communication cost of data transfer
over the network.
• The transmission cost is low when sites are connected through high-
speed Networks and is quite significant in other networks.
• 1. Costs (Transfer of data) of Distributed Query processing :
• In Distributed Query processing, the data transfer cost of distributed query
processing means the cost of transferring intermediate files to other sites for
processing and therefore the cost of transferring the ultimate result files to the
location where that result’s required.
• Let’s say that a user sends a query to site S1, which requires data from its own
and also from another site S2. Now, there are three strategies to process this
query which are given below:
• We can transfer the data from S2 to S1 and then process the query
• We can transfer the data from S1 to S2 and then process the query
• We can transfer the data from S1 and S2 to S3 and then process the query. So
the choice depends on various factors like, the size of relations and the results,
the communication cost between different sites, and at which the site result
will be utilized.
• Commonly, the data transfer cost is calculated in terms of the size of
the messages. By using the below formula, we can calculate the data
transfer cost:
• Data transfer cost = C * Size Where C refers to the cost per byte of data
transferring and Size is the no. of bytes transmitted.

• 2. Using Semi join in Distributed Query processing :


• The semi-join operation is used in distributed query processing to
reduce the number of tuples in a table before transmitting it to
another site. This reduction in the number of tuples reduces the
number and the total size of the transmission ultimately reducing the
total cost of data transfer.

You might also like