0% found this document useful (0 votes)
51 views7 pages

Query Optimization in DDBMS

Uploaded by

Gamer Bhagvan
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views7 pages

Query Optimization in DDBMS

Uploaded by

Gamer Bhagvan
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT – II – DISTRIBUTED DATABASE – KCA045

UNIT - II
QUERIES AND OPTIMAZATION
Global Queries to Fragment Queries-Equivalence Transformations for
Queries-Distributed Grouping and Aggregate Function Evaluation-
Parametric Queries-Optimization of Access Strategies-Framework for Query
Optimization-Join Queries- General Queries-Introduction to Distributed
Transactions.

Global Queries to Fragment Queries


When a query is placed, it is at first scanned, parsed and validated. An internal
representation of the query is then created such as a query tree or a query graph. Then
alternative execution strategies are devised for retrieving results from the database tables.
The process of choosing the most appropriate execution strategy for query processing is
called query optimization.

Query Optimization Issues in DDBMS


In DDBMS, query optimization is a crucial task. The complexity is high since number of
alternative strategies may increase exponentially due to the following factors −

 The presence of a number of fragments.

 Distribution of the fragments or tables across various sites.

 The speed of communication links.

 Disparity in local processing capabilities.

Hence, in a distributed system, the target is often to find a good execution strategy for query
processing rather than the best one. The time to execute a query is the sum of the following

 Time to communicate queries to databases.

 Time to execute local query fragments.

 Time to assemble data from different sites.

 Time to display results to the application.

Query Processing
Query processing is a set of all activities starting from query placement to displaying the
results of the query. The steps are as shown in the following diagram −
Figure 2.1 step in query processing

Global Query Optimization

Input: Fragment query

• Find the best (not necessarily optimal) global schedule

➡ Minimize a cost function

➡ Distributed join processing

✦ Bushy vs. linear trees

✦ Which relation to ship where?

✦ Ship-whole vs ship-as-needed

➡ Decide on the use of semijoins


✦ Semijoin saves on communication at the expense of more local
processing.

➡ Join methods

✦ nested loop vs ordered joins (merge join or hash join)

Cost-Based Optimization

• Solution space

➡ The set of equivalent algebra expressions (query trees).

• Cost function (in terms of time)

➡ I/O cost + CPU cost + communication cost

➡ These might have different weights in different distributed environments


(LAN vs WAN).

➡ Can also maximize throughput

• Search algorithm

➡ How do we move inside the solution space?

➡ Exhaustive search, heuristic algorithms (iterative improvement, simulated


annealing, genetic,…)

Query Optimization Process


Figure 2.2 Query Optimization Process

Search Space

• Search space characterized by alternative execution

• Focus on join trees

• For N relations, there are O(N!) equivalent join trees that can be obtained by applying
commutativity and associativity rules

SELECT ENAME,RESP

FROM EMP, ASG,PROJ

WHERE [Link]=[Link]

AND [Link]=[Link]

Cost Functions

• Total Time (or Total Cost)

➡ Reduce each cost (in terms of time) component individually

➡ Do as little of each cost component as possible

➡ Optimizes the utilization of the resources


Increases system throughput
• Response Time

➡ Do as many things as possible in parallel

➡ May increase total time because of increased total activity

• Summation of all cost factors

• Total cost = CPU cost + I/O cost + communication cost

• CPU cost = unit instruction cost * [Link] instructions

• I/O cost = unit disk I/O cost * no. of disk I/Os

• communication cost = message initiation + transmission

2- Step – Problem Definition

• Given

➡ A set of sites S = {s1, s2, …,sn} with the load of each site

➡ A query Q ={q1, q2, q3, q4} such that each subqueryqiis the maximum
processing unit that accesses one relation and communicates with its
neighboring queries

➡ For each qi in Q, a feasible allocation set of sites Sq={s1, s2, …,sk} where each
site stores a copy of the relation in qi

• The objective is to find an optimal allocation of Q to S such that

➡ the load unbalance of S is minimized

➡ The total communication cost is minimized

• For each q in Q compute load (Sq)

• While Q not empty do

➡ Select subquerya with least allocation flexibility

➡ Select best site b fora (with least load and best benefit)
➡ Remove a from Q and recompute loads if needed

2- Step Algorithm Example


Let Q = {q1, q2, q3, q4} where q1 is associated with R1, q2 is associated with R2 joined
with the result of q1, etc.


Iteration 1: select q4, allocate to s1, set load(s1)=2


Iteration 2: select q2, allocate to s2, set load(s2)=3


Iteration 3: select q3, allocate to s1, set load(s1) =3


Iteration 4: select q1, allocate to s3 or s4

Relational Algebra :
 The Relational Algebra is used to define the ways in which relations (tables) can be
operated to manipulate their data.
 This Algebra is composed of Unary operations (involving a single table) and Binary
operations (involving multiple tables).
 Join, Semi-join these are Binary operations in Relational Algebra.
Join

Join is a binary operation in Relational Algebra.

It combines records from two or more tables in a database.

A join is a means for combining fields from two tables by using values common to
each.
Semi-Join
•A Join where the result only contains the columns from one of the joined tables.
•Useful in distributed databases, so we don't have to send as much data over the network.
•Can dramatically speed up certain classes of queries.
What is “Semi-Join” ?
Semi-join strategies are technique for query processing in distributed database systems. Used
for reducing communication cost.
A semi-join between two tables returns rows from the first table where one or more matches
are found in the second table.
The difference between a semi-join and a conventional join is that rows in the first table will
be returned at most once. Even if the second table contains two matches for a row in the first
table, only one copy of the row will be returned.
Semi-joins are written using EXISTS or IN.

A Simple Semi-Join Example “Give a list of departments with at least one employee.” Query
written with a conventional join:
SELECT [Link], [Link] FROM dept D, emp E WHERE [Link] = [Link]
ORDER BY [Link];
◦ A department with N employees will appear in the list N times.
◦ We could use a DISTINCT keyword to get each department to appear only once.

A Simple Semi-Join Example “Give a list of departments with at least one employee.” Query
written with a semi-join:
SELECT [Link], [Link] FROM dept D WHERE EXISTS (SELECT 1 FROM
emp E WHERE [Link] = [Link]) ORDER BY [Link];
◦ No department appears more than once.
◦ Oracle stops processing each department as soon as the first employee in that
department is found.

Common questions

Powered by AI

In distributed databases, semi-join operations decrease the amount of data transferred across the network by eliminating unnecessary data that doesn't match query conditions from one of the join tables . This optimization reduces communication costs significantly and speeds up processing by limiting the volume of data that needs to be processed at various sites .

The two-step algorithm aims to minimize load imbalance among sites and reduce total communication costs during query allocation . These goals are achieved by iteratively selecting subqueries with the least allocation flexibility and allocating them to sites with the least load that offer the best benefit. The process involves computing loads for feasible allocation sites and adjusting after each subquery is allocated .

Exhaustive search explores all possible execution strategies to find the optimal solution, which is computationally intensive and often impractical in distributed environments due to the large search space . In contrast, heuristic algorithms use methods like iterative improvement or simulated annealing to quickly converge on a good solution without exploring every possibility, thus offering a more feasible approach to query optimization in distributed databases .

Parallel execution in query processing aims to do as many things simultaneously as possible, effectively reducing response time . However, it may increase the total time due to the rise in total activity resulting from increased simultaneous operations . By managing the balance between parallel execution and total activity, distributed systems can optimize for faster query responses without unnecessary increases in resource usage .

Bushy trees allow multiple joins to be processed in parallel, potentially improving execution time in systems where such parallelism can be exploited, whereas linear trees process joins sequentially . The choice between them impacts the query optimization process, with bushy trees offering more optimization possibilities but requiring more complex resource scheduling .

Relational algebra operations, such as joins and semi-joins, impact the optimization of parametric queries by defining efficient methods for data retrieval and manipulation . These operations allow the query optimizer to find alternative execution plans that minimize resource use and execution time. In distributed databases, employing operations like semi-joins can greatly reduce communications costs and improve response times by processing only necessary data across networked sites .

Commutativity and associativity rules in relational algebra allow for the rearrangement of operations in query trees to generate equivalent trees, offering multiple execution paths for a given query . These transformations increase the search space for optimization by providing diverse strategies, enabling the query optimizer to select a route that minimizes execution costs according to the system's specific constraints .

Query optimization in DDBMS is complex because the number of alternative execution strategies can increase exponentially due to factors such as the presence of numerous fragments, the distribution of fragments across various sites, varying speeds of communication links, and disparities in local processing capabilities . To manage these complexities, strategies such as minimizing cost functions and employing distributed join processing (bushy vs. linear trees, ship-whole vs. ship-as-needed), and the use of semi-joins are adopted, which help reduce communication costs at the expense of more local processing .

The semi-join strategy is beneficial in distributed databases as it reduces communication costs by only returning rows from the first table for which matches are found in the second table, thereby reducing the amount of data transmitted over the network . Unlike a conventional join, which includes all matches between the tables and potentially duplicates rows, a semi-join returns rows from the first table at most once, making it more efficient for certain queries .

In distributed systems, cost functions in query optimization measure the total cost of executing a query in terms of CPU, I/O, and communication costs . These functions guide optimization decisions by aiming to reduce each cost component individually, which increases system throughput. Cost functions may also prompt parallel execution to minimize response time, although this might increase the total activity time . By balancing these costs, the system can optimize resource utilization .

You might also like