Distributed query Optimization
• Query Optimization
• Centralized Query Optimization
• Join Ordering
• Distributed Query Optimization Algorithm
Query Optimization
• Query optimization is a crucial and difficult part of the overall query processing.
Objective of query optimization is a minimize the cost function: I/O cost + CPU
cost + Communication cost.
• Query optimization tries to minimize the response time for a given query
language and max of query types in a given system environment.
• Activity of choosing an efficient execution strategy for processing query. As there
are many equivalent transformations of same high-level query, aim of query
optimization is to choose one that minimizes resource usage.
Query Optimization
(Cont.)
Components of Distributed Query Optimization
1. Access Method
• Table can be accessed by 2 different ways:
a) Complete scanning the entire table
b) Using an index
2. Join Criteria
• If more than one table is accessed, the manner in which they are to be
joined together must be determined. Usually will provide several
different methods of joining tables.
3. Transmission Costs
• If data from multiple sites must be joined to satisfy a single query, then the
cost of transmitting the results from intermediate steps needs to
be factored into the equation.
Search
Space
Search
Space
• Query execution plans are typically abstracted by means of
operator trees which define the order in which the operations are
executed.
• Search space characterized by alternative execution
• Focus on join trees
• For N relations, there are @N( equivalent join trees that can
be obtained by applying commutatively and associatively rules
Search Space
(Cont.)
Query optimizers typically restrict the size of the search space they
consider.
The first restriction is to use heuristics.
The most common heuristic is to perform selection and projection when
accessing base relations.
Perform unary operations before binary operations.
The second restriction on shape of the join tree.
A linear tree is a tree such that at least one operand of each operator node is a base relation.
Abushy tree is more general and may have operators with no base
relations as operands
Search Space
(Cont.)
Search Strategy
• How to “move” in the search space.
• Deterministic algorithm
• Start from base relations and build plans by adding one relation at each step
• Dynamic programming: breadth-first
• Greedy: depth-first
• Small relations
Randomized algorithm
Search for optimality s around a particular starting point Do not guarantee that the best
solution is obtained
But avoid the high cost of optimization (terms of memory and time) Trade
optimization time for execution time
Better when > 10 relations
Search Strategy (Cont.)
• Deterministic
R1 R1 R1
Step
1
Search Strategy (Cont.)
• Randomized
R, R2 R R3
1
Cost Functions
Distribution Cost Model
• Total Time (or Total Cost)
• Reduce each cost (in terms of time) component individually
• Do as little of each cost component as possible
• Optimizes the utilization of the resources
Increases system throughput
Distribution Cost Model
(Cont.)
• Local Processing Time
• CPU time and I/O time
• Communication Time
• Fixed time to initiate a message and Time to transmit the data
• Response Time
• Do as many things as possible in parallel
• May increase total time because of increased total activity
Cost based Query Optimization for Distributed Database
• Summation of all cost factor
• Total cost= CPU cost+1/0 cost+ communication cost
• CPU cost = unit instruction cost * no. of instructions
• 1/0 cost = unit disk 1/0 cost * no. of disk I/0s
• communication cost= message initiation+ transmission
• Response time = CPU time + I/O time + Communication time
• CPU time = unit instruction time * number of sequential instructions
• I/O time = unit I/O time * number of sequential I/0s
• Communication time = (unit message initiation time * number of sequential
message) + (unit transmission time * number of sequential bytes)
Steps in a Cost-based Query Optimization
1. Parsing
2. Transformation
3. Implementation
4. Plan selection based on cost estimates
Steps in a Cost-based Query Optimization
(Cont.)
• Query parser:
• Verify validity of the SQL statement. Translate query into an internal structure
using relational calculus.
• Query Optimizer:
• Find the best expression from various different algebraic expressions. Criteria
used is “ Cheapness ”.
• Code generator:
• Make calls for the query processor as a result of the work done by the
optimizer.
Steps in a Cost-based Query Optimization(Cont.)
• Query processor:
• Execute the calls obtained from the code generator
• When there is an interactive user query, the query goes through the query
parser, query optimizer, code generator and query processor each time.
• Embedded query:
• When there is an embedded query, the query does not have to through
the query parser, query optimizer, code generator, and the query processor
each time. In an embedded query, the calls generated by the code generator
are stored in the database.
Total Cost Factors
• Wide area network
• Message initiation and transmission costs high
• Local processing cost is law (fast mainframes or minicomputers)
• Ratio of communication to I/O costs= 20:1
• Local area networks
• Communication and local processing casts are more or lex equal
• Ratio= 1:1.6
Centralized Query Optimization
• In this section we present the main query optimization techniques for centralized
• systems. This presentation is a prerequisite to understanding distributed
query optimization
• for three reasons. First, a distributed query is translated into local queries,
• each of which is processed in a centralized way. Second, distributed query
optimization
• techniques are often extensions of the techniques for centralized systems.
• Finally, centralized query optimization is a simpler problem; the minimization of
• communication costs makes distributed query optimization more complex.
• the optimization timing, which can be dynamic, static
Centralized Query Optimization (Cont.)
• In a centralized system, query processing is done with the following
aim —
• Minimization of response time of query (time taken to produce the
results to user's query).
• Maximize system throughput (the number of requests that are
processed in a given amount of time).
• Reduce the amount of memory and storage required for processing.
• Increase parallelism.
Centralized Query Optimization (Cont.)
• Centralized query optimization has divide into 3 approaches:
• Static Query Optimization
• Dynamic Query Optimization
• Hybrid Query Optimization
Dynamic Query Optimization
• Dynamic query optimization combines the two phases of
query decomposition and
• optimization with execution. The QEP is dynamically constructed
by the query
• optimizer which makes calls to the DBMS execution engine
for executing the query's
• operations. Thus, there is no need for a cost model.
• The most popular dynamic query optimization algorithm is that
of INGRES one of the first relational DBMS.
Dynamic Query Optimization (Cont.)
• The algorithm recursively breaks up a query expressed in relational
calculus (i.e., SQL) into smaller pieces which are executed along the
way. The query is first decomposed into a sequence of queries having
a unique relation in common. Then each monorelation query is
processed by selecting, based on the predicate, the best access
method to that relation (e.g., index, sequential scan).
• Advantage
• All information required to select an optimum strategy is up to date
• Disadvantage
• The performance of the query is affected because the query has to be parsed, validated and
optimized before it can be executed.
Dynamic Query Optimization (Cont.)
INGRES Algorithm:
• Que.
• To illustrate the detachment technique, we apply it to the following
query: “Names of employees working on the CAD/CAM project”
• Ans.
• This query can be expressed in SQL by the following query q1 on
the engineering database
Dynamic Query Optimization (Cont.)
• After detachment of the selections, query q1 is replaced by q11
followed by q* where JVAR is an intermediate relation.
• The successive detachments of q” may generate
Dynamic Query Optimization (Cont.)
• Note that other sub queries are also possible.
• Thus query q1 has been reduced to the subsequent queries q11 -> q12 -
> q13. Query q11 is monorelation and can be executed. However,
q12 and q13 are not monorelation and cannot be reduced by detachment.
Static Query Optimization
• Simple (i.e., mono-relation) queries are executed according to the best access
path
• Execute joins
• Determine the possible ordering of joins
• Determine the cost of each ordering
• Choose the join ordering with minimal cost
Static Query Optimization (Cont.)
• For joins, two alternative algorithms :
• Nested loops
for each tuple of external relation (cardinality n,)
for each tuple of erno/ relation (cardinality n2) join two tuples if the
join predicate is true
end end
• Complexity: n,* n2
• Merge join
sort relations
merge relations
Static Query Optimization (Cont.)
• Names of employees working on the CAD/CAM project
• Assume
• EMP has an index on ENO,
• ASG has an index on PNO,
• PROJ hasan index on PNO and an index on PNAME
Hybrid Query Optimization
• It is mixer of static and dynamic approaches; the approach is mainly
static, but dynamic query optimization may take place when high
difference between predicated and actual sizes is detected
Join Ordering in Distributed Queries
• The distributed join is a query operator that combines two relations
stored at different sites in the following way: each tuple from the first
relation is concatenated with each tuple from the second relation that
satisfies a given join condition, e.g., the match in two attributes.
• The main characteristics of a distributed join is that at least one of the
operand relations has to be transferred to another site.
Join Ordering in Distributed Queries (Cont.)
• Two basic approaches exist to order joins in distributed queries. One tries to
optimize the ordering of joins directly, whereas the other replaces joins
by combinations of semi joins in order to minimize communication costs.
• Let us first concentrate on the simple problem of operand transfer in a single join.
Consider 2 relations only.
If size(R) < Size(S)
If size(R) > Size(S)
Join Ordering in Distributed Queries (Cont.)
• The query is R l>1 S, where R and S are relations stored at diPerent sites. The
obvious choice of the relation to transfer is to send the smaller relation to the site of the
larger one.
• Multiple relations more difficult because too many alternatives.
• Compute the cost of all alternatives and select the best one.
• Necessary to compute the size of intermediate relations which is difficult.
• Use heuristics
Join Ordering- Example
Join Ordering- Example (Cont.)
• Note that we have made certain assumptions about the locations of the
three relations. This query can be executed in at least five different
ways. We describe these strategies by the following programs,
where (R Osite j) stands for “relation R is transferred to site j.”
Join Ordering- Example (Cont.) Site
2
ASG
EN PNO
O
Site 1 PROJ Site 3
EMP
Join Ordering- Example (Cont.)
• To select one of these programs, the following sizes must be known or predicted:
size(EMP), size(ASG), size(PROJ), size(EMP ASG), and size(ASG PROJ).
Furthermore, if it is the response time that is being considered, the optimization
must take into account the fact that transfers can be done in parallel with strategy
5.
• An alternative to enumerating all the solutions is to use heuristics that consider
only the sizes of the operand relations by assuming, for example, that the
cardinality of the resulting join is the product of operand cardinalities. In this
case, relations are ordered by increasing sizes and the order of execution is given by
this ordering and the join graph. For instance, the order (EMP, ASG, PROJ) could
use strategy 1, while the order (PROJ, ASG, EMP) could use strategy 4.
Semijoi-based Algorithm
• In this section we show how the semijoin operation can be used to decrease
the total time of join queries. The main shortcoming of the join
approach described in the preceding section is that entire operand
relations must be transferred between sites. The semijoin acts as a size
reducer for a relation much as a selection does.
• The join of two relations R and S over attribute A, stored at sites 1 and
2, respectively, can be computed by replacing one or both operand
relations by a semijoin with the other relation, using the following rules:
Semijoi-based Algorithm (Cont.)
• Consider the following relations, where attribute CITY has been added
to relations EMP (renamed ET), PROJ (renamed PT) and ASG
(renamed AT) of the engineering database. Attribute CITY of AT
corresponds to the city where the employee identified by ENO lives.
Semijoi-based Algorithm (Cont.)
• Transformation of Cyclic Query
Join Vs Semijoin Approaches
Distributed Query Optimization
• Distributed query optimization requires evaluation of a large number of
query trees each of which produce the required results of a query. This is
primarily due to the presence of large amount of replicated and fragmented
data. Hence, the target is to find an optimal solution instead of the best
solution.
• The main issues for distributed query optimization are —
• Optimal utilization of resources in the distributed system.
• Query trading.
• Reduction of solution space of the query.
Distributed Query Optimization (Cont.)
1. The optimization timing is dynamic for distributed INFRES, while it is static for
the others.
2. The objective function of SDD-1 and R* is to minimize total time, while
distributed INGRES aims at decreasing a combination of response time and
total time.
3. The optimization factors of the cost function are the message size for SDD-1.
system R*, which takes into account local processing time, uses message size,
number of message , I/O and CPU costs. Distributed INGRES considers both
message size and local processing time.
4. The network topology is assumed to be a wide area point-to-point network by
SDD-1. Distributed INGRES and R* can work in both local and wide area
networks.
Distributed Query Optimization
(Cont.)
5. The use of semi joins as a query optimization technique is employed by SDD-1.
Distributed INGRES and R* perform joins in a fashion similar to that of
centralized query optimization algorithms of their counterparts: INGRES and
system R.
6. Each algorithm assumes statistical information about the data. Semijoi
algorithms typically use more information.
7. INGRES can handle fragments.
SDD-1 Algorithm
• Refinements of an initial feasible solution are recursively computed until
no more cost improvement can be made. It is devised for wide area point
to point network.
• The cost of transferring the result to final site is ignored.
• It is based on hill climbing algorithm. It's a greedy approach it finds the
local minimum and iteratively tries to improve it. It may not reach global
solution all the time.
SDD-1 Algorithm
(Cont.)
• SDD-1 algorithm has 3 important characteristics as follows:
• It uses the semi-join operation to handle strategy.
• The relationship of the whole sites is not repetitive and fragmented.
• During price estimation of the whole algorithm, the transmission cost to
the starting site is not calculated.
SDD-1 Algorithm (Cont.)
• The basic algorithm is as follows:
1. Evaluate the income of all semi-join programs in the query graph.
2. Choose the semi-join with maximum income, execute it and recount the
income of all the semi-join.
3. Execute step 2 circularly until all of the semi-join value of whose income is
greater than 0 has been executed.
4. Select the site with minimum communication cost as the last executing
site. Transmit all the relationship to the site and connect them to got the ultimate
result.