Chapter 2
Query Processing and Optimization
Query
query language
Query processing
Query optimization
1
Query Processing and Optimization
Query processing :
• In general, a query is a form of questioning, in a line of inquiry
• A query language is a language in which user requests
information from the database
The aim of query processing is to find information in one or
more databases and deliver it to the user
Query optimization:
It is the process of choosing a suitable execution strategy for
processing a query
Two internal representations of a query:
Query Tree
Query graph
2
Query Processing Activities involved in retrieving data from the
database.
Aims of QP: transform query written in high-level language (e.
g. SQL), into correct and efficient execution strategy
expressed in low-level language (implementing RA); execute
strategy to retrieve required data.
Query Optimization Activity of choosing an efficient execution
strategy for processing query.
As there are many equivalent transformations of same high-
level query, aim of QO is to choose one that minimizes resource
usage.
Generally, reduce total execution time of query.
May also reduce response time of query
Query Processing and Optimization
Scanner: identify language components.
keywords, attribute, relation names
Parser : check query system
Validation: check attributes & relations
Query tree (query graph) : internal
representation
Execution strategy: plan
Query optimization : choose a strategy
(reasonably efficient strategy)
Query Processing (cont…)
5
Query Processing (cont…)
Query Processing can be divided into four main phases:
Decomposition:
Optimization
Code generation, and
Execution
Decomposition: it is the process of transforming a high
level query into a relational algebra query, and to check
that the query is syntactically and semantically correct.
Query decomposition consists of parsing and validation
6
Query Processing (cont…)
Typical stages in query decomposition are:
Analysis: lexical and syntactical analysis of the query (correctness).
Query tree will be built for the query.
Normalization: convert the query into a normalized form.
Semantic Analysis: to reject normalized queries that are not correctly
formulated or contradictory.
Incorrect if components do not contribute to generate result.
Contradictory if the predicate can not be satisfied by any tuple.
Simplification: to detect redundant qualifications, eliminate common
sub-expressions, and transform the query to a semantically equivalent
but more easily and effectively computed form.
Query Restructuring: When more than one translation is possible, use
transformation rules to re arranging nodes so that the most
restrictive condition will be executed first.
7
Transformation Rules for Relational Algebra
Relational algebra operations:
Select, Project , Join, Union, Intersection, Cartesian Poduct
- refer the text book for detail understanding
General Transformation rules for relational
algebra
1. Cascade of : A conjunctive selection condition can be broken
up into a cascade (sequence) of individual operations:
c1 AND c2 AND ... AND cn(R) = c1 (c2 (...(cn(R))...) )
2. Commutativity of :
The operation is commutative:
c1 (c2(R)) = c2 (c1(R))
3. Cascade of : In a cascade (sequence) of operations, all
but the last one can be ignored:
List1 (List2 (...(Listn(R))...) ) = List1(R)
8
Transformation Rules for Relational Algebra (cont…)
4. Commuting with :
If the selection condition c involves only the
attributes A1, ..., An in the projection list, the
two operations can be commuted:
A1, A2, ..., An (c (R)) = c (A1, A2, ..., An (R))
5. Commuting of and X: both are commutative
R cS = S c R, RXS=SXR
9
Transformation Rules for Relational Algebra
(cont…)
6. Commuting with : If projection list is L = {A1, ..., An,
B1, ..., Bm}, where A1, ..., An are attributes of R and B1, .
.., Bm are attributes of S and the join condition c involves
only attributes in L, the two operations can be commuted
as follows:
L ( R CS) = (A1, ..., An (R)) C ( B1, ..., Bm (S))
7. Converting a (, x) sequence into : If the condition c of
a that follows a x Corresponds to a join condition,
convert the (, x) sequence into a as follows:
( C (R x S)) = (R C S)
10
Query Processing
Query Block:
A query block contains a single SELECT-FROM-WHERE
expression, as well as GROUP BY and HAVING clause if
these are part of the block.
It is the basic unit that can be translated into the algebraic
operators
Nested queries within a query are identified as
separate query blocks.
Example: Find the name of employees whose salary
is below the average salary of all employees who are
working at department number 5
11
Use the following logical model to understand the
discussions in this chapter
Example: Find the name of employees
whose salary is below the average
salary of all employees who are
working at department number 5
12
Query Processing (cont…)
SELECT FNAME, LNAME
FROM EMPLOYEE
WHERE SALARY < ( SELECT AVG (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);
SELECT FNAME, LNAME SELECT AVG (SALARY)
FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY < C WHERE DNO = 5
πFNAME, LNAME (σSALARY<C(EMPLOYEE)) C=ℱAVGSALARY (σDNO=5 (EMPLOYEE))
13
Basic algorithms for executing query
operations
External sorting:
Refers to sorting algorithms that are suitable for large files of
records stored on disk that do not fit entirely in main memory, such
as most database files
External Sorting Uses Sort-Merge Strategy:
Starts by sorting small sub files (runs) of the main file and then
merges the sorted runs, creating larger sorted sub files that are
merged in turn
Sorting Phase:
Number subfiles (runs) n = (b/n )
R B
The size of a run and number of initial run depends on the number of file blocks (b) and
available buffer space (nB)
Merging phase:
Degree of merging(dM) = Min (n -1, n ); n = (log
B R P dM(nR))
nR: number of initial runs;
b: number of file blocks;
nB: available buffer space;
dM: degree of merging(no of runs that can be merged together in each pass)
np: number of passes(stored runs are merged during one or more passes)
14
Sort-Merge strategy (cont…)
Example: if nB=5 blocks and size of the file b=1024 blocks,
Sorting phase:
The size of a run and number of initial run depends on the number of
file blocks (b) and available buffer space (nB)
nR=(b/nB)= (1024/5) =205 runs each of size 5 blocks
except the last run which will have 4 blocks.
Hence, after the sort phase, 205 sorted runs are stored
as temporary subfiles on disk.
15
Sort-Merge strategy (cont…)
In the merging phase, the sorted runs are merged during
one or more passes.
The degree of merging (dM) is the number of runs that
can be merged.
dM=min((nB-1), nR))
The number of passes=[(logdM (nR))]
In each pass, one buffer block is needed to hold one block
from each of the runs being merged and one block is
needed for containing one block of the merge result
16
Sort-Merge strategy (cont…)
In the above example, dM=4(four way merging)
Hence, the 205 initial sorted runs would be merged into:
52 at the end of the first pass
13 at the end of the second pass
4 at the end of the third pass
1 at the end of the fourth pass
17
Implementing SELECT operation
There are many options for executing a SELECT
operation
Examples: Use the logical model provided in the
previous slide to understand the following operations:
(OP1) σSsn=12 (EMPLOYEE)
equality comparison on key attribute
(OP2) σDNumber > 5 (DEPARMENT)
nonequality comparison on key attribute
(OP3) σDNO=5 (EMPLOYEE)
equality comparison on non key attribute
(OP4) σDNO=5 AND SALARY >30000 AND SEX=F(EMPLOYEE)
conjunctive condition
(OP5) σSsn=12 AND PNO=10 (WORKS_ON)
conjunctive condition and composite key
18
Implementing the SELECT Operation (cont…)
Search Methods for implementing Simple Selection:
S1 Linear search (brute force):
Retrieve every record in the file, and test whether its
attribute values satisfy the selection condition.
S2 Binary search:
If the selection condition involves an equality comparison on a
key attribute on which the file is ordered, binary search (which
is more efficient than linear search) can be used. An example is
OP1 if Ssn is the ordering attribute for EMPLOYEE file
S3 Using a primary index or hash key to retrieve a single
record:
If the selection condition involves an equality comparison on a
key attribute with a primary index (or a hash key), use the
primary index (or the hash key) to retrieve the record.
For Example, OP1 use primary index to retrieve the record
19
Implementing the SELECT Operation (cont…)
Search Methods for implementing Simple Selection(cont…)
S4 Using a primary index to retrieve multiple records:
If the comparison condition is >, ≥, <, or ≤ on a key
field with a primary index, use the index to find the
record satisfying the corresponding equality condition,
then retrieve all subsequent records in the (ordered)
file. (see OP2)
S5 Using a clustering index to retrieve multiple records:
If the selection condition involves an equality
comparison on a non-key attribute with a clustering
index, use the clustering index to retrieve all the
records satisfying the selection condition. (See OP3)
20
Implementing the SELECT Operation (cont…)
• Search Methods for implementing complex Selection
• S6: Conjunctive selection using an individual index :
If an attribute involved in any single simple condition in
the conjunctive condition has an access path that
permits the use of one of the methods S2 to S6, use that
condition to retrieve the records and then check whether
each retrieved record satisfies the remaining simple
conditions in the conjunctive condition
21
Implementing the SELECT Operation (summarized)
Whenever a single condition specifies the selection, we
can only check whether an access path exists on the
attribute involved in that condition.
If an access path exists, the method corresponding
to that access path is used;
Otherwise, the “brute force” linear search
approach of method S1 is used
For conjunctive selection conditions, whenever more
than one of the attributes involved in the conditions
have an access path, query optimization should be done
to choose the access path that retrieves the fewest
records in the most efficient way
22
Implementing the SELECT Operation
(summarized)
Disjunctive selection conditions: This is a situation where
simple conditions are connected by the OR logical connective
rather than AND
Compared to conjunctive selection, It is much harder to
process and optimize
Example: DNO=5 OR SALARY>3000 OR SEX=‘F’(EMPLOYEE)
Little optimization can be done because the records
satisfying the disjunctive condition are the union of the
records satisfying the individual conditions
Hence, if any of the individual conditions does not have
an access path, we are compelled to use the brute force
approach
23
Implementing the JOIN Operation:
The join operation is one of the most time
consuming operation in query processing
Join
two–way join: a join on two files
e.g. R A=B S
multi-way joins: joins involving more than two files
e.g. R A=B S C=D T
Examples
(OP6): EMPLOYEE DNO=DNUMBER DEPARTMENT
(OP7): DEPARTMENT MGRSSN=SSN EMPLOYEE
24
Implementing the JOIN Operation (cont…)
Methods for implementing joins:
J3 Sort-merge join:
If the records of R and S are physically sorted
(ordered) by value of the join attributes A and B,
respectively, we can implement the join in the most
efficient way possible.
Both files are scanned in order of the join
attributes, matching the records that have the
same values for A and B.
In this method, the records of each file are scanned
only once each for matching with the other
file—unless both A and B are non-key attributes
25
Algorithms for SELECT and JOIN
Operations (cont…)
Implementing the JOIN Operation (cont...):
Factors affecting JOIN performance
Available buffer space
Join selection factor
Choice of inner VS outer relation
26
Algorithms for PROJECT and SET
Operations
Algorithm for PROJECT operations ()
<attribute list>(R)
1. If <attribute list> has a key of relation R, extract all
tuples from R with only the values for the attributes in
<attribute list>.
2. If <attribute list> does NOT include a key of relation R,
duplicated tuples must be removed from the results.
Methods to remove duplicate tuples
1. Sorting
2. Hashing
27
Query Optimization
Given the database structure, the challenge of query
optimization is to find the sequence of steps that produces
the answer to user request in the most efficient manner
The performance of a query is affected by the tables or
queries that underlies the query and by the complexity of
the query.
Given a request for data manipulation or retrieval, an
optimizer will choose an optimal plan for evaluating the
request from among the manifold alternative strategies.
There are many ways (access paths) for accessing desired
file/record. The optimizer tries to select the most efficient
(cheapest) access path for accessing the data.
DBMS is responsible to pick the optimal execution strategy
based on various considerations.
28
Approaches to Query Optimization
Heuristics Approach
The heuristic approach uses the knowledge of the
characteristics of the relational algebra operations
and the relationship between the operators to
optimize the query.
Thus the heuristic approach of optimization will make
use of:
Properties of individual operators
Association between operators
Query Tree: a graphical representation of the
operators, relations, attributes and processing
sequence during query processing.
29
Using Heuristics in Query Optimization
Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query
The main heuristic is to apply first the operations that
reduce the size of intermediate results
E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations
30
Using Heuristics in Query Optimization (cont…)
Heuristic Optimization of Query Trees:
The same query could correspond to many different
relational algebra expressions — and hence many different
query trees.
The task of heuristic optimization of query trees is to find a
final query tree that is efficient to execute.
Example:
Query : Find the last name of employees in AQUARIUS
project who were born before 1957-12-31
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND PNMUBER=PNO
AND ESSN=SSN AND BDATE > ‘1957-12-31’;
31
Steps in converting a query tree during heuristic optimization:
Initial query tree for the query Q on slide 34
32
Steps in converting a query tree during heuristic
optimization:
Moving the select operation down the tree
33
Steps in converting a query tree during heuristic optimization:
Applying the more restrictive select operation first
34
Steps in converting a query tree during heuristic optimization:
Replacing Cartesian product and select with join
35
Steps in converting a query tree during heuristic
optimization:
Moving project operations down the query tree
36
Using Heuristics in Query Optimization
Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that
reduce the size of intermediate results
2. Perform select operations as early as possible to reduce the
number of tuples and perform project operations as early
as possible to reduce the number of attributes. (This is done
by moving select and project operations as far down the
tree as possible.)
3. The select and join operations that are most restrictive
should be executed before other similar operations. (This is
done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree appropriately.
)
37
Using Selectivity and Cost Estimates in
Query Optimization
Cost-based query optimization:
Estimate and compare the costs of executing a query
using different execution strategies and choose the
strategy with the lowest cost estimate
Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation cost
4. Memory usage cost
5. Communication cost
Note: Different database systems may focus on different
cost components.
38
Semantic Query Optimization
Semantic Query Optimization:
Uses constraints specified on the database schema in order to
modify one query into another query that is more efficient to
execute
Consider the following SQL query,
SELECT [Link], [Link]
FROM EMPLOYEE E, EMPLOYEE M
WHERE [Link]=[Link] AND [Link]>[Link]
Explanation:
Suppose that we had a constraint on the database schema that
stated that no employee can earn more than his or her direct
supervisor. If the semantic query optimizer checks for the
existence of this constraint, it need not execute the query at all
because it knows that the result of the query will be empty.
39