Chapter 2
Query Processing and Optimization
◼ Query
◼ query language
◼ Query processing
◼ Query optimization
1
Query Processing and Optimization
Query processing :
• In general, a query is a form of questioning, in a line of inquiry
• A query language is a language in which user requests
information from the database
◼ The aim of query processing is to find information in one or
more databases and deliver it to the user
Query optimization:
It is the process of choosing a suitable execution strategy for
processing a query
◼ Two internal representations of a query:
◼ Query Tree
◼ Query graph
2
Query Processing Activities involved in retrieving data from the
database.
Aims of QP: transform query written in high-level language (e.g.
SQL), into correct and efficient execution strategy expressed in low-
level language (implementing RA); execute strategy to retrieve
required data.
Query Optimization Activity of choosing an efficient execution
strategy for processing query.
As there are many equivalent transformations of same high-level
query, aim of QO is to choose one that minimizes resource usage.
❖ Generally, reduce total execution time of query.
❖ May also reduce response time of query
Query Processing and Optimization
◼ Scanner: identify language components.
keywords, attribute, relation names
◼ Parser : check query system
◼ Validation: check attributes & relations
◼ Query tree (query graph) : internal
representation
◼ Execution strategy: plan
◼ Query optimization : choose a strategy
(reasonably efficient strategy)
Query Processing (cont…)
5
Query Processing (cont…)
◼ Query Processing can be divided into four main phases:
◼ Decomposition:
◼ Optimization
◼ Code generation, and
◼ Execution
◼ Decomposition: it is the process of transforming a high
level query into a relational algebra query, and to check
that the query is syntactically and semantically correct.
Query decomposition consists of parsing and validation
6
Query Processing (cont…)
◼ Typical stages in query decomposition are:
◼ Analysis: lexical and syntactical analysis of the query (correctness).
◼ Query tree will be built for the query.
◼ Normalization: convert the query into a normalized form.
◼ Semantic Analysis: to reject normalized queries that are not correctly
formulated or contradictory.
◼ Incorrect if components do not contribute to generate result.
◼ Contradictory if the predicate can not be satisfied by any tuple.
◼ Simplification: to detect redundant qualifications, eliminate common
sub-expressions, and transform the query to a semantically equivalent
but more easily and effectively computed form.
◼ Query Restructuring: When more than one translation is possible,
use transformation rules to re arranging nodes so that the most
restrictive condition will be executed first.
7
Transformation Rules for Relational Algebra
◼ Relational algebra operations:
◼ Select, Project , Join, Union, Intersection, Cartesian Poduct
- refer the text book for detail understanding
▪ General Transformation rules for relational
algebra
1. Cascade of : A conjunctive selection condition can be
broken up into a cascade (sequence) of individual
operations:
◼ c1 AND c2 AND ... AND cn(R) = c1 (c2 (...( cn(R))...) )
2. Commutativity of :
◼ The operation is commutative:
◼ c1 (c2(R)) = c2 (c1(R))
3. Cascade of : In a cascade (sequence) of operations, all
but the last one can be ignored:
◼ List1 (List2 (...(Listn(R))...) ) = List1(R)
8
Transformation Rules for Relational Algebra
(cont…)
4. Commuting with :
◼ If the selection condition c involves only the
attributes A1, ..., An in the projection list, the two
operations can be commuted:
◼ A1, A2, ..., An (c (R)) = c (A1, A2, ..., An (R))
5. Commuting of and X: both are commutative
R cS = S c R, RXS=SXR
9
Transformation Rules for Relational Algebra
(cont…)
6. Commuting with : If projection list is L = {A1, ..., An,
B1, ..., Bm}, where A1, ..., An are attributes of R and B1,
..., Bm are attributes of S and the join condition c involves
only attributes in L, the two operations can be commuted
as follows:
◼ L ( R C S ) = (A1, ..., An (R)) C ( B1, ..., Bm (S))
7. Converting a (, x) sequence into : If the condition c of
a that follows a x Corresponds to a join condition,
convert the (, x) sequence into a as follows:
(C (R x S)) = (R C S)
10
Query Processing
◼ Query Block:
◼ A query block contains a single SELECT-FROM-
WHERE expression, as well as GROUP BY and
HAVING clause if these are part of the block.
◼ It is the basic unit that can be translated into the algebraic
operators
◼ Nested queries within a query are identified as
separate query blocks.
◼ Example: Find the name of employees whose
salary is below the average salary of all employees
who are working at department number 5
11
Use the following logical model to understand the
discussions in this chapter
Example: Find the name of
employees whose salary is below
the average salary of all employees
who are working at department
number 5
12
Query Processing (cont…)
SELECT FNAME, LNAME
FROM EMPLOYEE
WHERE SALARY < ( SELECT AVG (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);
SELECT FNAME, LNAME SELECT AVG (SALARY)
FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY < C WHERE DNO = 5
πFNAME, LNAME (σSALARY<C(EMPLOYEE)) C=ℱAVGSALARY (σDNO=5 (EMPLOYEE))
13
Basic algorithms for executing query operations
◼ External sorting:
◼ Refers to sorting algorithms that are suitable for large files of records
stored on disk that do not fit entirely in main memory, such as most
database files
◼ External Sorting Uses Sort-Merge Strategy:
◼ Starts by sorting small sub files (runs) of the main file and then
merges the sorted runs, creating larger sorted sub files that are
merged in turn
◼ Sorting Phase:
◼ Number subfiles (runs) nR = (b/nB)
◼ The size of a run and number of initial run depends on the number of file blocks (b) and
available buffer space (nB)
◼ Merging phase:
◼ Degree of merging(dM) = Min (nB-1, nR); nP = (logdM(nR))
◼ nR: number of initial runs;
◼ b: number of file blocks;
◼ nB: available buffer space;
◼ dM: degree of merging(no of runs that can be merged together in each pass)
◼ np: number of passes(stored runs are merged during one or more passes)
14
Sort-Merge strategy (cont…)
◼ Example: if nB=5 blocks and size of the file b=1024 blocks,
◼ Sorting phase:
◼ The size of a run and number of initial run depends on the number of
file blocks (b) and available buffer space (nB)
◼ nR=(b/nB)= (1024/5) =205 runs each of size 5 blocks
except the last run which will have 4 blocks.
◼ Hence, after the sort phase, 205 sorted runs are stored
as temporary subfiles on disk.
15
Sort-Merge strategy (cont…)
◼ In the merging phase, the sorted runs are merged during
one or more passes.
◼ The degree of merging (dM) is the number of runs that
can be merged.
◼ dM=min((nB-1), nR))
◼ The number of passes=[(logdM (nR))]
◼ In each pass, one buffer block is needed to hold one
block from each of the runs being merged and one block
is needed for containing one block of the merge result
16
Sort-Merge strategy (cont…)
◼ In the above example, dM=4(four way merging)
◼ Hence, the 205 initial sorted runs would be merged into:
◼ 52 at the end of the first pass
◼ 13 at the end of the second pass
◼ 4 at the end of the third pass
◼ 1 at the end of the fourth pass
17
Implementing SELECT operation
◼ There are many options for executing a SELECT
operation
◼ Examples: Use the logical model provided in the
previous slide to understand the following operations:
(OP1) σSsn=12 (EMPLOYEE)
equality comparison on key attribute
(OP2) σDNumber > 5 (DEPARMENT)
nonequality comparison on key attribute
(OP3) σDNO=5 (EMPLOYEE)
equality comparison on non key attribute
(OP4) σDNO=5 AND SALARY >30000 AND SEX=F(EMPLOYEE)
conjunctive condition
(OP5) σSsn=12 AND PNO=10 (WORKS_ON)
conjunctive condition and composite key
18
Implementing the SELECT Operation (cont…)
◼ Search Methods for implementing Simple Selection:
◼ S1 Linear search (brute force):
◼ Retrieve every record in the file, and test whether its attribute
values satisfy the selection condition.
◼ S2 Binary search:
◼ If the selection condition involves an equality comparison on a
key attribute on which the file is ordered, binary search (which
is more efficient than linear search) can be used. An example
is OP1 if Ssn is the ordering attribute for EMPLOYEE file
◼ S3 Using a primary index or hash key to retrieve a
single record:
◼ If the selection condition involves an equality comparison on a
key attribute with a primary index (or a hash key), use the
primary index (or the hash key) to retrieve the record.
◼ For Example, OP1 use primary index to retrieve the record
19
Implementing the SELECT Operation (cont…)
◼ Search Methods for implementing Simple Selection(cont…)
◼ S4 Using a primary index to retrieve multiple records:
◼ If the comparison condition is >, ≥, <, or ≤ on a key
field with a primary index, use the index to find the
record satisfying the corresponding equality condition,
then retrieve all subsequent records in the (ordered)
file. (see OP2)
◼ S5 Using a clustering index to retrieve multiple
records:
◼ If the selection condition involves an equality
comparison on a non-key attribute with a clustering
index, use the clustering index to retrieve all the
records satisfying the selection condition. (See OP3)
20
Implementing the SELECT Operation (cont…)
• Search Methods for implementing complex Selection
• S6: Conjunctive selection using an individual index :
◼ If an attribute involved in any single simple condition in
the conjunctive condition has an access path that
permits the use of one of the methods S2 to S6, use that
condition to retrieve the records and then check whether
each retrieved record satisfies the remaining simple
conditions in the conjunctive condition
21
Implementing the SELECT Operation (summarized)
◼ Whenever a single condition specifies the selection, we
can only check whether an access path exists on the
attribute involved in that condition.
◼ If an access path exists, the method corresponding
to that access path is used;
◼ Otherwise, the “brute force” linear search approach
of method S1 is used
◼ For conjunctive selection conditions, whenever
more than one of the attributes involved in the
conditions have an access path, query optimization
should be done to choose the access path that
retrieves the fewest records in the most efficient way
22
Implementing the SELECT Operation
(summarized)
◼ Disjunctive selection conditions: This is a situation where
simple conditions are connected by the OR logical
connective rather than AND
◼ Compared to conjunctive selection, It is much harder to
process and optimize
◼ Example: DNO=5 OR SALARY>3000 OR SEX=‘F’(EMPLOYEE)
◼ Little optimization can be done because the records
satisfying the disjunctive condition are the union of the
records satisfying the individual conditions
◼ Hence, if any of the individual conditions does not have
an access path, we are compelled to use the brute force
approach
23
Implementing the JOIN Operation:
◼ The join operation is one of the most time
consuming operation in query processing
◼ Join
◼ two–way join: a join on two files
◼ e.g. R A=B S
◼ multi-way joins: joins involving more than two files
◼ e.g. R A=B S C=D T
◼ Examples
◼ (OP6): EMPLOYEE DNO=DNUMBER DEPARTMENT
◼ (OP7): DEPARTMENT MGRSSN=SSN EMPLOYEE
24
Implementing the JOIN Operation (cont…)
◼ Methods for implementing joins:
◼ J3 Sort-merge join:
◼ If the records of R and S are physically sorted
(ordered) by value of the join attributes A and B,
respectively, we can implement the join in the most
efficient way possible.
◼ Both files are scanned in order of the join attributes,
matching the records that have the same values for
A and B.
◼ In this method, the records of each file are scanned
only once each for matching with the other file—
unless both A and B are non-key attributes
25
Algorithms for SELECT and JOIN
Operations (cont…)
◼ Implementing the JOIN Operation (cont...):
◼ Factors affecting JOIN performance
◼ Available buffer space
◼ Join selection factor
◼ Choice of inner VS outer relation
26
Algorithms for PROJECT and SET
Operations
◼ Algorithm for PROJECT operations ()
<attribute list>(R)
1. If <attribute list> has a key of relation R, extract all tuples
from R with only the values for the attributes in <attribute
list>.
2. If <attribute list> does NOT include a key of relation R,
duplicated tuples must be removed from the results.
◼ Methods to remove duplicate tuples
1. Sorting
2. Hashing
27
Query Optimization
◼ Given the database structure, the challenge of query
optimization is to find the sequence of steps that produces
the answer to user request in the most efficient manner
◼ The performance of a query is affected by the tables or
queries that underlies the query and by the complexity of the
query.
◼ Given a request for data manipulation or retrieval, an
optimizer will choose an optimal plan for evaluating the
request from among the manifold alternative strategies.
◼ There are many ways (access paths) for accessing desired
file/record. The optimizer tries to select the most efficient
(cheapest) access path for accessing the data.
◼ DBMS is responsible to pick the optimal execution strategy
based on various considerations.
28
Approaches to Query Optimization
◼ Heuristics Approach
◼ The heuristic approach uses the knowledge of the
characteristics of the relational algebra operations and
the relationship between the operators to optimize the
query.
◼ Thus the heuristic approach of optimization will make
use of:
◼ Properties of individual operators
◼ Association between operators
◼ Query Tree: a graphical representation of the
operators, relations, attributes and processing
sequence during query processing .
29
Using Heuristics in Query Optimization
◼ Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups of
operations based on the access paths available on the files
involved in the query
◼ The main heuristic is to apply first the operations that
reduce the size of intermediate results
◼ E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations
30
Using Heuristics in Query Optimization (cont…)
◼ Heuristic Optimization of Query Trees:
◼ The same query could correspond to many different
relational algebra expressions — and hence many different
query trees.
◼ The task of heuristic optimization of query trees is to find a
final query tree that is efficient to execute.
◼ Example:
Query : Find the last name of employees in AQUARIUS
project who were born before 1957-12-31
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
31
Steps in converting a query tree during heuristic optimization:
Initial query tree for the query Q on slide 34
32
Steps in converting a query tree during heuristic optimization:
Moving the select operation down the tree
33
Steps in converting a query tree during heuristic optimization:
Applying the more restrictive select operation first
34
Steps in converting a query tree during heuristic optimization:
Replacing Cartesian product and select with join
35
Steps in converting a query tree during heuristic optimization:
Moving project operations down the query tree
36
Using Heuristics in Query Optimization
◼ Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations that
reduce the size of intermediate results
2. Perform select operations as early as possible to reduce
the number of tuples and perform project operations as
early as possible to reduce the number of attributes. (This
is done by moving select and project operations as far
down the tree as possible.)
3. The select and join operations that are most restrictive
should be executed before other similar operations. (This
is done by reordering the leaf nodes of the tree among
themselves and adjusting the rest of the tree
appropriately.)
37
Using Selectivity and Cost Estimates in
Query Optimization
◼ Cost-based query optimization:
◼ Estimate and compare the costs of executing a query
using different execution strategies and choose the
strategy with the lowest cost estimate
◼ Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation cost
4. Memory usage cost
5. Communication cost
◼ Note: Different database systems may focus on different
cost components.
38
Semantic Query Optimization
◼ Semantic Query Optimization:
◼ Uses constraints specified on the database schema in order to
modify one query into another query that is more efficient to
execute
◼ Consider the following SQL query,
SELECT [Link], [Link]
FROM EMPLOYEE E, EMPLOYEE M
WHERE [Link]=[Link] AND [Link]>[Link]
◼ Explanation:
◼ Suppose that we had a constraint on the database schema that
stated that no employee can earn more than his or her direct
supervisor. If the semantic query optimizer checks for the
existence of this constraint, it need not execute the query at all
because it knows that the result of the query will be empty.
39