Chapter 2
Query Processing and
Optimization
Outline
Query processing steps
Query Optimization:
Heuristic based query optimization
Cost based query optimization
Selection and Sorting operation optimization
Advanced database-Ch2: Query Processing & Optimization Slide 2
Query Processing
Activities involved in retrieving data from
the database.
Aims of QP:
transform query written in high-level language
(e.g. SQL), into correct and efficient execution
strategy expressed in low-level language
execute strategy to retrieve required data.
Advanced database-Ch2: Query Processing & Optimization Slide 3
Query Optimization
Activity of choosing an efficient execution
strategy for processing query.
As there are many equivalent transformations
of same high-level query, aim of QO is to
choose one that minimizes resource usage.
Generally, reduce total execution time of
query. May also reduce response time of
query.
Problem: computationally intractable with
large number of relations, so strategy
adopted is reduced to finding near optimum
solution.
Advanced database-Ch2: Query Processing & Optimization Slide 4
Basic Steps in Query Processing
Is a 3-step process that transforms a high-
level query (of relational calculus/SQL) into
an equivalent and more efficient lower-
level query (of relational algebra).
1. Parsing and translation
Check syntax and verify relations.
Translate the query into an equivalent
relational algebra (RA)expression.
2. Optimization
Generate an optimal evaluation plan
(with lowest cost) for the query plan.
Advanced database-Ch2: Query Processing & Optimization Slide 5
Basic Steps in Query Processing …
Evaluation
The query-execution engine takes an (optimal)
evaluation plan, executes that plan, and
returns the answers to the query.
Advanced database-Ch2: Query Processing & Optimization Slide 6
Basic Steps in Query Processing …
Advanced database-Ch2: Query Processing & Optimization Slide 7
Basic Steps in Query Processing …
Advanced database-Ch2: Query Processing & Optimization Slide 8
Example: SQL To RA Query
Find students GPA and address of
students whose GPA>3.5
Advanced database-Ch2: Query Processing & Optimization Slide 9
Bad query processing plan example
Cartesian product first:
Student x People
Selection criteria next:
Gpa>3.5 AND sname=pname
GROUP BY; HAVING (if available)
Projections
SELECT gpa,address
ORDER BY last
Incredibly inefficient with huge
intermediate results
Advanced database-Ch2: Query Processing & Optimization Slide 10
Query Optimization
• Aim: Transform query into faster, equivalent query
equivalent query 1
equivalent query 2 faste
query
rquer
y
equivalent query n
• Heuristic (logical) optimization
Query tree (relational algebra) optimization
Query graph optimization
• Cost-based (physical) optimization
Advanced database-Ch2: Query Processing & Optimization Slide 11
Query Optimization Heuristics
Apply heuristic rules on standard initial
query tree to find optimized equivalent
query tree
Main heuristic: Favor operations that
reduce the size of intermediate results first
Apply SELECT and PROJECT operations before
join or other set operations
Apply more selective SELECT and join first
General transformation rules for relational
algebra operators
Advanced database-Ch2: Query Processing & Optimization Slide 12
Some Transformation Rules
• Cascade of σ: σc1 ∧c2 ∧... ∧cn(R) = σc1(σc2(...(σ cn(R))...))
• Commutativity of σ: σc1(σc2(R)) = σc2(σ c1(R))
• Commuting σ with π: πL(σc (R)) = σc(πL(R))
Only if c involves solely attributes in L.
• Commuting σ with : σc (R S) = σc (R) S
Only if c involves solely attributes in R.
• Commuting σ with set operations: σc(R θ S) = σc(R) θ
σc(S)
Where θ is one of ∪, ∩, or
-.
• Commutativity of ∪, ∩, and :RθS=SθR
Where θ is one of ∪, ∩
and .
• Associativity of , ∪, ∩: (R θ S) θ T = R θ (S θ T)
Advanced database-Ch2: Query Processing & Optimization Slide 13
Transformation Algorithm Outline
• Transform a query represented in relational algebra to
an equivalent one (generates the same result.)
• Step 1: Decompose σ operations.
• Step 2: Move σ as far down the query tree as possible.
• Step 3: Rearrange leaf nodes to apply the most
restrictive σ operations first.
• Step 4: Form joins from × and subsequent σ operations.
• Step 5: Decompose π and move down the query tree as
far as possible.
• Step 6: Identify candidates for combined operations.
Advanced database-Ch2: Query Processing & Optimization Slide 14
Heuristic Query Optimization Summary
• Heuristic optimization transforms the query-tree by
using a set of rules (Heuristics) that typically (but not
in all cases) improve execution performance.
Perform selection early (reduces the number of tuples)
Perform projection early (reduces the number of attributes)
Perform most restrictive selection and join operations (i.e.
with smallest result size) before other similar operations.
• Generate initial query tree from SQL statement.
• Transform query tree into more efficient query tree, via
a series of tree modifications, each of which hopefully
reduces the execution time.
• A single query tree is involved.
Advanced database-Ch2: Query Processing & Optimization Slide 15
Query Tree (plan)
A tree data structure that corresponds to a
relational algebra expression
Leaf nodes = input relations
Internal nodes = RA operations
Execution of query tree
Start at the leaf nodes
Execute internal node whenever its operands are
available and replace node by result
Advanced database-Ch2: Query Processing & Optimization Slide 16
Query Tree Optimization Example
• What are the names of customers living on Elm Street
who have checked out “Terminator”?
• SQL query:
SELECT Name
FROM Customer CU, CheckedOut CH, Film F
WHERE Title = ’Terminator’ AND [Link] = [Link]
AND [Link] = [Link] AND [Link] = ‘Elm’
• Canonical query tree πName
σTitle= ‘Terminator’∧[Link] = [Link] ∧[Link] = [Link] ∧[Link] = ‘Elm’
×
× Note the use of
F
Cartesian product!
CU CH
Advanced database-Ch2: Query Processing & Optimization Slide 17
Apply Selections Early
πName
σ [Link] = [Link]
σ[Link] = [Link] σTitle= ‘Terminator’
× F
σStreet = ‘Elm’ CH
CU
Advanced database-Ch2: Query Processing & Optimization Slide 18
Apply More Restrictive Selections Early
πName
σ [Link] = [Link]
σ [Link] = [Link] σStreet= ‘Elm’
× CU
σTitle= ‘Terminator’ CH
F
Advanced database-Ch2: Query Processing & Optimization Slide 19
Form Joins
πName
[Link] =
[Link]
[Link] = [Link] σStreet= ‘Elm’
σTitle= ‘Terminator’ CH CU
Advanced database-Ch2: Query Processing & Optimization Slide 20
Apply Projections Early
πName
[Link] =
[Link]
[Link] = [Link] πName, CustomerID
πFilmID πFilmID, CustomerID σStreet= ‘Elm’
σTitle= ‘Terminator’ CU
CH
F
Advanced database-Ch2: Query Processing & Optimization Slide 21
Cost-Based Optimization
• Cost Components for Query Execution
1. Access cost to secondary storage
2. Storage cost
3. Computation cost
4. Memory usage cost
5. Communication cost
Note: Different database systems may
focus on different cost components
Advanced database-Ch2: Query Processing & Optimization Slide 22
Relevant Statistics
• Catalog Information Used in Cost Functions:
• Information about the size of a file
• number of records (tuples) (r),
• record size (R),
• number of blocks (b)
• blocking factor (bfr)
• Information about indexes and indexing attributes of
a file
• Number of levels (x) of each multilevel index
• Number of first-level index blocks (bI1)
• Number of distinct values (d) of an attribute
• Selectivity (sl) of an attribute
• Selection cardinality (s) of an attribute. (s = sl * r)
Advanced database-Ch2: Query Processing & Optimization Slide 23
Selection Queries
• Primary key, point
σFilmID= 2 (Film)
• Point
σTitle= ‘Terminator’ (Film)
• Range
σ1 < RentalPrice< 4 (Film)
• Conjunction
σType= ‘M’∧Distributor= ‘MGM’ (Film)
• Disjunction
σPubDate< 2004 ∨Distributor= ‘MGM’ (Film)
Advanced database-Ch2: Query Processing & Optimization Slide 24
Selection Strategies
• Linear search
• Expensive, but always applicable.
• Binary search
• Applicable only when the file is appropriately ordered.
• Hash index search
• Single record retrieval; does not work for range queries.
• Clustering index search
• Multiple records for each index item.
• Implemented with single pointer to block with first
associated record.
• Secondary index search
• Implemented with dense pointers, each to a single record.
Advanced database-Ch2: Query Processing & Optimization Slide 25
Selection Strategies for Conjunctive Queries
• Use any available indices for attributes involved in
simple conditions.
• If several are available, use the most selective index. Then
check each record with respect to the remaining conditions.
• Attempt to use composite indices.
• This can be very efficient.
• Do intersection of record pointers.
• If several indices with record pointers are applicable to the
selection predicate, retrieve and intersect the pointers. Then
retrieve (and check) the qualifying records.
• Disjunctive queries provide little opportunity for smart
processing.
Advanced database-Ch2: Query Processing & Optimization Slide 26
Examples of Cost Functions for SELECT
S1. Linear search (brute force) approach
• CS1a = b;
• For an equality condition on a key, CS1a = (b/2) if
the record is found; otherwise CS1a = b.
S2. Binary search:
• CS2 = log2b + (s/bfr) –1
• For an equality condition on a unique (key)
attribute, CS2 =log2b
S3. Using a primary index (S3a) or hash key
(S3b) to retrieve a single record
• CS3a = x + 1; CS3b = 1
Advanced database-Ch2: Query Processing & Optimization Slide 27
Joins
• Join Strategies
Nested loop join
Index-based join
Sort-merge join
• Strategies work on a per block (not per record) basis.
Need to estimate #I/Os (block retrievals)
• Relation sizes and join selectivities impact join cost.
Query selectivity = #tuples in result / #candidates
‘More selective’ means smaller ‘selectivity value’
For join, #candidates is the size of Cartesian product
Advanced database-Ch2: Query Processing & Optimization Slide 28
Nested Loop Join and Index-Based Join
• Nested loop join
• Exhaustive comparison (i.e., brute force approach)
• The ordering (outer/inner) of files and allocation of buffer
space is important.
• Index-based join
• Requires (at least) one index on a join attribute.
• At times, a temporary index is created for the purpose of a
join.
• The ordering (outer/inner) of files is important.
Advanced database-Ch2: Query Processing & Optimization Slide 29
Nested Loop
• Basically, for each block of the outer table (r), scan the
entire inner table (s).
• Requires quadratic time, O(n ) 2
• Improved when buffer is used.
Outer Main Memory
Buffer
r
spill Output
Scan Inner when full
Advanced database-Ch2: Query Processing & Optimization Slide 30
Example of Nested-Loop Join
Customer [Link]= [Link]
CheckedOut
• Parameters
rCheckedOut = rCustomer = 200
40,000
bCheckedOut = bCustomer = 10
2,000
nB = 6 (size of main memory buffer)
• Algorithm:
repeat: read (nB - 2) blocks from outer relation
repeat: read 1 block from inner relation
compare tuples
• Cost: bouter + ( ⎡bouter/ (nB -2)⎤ ) × binner
• CheckedOut as outer: 2,000 + ⎡2,000/4⎤ × 10 =
7,000
• Customer as outer: 10 + ⎡10/4⎤ × = 5,010
Advanced database-Ch2: Query Processing & Optimization
2,000 Slide 31
Index-based Join
• Requires (at least) one index on a join attribute
A temporary index can be created
for each index on r
record of
s, query
r
Scan Outer in index
if joins
s
Output
spill
when full
Main Memory
Advanced database-Ch2: Query Processing & Optimization Slide 32
Example of Index-Based Join
Customer [Link]= [Link]
• Cost:CheckedOut
bouter + router × cost use of index
• Assume that the video store has 10 employees.
There are 10 distinct EmpIDs in
CheckedOut.
• Assume 1-level index on CustomerID of Customer.
• Iterate through all 40,000 tuples in CheckedOut
(outer
rel.) 2,000 disk reads (bCheckedOut) to scan CheckedOut
For each CheckedOut tuple, search for matching
Customer
tuples
using
0 disk index.
reads for index (in main memory) + 1 disk read for actual data
block
• Cost: 2,000 + 40,000 × (0 + 1) = 42,000
Advanced database-Ch2: Query Processing & Optimization Slide 33
Sort-Merge Join
• Sort each relation using multiway merge-sort
• Perform merge-join
Scan
sorted
Main Memory
r1 r
•••
unsorted r2 sorted ••• Match
r ••• r •••
Scan •••
rn
sorted
s
Output
Advanced database-Ch2: Query Processing & Optimization Slide 34
External or Disk-based Sorting
• Relation on disk often too large to fit into memory
• Sort in pieces, called runs
Main Memory r1
Buffer (N blocks)
Scan Output r2
when sorted
r •••
rMdivN
Unsorted relation rlast
(M blocks)
Runs
(M div N runs each of
size N blocks, and maybe one
last run of < N leftover blocks)
Advanced database-Ch2: Query Processing & Optimization Slide 35
External or Disk-based Sorting, Cont.
• Runs are now repeatedly merged
• One memory buffer used to collect output
Main Memory Main Memory
m1
r1 Buffer (N blocks)
Buffer (N blocks)
Output m2
r2
when merged
••• •••
rN-1 mlast-1
•••
rMdivN output
mlast output
Merged runs
rlast ((M div N) div N-1) runs each of
sorted runs size N*N-1 blocks, and maybe one
(N blocks each) last run of < N*N-1 leftover blocks)
Advanced database-Ch2: Query Processing & Optimization Slide 36
External Sorting (Multiway Merge Sort)
• Buffer size is nB = 4 (N)
Orginal rel.
32 5 1712 1 14 8 3 313023 2629 2 2511 6 7 1516 28 4 9 221821 102013 272419
Initial runs
5 121732 1 3 8 14 23263031 2 11 2529 6 7 1516 4 9 2228 101820 21 13192427
First merge
1 3 5 8 12141723263031 32 2 4 6 7 9 11 151622252829 1013181920202427
Second merge
1 2 3 4 5 6 7 8 9 1011 121314151617181920212223242526272829303132
• Cost: 2 × brelation + 2 × brelation × ⎡lognB-1 (brelation/nB)⎤
• 2 × 32 + 2 × 32 × log3(32/4) = 192
Advanced database-Ch2: Query Processing & Optimization Slide 37
Example of Sort-Merge Join
• Cost to sort CheckedOut (bCheckedOut=
CostSortChecedOut = 2 × 2,000 + 2 × 2,000 × ⎡log5(2,000/6)⎤ =
2000)
to sort Customer relation (bCustomer= 10)
20,000
• Cost
CostSortCustomer= 2 × 10 + 2 × 10 × ⎡log5(10/6)⎤ =
40 for merge join
• Cost
Cost to scan sorted Customer + cost to scan sorted
CheckedOut
Costmergejoin= 10 + 2,000 = 2,010
• Costsort-merge join = CostSortCustomer + CostSortChecedOut + Costmergejoin
• Costsort-merge join= 20,000 + 40 + 2,010 = 22,050
Advanced database-Ch2: Query Processing & Optimization Slide 38
Cost and Applicability of Join Strategies
• Nested-loop join
Brute-force
Can handle all types of joins (=, <, >)
• Index-based join
Requires minimum one index on join attributes
• Sort-merge join
Requires that the files are sorted on the join attributes.
Sorting can be done for the purpose of the join.
A variation is also applicable when secondary indices are
available instead.
Advanced database-Ch2: Query Processing & Optimization Slide 39