Query Processing and Optimization
Chapter 11
11.1 Introduction
Query processing requires that the DBMS identify and execute a strategy for retrieving the
results of the query. The query determines what data is to be found, but does not define
the method by which the data manager searches the database. Therefore Query optimization
is necessary to determine the optimal alternative to process a query. There are two main
techniques for query optimization. The first approach is to use a rule based or heuristic
method for ordering the operations in a query execution strategy. The second approach
estimates the cost of different execution strategies and chooses the best solution. In general
most commercial database systems use a combination of both techniques.
11.2 Basics of Query Processing
Query Processing : Query Processing is a procedure of converting a query written in high-
level language (Ex. SQL, QBE) into a correct and efficient execution plan expressed in low-
level language, which is used for data manipulation.
Query Processor : Query processor is responsible for generating execution plan.
Execution Plan : Query processing is a stepwise process. Before retrieving or updating
data in database, a query goes through a series of query compilation steps. These steps are
known as execution plan.
The success of a query language also depends upon its query processor i.e., how much
efficient execution plan it can create? The better execution plan leads to low time and cost.
In query processing, the first phase is transformation in which parser first checks the syntax
of query and also checks the relations and attributes used in the query that are defined in
the database. After checking the syntax and verifying the relations, query is transformed into
equivalent expression that are more efficient to execute. Transformation, depends upon various
419
420 Introduc tion to Database Management System
factors like existence of certain database structures, presence of different indexes, file is sorted
or not, cost of transformation, physical characteristics of data etc. After transformation of
query, transformed query is evaluated by using number of strategies known as access plans.
While generating access plans, factors like physical properties of data and storage are taken
into account and the optimal access plan is executed. The next step is to validate the user
privileges and ensure that the query does not disobey the relevant integrity constraints.
Finally, execution plan is executed to generate the result.
11.2.1 General Strategy for Query Processing
The general strategy for query processing is as follows:
(i) Representation of query : Query written by user cannot be processed directly by
system. Query processor first checks the syntax and existence of relations and their
attributes in database. After validations, query processor transform it into equivalent
and more efficient expression for example query will be converted into a standard
internal format that parser can manipulate. Parser also adds some additional predicates
to the query to enforce security. Internal form may be relational algebra, relational
calculus, any low-level language, operator graphs etc.
(ii) Operator graphs : Operator graphs are used to represent query. It gives the sequence
of operations that can be performed. It is easy to understand the query represented
by operator graphs. It is useful to determine redundancy in query expressions, result
of transformation, simplify the view etc.
(iii) Response time and Data characteristics consideration : Data characteristics like length
of records, expected sizes of both intermediate and final results, size of relations etc.,
are also considered for optimizing the query. In addition to this overall response time
is also determined.
11.2.2 Steps in Query Processing
Various steps in query processing are shown in Figure 11.1. Suppose that user inputs a query
in general query language say QBE, then it is first converted into high-level query language
say SQL etc. Other steps in query processing are discussed below in detail:
(i) Syntax Analysis : Query in high-level language is parsed into tokens and tokens are
analyzed for any syntax error. Order of tokens are also maintained to make sure that all the
rules of language grammars are followed. In case of any error, query is rejected and an error
code with explanation for rejection is returned to the user. (Only syntax is checked in this
step).
(ii) Query Decomposition : In this step, query is decomposed into query blocks which
are the low-level operations. It starts with the high-level query that is transformed into low-
level operations and checks whether that query is syntactically and semantically correct. For
example, a SQL query is decomposed into blocks like Select block, From block, Where block
etc. Various stages in query decomposition are shown in Figure 11.2.
Query Processing and Optimiz ation 421
User input General query
Transform into
High level query language
ex. SQL
Scanning, parsing, Syntax checking and verification
validating by parser
Check existence of relations and attributes,
Database Query decomposer semantic analysis, decomposing complex
catalog query into smaller ones
Optimization to reduce execution time and
Query optimizer cost by substituting equivalent expressions
for those in the query
Execution plan Query modification
Query code generator Generate code for queries
Main Runtime database processor Deals with database to do necessary
database operation
Query result
FIGURE 11.1. Steps in query processing.
SQL query Query analysis
Query normalization
Semantic analysis
Query simplifier
Query restructuring
Algebraic expressions
FIGURE 11.2. Steps in query decomposition.
422 Introduc tion to Database Management System
(a) Query Analysis : In the query analysis stage, programming language compiler checks
that the query is lexically and syntactically correct. A syntactically correct query is
analyzed using system catalogues to verify the existence of relations and attributes used
in query. After analysis a correct query is converted into some internal representation,
which is more efficient for processing.
The type specification of the query qualifier and result is also checked at this stage.
The internal representation may be, query tree or query graph.
Query tree notation : A Typical internal representation of query is query tree. It is
also known as relational algebra tree. A query tree is constructed using tree data
structure that corresponds to the relational algebra expression. Main components of
query tree are:
Root of tree–represents result of query.
Leaf nodes–represent input relations of the query.
Internal nodes–represent intermediate relation that is the output of applying an
operation in the algebra.
The sequence of operations is directed from leaves to the root node.
For example,
Employee_Name, Address, JOB, Department, Department-location
[Link] = [Link]
E D
Employee Department
FIGURE 11.3. Query treee notation.
Query graph notation : Graph data structure is also used for internal representation
of query. In graphs:
Relation nodes–represent relations by single circle.
Constant nodes–represent constant values by double circle.
Edges–represent relation and join conditions.
Square brackets–represent attributes retrieved from each relation.
[E.Employee_Name, [Link],
[[Link]-ID]
[Link], [Link]]
‘Delhi’ D E
Department-Location = Delhi [Link] = [Link]
FIGURE 11.4. Query graph notation.
Query Processing and Optimiz ation 423
(b) Query normalization : After query analysis, it is normalized to remove any redundancy.
In this phase query is converted into normalized form that can be easily manipulated.
A set of equivalency rules are applied to query to simplify the projection and selection
operations to avoid redundancy. Query can be converted into one of the following
two normal forms :
Conjunctive normal form : It is a sequence of conjuncts that are connected with
‘AND’ operator. A conjunct consists of one or more terms connected with ‘OR’
operator. A conjuctive selection consists only those tuples that satisfy all conjuncts.
Example. (Emp_Job = ‘Analyst’ ∨ salary < 50000) ∧ (Hire_Date > 1–1–2000)
Disjunctive normal forms : It is a sequence of disjuncts that are connected with ‘OR’
operator. A disjunct consists of one or more terms connected with ‘AND’ operator.
A disjunctive selection contains those tuples that satisfy anyone of the disjunct.
Example. (Emp_Job= ‘Analyst’ ∧ salary < 5000) ∨ (Hire_Date > 1–1–2000)
Disjunctive normal form is more useful as it allows the query to break into a series
of independent sub-queries linked by union.
(c) Semantic analyzer : The semantic analyzer performs the following tasks :
It helps in reducing the number of predicates.
It rejects contradictory normalized forms.
In case of missing join specification, components of query do not contribute to
generation of results. It identifies these queries and rejects them.
It makes sure that each object in query is referenced correctly according to its data
type.
(d) Query simplifier : The major tasks of query simplifier are as follows :
It eliminates common sub-expressions.
It eliminates redundant qualification.
It introduces integrity constraints, view definitions into the query graph representation.
It eliminates query that voids any integrity constraint without accessing the database.
It transforms sub-graphs into semantically equivalent and more efficient form.
It deals with user access rights.
Idempotence rules of Boolean Algebra are applied to get final form of simplification.
(e) Query Restructuring : At the final stage of query decomposition, transformation rules
are applied to restructure the query to give a more efficient implementation.
(iii) Query Optimization : The aim of the query optimization step is to choose the best
possible query execution plan with minimum resources required to execute that plan. Query
optimization is discussed in detail in section 11.3.
(iv) Execution Plan : Execution plan is the basic algorithm used for each operation in
the query. Execution plans are classified into following Four types : (a) Left-deep tree query
execution plane, (b) Right-deep tree query execution plan, (c) Linear tree execution plan,
(d) Bushy execution plan.
424 Introduc tion to Database Management System
(a) Left-deep tree query execution plan : In left-deep tree query execution plan, development
of plan starts with a single relation and successively adding a operation involving
a single relation until the query is completed. For example, Only the left hand side
of a join is allowed to participate in result from a previous join and hence named
left-deep tree. It is shown in Figure 11.5.
Result
R4
R3
R1 R2
FIGURE 11.5. Left-deep execution plan.
Advantages : The main advantages of left-deep tree query execution plan are
It reduces search space.
Query optimiser is based on dynamic programming techniques.
It is convenient for pipelined evaluation as only one input to each join is pipelined.
Disadvantage : The disadvantages of left-deep tree query execution plan are
Reduction in search space leads to miss some lower cost execution strategies.
(b) Right-deep tree query execution plan : It is almost same as left-deep query execution
plan with the only difference that only the right hand side of a join is allowed to
participate in result from a previous join and hence named right-deep tree. It is
applicable on applications having a large main memory. It is shown in Figure 11.6.
Result
R4
R3
R2 R1
FIGURE 11.6. Right-deep execution plan.
(c) Linear tree execution plan : The combination of left-deep and right-deep execution
plans with a restriction that the relation on one side of each operator is always a
base relation is known as linear trees. It is shown in Figure 11.7.
Query Processing and Optimiz ation 425
Result
R4
R3
R2 R1
FIGURE 11.7. Linear tree execution plan.
(d) Bushy execution plan : Bushy execution plan is the most general type of execution
plan. More than one relation can participate in intermediate results. It is shown in
Figure 11.8.
Result
R3 R4 R5
R2 R1
FIGURE 11.8. Bushy execution plan.
The main advantage of bushy execution plan is the flexibility provided by it in
choosing the best execution plan by increasing search space but this flexibility may
leads to considerably increase the search space.
(v) Query Code Generator : After selecting the best possible execution plan query is
converted into low-level language so that it can be taken as input by runtime database
process.
(vi) Runtime Database Processor : It deals directly with main database and do the
necessary operation mentioned in query and returns the result to user.
11.3 Query Optimization
Query performance of a database systems is dependent not only on the database structure,
but also on the way in which the query is optimized. Query optimization means converting a
query into an equivalent form which is more efficient to execute. It is necessary for high-level
relation queries and it provides an opportunity to DBMS to systematically evaluate alterative
426 Introduc tion to Database Management System
query execution strategies and to choose an optimal strategy. A typical query optimization
process is shown in Figure 11.9.
Statistical data
Estimation formulas
Simplified relational
(determine cardinality
algebra query tree
of intermediate result tables)
Query optimiser
Cost model Execution plan generator
Execution plan in form of
optimized relational algebra query
FIGURE 11.9. Query optimization process.
The main issues that need to be considered in query optimization are:
1. Reduction of data transfer with database.
2. Use of available indexes for fast searching.
3. Reduction of number of times the database is manipulated.
4. The order in which joins should be performed.
5. How to store intermediate results?
Following are the three relations we used in each example:
Employee (Emp-ID, Emp-Name, Age, Salary, Dept-ID)
Department (Dept-ID, Proj-ID, Dept-Name)
Project (Proj-ID, Name, Location, Duration)
There are two main techniques used to implement query optimization. These are heuristic
query optimization and cost based query optimization.
11.3.1 Transformation Rules for Relational Algebra
The transformation rules are used to formulate a relational algebra expression into different
ways and query optimizer choose the most efficient equivalent expression to execute. Two
expressions are considered to be equivalent if they have same set of attributes in different
order but representing the same information.
Let us consider relations R, S, and T with set of attributes
X = {X1, X2, ..., Xn}, Y = {Y1, Y2, ..., Yn} and Z = {Z1, Z2, ..., Zn}
respectively, where X, Y, and Z represent predicates and L, L1, L2, M, M1, M2, and N denote
sets of attributes.
Rule 1. Cascading of selection (σ)
σX ∧ Y ∧ Z (R) ≡ σX (σY(σZ (R))).
Query Processing and Optimiz ation 427
It means that conjunctive selection operations can be transformed into individual selection
operations and vice versa.
σAge
Ex. = 35∧ salary > 50000 (Employee) = σAge = 35 (σsalary> 50000 (Employee)).
Rule 2. Commutativity of selection (σ)
σX (σY (R)) ≡ σY (σX (R)).
Ex. σAge = 35 (σSalary > 50000 (Employee)) ≡ σSalary > 5000 (σAge = 35 (Employee)).
Rule 3. Cascading of projection (π)
πL πM ... πN (R) ≡ πL (R)
Ex. πEmp-name πEmp-name, Age (Employee) ≡ πEmp-name (Employee)
Rule 4. Commutativity of selection (σ) and projection (π)
pX1, X2, ..., Xn (sA (R)) ≡ sA (pX1, X2, ..., Xn (R))
πEmp-name, Age (σSalary > 50000 (Employee)) ≡ σSalary > 5000 (πEmp-name, Age (Employee))
Rule 5. Commutativity of join () and Cartesian Product (X)
R Y S ≡ S Y R
R × S ≡ S × R
Ex. Employee [Link]-ID = [Link]-ID Department ≡
Department [Link]-ID = [Link]-ID Employee
Rule 6. Commutavity of selection (σ) and join () or Cartesian product (X)
(σX R Y S ≡ (σX (R) Y S)
σX (R × S) ≡ (σX (R)) × S
Ex. σ[Link] > 30 ∧ Dept-Name = ‘MARKETING’ (Employee) [Link]-ID = [Link]-ID
(Department) ≡ σ[Link]>30 (Employee) [Link]-ID = [Link]-ID
(σDept-Name = ‘MARKETING’ (Department))
Rule 7. Commutavity of projection (π) and join () or Cartesian product (X).
pL1 ∪ L2 (R Z S) ≡ (pL1 (R)) Z (pL2 (S))
Ex. πEmp-Name, Dept-Name, Dept-ID (Employee [Link]-ID = [Link]-ID Department) ≡
(πEmp-name, Dept-ID (Employee)) [Link]-ID = [Link]-ID (πDept-Name, Dept-ID (Department))
Rule 8. Commutativity of Union (∪) and Intersection (∩)
R ∪ S = S ∪ R
R ∩ S = S ∩ R
Rule 9. Commutativity of Selection (σ) and Union (∪) or Intersection (∩) or
Differerence (–).
σX (R ∪ S ) = σX (S) ∪ σX (R)
σX (R ∩ S) = σX (S) ∩ σX (R)
σX (R – S) = σX (S) – σX (R)
428 Introduc tion to Database Management System
Rule 10. Comutativity of projection (π) and Union (∪)
πL (R ∪ S) = πL (S) ∪ πL (R)
Rule 11. Associativity of Join () and Cartesian product (X)
(R S) T ≡ R (S T)
(R X S) X T ≡ R X (S X T)
Rule 12. Associativity of Union (∪) and Intersection (∩)
(R ∪ S ) ∪ T ≡ S ∪ (R ∪ T)
(R ∩ S) ∩ T ≡ S ∩ (R ∩ T)
Rule 13. Converting a selection (σ) and Cartesian product (X) sequence into Join ()
σX (R X S ) ≡ (R X S)
11.3.2 Heuristic Query Optimization
Heuristic query optimization technique is used to modify the internal representation of a
query by using heuristic rules and transformation rules. Heuristic rules are used in the
form of a query tree or query graph structure. Optimiser starts with initial query tree and
transform it into an equivalent and efficient query tree using transformation rules.
Heuristic Optimization Algorithm : DBMS use heuristic optimization algorithms to improve
the efficiency of query by converting initial query tree into an equivalent and optimized
query tree. Optimizers utilize transformation rules to optimize the structure of query tree.
Following are the steps of heuristic optimization algorithm.
Step 1. Perform Selection operation as early as possible : By using selection operation at
early stages, you can reduce the unwanted number of record or data, to transfer
from database to primary memory.
Optimizer use transformation rule 1 to divide selection operations with conjunctive
conditions into a cascade of selection operations.
Step 2. Perform commutativity of selection operation with other operations as early as
possible : Optimizer use transformation rule 2, 4, 6, and 9 to move selection
operation as far down the tree as possible and keep selection predicates on the
same relation together. By keeping selection operation down at tree reduces the
unwanted data transfer and by keeping selection predicates together on same
relations reduces the number of times of database manipulation to retrieve
records from same database table.
Step 3. Combine the Cartesian Product with subsequent selection operation whose predicates
represents a join condition into a JOIN operation : Optimizer uses transformation rule
13 to convert a selection and cartesian product sequence into join. It reduces
data transfer. It is always better to transfer only required data from database
instead of transferring whole data and then refine it. (Cartesian product combines
all data of all the tables mention in query while join operation retrieves only
those records from database that satisfy the join condition).
Step 4. Use Commutativity and Associativity of Binary operations : Optimizer use transformation
rules 5, 11, and 12 to execute the most restrictive selection operations first.
Query Processing and Optimiz ation 429
It rearranges the leaf nodes of query tree. By using the most restrictive selection
operations, the number of records fetched from database reduces and also
subsequent operations can be performed on less number of records.
Step 5. Perform projection operations as early as possible : After performing selection operations,
optimizer use transformation rules 3, 4, 7 and 10 to reduce the number of
columns of a relation by moving projection operations as far down the tree as
possible and keeping projection predicates on the same relation together.
Step 6. Compute common expressions only once: It is used to identify sub-trees that represent
groups of operations that can be executed by a single algorithm.
Consider the query below in SQL and transformation of its initial query tree into an
optimal query tree.
Select Emp_Name
From Employee e, Department d, Project p
Where [Link] = ‘LUXMI PUB.’
AND d.Proj_ID = p.Proj_ID
AND e.Dept_ID= d.Dept_ID
AND [Link] > 35
This query needs to display names of all employees working for project “LUXMI PUB.”
and having age more than 35 years.
Figure 11.10 shows the initial query tree for the given SQL query. If the tree is executed
directly then it results in the Cartesian product of entire Employee, Department, and Project
table but in reality, the query needed only one record from relation Project and only the
employee records for those whose age is greater than 35 years.
Emp_Name
[Link] = ‘LUXMI PUB.’ d.Proj_ID = p.Proj_ID e.Dept_ID = d.Dept_ID [Link] > 35
X Project
Employee Department
FIGURE 11.10. Initial query tree.
— We can improve the performance by first applying selection operations to reduce the
number of records that appear in Casterian product. Figure 11.11 shows the improved query
tree.
430 Introduc tion to Database Management System
Emp_Name
d.Proj_ID = p.Proj_ID
e.Dept_ID = d.Dept_ID [Link] = ‘LUXMI PUB.’
X Project
[Link] > 35 Department
Employee
FIGURE 11.11. Improved query tree by first applying selection operations.
— The query tree can be further improved by applying more restrictive selection operation.
So, switch the positions of relations Project and Employee as you know that in a single
project it may be more than one employee. Figure 11.12 shows the improved query tree.
Emp_Name
e.Dept_ID = d.Dept_ID
d.Proj_ID = p.Proj_ID [Link] > 35
X Employee
[Link] = ‘LUXMI PUB.’ Department
Project
FIGURE 11.12. Improved query tree by applying more restrictive selection operations.
— A further improvement can be done by replacing Cartesian product operations by
Join operations with a join condition as shown in Figure 11.13.
Query Processing and Optimiz ation 431
Emp_Name
p.Dept_ID = d.Dept_ID
d.Proj_ID = p.Proj_ID [Link] > 35
Employee
[Link] = ‘LUXMI PUB.’ Department
Project
FIGURE 11.13. Improved query tree by replacing Cartesian product and
selection operations by join operations.
— Further improvement can be done in query tree by keeping only required attributes
(columns) of relations by applying projection operations as early as possible in the query
tree. Optimizer keep the attributes required to display, and the attributes needed by
the subsequent operation in the intermediate relations. Modified query tree is shown in
Figure 11.14.
Emp_Name
e.Dept_ID = d.Dept_ID
d.Dept_ID e.Emp_ID, e.Emp_Name, e.Dept_ID
d.Proj_ID = p.Proj_ID [Link] > 35
p.Proj_ID d.Dept_ID, d.Proj_ID Employee
[Link] = ‘LUXMI PUB.’ Department
Project
FIGURE 11.14. Improved query tree by applying and moving projection
operations down the query tree.
The SELECT clause in SQL is equivalent to projection operation and WHERE clause in SQL is
equivalent to selection operation in relational algebra.
432 Introduc tion to Database Management System
11.3.3 Cost Based Query Optimization
In cost based query optimization, optimizer estimates the cost of running of all alternatives
of a query and choose the optimum alternative. The alternative which uses the minimum
resources is having minimum cost. The cost of a query operation is mainly depend on its
selectivity i.e., the proportion of the input relations that forms the output. Following are the
main components used to determine the cost of execution of a query:
(a) Access cost of secondary storage : Access cost to secondary storage consists of cost of
database manipulation operations which includes searching, writing, reading of data
blocks stored in the secondary memory. The cost of searching depends upon the type
of indexes (primary, secondary, hashed), type of file structure, ordering of relation
in addition to physical storage location like file blocks are allocated contiguously on
the same disk or scattered on the disk.
(b) Storage cost : Storage cost consists of cost of storing intermediate results (tables or
files) that are generated by the execution strategy for the query.
(c) Computation cost : Computation cost consists of performing in-memory operations
during query execution such as sorting of records in a file, merging of records,
performing computations on field values, searching of records. These are mainly
performed on data buffers.
(d) Memory usage cost : It consists of cost of pertaining to the number of memory
buffers needed during query execution.
(e) Communication cost : It consists of the cost of transferring query and its result from
database location to the location of terminal where the query is originated.
From all the above components, the most important is access cost to secondary storage
because secondary storage is comparatively slower than other devices. Optimizer try to
minimize computation cost for small databases as most of the data files are stored in main
memory. For large database, it try to minimize the access cost to secondary storage and for
distributed databases, it trys to minimize the communication cost because various sites are
involved for data transfer.
To estimate the cost of execution strategies, optimizer access statistical data stored in
DBMS catalog. The information stored in DBMS catalog is given below:
(i) Number of records in relation X, given as R.
(ii) Number of blocks required to store relation X, given as B.
(iii) Blocking factor of relation X, given as BFR.
(iv) Primary access method for each file and attributes for each file.
(v) Number of levels for each multi-level index for a attribute A given as IA.
(vi) Number of first-level index blocks for a attribute A, given as BAI1.
(vii) Selection cardinality of attribute A in relation R, given as SA, where SA= R × SLA,
where SLA is the selectivity of the attributes.
Cost Function for Selection Operation : Selection operation works on a single relation in
relation algebra and retrieves the records that satisfy the given condition. Depending upon
the structure of file, available indexes, searching methods, the estimated cost of strategies
for selection operation is as given below.
Query Processing and Optimiz ation 433
[Link]. Strategies Cost
1. Linear search B/2, if record is found
B, if record not found
2. Binary search log2 B, if equality condition is one a
unique key attribute
log2 B + (SA/BFR)-1, otherwise
3. Using primary index to retrive a single 1, assuming no overflow
record
4. Equality condition on primary key IA + 1
5. Equality condition on hash key 1, assuming no overflow
6. Inequality condition of primary index IA + B/2
7. Inequality condition of any ordered index IA + B/2
8. Equality condition on clustering index [IA + 1] + [SA/BFR]
(secondary index)
9. Equality condition on non-clustering [IA + 1] + [SA]
index (secondary index)
10. Inequality condition on B+-tree index IA + [BAI1/2] + [R/2]
(secondary index)
Now consider the example of Employee table. Consider, there are 10,000 records stored
in 2000 blocks. Also the following indexes are available.
A secondary index on the Emp-ID with 4 levels.
A clustering index on salary with 3 levels and average selection cardinality of 30.
A secondary index on non-key attributes Age with 2 levels and 4 first level index
blocks. There are 200 distinct values for Age.
Consider the queries:
(i) σEmp-ID = 1A (Employee).
(ii) σAge > 20 (Employee).
(iii) σAge = 20 AND salary > 9000 (Employee).
Consider the following cost components.
Number of records (R) = 10,000
Number of blocks (B)= 2000
Blocking Factor (BFR) = 10000/2000 = 5
IEmp-ID = 4, SEmp-ID = 10000/10000 = 1
IAge = 2, SAge = 10000/200 = 50
ISalary = 2, SSalary = 30
Now cost of the above queries (suppose records are in table)
(i) If linear search is used then cost will be B/2 = 2000/2= 1000
(ii) If linear search is used then cost will be B = 2000.
434 Introduc tion to Database Management System
(iii) This query has a conjunctive selection condition. To estimate the cost of use using
anyone of the two components of selection condition, to retrieve the records plus
the linear search. The linear search costs 2000 and condition salary > 9000 first gives
cost estimate of Isalary + (B/2)= 2 + 1000 = 1002 and cost of condition Age = 20 is
30 + 2 = 32 So, total cost is 1034.
Cost Function for Join Operation : Join operation is the most time consuming operation
in databases and an accurate cost function for join operations depends upon the estimate of
size of file (number of records) after the join operation. The estimated cost of strategies for
join operation is given below:
[Link]. Strategies Cost
1. Nested loop joins (Block) (a) B(R) + [B(R) * B(S)], if the buffer has only
one block
(b) B(R) + [B(S) * (B(R)/(Buffer-2))], if [Buffer-2]
block for relation R.
(c) B(R) + B(S), if all blocks of R can be read into
database buffer
2. Indexed nested-loop join (a) B(R) + R * (IA + 1), if join attribute A in
relation S is a primary key.
(b) B(R) + R * [IA + (SA(R)/BFR(R))], for clustering
index I on attribute A
3. Sort merged join (a) B(R) * [log2 (B(R) + B(S) * log2 (BCR)))], for
sorts
(b) B(R) + B(S) for merge
4. Hash join (a) (3 * B(R)) + B(S), if hash index is held in
memory
(b) 2[B(R) + B(S)] * [log (B(S) – 1)] + B(R) +
B(S), otherwise
(R) and (S) are relations R and S respectively.
Example. Consider we have 600 records in Department table. BFR for department table
is 60 and number of blocks are 600/60 = 10.
For the Join operation, Employee Dept-ID Department
Type of Join Cost
Block nested-loop joins 22000 (Buffer has only one block)
2010 (if all blocks of employee can be read into
database buffer)
Hash Join 6010 (is hash index is held in memory)
Example. Given two unary relation (contains only one attribute), R1 and R2.
R1 R2
7 8
2 4
Query Processing and Optimiz ation 435
9 2
8 1
3 3
9 2
1 7
3 3
6
Show the result of joining R1 and R2 using each of the following algorithms. List the
results in the order that would be output by the join algorithm.
(a) Nested Loop Join. Use R1 for outer Loop and R2 for the inner loop.
(b) Sort Merge Join.
Solution. The result relation contains only one attribute, which is the common attribute
between R and S.
(a) Nested loop join. For every tuple in R1, find matches in R2.
7, 2, 2, 8, 3, 3, 1, 3, 3
(b) Sort merge join. The result appears in sorted order on the join attribute.
1, 2, 2, 3, 3, 3, 3, 7, 8
SOLVED PROBLEMS
Problem 1. Consider two relations R and S. The number of tuples in R is 500,000 each
of size 50 bytes. The number of tuples in S is 125,000 each of size 40 bytes. The page size
is 5 KB. We have a buffer of size 250 KB. Answer the following questions with detail steps
for following the join query:
SELECT *
FROM R, S
WHERE [Link]=[Link].
1. Find the cost of page oriented Simple Nested Loops Join.
2. Find the cost of Block Nested Loops Join.
3. Find the cost of Sort-Merge Join.
4. Can you do any refinement in the Sort-Merge Join Cost? Explain?
5. Find the cost of Hash Join.
Solution. The given information is as follows:
Page size = 5 K, Buffer size = 250 K = 250/50 = 50 pages.
For R, number of tuples = 500,000, size of one tupe = 50 bytes.
Size of R = 500,000 × 50 byte = 500,000×50/5000 pages = 5000 pages.
For S, number of tuples = 125,000, size of each tuple = 40 bytes.
Size of S = 125,000 × 40 byte = 1250,000 × 40/5000 pages = 1000 pages.
436 Introduc tion to Database Management System
Consider S to be the outer relation and R to be the inner relation.
Note. We considered S to be outer because it is the smaller relation. You can also
consider R to be outer but the cost will be more.
Consider N=size of R=5000 pages and M=size of S=1000 pages.
1. Page Oriented Simple Nested Loops Join.
Cost=M+M*N
(Reading outer Relation once + for each outer page reading the whole inner
relation)
Cost=1000+5000*1000=5,001,000 I/O
2. Block Nested Loops Join.
Cost= (scan outer)+(scan inner)*(number of outer blocks)
(number of outer blocks)= ceiling of ((size of outer relation)/(Block size) ).
(block size)=(Buffer Size)-2. (one buffer for reading and one buffer for writing)
Cost=M+N*ceiling of (M/B-2).
Cost=1000+5000*21=106,000 I/O.
3. Sort-Merge Join.
Cost=sorting both relations + reading both relations.
Cost of sorting relation = 2*(number of passes)*Size. (2 for reading and writing)
Number of passes= ceiling of (log of (size/buffer) to the base of B-1) + 1.
The one comes here for pass 0. After pass zero you get (Size/Buffer) sorted runs
For Relation S, (# of passes)=2; by applying the above formula.
Cost of sorting S=2*2*1000 =4000.
Explanation: You can sort S in two passes.
In pass 0, you produce 20 (1000/50) sorted runs of size 50.
In pass 1, you can sort these 20 sorted runs, and hence the whole relation.
For Relation R, (# of passes)=3; by applying the above formula.
Cost of sorting R=2*3*5000 =30,000.
Explanation: You can sort R in three passes.
In pass 0, you produce 100 (5000/50) sorted runs of size 50.
In pass 1, you sort these 100 sorted runs, and produce 2 sorted runs of size 50.
In pass 2, you sort these 2 sorted runs and hence the whole relation.
Final Cost=1000 + 5000 + 4000 + 30,000 = 40,000 I/O.
4. To do the refinement of combing the merging phase in the sorting of R and S,
size of buffer must be greater than sqrt of (size of Larger relation)
In this example, size of Larger relation = 5000 and buffer size = 50. Hence this
condition is not satisfied. So we cannot do this refinement.
5. Hash Join
Cost= 3*(M+N)
Cost= 3*6000=18,000 I/O.
Query Processing and Optimiz ation 437
Problem 2. Consider the join between relations R and S on R.a=S.b, given the following
information about the relations to be joined. The cost metric is the number of page I/Os
unless otherwise noted, and the cost of writing out the result should be uniformly ignored.
• Relation R contains 10,000 tuples and has 10 tuples per page.
• Relation S contains 2,000 tuples and also has 10 tuples per page.
• Attribute b of relation S is the primary key for S.
• Both relations are stored as simple heap .les.
• Neither relation has any indexes built on it.
• 52 buffer pages are available.
(a) What is the cost of joining R and S using a page-oriented simple nested loops
join? What is the minimum number of buffer pages required for this cost to
remain unchanged?
(b) What is the cost of joining R and S using a block-nested loops join? What is the
minimum number of buffer pages required for this cost to remain unchanged?
(c) What is the cost of joining R and S using a sort-merge join? What is the minimum
number of buffer pages required for this cost to remain unchanged?
(d) What is the cost of joining R and S using a hash join? What is the minimum
number of buffer pages required for this cost to remain unchanged? Neglect the
size involved in building a hash table.
Solution. Let M = 1000 be the number of pages in R, N = 200 be the number of pages
in S, and B = 52 be the number of [Link] pages available.
(a) Basic idea is to read each page of the outer relation, and for each page scan
the inner relation for matching tuples. Total cost would be #pages in outer +
(#pages in outer #pages in inner) which is minimized by having the smaller
relation be the outer relation.
Total Cost = N + (N*M) = 200, 200
The minimum number of buffer pages for this cost is 3.
(b) This time read the outer relation in blocks, and for each block scan the inner
relation for matching tuples. So the outer relation is still read once, but the inner
relation is scanned only once for each outer block, of which there are:
ceiling of (#pagesinouter/ B-2] = 4.
Total Cost = N +M* ceiling (N/B-2) = 4, 200.
If the number of buffer pages is less than 52, the number of scans of the inner
would be more than 4 since ceiling(200/49) is 5. The minimum number of buffer
pages for this cost is therefore 52.
(c) Since B > √M > √N we can use the refinement to Sort-Merge discussed on pages
254–255 in the text.
Total Cost = 3 (M + N) = 3, 600
Note. If S.b were not a key, then the merging phase could require more than
one pass over one of the relations, making the cost of merging M N I/Os in
the worst case.
438 Introduc tion to Database Management System
The minimum number of buffer pages required is 25. With 25 buffer pages, the
initial sorting pass will split R into 20 runs of size 50 and split S into 4 runs
of size 50 (approximately). These 24 runs can then be merged in one pass, with
one page left over to be used as an output buffer. With fewer than 25 buffer
pages the number of runs produced by the first pass over both relations would
exceed the number of available pages, making a one-pass merge impossible.
(d) The cost of Hash Join is 3(M+N) if B > √ N where N is the number of pages in
the smaller relation, S. Since √N =14, this condition is met. We will also assume
uniform partitioning from our hash function.
Total Cost = 3 (M + N) = 3, 600.
The minimum number of buffer pages required, and a good guess is that we
need B >√N.
Problem 3. Consider the relations r1(A,B,C), r2(C,D,E), and r3(E, F), with primary keys
A, C, and E, respectively. Assume that r1 has 1000 tuples, r2 has 1500 tuples, and r3 has
750 tuples.
(a) Give an efficient strategy (in which order) for computing r1 1 r2 1 r3
(b) Estimate the size of the above joins.
Solution.
(a) the order shall be (r1 1 r2) 1 r3. This is because the estimate size of r1 1 r2 is
1000, the estimate size of r1 1 r3 is 750,000, while that of r2 1 r3 is 1500.
(b) less than 1000.
Note that (a) and (b) are not necessarily related.
Problem 4. Consider the tables WORKS AT(SSN, gymID) and GYM(gymID, name). Notice
that gymID is not a candidate key for the table GYM.
WORKS AT(SSN, gymID) consists of N1 = 100,000 tuples and has
• V(SSN,WORKS AT) = 50,000 distinct values of SSN
• V(gymID,WORKS AT) = 20,000 distinct values of gymID.
GYM(gymID, name) consists of N1 = 40,000 tuples and has
• V(gymID,GYM) = 20,000 distinct values of gymID
• V(name,GYM) = 30,000 distinct values of name.
(a) Estimate the number of qualifying tuples of the query:
SELECT *
FROM WORKS_AT
WHERE SSN = 123456789;
(b) Can SSN be a candidate key for the table WORKS AT? Give a short explanation
for your answer.
(c) Estimate the number of qualifying tuples of the query:
SELECT *
FROM GYM
WHERE name = “Gym planet”;
Query Processing and Optimiz ation 439
(d) Estimate the number of qualifying tuples of the query:
SELECT *
FROM WORKS_AT
WHERE SSN = 123456789 AND gymID=101;
(e) Notice that gymID is not a candidate key for the table GYM. Estimate the
number of qualifying tuples of the query:
SELECT SSN, [Link], name
FROM WORKS_AT JOIN GYM
WHERE [Link] = WORKS_AT.gymID;
(f) Estimate the number of qualifying tuples of the query:
SELECT [Link], [Link]
FROM WORKS_AT AS WA1 JOIN WORKS_AT AS WA2
WHERE [Link] = [Link];
Solution.
(a) N1/V (SSN, WORKS AT) = 100, 000/50, 000 = 2
(b) No, because it has multiple duplicates in the table.
(c) N2/V (name, GYM) = 40, 000/30, 000 = 1.33
(d) N1/(V (SSN, WORKS AT)∗V (gymID, WORKS AT)) = 100, 000/50, 000/20, 000 =0.0001
(e) gymID is primary key in the table GYM and foreign key in the table WORKS AT. So,
N1 ∗ N2/V (gymID, WORKS AT) = 100, 000 ∗ 40, 000/20, 000 = 200, 000
(f) gymID is not primary key in the table WORKS AT. So, N1 ∗ N1/V (gymID, WORKSAT)
= 100, 000 ∗ 100, 000/20, 000 = 500,000
TEST YOUR KNOWLEDGE
True/False
1. In a query processing system, the optimizer optimizes the query before it is given to the
parser and translator.
2. Working with data in the data cache is many times faster than working with data in the data
files.
3. The cost of processing a query is usually dominated by secondary storage access, which is
slow compared to memory access.
4. The CPU time taken to process a query is usually taken as the cost of the query.
5. The SQL execution activities are performed by the query optimizer.
6. In general, a query plan that generates few intermediate result tuples is preferable to one that
generates a lot of intermediate result tuples.
7. Query processing is the procedure of selecting the most appropriate plan that is used in
responding to a database request.
8. Binary search is better than linear search in the sense that it always works, no matter what.
440 Introduc tion to Database Management System
9. The internal query representation is usually a binary query tree.
10. A cost-based optimizer uses a set of preset rules and points to determine the best approach
to execute a query.
11. To access the records in a file in a specific sort order on the value of an attribute, we could
create an index on that attribute, and use that index to access the record in the desired order.
12. A query tree is also called a relational algebra tree.
13. We can implement a disjunctive selection (“or”s) by taking the intersection of pointers to
records satisfying each component of the disjunction.
14. It is more advantageous (in terms of query processing cost) to use nested loop join instead
of block nested loop join.
15. In the hash-join algorithm, the number of partitions is determined by the size of memory and
the size of the probe relation.
16. Heuristic rules are used as an optimization technique to modify the internal representation of
a query.
17. The cost of processing a query is not dependent on disk access.
18. In general, heuristic rules are used in the form of query tree or query graph data structure.
19. The heuristic optimization algorithm utilizes some of the transformation rules to transform an
initial query tree into an optimised and efficiently executable query tree.
20. Character field comparisons are faster than numeric, date, and NULL comparisons.
Fill in the Blanks
1. Query processing has __________ phases.
2. The ______ analyzes the SQL query and finds the most efficient way to access the data.
3. _____________ play an important role in speeding up data access.
4. ________ is a measure of how likely an index will be used in query processing.
5. Once an SQL statement is transformed, the DBMS creates a(n) _______ plan.
6. In syntax-checking phase of query processing the system _______ the query and checks that
it obeys the ________ rules.
7. The two types of query optimization techniques are (a) ________ and (b) ________.
8. A query tree is also called a ________ tree.
9. Query transformation is performed by transforming the query into ________ that are more
efficient to execute.
10. In the ________ stage, the query is lexically and syntactically analysed using parsers to find
out any syntax error.
11. In ________ stage, the query is converted into normalised form that can be more easily
manipulated.
12. ________ uses the transformation rules to convert one relational algebraic expression into an
equivalent form that is more efficient.
13. In general, the heuristic rules are used in the form of ________ or _______ structure.
Multiple Choice Questions
1. The DBMS ____ the SQL query using the chosen execution plan.
(a) parses (b) executes (c) fetches (d) processes
Query Processing and Optimiz ation 441
2. The objective of query simplifier is
(a) transformation of the query to a semantically equivalent and more efficient form.
(b) detection of redundant qualifications
(c) elimination of common sub-expressions
(d) all of these.
3. If there is no index, the DBMS will perform a ____ scan
(a) loop (b) range
(c) row ID table access (d) full table
4. One measure that determines the need for an index is the ____ of the column you want to
index. ____ refers to the number of different values a column could possibly have
(a) Database statistics (b) Data sparsity
(c) Primary keys (d) Query optimization
5. Knowing the sparsity of a column helps you decide whether the use of ____ is appropriate
(a) query processing (b) query optimization
(c) an index (d) a full table scan
6. ____ is/are the central activity during the parsing phase in query processing
(a) Database statistics (b) Data sparsity
(c) SQL query (d) Query optimization
7. Hash and Merge join are applicable
(a) for any join condition
(b) only when the join condition is an equality (=)
(c) when the join condition has an inequality (> or <)
(d) when the join condition has a \like”
8. Two very large relations R and S are have the physical sort order of R.c1 and S.c1 respectively
(assume a clustering index exists on R.c1 and S.c1). For the query select * from R, S where
R.c1 = S.c1 order by R.c1 the likely best join method is
(a) nested loops join (b) hash join
(c) merge join (d) index nested loops join
9. The component of a database management system responsible for determining the most efficient
way to execute a query is
(a) Query plan (b) Query optimize
(c) Composite index (d) Query
10. Cardinality estimation depends on estimates of the
(a) Selection factors of predicates in the query
(b) Selection factors of query optimizer
(c) Selection factors of query process
(d) Transitional database management system
11. Relational query optimization depends on knowledge of the physical storage of data which is
(i) Readily available
(ii) Problematic
442 Introduc tion to Database Management System
(iii) With query optimizer
(iv) Access path
which of the following are correct?
(a) (i), (iii) (b) (i), (iv) (c) (i), (ii) (d) (iii), (iv)
12. Efficiency of a query can be measured by
(i) Response time
(ii) Execution time
(iii) Execution sequence
(iv) Correctness
which of the following satisfies the statement?
(a) (i), (iii) (b) (ii), (iv) (c) (iii), (iv) (d) (i), (iv)
13. The data processor consists of three elements
(i) Local query optimizer
(ii) Distributed execution monitor
(iii) Local recovery manager
(iv) Run-time support processor
which of the following are true?
(a) (i), (ii) (b) (i), (iv) (c) (iii), (iv) (d) (i), (ii)
14. One of the following steps is not involved in processing a query
(a) Parsing and translation (b) Optimization
(c) Evaluation (d) Distribution
For the Q. 15–19, let r and s be relations, b s (number of blocks occupied by s) =
400, ns (number of records in s)= 1600, b r (number of blocks occupied by r) = 1600,
nr (number of records in r) = 3200. Assume the final result of operations is not written to disk.
15. Let the buffer size M = 400. What is the minimum achievable cost of a join using the block
nested loop join algorithm, with s being traversed in the outside loop?
(a) 400 + 160 (b) 1600 + 400*2 (c) 1600 + 400*4 (d) None of the above
16. Let the buffer size M = 201. What is the minimum achievable cost of a join using the block
nested loop join algorithm, with the r relation being traversed in the outer loop?
(a) 400 + 1600 (b) 400 + 1600 + 1599
(c) 1600 + 400 + 7*399 (d) 400 + 1600 * 1600
17. Assuming we use merge join to join r and s, and the memory allocated for the sort operation
M = 10, what is the cost of the join if both r and s are already sorted on the join attribute?
(a) 400 + 600 (b) 400*10 + 600
(c) 400*10 + 600*10 (d) None of the above
18. Assuming we use hash join, r is the build relation, and M = 10. How many partitioning
passes are needed?
(a) 1 (b) 2 (c) 3 (d) 4
19. Suppose we want to sort the s relation using the sort-merge algorithm, and M = 4. How
many merge passes are needed to get the final result?
(a) Log3400 (b) Log3100 (c) Log4100 (d) Log4400
Query Processing and Optimiz ation 443
For Q. 20–25 below, assume r(A,B,C), s(C,D,E), t(E,F,G) are relations, nr=1000, ns=5000, nt=600,
V(C,r)=200, V(C,s)=100, V(B,r)=400,Min(A)=41, Max(A)=50, in(B)=31, Max(B)=40, Min(C)=80,
Max(C)=89, Min(D)=10, Max(D)=14, V(E,s)=50, V(E,t)=200.
In Q. 20–24, estimate how many tuples there are in the result of the queries.
20. sA≥49 ∨ A<42 (r)
(a) 200 (b) 300 (c) 400 (d) 401
21. sA≥49 ∨ B≥39 ∨ C≥86 (r)
(a) 384 (b) 616 (c) 614 (d) 386
22. r |X| s if C is not a superkey of r and C is not a superkey of s
(a) 25,000 * (b) 50,000 (c) 75,000 (d) 78,000
23. s |X| t if E is the primary key of t and s
(a) At most 600 (b) At most 5,000
(c) At most 3,000,000 (d) None of the above
24. CGsum(A) (r)
(a) 200 (b) 1000 (c) 200,000 (d) 1,200
25. Estimate the number of distinct values V(B, s A≥46(r) )
(a) 50 (b) 100 (c) 200 (d) 400
26. Which of the following cost is the most important cost component to be considered during
the cost-based query optimization?
(a) memory uses cost (b) secondary storage access cost
(c) communication cost (d) all of these
27. In general, the heuristic rules are used in the form of
(a) query tree (b) query graph data structure.
(c) both (a) and (b) (d) either (a) or (b)
Answers
True/False
1. F 2. T 3. T
4. F 5. F 6. T
7. T 8. F 9. T
10. F 11. T 12. T
13. F 14. F 15. F
16. T 17. T 18. T
19. T 20. F
Fill in the Blanks
1. Three 2. query optimizer 3. Indexes
4. Index selectivity 5. execution 6. analyzed, language
7. heuristic, cost based 8. relational algebra 9. equivalent expression
10. syntax analysis 11. query normalization 12. query simplifier
13. query tree, query graph
444 Introduc tion to Database Management System
Multiple Choice Questions
1. (b) 2. (d) 3. (d)
4. (b) 5. (c) 6. (d)
7. (b) 8. (c) 9. (b)
10. (a) 11. (a) 12. (d)
13. (b) 14. (d) 15. (d)
16. (c) 17. (d) 18. (c)
19. (b) 20. (b) 21. (b)
22. (a) 23. (d) 24. (a)
25. (d) 26. (d) 27. (c)
Exercises
Short Answer Questions
1. What is query processing?
2. What are the objectives of query processing?
Ans. The aim of query processing is to transform a query written in a high level language
into a correct and efficient execution strategy expressed in a low level language and execute
the strategy to retrieve the data.
3. What are the 3 steps of query processing?
(i) Parsing and translation, (ii) Optimization, (iii) Evaluation
4. What is query processor?
5. What is execution/query plan?
6. What is specified in a query plan (evaluation plan)?
Ans. Query plan defines what algorithm is used for each operation and how the execution
of the operations is coordinated.
7. Explain the general strategy for query processing?
8. Explain various steps in query processing.
9. What is query decomposition?
10. What are the various stages of query processing?
11. What is query analysis?
12. What is query tree notation?
13. What is query graph notation?
14. What is query optimization? What is its aim?
15. What are various query execution plans?
16. What is left deep tree query execution plan?
17. What is right deep tree query execution plan?
18. What is linear tree query execution plan?
19. What is bushy execution plan?
20. What is query code generator?
Query Processing and Optimiz ation 445
21. What is runtime database processor?
22. What is the goal of query optimization?
Ans. Find the most efficient query evaluation plan (query plan) for a given query, i.e., the
one which can be executed most efficiently.
23. What are the main issues that must be considered in query optimization?
24. At what time during query processing does (query) optimization occur?
Ans. After parsing the query and translating it to relational algebra.
25. What are the various rules for transforming relational algebra query?
26. What is heuristic query optimization?
27. Explain heuristic query optimization algorithm.
28. What is cost based query optimization?
29. How cost-based optimization works?
Ans. Cost-based optimization works in 3 steps:
1. Generating logically equivalent expressions (using equivalence rules);
2. Annotating resultant expressions to get alternative query plans;
3. Choosing the cheapest plan based on estimated cost.
30. What are the main components that are used to determine the cost of execution of a query?
31. What is the cost function for selection operation?
32. What is the cost function for join operation?
33. Two relations R with 60000 tuples and occupying of 300 blocks is to be joined with a relation
S with 40000 tuples and occupying 400 blocks. What is the total cost using the algorithms
of nested-loop-join and block-nested loop join? Give both the best case and the worst case
figures.
Ans. Nested-loop-join: Worst Case: If R as the outer relation 60000 × 400 + 300 = 24000300
Disk access, if S as the outer relation we need 40000 × 300 + 400 = 12000400 disk accesses
Best Case: 300 + 400 = 700 disk accesses will be required.
Block-nested loop join:
Worst Case: If R is the outer relation, we need 300 * 400 + 300 = 120300 disk access, if S is
the outer relation we need 400 * 300 + 400 = 120400 disk access
Best Case: 300+ 400 = 700 disk accesses will be required.
34. How do you estimate the query cost for natural join when
(i) R S = j
(ii) R S is a foreign key
Ans.
(i) If R S = F , then r s is the same as r × s.
(ii) If R S in S is a foreign key in S referencing R, then the number of tuples in r s is
exactly the same as the number of tuples in S. The case for R S being a foreign key
referencing S is symmetric.
35. Given two relations R (A, B) and S (B, C) with number of tuples in R and S equal to 500
and 1000 respectively and B is the foreign key in R, what is the number of tuples in R S.
Ans. The number of tuples in R S is 500000.
36. Assuming M memory blocks, what is the best way to use these blocks in the block nested
loop join?
Ans. M−2 blocks for the outer relation, 1 block for the inner relation, 1 block for the output
446 Introduc tion to Database Management System
37. What is the number of block accesses for a merge-join, assuming that the relations are sorted
and the join attribute(s) in one relation forms a key?
Ans. br + bs, where br and bs is the number of blocks of relation r and s, respectively.
38. Is E1 θ E2 = E2 θ E1 a correct equivalence rule (for query optimization)?
Ans. Yes
39. What is the number of block accesses for a merge-join, assuming that the relations are sorted
and the join attribute(s) in one relation forms a key?
Ans. br + bs, where br and bs are the number of blocks in relation r and s, respectively.
40. Consider a relation r(A, B), which is sorted on A, and the query sB=v(r). Is binary search a
valid strategy to evaluate this query?
Ans. No
41. What are optimizer choices? How many modes do optimizer choices have?
Ans. Query optimization is the central activity during the parsing phase in query processing.
In this phase, the DBMS must choose what indexes to use, how to perform join operations,
and what table to use first, and so on. Each DBMS has its own algorithms for determining
the most efficient way to access the data. The query optimizer can operate in one of two
modes:
A rule-based optimizer uses preset rules and points to determine the best approach to execute
a query. The rules assign a “fixed cost” to each SQL operation; the costs are then added to
yield the cost of the execution plan. For example, a full table scan has a set cost of 10, while
a table access by row ID has a set cost of 3.
A cost-based optimizer uses sophisticated algorithms based on the statistics about the objects
being accessed to determine the best approach to execute a query. In this case, the optimizer
process adds up the processing cost, the I/O costs, and the resource costs (RAM and temporary
space) to come up with the total cost of a given execution plan.
42. How can queries be written to perform the fastest when equality and inequality comparisons
are needed?
Ans. Equality comparisons are faster than inequality comparisons. As a general rule, equality
comparisons are processed faster than inequality comparisons. For example, PRICE = 10.00
is processed faster because the DBMS can do a direct search using the index in the column.
If there are no exact matches, the condition is evaluated as false. However, if you use an
inequality symbol (>, >=, <, <=), the DBMS must perform additional processing to complete
the request. The reason is because there will almost always be more “greater than” or “less
than” values than exactly “equal” values in the index. The “not equal” symbol (<>) yields
slower searches, especially when the sparsity of the data is high, that is, when there are many
more different values than there are equal values.
Long Answer Questions
1. What are the general strategies for query processing ? Explain. Also discuss query optimization
techniques through suitable example.
2. What is meant by query processing ? Discuss the various methods for query processing and
query optimization with the help of suitable examples.
3. What do you mean by query optimization ? Explain the various query optimization methods.
4. Explain the basic algorithm for executing query-processing operation with examples.
5. Discuss various Heuristic based query optimization techniques with examples.
6. Discuss the issues that are considered in designing of query optimizer.
Query Processing and Optimiz ation 447
7. What is query processing ? Describe the steps involved in query processing.
8. Describe how a query that involves join, selection and projection operations can be optimized.
Explain the above with a suitable example.
9. Write the relational algebraic expression and create the query graph for the query. List the
names of the employees who draw a salary above. ` 10,000 and are working on a project
whose title is “Disaster management”. Suggest query improvements in the query graph, if any.
10. A query-involving join on three relations is to be executed. What factors will a DBMS consider
to do evaluation of the query in an optimal fashion ? Explain with the help of an example.
11. Does the data dictionary have any role to play in query processing? Describe with the help
of an SQL query requiring Join operation, SELECTION and PROJECTION.