CHAPTER – 2
QUERY PROCESSING AND OPTIMIZATION
ADVANCED DATABASE SYSTEM
Compiled by: Temesgen Tilahun
CHAPTER OUTLINE
Translating SQL Queries into Relational Algebra
Basic Algorithms for Executing Query Operations
Using Heuristic in Query Optimization
Using Selectivity and Cost Estimates in Query Optimization
Semantic Query Optimization
Compiled by: Temesgen Tilahun 2
INTRODUCTION
Query processing requires that the DBMS identify and execute a strategy for
retrieving the results of the query. Query optimization is necessary to determine
the optimal alternative to process a query.
There are two main techniques for query optimization.
The first approach is to use a rule based or heuristic method for ordering the
operations in a query execution strategy.
The second approach estimates the cost of different execution strategies and
chooses the best solution.
In general most commercial database systems use a combination of both
techniques.
Compiled by: Temesgen Tilahun 3
INTRODUCTION
In this chapter we discuss the techniques used internally by a DBMS to
process, optimize, and execute high-level queries.
A query expressed in a high-level query language such as SQL must first
be scanned, parsed, and validated.
The scanner identifies the query tokens such as SQL keywords, attribute
names, and relation names that appear in the text of the query.
The parser checks the query syntax to determine whether it is formulated
according to the syntax rules (rules of grammar) of the query language.
Compiled by: Temesgen Tilahun 4
INTRODUCTION
The query must also be validated by checking that all attribute and
relation names are valid and semantically meaningful names in the
schema of the particular database being queried.
An internal representation of the query is then created, usually as a tree
data structure called a query tree.
It is also possible to represent the query using a graph data structure
called a query graph.
The DBMS must then formulate an execution strategy or query plan for
retrieving the results of the query from the database files.
Compiled by: Temesgen Tilahun 5
Translating SQL Queries into Relational Algebra
In practice, SQL is the query language that is used in most commercial
RDBMSs.
SQL query is first translated into an equivalent extended relational algebra
expression represented as a query tree data structure that is then
optimized.
Typically, SQL queries are decomposed into query blocks, which form the
basic units that can be translated into the algebraic operators and
optimized.
A query block contains a single SELECT – FROM – WHERE expression, as
well as GROUP BY and HAVING clauses if these are part of the block.
Compiled by: Temesgen Tilahun 6
Query Processing
Query Processing: It is a procedure of converting a query written in high level language (Eg. SQL)
into a correct and efficient execution plan expressed in low level language, which is used for
data manipulation.
Query Processing is the activity performed in extracting data from the database.
Before retrieving, updating or deleting data in database, a query goes through a series of query
compilation steps. These steps are known as execution plan.
Execution Plan: It is the basic algorithm used for each operation in the query.
Execution Plan: Query processing is a stepwise process.
Query Processor : Query processor is responsible for generating execution plan.
In query processing, the first phase is transformation in which parser first checks the syntax of
query and also checks the relations and attributes used in the query that are defined in the
database.
Compiled by: Temesgen Tilahun 7
Query Processing
After checking the syntax and verifying the relations, query is transformed into
equivalent expression that are more efficient to execute.
Transformation, depends upon various factors like existence of certain database
structures, presence of different indexes, file is sorted or not, cost of
transformation, physical characteristics of data, etc.
After transformation of query, transformed query is evaluated by using number of
strategies known as access plans.
The next step is to validate the user privileges and ensure that the query does not
disobey the relevant integrity constraints.
Finally, execution plan is executed to generate the result.
Compiled by: Temesgen Tilahun 8
General Strategy for Query Processing
(i) Representation of query: Query written by user cannot be processed directly
by the system.
Query processor first checks the syntax and existence of relations and their
attributes in database.
After validations, query processor transform it into equivalent and more efficient
expression.
For example, query will be converted into a standard internal format that parser
can manipulate.
Internal form may be relational algebra, relational calculus, any low-level
language, operator graphs, etc.
Compiled by: Temesgen Tilahun 9
General Strategy for Query Processing
(ii) Operator Graphs: Operator graphs are used to represent query.
It gives the sequence of operations that can be performed.
It is easy to understand the query represented by operator graphs.
It is useful to determine redundancy in query expressions, result of transformation,
simplify the view etc.
(iii) Response time and Data characteristics consideration: Data characteristics
like length of records, expected sizes of both intermediate and final results, size of
relations etc., are also considered for optimizing the query.
In addition to this overall response time is also determined.
Compiled by: Temesgen Tilahun 10
Steps in Query Processing
I. Syntax Analysis III. Query Optimization
II. Query Decomposition IV. Execution Plan
A. Query Analysis A. Left-deep tree
B. Query Normalization B. Right-deep tree
C. Semantic Analyzer C. Linear tree
D. Query Simplifier D. Bushy
E. Query Restructuring V. Query Code Generator
VI. Runtime Database Processor
Compiled by: Temesgen Tilahun 11
Steps in Query Processing
Compiled by: Temesgen Tilahun 12
Steps in Query Processing
(i) Syntax Analysis: Query in high-level language is parsed
into tokens and tokens are analyzed for any syntax error.
Order of tokens are also maintained to make sure that all the
rules of language grammars are followed.
In case of any error, query is rejected and an error code with
explanation for rejection is returned to the user.
(Only syntax is checked in this step).
Compiled by: Temesgen Tilahun 13
Steps in Query Processing
(ii) Query Decomposition: In this step, query is decomposed into query
blocks which are the low-level operations.
It starts with the high-level query that is transformed into low level
operations & checks whether that query is syntactically & semantically
correct.
For example, SQL Query is decomposed into blocks like Select block,
From block, Where block, etc.
Compiled by: Temesgen Tilahun 14
Steps in Query Decompositions
(A) Query Analysis: In this stage, programming language compiler
checks that the query is lexically and syntactically correct.
A syntactically correct query is analyzed using system catalogues to
verify the existence of relations and attributes used in query.
After analysis a correct query is converted into some internal
representation, which is more efficient for processing.
The type specification of the query qualifier and result is also checked at
this stage.
The internal representation may be, query tree or query graph.
Compiled by: Temesgen Tilahun 15
Query Tree Notation
The internal representation may be query tree or query graph.
Query tree notation: A typical internal representation of query is query tree.
It is also known as relational algebra tree.
A query tree is constructed using tree data structure that corresponds to the relational
algebra expression.
Main Components of Query Tree:
⧫ Root of tree – represents result of query.
⧫ Internal nodes – represent intermediate relation that is the output of applying an
operation in the algebra.
⧫ Leaf nodes – represent input relations of the query.
Compiled by: Temesgen Tilahun 16
Query Tree Notation
The sequence of operations is
Directed from leaves to the root node.
Example of Query Tree Notation:
Compiled by: Temesgen Tilahun 17
Query Graph Notation
Query Graph Notation: Graph data structure is also used for internal
representation of query.
Main Components of Graphs Notations:
⧫ Relation nodes – represent relations by single circle.
⧫ Constant nodes – represent constant values by double circle.
⧫ Edges – represent relation and join conditions.
⧫ Square brackets – represent attributes retrieved from each relation.
Compiled by: Temesgen Tilahun 18
Query Graph Notation
Figure of Query Graph Notation:
Compiled by: Temesgen Tilahun 19
Steps in Query Decompositions
(B) Query Normalization: After query analysis, it is normalized to
remove any redundancy.
In this phase, query is converted into normalized form that can be
easily manipulated.
A set of equivalency rules are applied to query and simplify the
projection and selection operations to avoid redundancy.
Query can be converted into one of the following two normal forms:
Conjunctive normal form and Disjunctive normal forms
Compiled by: Temesgen Tilahun 20
Steps in Query Decompositions
(1) Conjunctive normal form: It is a sequence of conjuncts that are connected with ‘AND’
operator.
A conjunct consists of one or more terms connected with ‘OR’ operator.
A conjunctive selection consists only those tuples that satisfy all conjuncts.
Example: (Emp_Job = ‘Analyst’ ∨ salary < 50000) ∧ (Hire_Date > 1–1–2000)
(2) Disjunctive normal forms: It is a sequence of disjuncts that are connected with ‘OR’ operator.
A disjunct consists of one or more terms connected with ‘AND’ operator.
A disjunctive selection contains those tuples that satisfy anyone of the disjunct.
Example: (Emp_Job= ‘Analyst’ ∧ salary < 5000) ∨ (Hire_Date > 1–1–2000).
Disjunctive normal form is more useful as it allows the query to break into a series of
independent sub-queries linked by union.
Compiled by: Temesgen Tilahun 21
Steps in Query Decompositions
(C) Semantic Analyzer: It performs the following tasks :
It helps in reducing the number of predicates & rejects contradictory
normalized forms.
In case of missing join specification, components of query don’t
contribute to generation of results.
It identifies these queries and rejects them.
It makes sure that each object in query is referenced correctly
according to its data type.
Compiled by: Temesgen Tilahun 22
Steps in Query Decompositions
(D) Query Simplifier: The major tasks of query simplifier are as follows:
It eliminates common sub-expressions and redundant qualification.
It introduces integrity constraints, view definitions into the query graph representation.
It eliminates query that voids any integrity constraint without accessing the database.
It transforms sub-graphs into semantically equivalent and more efficient form.
It deals with user access rights.
(E) Query Restructuring: At the final stage of query decomposition, transformation
rules are applied to restructure the query to give a more efficient implementation.
Compiled by: Temesgen Tilahun 23
Steps in Query Processing
(iii) Query Optimization: The aim of the query optimization step is to
choose the best possible query execution plan with minimum
resources required to execute that plan.
Query Optimization means converting a query into an equivalent
form which is more efficient to execute.
Compiled by: Temesgen Tilahun 24
Query Optimization Process
Compiled by: Temesgen Tilahun 25
Steps in Query Processing
(iv) Execution Plan: It is the basic algorithm used for each
operation in the query.
Execution plans are classified into following four types:
(Check Page 6 of the PDF file)
1) Left-deep tree query execution plan
2) Right-deep tree query execution plan
3) Linear tree execution plan and
4) Bushy execution plan
Compiled by: Temesgen Tilahun 26
Steps in Query Processing
(v) Query Code Generator: After selecting the best possible
execution plan query is converted into low-level language so
that it can be taken as input by runtime database process.
(vi) Runtime Database Processor: It deals directly with main
the database and do the necessary operation mentioned in
query and returns the result to user.
Compiled by: Temesgen Tilahun 27
Query Optimization
The main issues that need to be considered in query optimization are:
Reduction of data transfer with database.
Use of available indexes for fast searching.
Reduction of number of times the database is manipulated.
The order in which joins should be performed.
How to store intermediate results?
There are two main techniques used to implement query optimization.
➔ These are heuristic query optimization and cost based query optimization.
Compiled by: Temesgen Tilahun 28
Transformation Rules for Relational Algebra
The transformation rules are used to formulate a relational algebra
expression into different ways and query optimizer choose the most
efficient equivalent expression to execute.
Two expressions are considered to be equivalent if they have same
set of attributes in different order but representing the same
information.
Compiled by: Temesgen Tilahun 29
Heuristic Query Optimization
Heuristic query optimization technique is used to modify the internal
representation of a query by using heuristic rules and transformation rules.
Heuristic rules are used in the form of a query tree or query graph
structure.
Optimizer starts with initial query tree and transform it into an equivalent
and efficient query tree using transformation rules.
Heuristic Optimization Algorithm: DBMS use heuristic optimization
algorithms to improve the efficiency of query by converting initial query
tree into an equivalent and optimized query tree.
Compiled by: Temesgen Tilahun 30
Following are the steps of heuristic optimization algorithm
Step 1. Perform Selection operation as early as possible.
Step 2. Perform commutativity of selection operation with other operations
as early as possible.
Step 3. Combine the Cartesian Product with subsequent selection
operation whose predicates represents a join condition into a JOIN
operation.
Step 4. Use Commutativity and Associativity of Binary operations.
Step 5. Perform projection operations as early as possible.
Step 6. Compute common expressions only once.
Compiled by: Temesgen Tilahun 31
Cost based Query Optimization
In cost based query optimization, optimizer estimates the cost of running of all
alternatives of a query and choose the optimum/best alternative.
The alternative which uses the minimum resources is having minimum cost.
The cost of a query operation is mainly depend on its selectivity i.e., the proportion
of the input relations that forms the output.
The main components used to determine the cost of execution of a query:
Access cost of secondary storage, Storage cost, Computation cost, Memory
usage cost and Communication cost.
Compiled by: Temesgen Tilahun 32
Cost based Query Optimization
(a) Access cost of Secondary Storage: Consists of cost of database
manipulation operations which includes searching, writing, reading of
data blocks stored in the secondary memory.
The cost of searching depends upon the type of indexes (primary,
secondary, hashed), type of file structure and ordering of relation.
(b) Storage cost: Storage cost consists of cost of storing intermediate
results (tables or files) that are generated by the execution strategy for the
query.
Compiled by: Temesgen Tilahun 33
Cost based Query Optimization
(c) Computation cost: Consists of performing in-memory operations
during query execution such as sorting of records in a file, merging of
records, performing computations on field values, searching of records.
➢ These are mainly performed on data buffers.
(d) Memory usage cost: It consists of cost of relating to the number of
memory buffers needed during query execution.
(e) Communication cost: It consists of the cost of transferring query and
its result from database location to the location of terminal where the
query is originated.
Compiled by: Temesgen Tilahun 34
Cost based Query Optimization
From all the above components, the most important is access cost to secondary
storage because secondary storage is comparatively slower than other devices.
Optimizer try to minimize computation cost for small databases as most of the
data files are stored in main memory.
For large database, it try to minimize the access cost to secondary storage and
For distributed databases, it tries to minimize the communication cost because
various sites are involved for data transfer.
To estimate the cost of execution strategies, optimizer access statistical data
stored in DBMS catalog.
Compiled by: Temesgen Tilahun 35
The Information Stored in DBMS catalog is given below:
A. Number of records in relation X, given as R.
B. Number of blocks required to store relation X, given as B.
C. Blocking factor of relation X, given as BFR.
D. Primary access method for each file and attributes for each file.
E. Number of levels for each multi-level index for an attribute A given as IA.
F. Number of first-level index blocks for an attribute A, given as BA I 1 .
G. Selection cardinality of attribute A in relation R, given as SA, where SA = R ×
SLA , where SLA is the selectivity of the attributes.
Compiled by: Temesgen Tilahun 36
Cost Function for Selection Operation
Cost Function for Selection Operation: Selection operation
works on a single relation in relation algebra and retrieves the
records that satisfy the given condition.
Depending upon the structure of file, available indexes,
searching methods, the estimated cost of strategies for
selection operation is as given below.
Compiled by: Temesgen Tilahun 37
The Estimated Cost of Strategies for Selection Operation is as given below:
Compiled by: Temesgen Tilahun 38
Example of Employee Table
Consider, there are 10,000 records stored in 2000 blocks.
➢ Also the following indexes are available.
A secondary index on the Emp-ID with 4 levels.
A clustering index on salary with 3 levels and average selection
cardinality of 30.
A secondary index on non-key attributes Age with 2 levels and
4 first level index blocks.
There are 200 distinct values for Age.
Compiled by: Temesgen Tilahun 39
EXAMPLE – 1 CONT’D . . .
Compiled by: Temesgen Tilahun 40
EXAMPLE – 1 CONT’D . . .
(iii) This query has a conjunctive selection condition: To estimate the
cost of use using anyone of the two components of selection condition,
to retrieve the records plus the linear search.
The linear search costs 2000 and condition salary > 9000 first gives
cost estimate of Isalary + (B/2) = 2 + (2000/2) = 2 + 1000 = 1002 and
Cost of condition Age = 20 is 30 + 2 = 32
σAge = 20 AND salary > 9000 (Employee) = total cost is 32 + 1002 = 1034
Compiled by: Temesgen Tilahun 41
Cost Function for Join Operation
Cost Function for Join Operation: Join operation is the
most time consuming operation in databases and
An accurate cost function for join operations depends
upon the estimate of size of file (number of records) after
the join operation.
Compiled by: Temesgen Tilahun 42
The Estimated Cost of Strategies for Join Operation is given below:
Compiled by: Temesgen Tilahun 43
Cost Function for Join Operation : EXAMPLE
Example: Consider we have 600 records in department table.
BFR for department table is 60 and number of blocks are 600/60 = 10
For the Join Operation,
Type of JOINS COST
Block nested-loop joins 22000 (Buffer has only one block)
2010 (if all blocks of employee can be
read into database buffer)
Hash Join 6010 (is hash index is held in memory)
Compiled by: Temesgen Tilahun 44
Cost Function for Join Operation : EXAMPLE
Block nested-loop joins 22000 (Buffer has only one block)
B(R) => B(EMPLOYEE) = 2000
B(S) => B(DEPARTMENT) = 10
B(R) + [B(R) * B(S)] = 2000 + [2000 X 10] = 2000 + 20000 = 22000
Compiled by: Temesgen Tilahun 45
Cost Function for Join Operation : EXAMPLE
Block nested-loop joins 2010 (if all blocks of employee can be
read into database buffer)
B(R) => B(EMPLOYEE) = 2000
B(S) => B(DEPARTMENT) = 10
B(R) + B(S) = 2000 + 10 = 2010
Compiled by: Temesgen Tilahun 46
Cost Function for Join Operation : EXAMPLE
Hash Join 6010 (is hash index is held in memory)
Hash Join : If hash index is held in memory => (3 * B(R)) + B(S)
B(R) => B(EMPLOYEE) = 2000
B(S) => B(DEPARTMENT) = 10
(3 * B(R)) + B(S) = 3 X 2000 + 10 = 6000 + 10 = 6010
Compiled by: Temesgen Tilahun 47
Join Operation : EXAMPLE
Given two unary relation (contains only one attribute), R1 and R2
R1 = 7, 2, 9, 8, 3, 9, 1, 3, 6
R2 = 8, 4, 2, 1, 3, 2, 7, 3
Compiled by: Temesgen Tilahun 48
Thank You !
?
Compiled by: Temesgen Tilahun 49