Evaluation of relational algebra expressions
Materialized evaluation − Evaluate one operation at a time. Evaluate
the expression in a bottom-up manner and stores intermediate results to
temporary files.
Store the result of A ⋈ B in a temporary file.
Store the result of C ⋈ D in a temporary file.
Finally, join the results stored in temporary files.
The overall cost=sum of costs of individual operations + cost of writing
intermediate results to disk, cost of writing results to results to temporary
files and reading them back is quite high.
Pipelined evaluation − Evaluate several operations simultaneously.
Result of one operation is passed to the next operation. Evaluate the
expression in a bottom-up manner and don’t store intermediate results to
temporary files.
Don’t store the result of A ⋈ B in a temporary file. Instead the result is
passed directly for projection with C and so on.
Query Equivalence
Query: A query is a request for information from a database.
Query Plans: A query plan (or query execution plan) is an ordered set of
steps used to access data in a SQL relational database management
system.
Query Optimization: A single query can be executed through different
algorithms or re-written in different forms and structures. Hence, the
question of query optimization comes into the picture – Which of these
forms or pathways is the most optimal? The query optimizer attempts to
determine the most efficient way to execute a given query by
considering the possible query plans.
Importance: The goal of query optimization is to reduce the
system resources required to fulfill a query, and ultimately provide
the user with the correct result set faster.
First, it provides the user with faster results, which makes the
application seem faster to the user.
Secondly, it allows the system to service more queries in
the same amount of time, because each request takes less
time than unoptimized queries.
Thirdly, query optimization ultimately reduces the amount of
wear on the hardware (e.g. disk drives), and allows the
server to run more efficiently (e.g. lower power
consumption, less memory usage).
There are broadly two ways a query can be optimized:
1. Analyze and transform equivalent relational expressions:
Try to minimize the tuple and column counts of the intermediate
and final query processes (discussed here).
2. Using different algorithms for each operation: These
underlying algorithms determine how tuples are accessed from
the data structures they are stored in, indexing, hashing, data
retrieval and hence influence the number of disk and block
accesses (discussed in query processing).
Analyze and transform equivalent relational expressions.
Here, we shall talk about generating minimal equivalent expressions. To
analyze equivalent expression, listed are a set of equivalence rules.
These generate equivalent expressions for a query written in relational
algebra. To optimize a query, we must convert the query into its
equivalent form as long as an equivalence rule is satisfied.
Transformation of Relational Expressions
• Two relational algebra expressions are said to be equivalent if the two
expressions generate the same set of tuples on every legal database
instance
Note: order of tuples is irrelevant
we don’t care if they generate different results on databases that violate
integrity constraints
• In SQL, inputs and outputs are multisets of tuples
Two expressions in the multiset version of the relational algebra are said
to be equivalent if the two expressions generate the same multiset of
tuples on every legal database instance.
• An equivalence rule says that expressions of two forms are equivalent
Can replace expression of first form by second, or vice versa
1. Conjunctive selection operations can be written as a
sequence of individual selections. This is called a sigma-
cascade.
σθ 1∧θ 2 ( E)=σ θ 1 (σθ 2 (E))
Explanation: Applying condition intersection is expensive.
Instead, filter out tuples satisfying condition (inner selection)
and then apply condition (outer selection) to the then
resulting fewer tuples. This leaves us with less tuples to process
the second time. This can be extended for two or more
intersecting selections. Since we are breaking a single condition
into a series of selections or cascades, it is called a “cascade”.
2. Selection is commutative.
σθ 1 (σθ 2 ( E))=σθ 2 (σθ 1 ( E))
Explanation: condition is commutative in nature. This means,
it does not matter whether we apply first or first. In
practice, it is better and more optimal to apply that selection
first which yields a fewer number of tuples. This saves time on
our outer selection.
3. All following projections can be omitted, only the first
projection is required. This is called a pi-cascade.
Π L1 ( Π L 2 (…( Π Ln( E))…))=Π L 1 (E)
Explanation: A cascade or a series of projections is
meaningless. This is because in the end, we are only selecting
those columns which are specified in the last, or the outermost
projection. Hence, it is better to collapse all the projections into
just one i.e. the outermost projection.
4. Selections on Cartesian Products can be re-written as
Theta Joins.
Equivalence 1
σθ (E1X E2 ) = E1 θ E2
Explanation: The cross product operation is known to
be very expensive. This is because it matches each
tuple of E1 (total m tuples) with each tuple of E2 (total
n tuples). This yields m*n entries. If we apply a
selection operation after that, we would have to scan
through m*n entries to find the suitable tuples which
satisfy the condition . Instead of doing all of this, it is
more optimal to use the Theta Join, a join specifically
designed to select only those entries in the cross
product which satisfy the Theta condition, without
evaluating the entire cross product first.
Equivalence 2
σθ1 (E1 θ2 E2 ) = E1 θ1∧ θ2E2
Explanation: Theta Join radically decreases the
number of resulting tuples, so if we apply an
intersection of both the join conditions i.e. and
into the Theta Join itself, we get fewer scans to do. On
the other hand, a condition outside unnecessarily
increases the tuples to scan.
5. Theta Joins are commutative.
E1 θ E2 = E2 θ E1
Explanation: Theta Joins are commutative, and the query
processing time depends to some extent which table is used as
the outer loop and which one is used as the inner loop during
the join process (based on the indexing structures and blocks).
6. Join operations are associative.
Natural Join
Explanation: Joins are all commutative as well as
associative, so one must join those two tables first
which yield less number of entries, and then apply the
other join.
Theta Join
Explanation: Theta Joins are associative in the above
manner, where involves attributes from only E2 and
E3.
7. Selection operation can be distributed.
Equivalence 1
Explanation: Applying a selection after doing the
Theta Join causes all the tuples returned by the Theta
Join to be monitored after the join. If this selection
contains attributes from only E1, it is better to apply
this selection to E1 (hence resulting in a fewer number
of tuples) and then join it with E2.
Equivalence 2
Explanation: This can be extended to two selection
conditions, and , where Theta1 contains the
attributes of only E1 and contains attributes of only
E2. Hence, we can individually apply the selection
criteria before joining, to drastically reduce the number
of tuples joined.
8. Projection distributes over the Theta Join.
Equivalence 1
Explanation: The idea discussed for selection can be
used for projection as well. Here, if L1 is a projection
that involves columns of only E1, and L2 another
projection that involves the columns of only E2, then it
is better to individually apply the projections on both
the tables before joining. This leaves us with a fewer
number of columns on either side, hence contributing
to an easier join.
Equivalence 2
Explanation: Here, when applying projections L1 and
L2 on the join, where L1 contains columns of only E1
and L2 contains columns of only E2, we can introduce
another column E3 (which is common between both the
tables). Then, we can apply projections L1 and L2 on E1
and E2 respectively, along with the added column L3.
L3 enables us to do the join.
9. Union and Intersection are commutative.
Explanation: Union and intersection are both distributive; we
can enclose any tables in parentheses according to requirement
and ease of access.
10. Union and Intersection are associative.
Explanation: Union and intersection are both distributive; we
can enclose any tables in parentheses according to requirement
and ease of access.
11. Selection operation distributes over the union,
intersection, and difference operations.
Explanation: In set difference, we know that only those tuples
are shown which belong to table E1 and do not belong to table
E2. So, applying a selection condition on the entire set
difference is equivalent to applying the selection condition on
the individual tables and then applying set difference. This will
reduce the number of comparisons in the set difference step.