0% found this document useful (0 votes)

10 views26 pages

Distributed Query Optimization Techniques

The document discusses distributed query optimization in the context of distributed database management systems, highlighting the importance of query decomposition, data localization, and optimization strategies. It details the elements of the optimization process, including search space, cost models, and join ordering, while emphasizing the challenges specific to distributed settings. The use of semijoins and bit arrays for reducing communication costs during joins is also explored, alongside the roles of master and apprentice sites in executing distributed queries.

Uploaded by

Ganga Bhattacharjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views26 pages

Distributed Query Optimization Techniques

Uploaded by

Ganga Bhattacharjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Distributed query optimization

Data Management for Big Data

2018-2019 (spring semester)

Dario Della Monica

These slides are a modified version of the slides provided with the book
Özsu and Valduriez, Principles of Distributed Database Systems (3rd Ed.), 2011
The original version of the slides is available at: [Link]

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/1

Outline (distributed DB)
• Introduction (Ch. 1) ⋆

• Distributed Database Design (Ch. 3) ⋆

• Distributed Query Processing (Ch. 6-8) ⋆

➡ Overview (Ch. 6) ⋆

➡ Query decomposition and data localization (Ch. 7) ⋆

➡ Distributed query optimization (Ch. 8) ⋆

• Distributed Transaction Management (Ch. 10-12) ⋆

⋆ Özsu and Valduriez, Principles of Distributed Database Systems (3rd Ed.), 2011

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/2

Outline (today)
• Distributed query optimization (Ch. 8) ⋆
➡ Overview
➡ Join Ordering in Localized Queries
➡ Semijoin-based Algorithm
➡ Distributed query optimization strategies
➡ Hybrid approaches

⋆ Özsu and Valduriez, Principles of Distributed Database Systems (3rd Ed.), 2011

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/3

Distributed Query Optimization
• In previous chapter (Ch. 7) :
⋆

➡ A distributed query is mapped into a query over fragments (decomposition and data localization)
➡ Reduction (“optimization”) independent from relation (fragment) statistics (e.g., cardinality)

• In this chapter (Ch. 8) ⋆:

➡ Optimization based on DB statistics (order of operations and operands, algorithm to perform simple
operations) to produce a query execution plan (QEP)
✦ In the distributed case a QEP is further extended with communication operations to support execution
of queries over fragment sites
➡ Once again: the problem is NP-hard, so not looking for the optimal solution
➡ Statement of the problem
✦ Input: Fragment query
✦ Output: the ”best” global strategy

➡ Additional problems specific to the distributed setting

✦ Where to execute (partial) queries? Which relation to ship where?
✦ Choose between data transfer methods : ship-whole vs. fetch-as-needed
✦ Decide on the use of semijoins (semijoins save on communication at the expense of more local
processing)

⋆ Özsu and Valduriez, Principles of Distributed Database Systems (3rd Ed.), 2011

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/4

Elements of the Optimizer
• The element of the optimization process are similar in distributed and
centralized cases
➡ Search space (aka solution space)
✦ The set of equivalent QEP: algebra expressions enriched with implementation details
and communication choices
➡ Cost model
✦ Cost function (in terms of time)
✓ I/O cost + CPU cost + communication cost
✓ In early approach only communication costs were considered; due to fast communication
technology, communication and I/O costs become comparable
✓ These might have different weights in different distributed environments (LAN vs WAN)
➡ Search algorithm (aka search strategy)
✦ How do we move inside the solution space?
✓ Exhaustive search, heuristic algorithms
✦ Goal is searching the solution space to find a good strategy according to the cost model
• Difference between centralized and distributed settings: search space and cost
model (search strategy remains the same)

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/5

Search Space
• Search space is large
➡ N relations ((2(N-1))!)/((N-1)!) ⋆ equivalent join trees (by join commutativity
and associativity)
➡ Larger search space due to more options

• QEP are decorated with more information (on data exchange)

• Focus on join and semijoin order
• Different candidate solution in the search space
➡ A good heuristics for centralized context: left-deep trees

➡ In distributed context: non left-deep trees allow for parallelization

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/6

Centralized vs. Distributed Query
Optimization
• Relation between centralized and distributed query optimization
➡ Distributed query optimization (DQO) employs techniques and solutions
from the centralized context
✦ A distributed query is translated into local ones (localized queries): centralized
query optimization (CQO) techniques
✦ Distributed query optimization is a more general (and thus difficult) problem
✓ Most solution to DQO extend solutions to CQO

➡ We focus on communication costs (local CPU and I/O costs are ignored)

✦ Clearly, cost of localized queries (handled with CQO techniques) is computed as

in the centralized case (mainly I/O costs)

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/7

Join Ordering in the Distributed
Context
• Join ordering is important in centralized query optimization

• It is even more in distributed query optimization (affect communication costs)

• Use of semijoins to reduce relation sizes (and thus communication costs) before
performing join operations

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/8

Join Ordering – 2 relations
• We assume query to be already localized (i.e., on fragments)
➡ Fragments are relations entirely stored at a single site
✦ We often use “fragments” and “relations” indistinguishably (no technical reason to
distinguish them)
• We first focus on ordering issues without using
semijoins
➡ Consider 2-relation join: R ⋈ S
(where R and S are stored at different sites)
if size(R) < size(S)
✦ Move the smaller relation to the site of the larger one R S
✦ If size(R) and size(S) are (more or less) the same if size(R) > size(S)
(and not other factor comes into play),
then moving outer relation R has benefits:
✓ No need for storing R in nested-loop or block nested-loop join
algorithms
✓ indexed nested-loop join algorithm remains available as index on
inner relation S is preserved (index is lost when transfering S)

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/9

Join Ordering – Multiple
Relations
• Multiple relations case: more difficult because too many alternatives
• Goal is still transmit small operands (relations)
➡ Compute the cost of all alternatives and select the best one
✦ Necessary to compute the size of intermediate relations which is difficult
✓ In distributed context it is even more because information may be not available on site

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/10

Join Ordering – Example
Consider PROJ ⋈PNO ASG ⋈ENO EMP Site 2
ASG
ENO PNO
Execution alternatives:
EMP PROJ
1. EMP→ Site 2 Site 1 Site 3
Site 2 computes EMP'=EMP ⋈ ASG
Join graph of distributed query
EMP'→ Site 3
Site 3 computes EMP' ⋈ PROJ
2. ASG → Site 1 4. PROJ → Site 2
Site 1 computes EMP'=EMP⋈ ASG Site 2 computes PROJ'=PROJ ⋈ ASG
EMP' → Site 3 PROJ' → Site 1
Site 3 computes EMP’ ⋈ PROJ Site 1 computes PROJ' ⋈ EMP
3. ASG → Site 3 5. EMP → Site 2
Site 3 computes ASG'=ASG ⋈ PROJ PROJ → Site 2
ASG' → Site 1 Site 2 computes EMP ⋈ PROJ ⋈ ASG
Site 1 computes ASG' ▷◁EMP

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/11

Semijoin Algorithms
• Semijoins can be used to reduce the sizes of operands to transfer (similar to what
selections do)
➡ Reduced communication costs

• Consider the join of two relations:

➡ R (at site 1)
➡ S (at site 2)
• Alternatives:
1. Do the join R ⋈AS
2. Perform one of the semijoin-based equivalent options
Tradeoff between
R ⋈ AS ⇔ (R ⋉AS) ⋈AS
a) cost to compute and send semijoin to other
⇔ R ⋈A (S ⋉A R) site (and then perform the join there)
b) Cost to send the whole relation to other
⇔ (R ⋉A S) ⋈A (S ⋉A R) site (and then perform the join there)

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/12

Semijoin Algorithms – Example
• Perform the join
➡ Send R to Site 2
➡ Site 2 computes R ⋈A S
• Consider semijoin (R ⋉AS) ⋈AS
➡ S' = ΠA(S)
➡ S' → Site 1
➡ Site 1 computes R' = R ⋉AS'
➡ R'→ Site 2
➡ Site 2 computes R' ⋈AS
• Semijoin is better if
size(ΠA(S)) + size(R ⋉AS)) < size(R)
➡ Only communication costs (time to transfer relations)

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/13

Semijoin Algorithms – Sum up
• Using semijoin is convenient if R ⋉AS has high selectivity (select few tuples) and/or size
of R is large
• It is bad otherwise, due to the additional transfer of ΠA(S)
• Cost of transferring ΠA(S) can be reduced by using bit arrays
• A disadvantage of using semijoin is the loss of indices

Bit arrays
• Let h be a hash function that distributes possible values for A into n buckets:

h : Dom(A) { 0, …, n-1 }
• Bit array BA[0 .. n-1] over relation S is defined as:
BA[i] = 1 iff ∃ value v for attribute A in S s.t. h(v) = i
• Transfer BA (n bits) rather than ΠA(S)
• A tuple of R with value v for attribute A belongs to R’ iff BA[h(v)] = 1
• R’ is an (over-)approximation of R ⋉AS
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/14
Bit Arrays for Seminoins
R S • Recall:
idR A idS A o BA[i] = 1 iff ∃ value v for attribute A in S s.t. h(v) = i
o a tuple of R with value v for A belongs to R’ iff BA[h(v)] = 1
1 1 1 5
2 2 2 5
3 2 3 3
• h(x) = x mod 4
4 5 4 5 • n=4 (4 buckets)
5 4 5 3 • h(1) = h(5) = 1
6 5 • BA[0] = 0 (no value v occurs in S.A s.t. h(v) = 0)
7 4
8 5
• BA[1] = 1 (due to occurrence of 5 for attribute A in S)
• BA[2] = 0 (no value v occurs in S.A s.t. h(v) = 2)
R’ ⊋ R ⋉A S • BA[3] = 1 (due to occurrence of 3 for attribute A in S)

idR A idS A R’ contains tuple <1,1> that does not

1 1 4 5 belong to R ⋉A S
4 5 6 5 However, R’ is a good approximation
6 5 8 5 because h has only one conflict (h(1) =
8 5 h(5)) among values for attribute A in R
and S
R’ : R ⋉AS computed
with bit array

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/15

Semijoins for Joins among
Multiple Relations
• Semijoins to optimize joins among more than 2 operands
EMP ⋈ ASG ⋈ PROJ = EMP’ ⋈ ASG’ ⋈ PROJ
where EMP’ = EMP ⋉ ASG
and ASG’ = ASG ⋉ PROJ
• Each operand can be further reduced using more than one semijoin in cascade
EMP’’ = EMP ⋈ (ASG ⋈ PROJ)
We have size(ASG ⋈ PROJ) <= size(ASG) Semijoin
program
Therefore size(EMP’’) <= size(EMP’)

• Full reducer for a relation is the semijoin program that reduces the relation the most
• Finding full reducer for a relation with exhaustive brute force approach
➡ For cyclic queries full reducer cannot be found
✦ Solution: break the cycle
➡ With other queries: inefficient (NP-hard)
✦ Solution: only use semijoin when problem is simple
✓ e.g., for chained queries, where relations are in sequence and each one joins with the next one

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/16

Distributed Query Optimization
• We focus on optimization of joins
• The algorithm for optimizing a join is adapted from the one for the centralized
case
• In distributed context
➡ There is a coordinator (master site) where query is initiated

➡ Coordinator chooses
1. execution site and
2. transfer method
➡ Apprentice sites (where fragments are stored and queries are executed)
✦ Apprentices behave as in the case of centralized query optimization in optimizing
localized queries (over fragments) assigned to them
✓ Choose best join ordering, join algorithm, and access method for relations

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/17

Choices of the Master Site
1. Choice of the execution sites
➡ E.g., R ⋈ S can be executed:
✦ at the site where R is stored
✦ at the site where S is stored
✦ at a third site (e.g., where a 3rd relation waits to be joined – allows for parallel transfer)
2. Transfer method
➡ ship-whole: relation is transferred to the join execution site entirely
✦ In some cases (e.g., for outer relations of in case of merge join) there is no need to store the relation:
join as it arrives, in pipelined mode
➡ fetch-as-needed (only needed tuples are transferred, i.e., tuples selected by the join):
✦ equivalent to perform semijoin of one relation with tuple of the other one (to reduce size of the
former) before executing the join
✦ e.g., semi-join of inner relation wrt outer one (only needed tuples of inner relation are transferred)
✓ tuples of the outer relation are sent (only the join attribute) to the site of the inner relation
✓ matching tuples of the inner relation are sent to the site of the external relation to execute the join

Choices of the master produce 4 strategies (not all combinations are worth being considered)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/18
Strategy 1 – ship-whole/inner site
1. ship-whole/site of inner relation: move outer relation (R) to the site of the inner
relation (S)

(a) Retrieve outer tuples • CT(x): communication time to transfer x bytes

• LT(x): local processing time to perform op. x
(b) Send them to the inner relation site • s = card(S ⋉A R)/card(R): average number of
tuples of S that match a tuple of R
(c) Join them as they arrive

Total Cost = LT ( retrieve card(R) tuples from R )

+ CT ( size(R) )
+ LT ( retrieve s tuples from S ) * card(R)

Join is done as R comes because R is the outer relation

Strategy 2 – ship-whole/outer site
2. ship-whole/site of outer relation: move inner relation (S) to the site of outer
relation (R)
Cannot join as S arrives; it needs to be stored

Total cost = LT ( retrieve card( S ) tuples from S )

+ CT ( size(S) )
+ LT ( store card(S) tuples in temporary relation T)
+ LT ( retrieve card(R) tuples from R )
+ LT ( retrieve s tuples from T ) * card(R)
• CT(x): communication time to transfer x bytes
• LT(x): local processing time to perform op. x
• s = card(S ⋉A R)/card(R): average number of
tuples of S that match a tuple of R

Strategy 3 – fetch-as-
needed/outer site
3. fetch-as-needed/site of outer relation
(a) Retrieve tuples at outer relation (R) site
(b) For each tuple of R, send join attribute values to inner relation (S) site
(c) Retrieve matching inner tuples at inner relation site
(d) Send the matching inner tuples to outer relation site
(e) Join as they arrive
Total Cost = LT ( retrieve card( R ) tuples from R )
+ CT ( length ( A ) ) * card ( R )
+ LT ( retrieve s tuples from S ) * card ( R )
+ CT ( s * length ( S ) ) * card ( R )
• CT(x): communication time to transfer x bytes
• LT(x): local processing time to perform op. x
• s = card(S ⋉A R)/card(R): average number of
tuples of S that match a tuple of R

Strategy 4 – Move Both Relation
at Third Site
4. move both inner (S) and outer (R) relations to another site

Total cost = LT ( retrieve card ( S ) tuples from S )

+ CT ( size ( S ) )
+ LT ( store card(S) tuples in temporary relation T)
+ LT ( retrieve card ( R ) tuples from R )
+ CT ( size( R ) )
+ LT ( retrieve s tuples from T ) * card ( R )

• CT(x): communication time to transfer x bytes

• LT(x): local processing time to perform op. x Moving inner relation S first is
• s = card(S ⋉A R)/card(R): average number of better so we can then join as outer
tuples of S that match a tuple of R relation R arrives

Strategy comparison
PROJ ⋈PNO ASG
• PROJ (outer rel.) and ASG (inner rel.) are stored at different sites
• Index on PNO for relation ASG

1. Ship whole PROJ at site of ASG CT ( size(PROJ) )

2. Ship whole ASG at site of PROJ CT ( size(ASG) )
3. Fetch tuples of ASG as needed at site of PROJ CT ( length ( A ) ) * card ( PROJ )
+ CT ( s * length ( ASG ) ) * card (PROJ )
4. Move both ASG and PROJ to a third site CT ( size ( ASG ) ) + CT ( size ( PROJ ) )

• If there is no upper level operation then 4 is a bad choice

• If size ( PROJ ) >> size ( ASG ), then 2 is a good choice (if local processing time is not too
bad compared with 1 and 3 (1 and 3 can exploit index on ASG in their local processing)
• If PROJ is large/few tuples of ASG match, then 3 is better than 1
• Otherwise, 1 is better than 3
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/23
Hybrid approach
• So far, focus on static approaches, i.e., strategies (QEP, expressed as decorated
trees) are evaluated and compared at compile time
• Advantages: query optimization is done once and used for several query
executions
• Disadvantages: cost evaluation is not that accurate
➡ it is not always done on exact values but on estimations based on statistics
✦ e.g., size of intermediate results
➡ some parameter of a query might be known only at runtime
• Problems of static query optimization are much more severe in the distributed
context: more infomation variability at runtime
➡ Sites may become unavailable or overloaded
➡ Selection of site and fragment copy should be done at runtime to increase
availability and load balancing
• An hybrid solution (some decisions are taken at runtime) is implemented by
means of the CP (choose-plan) operator, which is resolved at runtime, when an
exact plan comparison can be done

The CP (choose-plan) Operator
SELECT *
FROM EMP, PAY
WHERE SALARY > $a
where $a is a variable whose value is specified by the user at runtime

CP
Normally, pushing σ
inside ⋈ is a good
heuristics, but it can be
⋈ σSALARY > $a bad if selection rate of
⋈ is higher than the
σSALARY > $a EMP ⋈ one of σ

PAY PAY EMP

2-Step Optimization
• 2-Step optimization: a simpler approach (more efficient, less exhaustive) than the
one based on CP operator; it reduces workload at runtime (no CP operator)
➡ At runtime labels are added about site and fragment copy selection only

1. At compile time, generate a

static plan with operation
ordering and access methods
only
2. At startup time, carry out site
and copy selection and
allocate operations to sites

• Site (and copy) selection is done in a greedy fashion

➡ best load balancing,
➡ best benefit (# of queries already executed at the site, possible saving of
communication costs as the site might have already data available)
Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/26

Distributed Query Optimization Techniques
No ratings yet
Distributed Query Optimization Techniques
26 pages
Distributed Database Query Optimization
No ratings yet
Distributed Database Query Optimization
106 pages
Distributed Database Management Systems
No ratings yet
Distributed Database Management Systems
53 pages
Layers of Query Processing in DDBMS
80% (10)
Layers of Query Processing in DDBMS
20 pages
Distributed Query Optimization Techniques
No ratings yet
Distributed Query Optimization Techniques
72 pages
Layers of Query Processing in DBMS
No ratings yet
Layers of Query Processing in DBMS
24 pages
Distributed Query Processing Overview
No ratings yet
Distributed Query Processing Overview
121 pages
7 ddb07
No ratings yet
7 ddb07
46 pages
Query
No ratings yet
Query
104 pages
Distributed Query Optimization Strategies
86% (7)
Distributed Query Optimization Strategies
48 pages
Distributed Database Query Processing
No ratings yet
Distributed Database Query Processing
23 pages
Query Processing in Distributed Database
No ratings yet
Query Processing in Distributed Database
20 pages
Distributed Query Processing Techniques
No ratings yet
Distributed Query Processing Techniques
37 pages
4 Query Processing
No ratings yet
4 Query Processing
73 pages
NKM
No ratings yet
NKM
6 pages
Distributed Query Processing Overview
No ratings yet
Distributed Query Processing Overview
8 pages
Distributed Database Management Systems
No ratings yet
Distributed Database Management Systems
14 pages
Optimizing Queries in Distributed Databases
No ratings yet
Optimizing Queries in Distributed Databases
4 pages
Steps in Distributed Query Processing
No ratings yet
Steps in Distributed Query Processing
13 pages
Query Optimization in Distributed Databases
No ratings yet
Query Optimization in Distributed Databases
42 pages
6-Query Intro
No ratings yet
6-Query Intro
15 pages
Distributed Query Processing Overview
No ratings yet
Distributed Query Processing Overview
31 pages
Distributed Query Optimization Guide
No ratings yet
Distributed Query Optimization Guide
44 pages
Principles of Distributed Database Systems: M. Tamer Özsu Patrick Valduriez
No ratings yet
Principles of Distributed Database Systems: M. Tamer Özsu Patrick Valduriez
73 pages
Outline: Distributed Query Processing
No ratings yet
Outline: Distributed Query Processing
8 pages
Understanding Distributed Transactions
No ratings yet
Understanding Distributed Transactions
17 pages
Join Query Optimization in Distributed Databases
No ratings yet
Join Query Optimization in Distributed Databases
3 pages
Query Decomposition in CSE 453
No ratings yet
Query Decomposition in CSE 453
72 pages
Query Optimization in Distributed Databases
No ratings yet
Query Optimization in Distributed Databases
10 pages
Distributed Database Query Processing
No ratings yet
Distributed Database Query Processing
79 pages
Query Optimization in DDBMS
No ratings yet
Query Optimization in DDBMS
7 pages
Distributed Database Systems Overview
No ratings yet
Distributed Database Systems Overview
73 pages
Multidatabase Query Processing Overview
No ratings yet
Multidatabase Query Processing Overview
41 pages
Distributed Database Query Processing
No ratings yet
Distributed Database Query Processing
72 pages
Query Processing in Distributed Databases
No ratings yet
Query Processing in Distributed Databases
48 pages
Query Optimization in Distributed Databases
No ratings yet
Query Optimization in Distributed Databases
7 pages
ADB - Unit - III (Chapter-2) - Query Processing and Decomposition
No ratings yet
ADB - Unit - III (Chapter-2) - Query Processing and Decomposition
42 pages
Survey of Distributed Query Optimization
No ratings yet
Survey of Distributed Query Optimization
10 pages
Overview of Distributed Query Processing
100% (1)
Overview of Distributed Query Processing
3 pages
Distributed Query Optimization Issues
No ratings yet
Distributed Query Optimization Issues
9 pages
Query Processing in Distributed Databases
No ratings yet
Query Processing in Distributed Databases
97 pages
C4 - Decomposing Global Queries Into Fragment Queries
No ratings yet
C4 - Decomposing Global Queries Into Fragment Queries
78 pages
Distributed Query Processing Overview
No ratings yet
Distributed Query Processing Overview
13 pages
1 2d Query Localization Ozsu ch7 SPLIT
No ratings yet
1 2d Query Localization Ozsu ch7 SPLIT
33 pages
Join Ordering in Distributed Databases
No ratings yet
Join Ordering in Distributed Databases
7 pages
Distributed Query Processing Overview
No ratings yet
Distributed Query Processing Overview
17 pages
Distributed Query Optimization Techniques
No ratings yet
Distributed Query Optimization Techniques
47 pages
Query Optimization in Distributed Databases
No ratings yet
Query Optimization in Distributed Databases
28 pages
Distributed Query Processing Overview
No ratings yet
Distributed Query Processing Overview
69 pages
Strategies for Distributed Data Storage
No ratings yet
Strategies for Distributed Data Storage
19 pages
Overview of Distributed Databases
No ratings yet
Overview of Distributed Databases
37 pages
Distributed Database Concepts Explained
No ratings yet
Distributed Database Concepts Explained
19 pages
Distributed Database Management Systems
No ratings yet
Distributed Database Management Systems
27 pages
Query Decomposition in Distributed Database
No ratings yet
Query Decomposition in Distributed Database
3 pages
Query Processing in Distributed Databases
No ratings yet
Query Processing in Distributed Databases
11 pages
Query Optimization with Heuristics & GA
No ratings yet
Query Optimization with Heuristics & GA
20 pages
Evaluating Multiple Join Queries in A Distributed Database System
No ratings yet
Evaluating Multiple Join Queries in A Distributed Database System
16 pages
Query Decomposition & Data Localization
No ratings yet
Query Decomposition & Data Localization
9 pages
Understanding Distributed Databases
No ratings yet
Understanding Distributed Databases
9 pages
LaTeX Subfigures and amsmath Guide
No ratings yet
LaTeX Subfigures and amsmath Guide
41 pages
Cryptography Applications in Modern Tech
No ratings yet
Cryptography Applications in Modern Tech
13 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
28 pages
Key Management Principles and Practices
No ratings yet
Key Management Principles and Practices
36 pages
Network Security: Malware & Attack Steps
No ratings yet
Network Security: Malware & Attack Steps
10 pages
Textbook vs. Research Paper Explained
No ratings yet
Textbook vs. Research Paper Explained
3 pages
Primality Testing Methods Explained
No ratings yet
Primality Testing Methods Explained
22 pages
Interval Estimation and Hypothesis Testing
No ratings yet
Interval Estimation and Hypothesis Testing
75 pages
Probability Practice Set Solutions
No ratings yet
Probability Practice Set Solutions
1 page
Bluetooth Security Protocols Explained
No ratings yet
Bluetooth Security Protocols Explained
4 pages
PGFPlots: Creating Histograms in LaTeX
No ratings yet
PGFPlots: Creating Histograms in LaTeX
21 pages
Cryptography and Network Security Course
No ratings yet
Cryptography and Network Security Course
1 page
Introduction to LATEX Basics
No ratings yet
Introduction to LATEX Basics
17 pages
Point Doubling and Addition in Modular Arithmetic
No ratings yet
Point Doubling and Addition in Modular Arithmetic
3 pages
WEP vs WPA: Enhancing Wireless Security
No ratings yet
WEP vs WPA: Enhancing Wireless Security
1 page
Document Scanned by CamScanner
No ratings yet
Document Scanned by CamScanner
19 pages
Advanced Consensus Mechanism for Blockchain
No ratings yet
Advanced Consensus Mechanism for Blockchain
64 pages
BiBa Broadcast Authentication Protocol
No ratings yet
BiBa Broadcast Authentication Protocol
10 pages
Overview of Distributed Databases
No ratings yet
Overview of Distributed Databases
7 pages
Software Developer Profile & Skills
No ratings yet
Software Developer Profile & Skills
5 pages
T24 Application Development Guide
No ratings yet
T24 Application Development Guide
68 pages
Legal Directory Submissions - Putting Your Best Foot Forward (A LexBlurb Publication)
100% (2)
Legal Directory Submissions - Putting Your Best Foot Forward (A LexBlurb Publication)
22 pages
Airline Management System Project Report
No ratings yet
Airline Management System Project Report
23 pages
Doculink Configuration Steps
80% (5)
Doculink Configuration Steps
31 pages
K201 Lab CH 4
No ratings yet
K201 Lab CH 4
2 pages
AI Diagnostic System for Rural Healthcare
No ratings yet
AI Diagnostic System for Rural Healthcare
4 pages
Java Servlets: Dynamic Web Page Creation
No ratings yet
Java Servlets: Dynamic Web Page Creation
54 pages
10th Practical File 2024 25
No ratings yet
10th Practical File 2024 25
22 pages
Data Engineering Course Overview
No ratings yet
Data Engineering Course Overview
2 pages
E-Commerce Management System Project
No ratings yet
E-Commerce Management System Project
21 pages
PL/SQL Basics and Syntax Overview
No ratings yet
PL/SQL Basics and Syntax Overview
34 pages
Data Warehousing and Mining Overview
No ratings yet
Data Warehousing and Mining Overview
142 pages
Login & Account Creation System Overview
No ratings yet
Login & Account Creation System Overview
2 pages
DBMS Questions and Answers Guide
No ratings yet
DBMS Questions and Answers Guide
15 pages
Deleting Your Saylor Academy Account
No ratings yet
Deleting Your Saylor Academy Account
13 pages
Crime File System Project Report
67% (3)
Crime File System Project Report
79 pages
B.E IV Semester Exam Results 2025
No ratings yet
B.E IV Semester Exam Results 2025
23 pages
Data Engineering with Scala & Spark
No ratings yet
Data Engineering with Scala & Spark
10 pages
BI Analytics Manager & Tableau Expert
No ratings yet
BI Analytics Manager & Tableau Expert
7 pages
Candidate Key vs Primary Key Explained
No ratings yet
Candidate Key vs Primary Key Explained
2 pages
Technical Interview Questions Overview
No ratings yet
Technical Interview Questions Overview
6 pages
Relational Data Mining Overview
No ratings yet
Relational Data Mining Overview
57 pages
Big Data Trends in Tourism Research
No ratings yet
Big Data Trends in Tourism Research
9 pages
Code and Data
No ratings yet
Code and Data
45 pages
Online Voting System Development in Python
No ratings yet
Online Voting System Development in Python
69 pages
Oracle HRMS Table Overview
No ratings yet
Oracle HRMS Table Overview
16 pages
DB2 PureScale Redbook
100% (1)
DB2 PureScale Redbook
306 pages
Elfolab 16.x Software Installation Guide
No ratings yet
Elfolab 16.x Software Installation Guide
151 pages
PHP Basics: Variables, Arrays, and Syntax
No ratings yet
PHP Basics: Variables, Arrays, and Syntax
18 pages

Distributed Query Optimization Techniques

Uploaded by

Distributed Query Optimization Techniques

Uploaded by

Distributed query optimization

Data Management for Big Data

Dario Della Monica

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/1

• Distributed Database Design (Ch. 3) ⋆

• Distributed Query Processing (Ch. 6-8) ⋆

➡ Query decomposition and data localization (Ch. 7) ⋆

• Distributed Transaction Management (Ch. 10-12) ⋆

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/2

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/3

• In this chapter (Ch. 8) ⋆:

➡ Additional problems specific to the distributed setting

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/4

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/5

• QEP are decorated with more information (on data exchange)

➡ In distributed context: non left-deep trees allow for parallelization

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/6

✦ Clearly, cost of localized queries (handled with CQO techniques) is computed as

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/7

• It is even more in distributed query optimization (affect communication costs)

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/8

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/9

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/10

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/11

• Consider the join of two relations:

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/12

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/13

idR A idS A R’ contains tuple <1,1> that does not

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/15

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/16

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/17

(a) Retrieve outer tuples • CT(x): communication time to transfer x bytes

Total Cost = LT ( retrieve card(R) tuples from R )

Join is done as R comes because R is the outer relation

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/19

Total cost = LT ( retrieve card( S ) tuples from S )

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/20

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/21

Total cost = LT ( retrieve card ( S ) tuples from S )

• CT(x): communication time to transfer x bytes

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/22

1. Ship whole PROJ at site of ASG CT ( size(PROJ) )

• If there is no upper level operation then 4 is a bad choice

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/24

PAY PAY EMP

Distributed DBMS © M. T. Özsu & P. Valduriez Ch.8/25

1. At compile time, generate a

• Site (and copy) selection is done in a greedy fashion

You might also like