0% found this document useful (0 votes)
9 views15 pages

Query Processing and Optimization in DBMS

The document outlines the assignment for 4th-year students of Haramaya University on Advanced Database Systems, focusing on query processing, optimization, distributed database systems, data fragmentation, replication, and allocation. It details the phases of query processing, types of query optimization, characteristics of distributed databases, and techniques for concurrency control and recovery. The document serves as a comprehensive guide for understanding complex database concepts and their practical applications.

Uploaded by

fetenatusura80
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

Query Processing and Optimization in DBMS

The document outlines the assignment for 4th-year students of Haramaya University on Advanced Database Systems, focusing on query processing, optimization, distributed database systems, data fragmentation, replication, and allocation. It details the phases of query processing, types of query optimization, characteristics of distributed databases, and techniques for concurrency control and recovery. The document serves as a comprehensive guide for understanding complex database concepts and their practical applications.

Uploaded by

fetenatusura80
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

HARAMAYA UNIVERSITY COLLEAGUES OF COMPUTING AND

INFORMATICS CCI
Department Of It Summery For Bed 4th Year
Assignment of ADVANCED DATA BASE SYSTEM. (In TC 2110)

NO NAME OF STUDENTS ID NUMBER

1 Fetena Tusura Nure K/1462/11

2 Hasen Kedir Fato K/1466/11


3 Husein Gameda Hinsene K/1467/11
4 Husein Gamedi Kena K/1419/10
5 Radia Aliyi Boke K/1492/10
6 Roman Tagane Damake K/1495/10

7 Kedir Abdi Teji K/1263/11


1. Discuss about query processing phases.
 Query processing in a Database Management System (DBMS) is the series of steps that
the system follows to transform a high-level query (like SQL) into an efficient execution
plan and then retrieve the correct result.
 It involves several phases, usually divided into four main ones:
Phases of Query Processing.
1. Parsing and Translation
 The DBMS parses the SQL query to check for syntax errors and validates that all tables,
attributes, and operations exist in the schema.
 The query is then translated into an internal representation, often a relational algebra
expression.
Example: SELECT name FROM student WHERE age > 18; → relational algebra form.
2. Optimization
 The system generates many possible strategies (query execution plans) to execute the
query.
Optimization can be:
Heuristic-based: Uses predefined rules (e.g., apply selection early, avoid Cartesian product).
Cost-based: Estimates the cost of different plans (I/O, CPU, memory usage) using statistics about
tables and indexes. The best (lowest cost) plan is chosen.
3. Evaluation Plan Generation (Code Generation)
 The chosen relational algebra expression is converted into a physical execution plan.
 This specifies how operations are carried out (e.g., use of indexes, nested-loop join, sort-
merge join, or hash join).
 The plan is expressed as a tree of operators, where each node is an operation (scan, join,
sort, etc.).
4. Query Execution.
 The DBMS executes the plan step by step.
 Intermediate results may be stored in memory or on disk.
 The final result is produced and returned to the user.
 Evaluation Plan → Decide the physical operations.

1
 Execution → Run the plan and output results.
2. Discuss in detail about query optimization.
Query Optimization in DBMS
 Query optimization is the process of choosing the most efficient way to execute a
database query by considering different possible query execution plans and selecting the
one with minimum cost (in terms of time, CPU, memory, and I/O).
 Since the same SQL query can be executed in multiple ways, the optimizer ensures that
the query runs as fast as possible with minimal resource usage.
Types of Query Optimization.
1. Heuristic (Rule-Based) Optimization
Uses predefined rules/heuristics to transform the query into a more efficient one.
Example rules: Perform selections early (reduce the size of intermediate results).
 Use projection early (eliminate unnecessary columns).
 Replace Cartesian product + selection with a join.
Advantages: Simple, fast.
Disadvantages: May not always give the best plan.
2. Cost-Based Optimization (CBO)
Considers multiple possible execution plans and estimates their cost (I/O operations, CPU,
memory).
 Cost estimation uses statistics like:
 Table size (number of tuples).
 Index availability.
 Selectivity of predicates.
 Chooses the plan with the lowest estimated cost.
Advantages: Produces better plans.
Disadvantages: Complex requires statistics.
 Steps in Query Optimization
1. Parsing & Translation
SQL → relational algebra expression.
2. Logical Optimization

2
Transform query algebraically to an equivalent but more efficient expression.
Example: Move WHERE conditions close to the base table (predicate pushdown).
3. Physical Optimization
Decide how each operation will be performed:
 Table scan vs. Index scan.
 Nested loop join vs. Hash join vs. Sort-merge join.
 Sorting vs. Hashing for grouping.
4. Cost Estimation
 Assigns cost to each plan based on:
 Disk I/O (most expensive).
 CPU computation.
 Memory usage.
 Communication cost (in distributed DBs).
5. Plan Selection
 The plan with the minimum estimated cost is chosen for execution.
 Techniques Used in Optimization
 Predicate Pushdown → Apply filters early.
 Join Reordering → Perform smaller joins first.
 Importance of Query Optimization
 Reduces execution time.
 Saves CPU and memory resources.
 Reduces disk I/O (which is the most costly operation).
 Improves scalability of large databases.
 Provides better user experience with faster responses.
 In Short: Query optimization is all about finding the best execution strategy for a query. It
includes heuristic and cost-based methods, and focuses on reducing disk I/O, CPU time,
and memory usage by using techniques like predicate pushdown, join reordering, and
index utilization.

3
3. Discuss in detail about distributed database system.
 Distributed Database System (DDBS)
A Distributed Database System (DDBS) is a database in which the data is stored across multiple
locations (sites/computers) but managed as a single logical database.
 The sites are connected via a network.
 Users see the system as one integrated database, even though the data is physically
distributed.
Characteristics of Distributed Database System.
1. Transparency Features (Users should not notice the distribution):
 Distribution Transparency → Users don’t need to know where data is stored.
 Replication Transparency → Users don’t know if multiple copies of the same data exist.
 Fragmentation Transparency → Data may be split into fragments, but queries behave as
if it’s one table.
 Concurrency Transparency → Multiple users can access the system simultaneously
without conflict.
 Fault Tolerance Transparency → System continues to work even if one site fails.
2. Heterogeneity → May involve different DBMSs, hardware, or OSs.
3. Autonomy → each site can control its own data but still participate in the global database.
 Types of Distributed Databases.
1. Homogeneous DDBS.
 All sites use the same DBMS and schema.
 Easy to manage and query.
2. Heterogeneous DDBS.
 Sites may use different DBMSs, schemas, or data models.
 More complex, requires special integration tools.
 Data Distribution Strategies.
1. Replication
Copies of the same data are stored at multiple sites.
Advantages: High availability, faster access.
Disadvantages: High update cost (synchronization).

4
2. Fragmentation.
Data is divided into fragments:
 Horizontal Fragmentation → Rows are distributed across sites.
 Vertical Fragmentation → Columns are distributed across sites.
 Hybrid Fragmentation → Combination of both.
3. Mixed (Replication + Fragmentation)
 Some fragments are replicated, others are unique.
 Advantages of Distributed Database System
 Reliability & Availability → even if one site fails, others can continue.
 Scalability → Easy to add new sites without redesigning the whole system.
 Faster Access → Data stored near where it is most frequently used.
 Modularity → each site can be managed independently.
 Cost-Effective → can use multiple smaller systems instead of one large system.
 Disadvantages of Distributed Database System.
 Complexity → Query optimization, concurrency control, and recovery are harder.
 Costly Maintenance → more expensive to manage multiple sites.
 Data Consistency Issues → keeping replicas synchronized is challenging.
 Communication Overhead → Network delays can slow down queries.
 Security Risks → more entry point’s → higher security challenges.
 Applications of DDBS.
 Banking Systems → Branches store local customer data but share global access.
 E-commerce → Customer data distributed worldwide for fast access.
 Telecommunications → Subscriber information spread across different regions.
 Government & Defense → Sensitive data distributed for security and availability.
4. Explain about Data Fragmentation, Replication and Allocation in
distributed database.
1. Data Fragmentation.
Fragmentation means dividing a database into smaller parts (fragments) that can be stored at
different sites in a distributed database system.

5
Types of Fragmentation:
1. Horizontal Fragmentation
 Divides a relation (table) into subsets of rows (tuples)
 Each fragment stores rows that satisfy a condition.
Example: A STUDENT table can be divided based on Campus:
Fragment 1: Students from Campus A
Fragment 2: Students from Campus B
2. Vertical Fragmentation
 Divides a relation into subsets of columns (attributes).
Each fragment must include the primary key to allow reconstruction of the full table.
Example: STUDENT table split as:
Fragment 1: (Student ID, Name, Age)
Fragment 2: (Student ID, Address, GPA)
3. Hybrid (Mixed) Fragmentation
 Combination of horizontal and vertical fragmentation.
Example: First divide STUDENT table horizontally by Campus, then vertically by attributes.
Benefit: Reduces query cost since data is stored close to where it is used.
2. Data Replication
Replication means storing multiple copies of the same data at different sites.
Types of Replication:
1. Full Replication
 Whole database replicated at all sites.
Advantage: High availability, fault tolerance, fast local access.
Disadvantage: High update cost (every update must be applied to all copies).
2. Partial Replication
Only frequently accessed or critical fragments are replicated.
Advantage: Balance between availability and cost.
3. No Replication
 Each fragment stored only at one site.
Advantage: Low redundancy, less overhead.

6
Disadvantage: If site fails, data becomes unavailable.
Benefit: Increases availability and reliability, but synchronization is challenging.
3. Data Allocation Allocation means deciding where to store each fragment/replica in the
distributed system.
 Types of Allocation:
1. Centralized Allocation
All data stored at one site, others act as remote users.
Easy to manage, but not fault tolerant.
2. Partitioned Allocation
Database divided into fragments, and each fragment stored at one site only.
Example: Sales data for Region A stored at Site A, Region B stored at Site B.
3. Replicated Allocation
Copies of fragments stored at multiple sites.
Improves fault tolerance and query response time.
 Goal of Allocation: Minimize communication cost, maximize local access, and
balance load across sites.
Relationship between the Three 3
 Fragmentation decides how to split the data.
 Replication decides how many copies to keep.
 Allocation decides where to place fragments/replicas.
 In Short:
 Fragmentation → Splitting data into smaller pieces (horizontal, vertical, hybrid).
 Replication → Making copies of data (full, partial, none).
 Allocation → Deciding at which site(s) the fragments/replicas are stored.
5. Query Processing in Distributed Databases.
 Query Processing in Distributed Databases.
Definition
 Query processing in a distributed database refers to the steps of transforming a high-level
query (SQL) into an efficient execution plan that retrieves the required data from multiple
distributed sites, while minimizing cost (communication, I/O, CPU).

7
 Unlike centralized systems, here the system must decide which site(s) to access, how to
combine results, and how to reduce data transfer over the network.
 Phases of Query Processing in Distributed DB
1. Query Decomposition
 SQL query is parsed and checked for correctness.
 Transformed into relational algebra expression.
 High-level query → equivalent logical query.
Example: SELECT * FROM STUDENT WHERE Age > 18; → relational algebra form.
2. Data Localization
 Transforms the query using knowledge of fragmentation, replication, and allocation.
 Identifies which fragments of data are needed and at which site(s) they are located.
Example: If STUDENT table is horizontally fragmented by campus, only relevant fragments are
retrieved.
3. Global Optimization
 Multiple possible strategies are generated for retrieving data.
 Optimizer selects the plan that minimizes communication cost + I/O + CPU usage.
 Important because network communication cost dominates in distributed DB.
Example techniques: Semi joins to reduce data transfer.
Perform selection/projection at remote sites before sending data.
4. Local Optimization
Each site optimizes its part of the query independently.
Example: If Site A must execute a join, it decides whether to use index scan, nested-loop join,
etc.
5. Query Execution
 Execution plan is carried out.
 Partial results are collected from remote sites.
 Results are merged and presented to the user as if from a single centralized DB.
 Challenges in Distributed Query Processing
 Data Distribution → Data may be fragmented, replicated, or located at multiple sites.
 Communication Cost → Transferring data between sites is expensive.

8
 Heterogeneity → Sites may run different DBMSs.
 Concurrency Control → Ensuring consistency when multiple users access data at
different sites.
 Fault Tolerance → Handling failures of sites or network during query execution.
 Techniques to Improve Query Processing
 Predicate Pushdown → Apply filters at local sites to reduce transferred data.
 Semi joins → Send only join attributes first, then fetch relevant tuples.
 Replication Usage → Access the nearest copy of data.
 Parallel Execution → Execute sub queries at multiple sites simultaneously.
 Query processing in DDBS has five phases:
1. Query Decomposition
2. Data Localization
3. Global Optimization
4. Local Optimization
5. Query Execution
Main objective: Reduce communication cost and improve performance while maintaining
correctness and transparency.
6. Concurrency Control and Recovery in distributed database system.
 Concurrency control ensures that multiple transactions executing simultaneously in
different sites of a distributed database maintain consistency, isolation, and correctness.
Objectives
 Preserve ACID properties (Atomicity, Consistency, Isolation, and Durability).
 Avoid problems like lost update, dirty read, uncommitted data, and deadlocks.
 Ensure serializability (execution is equivalent to some serial order).
 Concurrency Control Techniques
1. Distributed Two-Phase Locking (2PL)
Transactions must acquire locks before accessing data.
Two phases:
 Growing phase → obtain locks.
 Shrinking phase → release locks.

9
 Guarantees serializability but may cause deadlocks.
2. Distributed Timestamp Ordering
 Each transaction gets a timestamp (unique identifier).
 Transactions are executed in timestamp order.
 Ensures serializability without deadlocks, but may lead to starvation.
3. Distributed Optimistic Concurrency Control (OCC)
 Transactions execute without locks.
 At commit, validation checks if conflicts occurred.
 If conflict → transaction is rolled back.
 Works best in systems with low conflict rate.
4. Distributed Deadlock Detection & Resolution
 Deadlocks can happen across multiple sites.
 Detection methods:
 Centralized → one site checks for deadlocks.
 Distributed → all sites cooperate to detect cycles.
 Resolution: Abort one transaction to break the deadlock.
2. Recovery in Distributed Databases
 Recovery ensures that the database returns to a consistent state after a failure (site
crash, communication failure, or transaction failure).
Types of Failures in Distributed DB
1. Transaction Failures → e.g., logical errors (invalid input).
2. Site Failures → one computer/node crashes.
3. Communication Failures → messages lost or delayed between sites.
4. Distributed Deadlocks → transactions blocked across sites.
 Recovery Techniques
1. Distributed Commit Protocols
 Ensure that either all sites commit a transaction or none do.
 Two-Phase Commit (2PC)
 Phase 1 (Prepare): Coordinator asks participants if they can commit.
 Phase 2 (Commit/Abort): Based on replies, coordinator decides.

10
 Ensures atomicity but may block if coordinator fails.
 Three-Phase Commit (3PC)
2. Check pointing
 Sites periodically save their state (checkpoint).
 After failure, recovery restarts from the last checkpoint instead of from scratch.
3. Logging (Write-Ahead Logging – WAL)
 Each site maintains a log file recording updates.
 After failure, log is used to redo committed transactions and undo uncommitted ones.
4. Message Logging in Distributed Systems
 Keeps track of messages between sites.
 Helps in recovery from communication failures.
 Concurrency Control in DDBS ensures correct execution of concurrent transactions using
techniques like 2PL, Timestamp ordering, OCC, and deadlock handling.
 Recovery ensures that failures do not corrupt the database, using commit protocols
(2PC/3PC), check pointing, and logging.
 Both together maintain the integrity, consistency, and reliability of distributed databases.
 Concurrency Control → Deals with simultaneous transaction execution (methods: 2PL,
Timestamp, OCC).
 Recovery → Deals with failures (methods: 2PC, 3PC, check pointing, logging.
6. Reference
“Concurrency Control and Recovery in Distributed Database System” is a standard exam
question in DBMS / Distributed Database courses.
You can find references in widely used textbooks:
� Textbook References
1. Abraham Silberschatz, Henry Korth, S. Sudarshan – Database System Concepts (6th/7th
Edition)
Chapter 23 & 24 → Concurrency Control and Recovery in Centralized DBMS.
Chapter 25 → extends these ideas to Distributed Databases.
2. M. Tamer Özsu & Patrick Valduriez – Principles of Distributed Database Systems (3rd
Edition)

11
Chapter 12: Distributed Concurrency Control
Chapter 13: Distributed Database Recovery.
3. Raghu Ramakrishna & Johannes Gehrke – Database Management Systems (3rd Edition)
Has a section on transaction management, concurrency control, and recovery, including
distributed aspects.
� Academic / Course References.
This topic is part of Transaction Management and Distributed DBMS modules in Computer
Science courses.
Usually covered under:
Unit: Concurrency Control
Unit: Recovery Systems
Unit: Distributed DBMS
� So your Question 6: Concurrency Control and Recovery in Distributed Database System is
directly referenced in:
Silberschatz et al., Database System Concepts
Özsu & Valduriez, Principles of Distributed Database Systems.
5. References
1. Query Processing Phases
Silberschatz, Korth & Sudarshan – Database System Concepts (6th/7th Edition)
Chapter 15: Query Processing
Raghu Ramakrishnan & Johannes Gehrke – Database Management Systems (3rd Edition)
Chapter 12: Query Evaluation
Elmasri & Navathe – Fundamentals of Database Systems (6th/7th Edition)
Chapter 15: Query Processing.
2. Reference
2. Query Optimization
Silberschatz, Korth & Sudarshan – Database System Concepts
Chapter 16: Query Optimization
Raghu Ramakrishnan & Johannes Gehrke – Database Management Systems
Chapter 13: Query Optimization

12
Elmasri & Navathe – Fundamentals of Database Systems
Chapter 16: Query Optimization.
3. Reference
3. Distributed Database System
M. Tamer Özsu & Patrick Valduriez – Principles of Distributed Database Systems (3rd Edition)
Chapter 1: Introduction
Chapter 5–9: Distributed Database Architecture & Design
Silberschatz, Korth & Sudarshan – Database System Concepts
Chapter 25: Distributed Databases
Elmasri & Navathe – Fundamentals of Database Systems
Chapter 25: Distributed Databases
Q1 & Q2 → Silberschatz et al. (Ch. 15 & 16), Ramakrishnan (Ch. 12 & 13), Elmasri & Navathe
(Ch. 15 & 16).
Q3 → Özsu & Valduriez (Ch. 5–9), Silberschatz et al. (Ch. 25), Elmasri & Navathe (Ch. 25).
4. References
1. M. Tamer Özsu & Patrick Valduriez – Principles of Distributed Database Systems (3rd
Edition)
Chapter 5: Data Storage
Covers fragmentation, replication, and allocation strategies in detail.
2. Abraham Silberschatz, Henry Korth, S. Sudarshan – Database System Concepts (6th/7th
Edition)
Chapter 25: Distributed Databases
Sections: Data Storage, Fragmentation, Replication, and Allocation.
3. Elmasri & Navathe – Fundamentals of Database Systems (6th/7th Edition)
Chapter 25: Distributed Databases
Özsu & Valduriez (Ch. 5)
Silberschatz et al. (Ch. 25)
Elmasri & Navathe (Ch. 25)

13
1. Reference
You can find them in these commonly used references:
� Textbook References
1. “Database System Concepts” – by Abraham Silberschatz, Henry Korth, and S. Sudarshan
Chapter: Distributed Databases & Query Processing.
Very standard book for DBMS courses.
2. “Principles of Distributed Database Systems” – by M. Tamer Özsu and Patrick Valduriez
Comprehensive reference specifically for distributed databases.
Covers query processing, optimization, fragmentation, replication, etc.
3. “Database Management Systems” – by Raghu Ramakrishna & Johannes Gherkin
Contains sections on query processing and distributed databases.
� Academic Sources
Distributed query processing is a major topic in Database Systems courses (BSc/MSc Computer
Science, IT, Software Engineering).
It often appears in university exam questions, especially under Distributed Database
Management Systems (DDBMS) modules.
Silberschatz et al., Database System Concepts (6th/7th Edition), Chapter 25 (Distributed DBMS)
Özsu & Valduriez, Principles of Distributed Database Systems (3rd Edition), Chapters 7–9.

14

You might also like