HEC NSCT 2026 Databases MCQs Answers Notes
HEC NSCT 2026 Databases MCQs Answers Notes
Mapped directly to the official P@SHA/HEC syllabus. Features a mix of scenario-based, conceptual, and
code-tracing MCQs. Each question shows the correct answer in bold, followed by the explanation
immediately under it.
Quick Notes
• DBMS vs File System: DBMS solves data redundancy, data isolation, and
concurrent access anomalies present in traditional OS file systems.
• Program-Data Independence: In a DBMS, the structure of the data files is stored
in the DBMS catalog separately from the access programs.
• Database Catalog (Metadata): "Data about data." Stores table definitions, data
types, constraints, and user access roles.
• Schema vs Instance: Schema is the structural design (rarely changes). Instance is
the snapshot of actual data at a given moment (changes constantly).
MCQs
Because the same data is stored in two places (Redundancy), updating one without the other
leads to conflicting facts about the student's address. A DBMS prevents this via centralized
storage and normalization.
Q2. (Comparison) Which of the following best describes the difference between a
Database Schema and a Database Instance?
The schema is the defined structure (tables, columns, data types) and rarely changes. An
instance is the snapshot of the data (the rows) inside those tables, which changes with every
INSERT, UPDATE, or DELETE operation.
Q3. (Code-Tracing) A database administrator executes the following SQL query to inspect
the database. What specific DBMS component is being accessed here?
mysql-terminal
Q4. (Conceptual) In traditional file systems, if you change the length of a 'ZipCode' field
from 5 to 9 characters in the storage file, you must also rewrite the application code reading
that file. A DBMS eliminates this requirement. This characteristic is known as:
A) ACID Compliance
B) Program-Data Independence
C) Data Isolation
D) Concurrency Control
Program-Data Independence means the structure of the data is managed by the DBMS and
separated from the application programs. Applications interact with the DBMS through APIs,
shielding them from low-level file structure changes.
Q5. (Scenario-Based) Two travel agents attempt to book the last remaining seat on a flight
at the exact same millisecond. In a traditional file system, both might succeed, causing an
overbooking. What inherent feature of a DBMS prevents this?
A) Data Abstraction
B) Data Dictionary
C) Concurrency Control
D) Crash Recovery
Concurrency Control (often implemented via locks) ensures that when multiple transactions
attempt to update the same data simultaneously, they are executed in a safe, serializable
manner, preventing lost updates or overbooking anomalies.
Q6. (Conceptual) While a DBMS offers many advantages, there are scenarios where
utilizing a DBMS is considered unnecessary or harmful. Which of the following is a valid
reason to avoid using a full-scale DBMS?
A) Employee A
B) Employee B
C) Employee C
D) Both B and C
A Database Administrator (DBA) is primarily responsible for the physical realization of the
database, performance tuning, backups, security, and hardware allocation. Employee B is a
Database Designer, and C is an Application Developer.
A DBMS manages data storage, retrieval, and integrity. It does not compile application
software (like Java or Python code) into machine code; that is the job of a programming
language compiler.
Q9. (Comparison) What is the primary difference between a "Data Definition Language
(DDL)" and a "Data Manipulation Language (DML)"?
A) Integrity Constraints
B) Security and Authorization
C) Transaction Management and Recovery
D) Query Optimization
• Slotted Page Architecture: Unlike file systems that read continuous streams of
bytes, a DBMS divides the hard drive into fixed-size blocks called Pages (typically
8KB in PostgreSQL or 16KB in MySQL InnoDB). Each page uses a "Slotted Page
Layout." The page header contains metadata, followed by an array of line pointers
growing downward, while the actual row data (tuples) grows upward from the
bottom. This allows the DBMS to delete rows and dynamically defragment the page
without shifting all other rows.
• Bypassing the OS Cache (O_DIRECT): Modern Operating Systems try to cache
file reads in RAM. However, a high-performance DBMS explicitly tells the OS to
bypass this cache (using flags like O_DIRECT in Linux). Why? Because the DBMS's
own Buffer Manager understands database workloads better than the OS. The
DBMS knows that an "Index Root Node" page should be kept in RAM forever, while
a "Sequential Scan Table" page should be evicted immediately after reading.
• Write-Ahead Logging (WAL) & Sequential I/O: Hard drives (especially HDDs) are
incredibly slow at random writes, but fast at sequential writes. To guarantee ACID
Durability without destroying performance, the DBMS does not write your UPDATE
directly to the table's data file immediately. Instead, it writes the change to the WAL
(a sequential log file) and simply updates the page in RAM (marking it as "dirty").
Even if the power goes out a millisecond later, the database can reconstruct the
exact state by replaying the sequential WAL upon reboot.
Quick Notes
◦ Internal Level: Describes physical storage structures and access paths (B-Trees,
Hash indexes).
◦ Conceptual Level: Describes the structure of the whole database for a
community of users (Entities, Data Types, Relationships).
◦ External Level: Describes the part of the database a specific user group is
interested in (Views).
• Data Independence: Capacity to change the schema at one level without having to
change the schema at the next higher level.
• Client-Server Architectures:
• DBMS Components: Query Optimizer (finds best execution plan), Buffer Manager
(handles RAM caching), Transaction Manager (enforces ACID).
MCQs
Q1. (Scenario-Based) A DBA notices that queries searching for a specific user's email are
too slow. They create a B-Tree index on the email column. The backend application's SQL
queries remain exactly the same, but execution speed increases 10x. This is a practical
example of:
Q2. (Comparison) In the context of database application architecture, what is the primary
distinction between a 2-Tier and a 3-Tier architecture?
A) 2-Tier systems do not use a network connection, while 3-Tier systems do.
B) 2-Tier uses NoSQL databases; 3-Tier uses Relational databases.
C) In 2-Tier, business logic is embedded in the client application; in 3-Tier, an
intermediate application server handles business logic.
D) 2-Tier cannot support concurrent users, while 3-Tier can.
In a 2-Tier model (like a thick desktop client), the UI connects via ODBC/JDBC directly to the
DB, meaning the client holds the business rules. A 3-Tier model inserts a middle layer (like a
[Link] or Spring Boot server) to handle business logic, drastically improving security and
scalability.
Q3. (Code-Tracing) A developer runs the following command in their terminal. Which
specific internal component of the DBMS architecture is responsible for generating the
output shown?
postgres - query_plan
-- Output:
Index Scan using idx_orders_cust on orders (cost=0.29..8.31 rows=1)
The Query Optimizer analyzes the SQL statement and determines the most cost-effective
execution strategy (e.g., using an Index Scan rather than a Sequential Table Scan). The
EXPLAIN command directly reveals the Optimizer's chosen plan.
Q4. (Conceptual) In the Three-Schema Architecture, what mechanism allows the DBMS to
translate a request on an External View into a request against the Conceptual Schema, and
finally into a read operation on the physical disk?
Mappings are the rules/programs used by the DBMS to transform requests and results
between the levels (External/Conceptual mapping and Conceptual/Internal mapping). This
process of transforming requests is what actually provides Data Independence.
Because the Conceptual schema (the tables themselves) was altered, the external application
broke. If the DBA had created a VIEW named Customers that joined the two new tables, they
would have successfully achieved Logical Data Independence, shielding the app from the
change.
In a modern 3-Tier or n-Tier architecture, Tier 1 is the browser/app (Frontend), Tier 2 is the
web server/API (Backend Logic), and Tier 3 is the dedicated Database Server (Data Layer)
where the DBMS engine runs.
Q7. (Code-Tracing) Look at the SQL script below. The object hr_employee_view
belongs to which level of the database architecture?
mysql - schema_setup
A View is a virtual table tailored to a specific group of users (in this case, an HR intern who
shouldn't see inactive employees or salary data). This customized, restricted perspective is
the definition of the External Level.
Q8. (Conceptual) What is the primary role of the Storage Manager module in a database
system?
A) To act as an interface between the low-level data stored in the OS file system
and the application queries submitted to the system.
B) To check user passwords and roles.
C) To compile SQL into machine code.
D) To format the output of queries into JSON or XML.
The Storage Manager (which contains the Buffer Manager, File Manager, and Transaction
Manager) handles the actual retrieval, storage, and caching of data blocks on the physical
hard drives.
Q9. (Comparison) Why is Logical Data Independence generally considered much harder
to fully achieve than Physical Data Independence?
Changing physical storage (moving from RAID 0 to RAID 5, or adding an index) is invisible to
a SQL query. But if you change the logical structure (e.g., changing a 1:1 relationship to an
M:N relationship), the application's underlying code and queries usually must change to
handle the new logic.
Q10. (Conceptual) When a database crash occurs, the recovery manager ensures that
incomplete transactions are undone and committed transactions are preserved. Which file
structure, maintained by the DBMS architecture, is strictly required to accomplish this?
The Write-Ahead Log (WAL) records all changes made to the database *before* those
changes are written to the actual data files. In the event of a crash, the recovery manager
reads the WAL to Redo or Undo transactions.
When an application sends a query (e.g., SELECT * FROM Users WHERE id = 5),
it travels through the DBMS architecture in a precise sequence:
Quick Notes
• Relational vs. ER Model: The ER Model is strictly for conceptual design (drawing
on a whiteboard). The Relational Model is the logical implementation (creating
actual tables with primary keys).
• Network Model vs. Graph Databases: The classic Network Model (CODASYL)
relied on rigid, hardcoded pointers. Modern Graph Databases (like Neo4j) use
flexible nodes and edges designed for deep traversal, though both are technically
graphs.
MCQs
Q1. (Conceptual) Which early database model organizes data in a strict upside-down tree
structure, where every child record is allowed to have exactly one parent record?
A) Network Model
B) Relational Model
C) Hierarchical Model
D) Object-Relational Model
The Hierarchical model connects data in a tree. For example, a "Department" node has child
"Employee" nodes. Its major flaw is that an employee cannot easily belong to two
departments (a Many-to-Many relationship is not natively supported).
Q2. (Comparison) How did the Network Data Model fundamentally improve upon the
limitations of the Hierarchical Data Model?
The Network model (standardized by CODASYL) solved the M:N limitation of the hierarchical
model by allowing network-like graph relationships, where a single child record could link to
multiple parent records using pointers.
Q3. (Scenario-Based) A software architect is designing the backend for a social media
application. The most critical, high-frequency query is finding "friends of friends of friends"
(3rd-degree connections). Which data model is most highly optimized for this specific
traversal operation?
A) Relational Model
B) Hierarchical Model
C) Graph Data Model
D) Column-Family Data Model
Q4. (Conceptual) In the Relational Data Model introduced by E.F. Codd in 1970, what
fundamental mathematical concept serves as the absolute foundation for tables?
A) Calculus matrices
B) Set theory and relations
C) Directed acyclic graphs
D) Doubly Linked Lists
A "Relation" is a mathematical term for a set of tuples (rows). The relational model treats data
strictly as sets, which allows the use of Relational Algebra (Union, Intersection, Select,
Project) to declaratively query data.
Q5. (Scenario-Based) During the initial requirements phase of a new university database, a
database architect draws a diagram on a whiteboard. They draw rectangles for `Students`,
diamonds representing `Enrollments`, and ovals representing `Student_IDs`. What specific
data model is being utilized here?
A) Physical Data Model
B) Relational Data Model
C) Entity-Relationship (ER) Model
D) Object-Oriented Model
The ER Model is a high-level conceptual data model. It uses standard visual notations
(Rectangles = Entities, Diamonds = Relationships, Ovals = Attributes) to represent business
rules before writing any SQL table definitions.
Q6. (Code/Data Tracing) Examine the raw data representation extracted from a database
below. What type of data model is being utilized to store this record?
db_export.json
{
"_id": ObjectId("60f1b2b3"),
"name": "Sarah Connor",
"skills": ["Python", "Machine Learning"],
"contact": {
"email": "sarah@[Link]",
"phone": null
}
}
Q8. (Conceptual) Database design typically progresses through three distinct modeling
phases. Which data model specifies the absolute low-level storage details, such as record
formats, file structures, table spaces, and B-Tree access paths?
Database design follows three phases: Conceptual (ER diagram representing business logic),
Logical (Relational tables and foreign keys), and Physical (Database-specific syntax, index
creation, disk partitioning, and file organization).
A) The hierarchical model strictly prohibits a child node (Account) from having
more than one parent node (Customer).
B) The hierarchical model does not support integer data types required for financial
balances.
C) The database lacks primary keys, which are required for any relationship.
D) Hierarchical databases cannot store more than a certain amount of records per tree
branch.
Because the hierarchical model uses a strict tree topology, 1:N relationships are easy.
However, M:N relationships (Many customers mapped to Many accounts, or multiple parents
for one child) break the tree structure, requiring messy, redundant data duplication to simulate.
Q10. (Conceptual) In the context of database models and architecture, what specific
problem does "Data Abstraction" solve for the end-user?
A) It automatically encrypts data so unauthorized users cannot read it over the network.
B) It hides complex low-level storage and hardware details, presenting a simplified
logical view of the data.
C) It automatically deletes duplicate records from a table during an INSERT.
D) It converts relational tables into NoSQL document stores on the fly.
Data Abstraction allows a developer to write a simple `SELECT * FROM Users` without
needing to know exactly how the operating system reads the magnetic disk sectors, or how it
physically traverses a B-Tree index structure to locate that data.
• Structure Difference: OO uses nested objects, lists, and graphs. Relational uses
flat tables.
• Identity: OO identity is based on memory reference (e.g., obj1 == obj2).
Relational identity is based on Primary Keys.
• Inheritance: OO naturally supports inheritance (e.g., Manager extends
Employee). Relational DBs have no native concept of inheritance, requiring
workarounds like Single Table Mapping or Joined Table Mapping.
Quick Notes
◦ Super Key: A set of one or more attributes that uniquely identifies a tuple.
◦ Candidate Key: A minimal Super Key (no redundant attributes).
◦ Primary Key: The specific Candidate Key chosen by the DBA to uniquely identify
rows.
◦ Foreign Key: An attribute that matches the Primary Key of another table to create
a relationship.
• Candidate Key vs. Super Key: All Candidate Keys are Super Keys, but not all
Super Keys are Candidate Keys. If {ID} is a key, then {ID, Name} is a Super
Key, but NOT a Candidate Key because Name is redundant for uniqueness.
• Primary Key vs. Unique Key: A table can have only one Primary Key, and it strictly
cannot contain NULL values. A table can have multiple Unique Keys, and in many
systems (like PostgreSQL/MySQL), they can contain NULLs.
• Foreign Key NULLs: Foreign Keys can be NULL unless explicitly marked NOT
NULL. A NULL Foreign Key represents an optional relationship.
MCQs
Degree refers to the number of columns (attributes). Here, there are 5 columns. Cardinality
refers to the number of rows (tuples). Here, there are 500 rows.
Q2. (Conceptual) According to the Entity Integrity Constraint, which of the following
statements is true regarding a Primary Key?
A) A Primary Key can be NULL if the row is not referenced by any Foreign Key.
B) A Primary Key must be an integer.
C) No component of a Primary Key is allowed to accept a NULL value.
D) A Primary Key must be a single attribute only.
Entity Integrity states that no part of the primary key can be null. This ensures that every tuple
can be uniquely identified; a null value would imply an unknown identity, which violates the
fundamental requirement of a key.
Q3. (Scenario-Based) A database designer identifies three possible unique identifiers for a
Vehicle table: VIN (Vehicle Identification Number), EngineNumber, and LicensePlate.
In relational terminology, what are these three attributes called before one is officially
selected as the Primary Key?
A) Super Keys
B) Secondary Keys
C) Candidate Keys
D) Foreign Keys
Candidate Keys are the set of all minimal super keys that are eligible to become the Primary
Key. Once the designer chooses one (e.g., VIN), the others are referred to as Alternate Keys.
Q4. (Code-Tracing) A developer attempts to execute the following SQL script. Which
relational constraint will likely cause the second INSERT to fail?
sql-shell - constraints_test
The table Emp has a foreign key d_ref that references Dept(dept_id). Since the value 20
does not exist in the Dept table (only 10 exists), the insertion violates Referential Integrity.
Q5. (Comparison) What is the technical difference between a Super Key and a Candidate
Key?
A) Super Keys are used for relationships, Candidate Keys are used for indexing.
B) Super Keys must be numeric, Candidate Keys can be strings.
C) A Candidate Key is a Super Key with no redundant attributes (minimal); a Super
Key may contain unnecessary attributes.
D) Super Keys are chosen by the DBMS; Candidate Keys are chosen by the user.
A Super Key is any set of attributes that uniquely identifies a row. A Candidate Key is the
"minimal" version. For example, if {PassportNumber} is unique, then {PassportNumber,
EyeColor} is a Super Key, but not a Candidate Key because EyeColor is redundant.
In a composite key, the combination of the two values must be unique. Inserting (501, 'A')
twice violates the uniqueness requirement of the Primary Key.
Q7. (Conceptual) In the relational model, why is it stated that "tuples are unordered"?
Q8. (Code-Tracing) Consider the following table and data. What happens if you try to
execute: DELETE FROM Parent WHERE id = 1;?
postgres - delete_logic
The ON DELETE CASCADE option tells the DBMS that if a referenced row in the Parent table
is deleted, all rows in the Child table that reference it should also be deleted automatically.
Q9. (Comparison) How does a Foreign Key differ from a Primary Key regarding NULL
values?
Entity Integrity forbids NULLs in Primary Keys. However, a Foreign Key can be NULL if the
relationship is optional (e.g., an Employee who is not currently assigned to any Department).
Q10. (Calculation) A relation R has attributes {A, B, C, D}. The only unique
combinations of attributes are {A, B} and {A, C}. How many Candidate Keys does this
relation have?
A) 1
B) 2
C) 3
D) 4
Since both {A, B} and {A, C} are minimal sets that uniquely identify tuples, they are both
Candidate Keys. Attributes B and C are part of different keys.
Q11. (Conceptual) What does the term "Domain" represent in a relational schema?
A) The physical location of the database on the server.
B) The set of all possible atomic values that can be assigned to an attribute.
C) The group of users who have access to a table.
D) The number of tables in the database.
A Domain is a pool of values from which an attribute takes its values. For example, the
domain for a Age attribute might be integers between 0 and 120.
Q12. (Scenario-Based) An e-commerce database has a table Orders. The designer wants
to ensure that the total_amount column never accepts negative values. Which relational
concept should be applied?
A) Referential Integrity
B) Entity Integrity
C) Domain Constraint (using CHECK)
D) Foreign Key Constraint
A Domain Constraint restricts the values allowed in a column. In SQL, this is typically
implemented using a CHECK constraint (e.g., CHECK (total_amount >= 0)).
Q13. (Calculation) If Relation A has 10 tuples and Relation B has 5 tuples, what is the
Cardinality of the Cartesian Product (A x B)?
A) 15
B) 10
C) 50
D) 5
The Cartesian Product of two sets contains every possible combination of elements. In
database terms, the resulting cardinality is the product of the cardinalities of the two relations:
10 * 5 = 50.
Q14. (Code-Tracing) Observe the SQL below. What is the Primary Key of the
Enrollments table?
sql-editor - composite_keys
A) student_id
B) course_id
C) A composite key consisting of (student_id, course_id)
D) student_id, course_id, and semester
A Primary Key can consist of multiple columns. This is known as a composite key. In this
case, neither student_id nor course_id alone is the PK; only their combination is.
Q15. (Comparison) What happens if you try to insert a duplicate value into a column that
has a UNIQUE constraint, but is NOT the Primary Key?
Both PRIMARY KEY and UNIQUE constraints enforce uniqueness. The primary difference is
that a table can have only one PK and it cannot be NULL, whereas it can have many UNIQUE
columns.
• The 3-Valued Logic (3VL): In standard logic, something is either True or False.
Because of NULLs, SQL uses 3VL: True, False, and Unknown. For example, NULL
= NULL does not return True; it returns Unknown. This is why you must use IS
NULL instead of = NULL.
• Referential Integrity Mechanics: When a Foreign Key is defined, the DBMS
doesn't just check it during INSERT. It must also monitor the Parent table. There are
four standard behaviors for ON DELETE and ON UPDATE:
• Codd's Rule #10 (Physical Data Independence): This rule states that the logical
level of the database (how you write your SQL) must not be impacted by how the
data is physically stored. This is why we use Primary Keys rather than "Memory
Addresses" or "Row IDs" to link tables.
Quick Notes
• Relational Algebra: A procedural query language. It tells the system how to retrieve
the data using a sequence of operations.
• Fundamental Operations:
• Union Compatibility: Two relations are only union-compatible if they have the
same number of attributes (degree) and the domains of corresponding attributes are
identical. You cannot union a Users table with an Orders table directly.
• Duplicates: In pure Relational Algebra, operations like Projection automatically
remove duplicate rows. In SQL, you must explicitly use DISTINCT to achieve this.
• Procedural vs. Declarative: Remember that Algebra is like a recipe (steps), while
Calculus is like a wishlist (the result). SQL is primarily based on Relational Calculus
but executed using Relational Algebra.
MCQs
Q1. (Conceptual) Which of the following best describes the fundamental difference
between Relational Algebra and Relational Calculus?
Relational Algebra provides a set of operations (Select, Project, Join) that are executed in a
specific order to produce a result. Relational Calculus simply describes the properties that the
resulting data must satisfy.
Q2. (Calculation) Relation R has 10 tuples and Relation S has 20 tuples. If we perform the
operation R ∪ S, what is the minimum and maximum possible number of tuples in the
result? (Assume R and S are union-compatible).
Maximum: If R and S have no common tuples, the result is 10 + 20 = 30. Minimum: If all 10
tuples of R are already present in S, the result is simply the size of S, which is 20 (since sets
remove duplicates).
Q4. (Conceptual) For the Set Difference (R - S) operation to be valid, which of the
following conditions must be met?
Set operations (Union, Intersection, Difference) require the participating relations to have the
same structure (number of columns and corresponding data types). This is called Union
Compatibility.
Q5. (Scenario-Based) A marketing team wants a list of all Customers who have never
placed an Order. Which relational algebra operator is best suited for this task?
To find customers with no orders, you take the set of all Customer IDs and subtract the set of
Customer IDs that appear in the Orders table: πcust_id(Customers) - πcust_id(Orders).
Q6. (Code-Tracing) Trace the logic of the following Tuple Relational Calculus expression:
{ t | t ∈ Employees ∧ [Link] > 5000 }
calculus-engine
// Relation: Employees
// id | name | salary
// 1 | John | 4000
// 2 | Sarah | 6000
// 3 | Mike | 5500
A) Returns all columns for John.
B) Returns all columns for Sarah and Mike.
C) Returns only the salary column for everyone.
D) Returns a count of employees.
The expression defines a set of tuples 't' such that 't' is a member of the Employees relation
and the salary attribute of 't' is greater than 5000. Sarah and Mike both meet this criteria.
Q7. (Calculation) Relation R has degree 3 (A, B, C) and 5 tuples. Relation S has degree 2
(D, E) and 4 tuples. What is the Degree and Cardinality of the Cartesian Product R × S?
A) Degree: 6, Cardinality: 9
B) Degree: 5, Cardinality: 9
C) Degree: 5, Cardinality: 20
D) Degree: 6, Cardinality: 20
Q8. (Conceptual) Why is the Rename (ρ) operator considered essential in Relational
Algebra, particularly during a Self-Join scenario?
When joining a table to itself (e.g., finding employees who earn more than their managers
within the same Employee table), we must use the Rename operator to give the table two
different aliases so the engine can distinguish the attributes.
A) { t | t ∈ Users ∧ [Link] = 5 }
B) { t | ¬(t ∈ Users) }
C) { t | t ∈ Users ∨ t ∈ Admins }
D) { t | [Link] = 'Alice' ∧ t ∈ Users }
An unsafe expression like "all tuples not in the Users table" is problematic because the
universe of possible tuples is infinite. Safe expressions are those that only return values
present in the database domains.
A) Projection (π)
B) Union (∪)
C) Theta Join (⨝θ)
D) Set Difference (-)
A Theta Join is specifically defined as σcondition(R × S). It combines the tables and then
immediately filters them based on the provided predicate (θ).
Q11. (Comparison) What is the primary difference between Tuple Relational Calculus
(TRC) and Domain Relational Calculus (DRC)?
In TRC, you query for rows: {t | Employee(t)}. In DRC, you query for components: {<id, name,
sal> | Employee(id, name, sal) ∧ sal > 5000}. Both are equally powerful (Relational
Completeness).
Q12. (Calculation) Relation R(A, B) has 3 tuples: {(1, 2), (3, 4), (5, 6)}. Relation S(B, C) has
2 tuples: {(2, 7), (4, 8)}. What is the result of the Natural Join (R ⨝ S)?
A query language is said to be Relationally Complete if it can express any query that
can be formulated in the fundamental Relational Algebra. Both Tuple Relational
Calculus and Domain Relational Calculus are relationally complete.
The Query Optimizer Secret: Even though we write SQL (which is declarative/
Calculus), the DBMS does not execute it directly as a "wishlist." Instead, the Query
Optimizer parses the SQL and translates it into an internal Relational Algebra Tree.
This tree represents the procedural steps. The optimizer then rearranges the nodes of
this tree (e.g., performing Selection as early as possible to reduce the number of rows)
to find the most efficient path. This process is known as Algebraic Query
Optimization.
Quick Notes
• SQL Categories: DDL (Structure), DML (Data), DCL (Security), TCL (Transactions),
DQL (Retrieval).
• SELECT Statement: The primary DQL command. Syntax order: SELECT → FROM
→ WHERE → GROUP BY → HAVING → ORDER BY.
• NULL Handling: NULL represents a missing value. Standard comparisons (=
NULL) always fail; use IS NULL or IS NOT NULL.
• Wildcards: In the LIKE operator, % matches zero or more characters, and _
matches exactly one character.
• Aggregate Functions: COUNT(), SUM(), AVG(), MIN(), and MAX(). They
generally ignore NULL values (except COUNT(*)).
• DELETE vs. TRUNCATE: DELETE is DML (logged per row, can have a WHERE
clause). TRUNCATE is DDL (drops data pages, faster, cannot be filtered).
• Precedence of AND/OR: AND has higher precedence than OR. Always use
parentheses to ensure logic like (A OR B) AND C works as intended.
• Counting NULLs: COUNT(column_name) counts only non-null values. COUNT(*)
counts every row regardless of content.
• WHERE vs. HAVING: WHERE filters raw rows before aggregation. HAVING filters
grouped results after aggregation.
MCQs
Q1. (Code-Tracing) A developer executes the following query on a Products table. What
will be the final calculated value in the discounted_price column for a product with a
price of 100 and a tax of 10?
mysql - query_editor
SELECT name,
price + tax * 0.90 AS discounted_price
FROM Products;
A) 99.00
B) 90.00
C) 109.00
D) 110.00
Operator precedence applies in SQL. Multiplication (*) happens before Addition (+).
Calculation: 100 + (10 * 0.90) = 100 + 9 = 109. To apply the discount to the total,
parentheses would be needed: (price + tax) * 0.90.
Q2. (Conceptual) Which of the following SQL keywords is strictly required to remove
duplicate rows from the result set of a SELECT query?
A) UNIQUE
B) DISTINCT
C) GROUP BY
D) ALL
The DISTINCT keyword follows SELECT to ensure that only unique combinations of values
are returned in the output. While GROUP BY can achieve similar results, DISTINCT is the
standard for duplicate removal in DQL.
Q3. (Scenario-Based) A marketing database contains a Customers table. You need to find
all customers whose names start with 'S' and end with 'n' (e.g., 'Steven', 'Susan'). Which
WHERE clause is correct?
The LIKE operator uses % as a wildcard for any sequence of characters. 'S%n' means 'S'
followed by anything, ending with 'n'. 'S_n' would only match three-letter names like 'Sun'.
Q4. (Calculation) A table Survey has a column feedback. There are 10 rows total: 7 rows
have text comments, and 3 rows have NULL values. What will be the result of SELECT
COUNT(feedback) FROM Survey;?
A) 10
B) 7
C) 3
D) 0
Aggregate functions like COUNT(column_name) ignore NULL values and only count rows
where data is present. COUNT(*) would have returned 10.
Q5. (Code-Tracing) Observe the following logical conditions in a WHERE clause. Which
rows will be returned?
sql-logic-test
A) Rows 2 and 3
B) Rows 1 and 3
C) Only Row 3
D) All 3 Rows
AND has higher precedence than OR. The query is interpreted as: (category = 'A') OR
(category = 'B' AND stock > 60).
Row 1: True OR (False) = True.
Row 2: False OR (True AND False) = False.
Row 3: True OR (False) = True.
Q6. (Scenario-Based) A database administrator needs to change the data type of the
phone_number column from INT to VARCHAR(20) in an existing Contacts table. Which
SQL command should be used?
A) UPDATE Contacts SET phone_number = VARCHAR(20);
B) MODIFY TABLE Contacts phone_number VARCHAR(20);
C) ALTER TABLE Contacts ALTER COLUMN phone_number VARCHAR(20);
D) CHANGE Contacts phone_number TO VARCHAR(20);
The ALTER TABLE command is used to modify the structure of an existing table (DDL). The
specific sub-syntax varies slightly by engine (e.g., MODIFY in MySQL, ALTER COLUMN in
PostgreSQL/SQL Server), but ALTER TABLE is the root command.
Q7. (Conceptual) Why is the TRUNCATE command generally much faster than the DELETE
command for removing all rows from a large table?
TRUNCATE is a DDL operation that records the deallocation of data pages in the transaction
log rather than logging every individual row deletion. This drastically reduces I/O and
overhead.
Q8. (Code-Tracing) What is the result of the following query if the salary column contains
a NULL for the employee 'Bob'?
sql-null-trap
In SQL, NULL compared to any value (even NULL itself) using standard operators (=, <>, >)
results in UNKNOWN. Since the WHERE clause only returns rows where the condition is TRUE,
Bob's row is filtered out because NULL <> 5000 is UNKNOWN.
Q9. (Calculation) You need to sort a list of students by grade (highest first) and then by
last_name alphabetically for students with the same grade. Which clause is correct?
DESC applies only to the attribute immediately preceding it. To sort grade descending and
last_name ascending, we specify grade DESC followed by last_name (ASC is the default).
Q10. (Scenario-Based) A banking application needs to find all accounts that do not have a
secondary_owner assigned. Which SQL condition is mathematically correct?
NULL represents the absence of a value. It cannot be checked using = because NULL =
NULL is unknown. The special operator IS NULL is provided for this purpose.
Q11. (Conceptual) In SQL, which of the following is true about the BETWEEN operator (e.g.,
WHERE price BETWEEN 10 AND 20)?
The BETWEEN operator is shorthand for val >= low AND val <= high. It is always
inclusive of the boundary values provided.
Q12. (Code-Tracing) Trace the output of this update operation. How many rows are
affected?
postgres - update_test
A) 1 row
B) 2 rows
C) 3 rows
D) 4 rows
Q13. (Scenario-Based) You need to add a new employee to the database, but you only
know their name and department. The salary should be left for later. Which SQL syntax
is correct?
When only providing some columns, you must list the column names in parentheses after the
table name. If the columns are omitted, SQL expects values for every column in the table in
their defined order.
Q14. (Calculation) What is the result of SELECT AVG(val) FROM T; if table T has
values: 10, 20, NULL, 30?
A) 15
B) 20
C) 22.5
D) NULL
The AVG function ignores NULLs. The calculation is (10 + 20 + 30) / 3 (three non-null
values). 60 / 3 = 20.
Q15. (Conceptual) The AS keyword in SQL is primarily used for:
AS is the alias keyword. It allows you to rename output columns for clarity (e.g., SELECT
price * tax AS total_cost) or rename tables for brevity in complex joins.
Q16. (Code-Tracing) Which query returns exactly the same result as: SELECT * FROM
Items WHERE id IN (1, 2, 3);?
The IN operator is a logical shorthand for multiple OR conditions on the same attribute. The
AND option (A) would return nothing because an id cannot be three different values
simultaneously.
Q17. (Scenario-Based) An HR database needs to delete all employees who belong to the
'Temporary' department. Which statement is correct?
DELETE is the standard DML command for removing specific rows based on a condition.
TRUNCATE and DROP are DDL and do not support WHERE clauses.
Q18. (Calculation) If a table has 100 rows and you execute SELECT DISTINCT country
FROM Users;, and 40 users are from 'USA', 40 are from 'UK', and 20 are from 'Germany',
how many rows are in the result?
A) 100
B) 20
C) 3
D) 1
The DISTINCT keyword removes all duplicate values. Since there are only 3 unique strings in
the country column ('USA', 'UK', 'Germany'), only 3 rows are returned.
mysql-terminal
ALTER TABLE ... DROP COLUMN is a DDL command that modifies the schema by
permanently deleting a column and all data stored within it.
Q20. (Conceptual) Which of the following is a key requirement for the UNION operator to
work between two SELECT statements?
For UNION to be valid, the two result sets must be "Union Compatible." This means the
number of columns and their corresponding data types must match so the rows can be
merged into a single set.
Deep Dive: SQL Execution Order vs. Syntax Order
One of the most confusing parts of SQL for developers is that the engine executes
clauses in a completely different order than you write them. While you write SELECT
first, it is actually one of the last things the engine processes.
• 1. FROM & JOIN: The engine first identifies the source tables and performs any
joins to create a "working set" of data.
• 2. WHERE: The engine filters the working set based on the row-level conditions.
• 3. GROUP BY: The remaining rows are grouped together based on shared values in
the specified columns.
• 4. HAVING: The aggregate groups are filtered (e.g., "only show groups where
SUM(sales) > 1000").
• 5. SELECT: The engine finally decides which columns to project. This is where
DISTINCT and AS aliases are applied.
• 6. ORDER BY: The final result set is sorted.
Why this matters: This is why you cannot use a column alias defined in the SELECT
clause inside a WHERE clause (because the WHERE happens before the SELECT), but
you can use it in the ORDER BY clause.
• Wikipedia: SQL
• W3Schools: SQL Tutorial
Quick Notes
• Joins:
• JOIN vs. WHERE Filtering: Putting a filter in the ON clause of a LEFT JOIN
behaves differently than putting it in the WHERE clause. Filtering in WHERE can
accidentally turn a LEFT JOIN into an INNER JOIN.
• UNION vs. UNION ALL: UNION performs a costly deduplication step. Use UNION
ALL if you know the sets are disjoint or if you specifically want duplicates for
performance reasons.
• RANK() vs. DENSE_RANK(): RANK() leaves gaps in numbering after ties (1, 2, 2,
4). DENSE_RANK() does not leave gaps (1, 2, 2, 3).
• IN vs. EXISTS: EXISTS is generally more efficient for subqueries because it stops
searching as soon as it finds a single match (Short-circuiting).
MCQs
Q1. (Code-Tracing) Analyze the following tables and the query below. What will be the
result of the query?
sql-join-challenge
A) ('Alice', 'NY')
B) ('Alice', 'NY'), ('Bob', 'LA')
C) ('Alice', 'NY'), ('Bob', NULL)
D) ('Alice', 'NY'), (NULL, 'LA')
A LEFT JOIN keeps all rows from the left table (Users). Alice has a match (NY). Bob has no
match in Profiles, so his city is returned as NULL. The row for 'LA' in Profiles is ignored
because it has no match in the left table.
Q2. (Scenario-Based) You have an Employees table with columns emp_id, name, and
manager_id. Which SQL technique is required to generate a list of all employees
alongside the name of their direct manager in a single result set?
A) It is a subquery that executes once and stores the result in a temporary table.
B) It is a subquery that depends on values from the outer query, executing once for
every row processed by the outer query.
C) It is a subquery that uses the UNION operator.
D) It is a subquery placed specifically in the FROM clause.
A correlated subquery refers to columns in the outer query's tables. Because its results
depend on the specific row being processed by the outer query, the engine must re-evaluate
the subquery for each row, which can impact performance.
Q4. (Code-Tracing) What is the purpose of the WITH clause in the following SQL snippet?
postgres - cte_logic
WITH RegionalSales AS (
SELECT region, SUM(amount) AS total
FROM Orders GROUP BY region
)
SELECT region FROM RegionalSales
WHERE total > (SELECT AVG(total) FROM RegionalSales);
The WITH clause defines a CTE named RegionalSales. This allows the developer to treat
the aggregated result like a temporary table within the scope of a single query, making it
easier to reference the result multiple times (as seen in the SELECT and the subquery).
Q5. (Calculation) Table A has 5 rows and Table B has 10 rows. If you perform a CROSS
JOIN between A and B, but the WHERE clause contains a condition that is always false (e.g.,
WHERE 1=0), how many rows will be returned?
A) 50
B) 15
C) 5
D) 0
A CROSS JOIN (Cartesian Product) initially produces 5 * 10 = 50 rows. However, the WHERE
clause is applied to the result set. If the condition is always false, all 50 rows are filtered out,
resulting in an empty set.
Q6. (Scenario-Based) A bank wants a list of customers who have a "Savings Account" but
do not have a "Credit Card". Both accounts are stored in the Accounts(cust_id,
type) table. Which set operator is most appropriate?
A) UNION
B) INTERSECT
C) EXCEPT (or MINUS)
D) UNION ALL
The EXCEPT operator returns all distinct rows from the first query that are not present in the
second query. By subtracting the set of 'Credit Card' owners from the set of 'Savings' owners,
we get the desired list.
Q7. (Code-Tracing) Examine the window function below. If three employees have the exact
same salary and are tied for the highest pay, what will the RANK() function return for them?
sql-window-functions
A) 1, 2, 3
B) 1, 1, 1 (and the next employee will be 4)
C) 1, 1, 1 (and the next employee will be 2)
D) NULL
RANK() assigns the same rank to ties. However, it leaves a "gap" in the sequence. If three
people are ranked 1, the 4th person receives rank 4. (If DENSE_RANK() were used, the 4th
person would be rank 2).
Q8. (Comparison) What is the primary operational difference between a standard VIEW
and a MATERIALIZED VIEW?
Standard Views are just "saved queries" that run every time they are called. Materialized
Views save the data result of the query. This makes them much faster for heavy aggregations,
though they must be "refreshed" when the underlying data changes.
Q9. (Calculation) Table T1 has 3 rows with values {1, 2, NULL}. Table T2 has 3 rows with
{1, NULL, NULL}. How many rows are returned by an INNER JOIN on these tables using
[Link] = [Link]?
A) 3
B) 2
C) 1
D) 0
In SQL joins, NULL = NULL is Unknown, not True. Therefore, NULL values never match
each other in a join. Only the value 1 matches between the two tables, resulting in exactly one
row.
Q10. (Code-Tracing) What does the EXISTS keyword do in the following query?
mysql - subquery_test
EXISTS returns true if the subquery returns *any* rows. It doesn't matter what columns are
selected (which is why SELECT 1 is common). It effectively finds departments that meet the
criteria.
Q11. (Scenario-Based) Which of the following is the most standard SQL approach to find
the second highest salary in an Employees table without using window functions?
Option B is standard SQL. It first finds the absolute maximum salary in a subquery, then finds
the maximum salary from the set of salaries that are strictly less than that absolute maximum.
(Option C is MySQL specific).
Q12. (Comparison) When should a developer prefer UNION ALL over UNION?
A) When they need to ensure the final result set contains no duplicates.
B) When the developer knows the two sets are already disjoint or when
performance is more important than removing duplicates.
C) When the column names in the two tables are different.
D) When the query contains a GROUP BY clause.
UNION performs a sort and unique operation to remove duplicates, which is CPU and memory
intensive. UNION ALL simply appends the two sets, making it significantly faster if
deduplication isn't required.
Q13. (Code-Tracing) Analyze the use of the CASE expression. What is returned for a
product with stock = 0?
sql-logic - case_expression
SELECT name,
CASE
WHEN stock > 10 THEN 'In Stock'
WHEN stock > 0 THEN 'Low Stock'
ELSE 'Out of Stock'
END as availability
FROM Products;
A) Low Stock
B) In Stock
C) Out of Stock
D) NULL
SQL evaluates CASE conditions in order. Since 0 is not > 10 and not > 0, it falls through to the
ELSE block, returning 'Out of Stock'.
Most joins are "Equijoins" (using =). A Non-equijoin uses other logic, such as joining an
employee to a Salary_Grades table where the salary is BETWEEN the low and high bounds
of the grade.
Q15. (Scenario-Based) You have a table with millions of duplicate rows due to a bug. You
want to delete duplicates but keep exactly one instance of each record. Which approach is
most efficient?
Q16. (Code-Tracing) What is the result of the HAVING clause in this query?
sql-aggregation-filter
HAVING filters groups after the GROUP BY has been processed. Here, it calculates the count
of each department group and only allows the department to appear in the final output if its
size is > 5.
Q17. (Calculation) If Relation R(A, B, C) and Relation S(C, D, E) are joined using a
NATURAL JOIN, what will be the attributes (Degree) of the resulting relation?
A) {A, B, C, C, D, E} (Degree 6)
B) {A, B, C, D, E} (Degree 5)
C) {C} (Degree 1)
D) {A, B, D, E} (Degree 4)
A Natural Join automatically joins on columns with the same name (C) and removes the
duplicate column from the output. The resulting attributes are the union of the two sets: A, B,
C, D, and E.
Q18. (Conceptual) What is a Scalar Subquery?
A scalar subquery is a query that returns a single value. Because it returns only one value, it
can be used in places where a single constant is expected, such as in the SELECT list or as a
comparison in a WHERE clause.
Q19. (Scenario-Based) An e-commerce company wants to keep an audit trail. Every time a
price is updated in the Products table, the old price and the time of change must be saved
into a Price_History table automatically. Which SQL object is designed for this?
A) Stored Procedure
B) Constraint
C) Trigger
D) Index
A Trigger is a special type of stored procedure that automatically runs ("fires") when a specific
event occurs in the database, such as an UPDATE, INSERT, or DELETE.
Q20. (Code-Tracing) Trace the logic of this three-table join. Which customers will appear in
the output?
sql-triple-join
SELECT [Link]
FROM Customers c
JOIN Orders o ON [Link] = o.cust_id
JOIN OrderItems i ON [Link] = i.order_id
WHERE i.product_name = 'Laptop';
When you write a JOIN, the Database Engine chooses between three primary physical
algorithms to execute it:
• Nested Loop Join: For every row in Table A, it scans Table B. Excellent for small
tables or when the join column is indexed.
• Hash Join: The engine creates a hash table in memory for the smaller table and
then probes it with the larger table. Very fast for large, unsorted datasets.
• Sort-Merge Join: Both tables are sorted by the join key, and then the engine "zips"
them together. Most efficient when both tables are already sorted or have clustered
indexes on the join column.
The "SARGable" rule: To ensure joins and filters are fast, queries must be "Search
ARGumentable." If you use a function on a join column (e.g., ON UPPER([Link]) =
[Link]), the database cannot use an index on [Link], forcing a slow full table scan.
Always try to keep columns "naked" in join conditions.
• Inner Join: Intersection. Left Join: Left side + NULLs for missing right.
• Self Join: Table joined to itself (Aliases mandatory).
• EXISTS: Returns True/False, stops early (Efficient).
• CTEs: Readable temporary result sets (WITH).
• RANK(): Gaps in ties. DENSE_RANK(): No gaps in ties.
• UNION ALL: Faster than UNION (no deduplication).
• NULL in Joins: NULL never equals NULL.
Quick Notes
• Functional Dependency (FD): X → Y means if two tuples have the same X value,
they must have the same Y value.
• 1NF (First Normal Form): All attribute values must be atomic (no multi-valued or
composite attributes).
• 2NF (Second Normal Form): Must be in 1NF and have No Partial Dependencies
(every non-prime attribute must depend on the whole primary key, not just part of it).
• 3NF (Third Normal Form): Must be in 2NF and have No Transitive Dependencies
(non-prime attributes should not depend on other non-prime attributes).
• BCNF (Boyce-Codd Normal Form): A stricter version of 3NF. For every FD X → Y,
X must be a super key.
• Decomposition: Splitting a table to remove anomalies. It must be Lossless-Join
(you can reconstruct the original table) and ideally Dependency Preserving.
• 2NF vs 3NF: Partial dependencies (2NF) only exist when you have a composite
primary key. If your PK is a single column, the table is automatically in 2NF.
• Prime vs Non-Prime: A prime attribute is part of any candidate key. A non-prime
attribute is part of no candidate key.
• Over-Normalization: While higher normal forms reduce redundancy, they increase
the number of JOINs required, which can degrade read performance in high-traffic
systems.
• Lossy Decomposition: If you split a table incorrectly, joining them back might
create "spurious tuples" (extra rows that weren't in the original data), making the
data invalid.
MCQs
A) Insertion Anomaly
B) Deletion Anomaly
C) Update Anomaly
D) Referential Integrity Violation
An Update Anomaly occurs when data is redundant, and changing a value in one place
requires changes in multiple rows. If the fee were stored in a separate Courses table linked
by course_id, only one update would be needed.
Q2. (Code-Tracing) Examine the following SQL table definition. Why does this table violate
1NF (First Normal Form)?
sql-schema-audit
A) {A, B}
B) {A, B, C}
C) {A, B, C, D}
D) {A}
Attribute closure is the set of all attributes that can be determined by a given set.
1. Start with {A}.
2. A → B, so {A, B}.
3. B → C, so {A, B, C}.
4. C → D, so {A, B, C, D}.
A) 1NF
B) 2NF
C) 3NF
D) BCNF
A) 1NF
B) 2NF
C) 3NF
D) BCNF
The table is in 2NF because the PK is a single column (no partial dependencies). However, it
violates 3NF because staff_id → dept_id → dept_manager_id is a Transitive
Dependency. dept_manager_id (non-prime) depends on dept_id (non-prime).
Q7. (Comparison) What is the key difference between 3NF and BCNF?
BCNF is more restrictive than 3NF. In 3NF, the FD X → A is allowed if A is a prime attribute. In
BCNF, X MUST be a super key, regardless of whether A is prime or not.
Q8. (Calculation) Given the relation R(A, B, C) and FDs: {A → B, B → A, B → C}. Which of
the following is a Candidate Key?
A) Only {A}
B) Only {B}
C) Both {A} and {B}
D) {A, B}
{A}+ = {A, B, C} and {B}+ = {B, A, C}. Since both A and B can uniquely identify all attributes in
the relation, they are both candidate keys. {A, B} is a super key but not a candidate key
because it isn't minimal.
A) R1 ∩ R2 = empty set.
B) R1 ∪ R2 = R.
C) R1 ∩ R2 is a candidate key for either R1 or R2.
D) The number of rows in R1 plus R2 equals the rows in R.
Lossless-join decomposition ensures that when you join R1 and R2 back together, you get
exactly the original relation R. This is mathematically guaranteed if the common attributes (the
intersection) form a candidate key in at least one of the decomposed tables.
Q10. (Scenario-Based) In an Orders table, if deleting a single customer's last order also
results in the database losing all of that customer's contact information (because it was only
stored in the order record), what anomaly has occurred?
A) Insertion Anomaly
B) Deletion Anomaly
C) Update Anomaly
D) Integrity Anomaly
A Deletion Anomaly occurs when the deletion of data representing one real-world fact (an
order) unintentionally results in the loss of data representing a completely different fact
(customer contact info) because the table is poorly designed.
A) {P, R, Q}
B) {P, R, Q, S}
C) {P, R, Q, S, T}
D) {P, R}
Q12. (Code-Tracing) You are auditing a database and find the following schema. Which
normal form does it currently violate?
sql-schema-check
A) 1NF
B) 2NF
C) 3NF
D) BCNF
This violates 2NF because of partial dependencies. item_name depends only on item_id,
and warehouse_location depends only on warehouse_id. Both non-prime attributes
depend on only part of the composite primary key.
In BCNF, every determinant must be a super key. In the FD Email → CourseID, Email is
the determinant. Since Email alone cannot uniquely identify a row (a student can have many
courses), it is not a super key. Therefore, the table violates BCNF.
Q15. (Calculation) How many candidate keys exist for R(A, B, C) if the FDs are {A → B, B
→ C, C → A}?
A) 1
B) 2
C) 3
D) 0
{A}+ = {A, B, C}. {B}+ = {B, C, A}. {C}+ = {C, A, B}. Since each single attribute can determine
the entire relation, A, B, and C are each separate candidate keys.
MVD (X ↠ Y) occurs when a value of X is associated with a set of values for Y, and this set is
independent of any other attributes. 4NF removes these dependencies to prevent redundant
rows.
Q17. (Scenario-Based) A developer refuses to normalize a table because they claim "the
data is small and JOINs are slow." Which risk are they primarily ignoring?
A) Disk space exhaustion.
B) SQL syntax complexity.
C) Data inconsistency and anomalies during future data entry.
D) Hardware compatibility.
A) It is in 4NF.
B) It is in 3NF.
C) It has no composite keys.
D) It has only one candidate key.
The normal forms are a hierarchy. BCNF is stricter than 3NF. Therefore, any relation that
satisfies BCNF automatically satisfies 3NF, 2NF, and 1NF.
Q19. (Conceptual) In database normalization theory, what is the "Minimal Cover" (or
Canonical Cover)?
The Minimal Cover is used during the normalization process to ensure that we are working
with the simplest possible representation of data relationships, removing any FDs that can be
inferred from others.
Q20. (Code-Tracing) You are tasked with fixing a 2NF violation in a schema. What is the
standard SQL procedure to achieve this?
If the intersection of R1 and R2 is NOT a key for either, the join will produce Spurious
Tuples. These are rows that were not in the original database. For example, if you split
a table of Employees and Departments by only using "Location" as a link, and two
different departments are in the same location, the join will incorrectly associate every
employee in that location with both departments. Normalization prevents this by
ensuring tables are only linked via Candidate Keys (Foreign Key → Primary Key
relationships).
Quick Notes
• Transaction: A single logical unit of work that accesses and possibly modifies the
contents of a database.
• ACID Properties:
◦ Atomicity: All operations of the transaction are reflected in the database, or none
are (All or Nothing).
◦ Consistency: Execution of a transaction in isolation preserves the consistency of
the database (Valid state to Valid state).
◦ Isolation: Transactions are hidden from each other until they are finished
(Concurrency control).
◦ Durability: Once a transaction is committed, changes persist even in the event of
a system failure.
MCQs
A) Consistency
B) Isolation
C) Atomicity
D) Durability
Atomicity ensures "All or Nothing." Since the entire purchase "unit of work" was not completed
(the shipping label wasn't logged), the DBMS rolls back all partial changes to ensure no partial
data exists.
Q2. (Code-Tracing) Analyze the following transaction block. What will be the final value of
balance in the database for id = 1?
sql-transaction-trace
A) 500
B) 700
C) 400
D) 1000
1. Start: 1000.
2. Subtract 200: 800.
3. Savepoint sp1.
4. Subtract 300: 500.
5. Rollback to sp1: Reverts to 800.
6. Subtract 100: 700.
7. Commit: 700 is saved.
Q3. (Conceptual) In the standard transaction state diagram, what is the critical difference
between the Partially Committed state and the Committed state?
A transaction is "Partially Committed" when the code finishes. It is "Committed" only when the
DBMS confirms that the recovery logs are safely on the disk. If a crash happens while
"Partially Committed," the transaction must be rolled back.
Q4. (Scenario-Based) A developer writes a transaction that transfers 500 from Account A to
Account B. If the developer accidentally writes code that adds 500 to Account B but only
subtracts 400 from Account A, the database will finish in an invalid state. This is a violation
of which property?
A) Atomicity
B) Consistency
C) Isolation
D) Durability
Consistency is a "semantic" property. The total sum of money in the bank must remain
constant. If the logic of the transaction is flawed, the DBMS cannot always detect it (unless
constraints are set); it is the programmer's responsibility to ensure the transaction moves the
DB from one valid state to another.
Q5. (Code-Tracing) What is the result of the following psql session?
postgres-session
BEGIN;
INSERT INTO logs VALUES ('Action 1');
DELETE FROM logs WHERE msg = 'Action 1';
ROLLBACK;
SELECT COUNT(*) FROM logs;
A) 1
B) 0 (assuming table was empty initially)
C) Error: Cannot delete what was just inserted.
D) NULL
A ROLLBACK undoes every operation performed since the BEGIN command. Even though an
insert and delete happened, the rollback wipes the slate clean as if the transaction never
started.
Q6. (Comparison) How does Durability typically handle data that is currently in the DB's
RAM buffer but has not yet been written to the data file when a crash occurs?
Durability is achieved by the Log Manager. Before any data is updated in the "lazy" buffer, a
record of the change is written sequentially to a log file on disk. On restart, the DBMS scans
this log to re-apply any committed changes that didn't make it to the main data files.
A) Partially Committed
B) Aborted
C) Failed
D) Terminated
A transaction enters the Failed state as soon as it is discovered that normal execution can no
longer proceed. Once the DBMS finishes the rollback of partial changes, the state changes to
Aborted.
A) Durability
B) Consistency
C) Isolation
D) Atomicity
Isolation ensures that concurrent transactions do not see each other's intermediate or
uncommitted states. This prevents "Dirty Reads" and ensures data looks consistent to every
user.
recovery-log-viewer
A) T1 and T2
B) Only T1
C) Only T2
D) Neither
T1 committed before the crash, so its changes must be permanent (Redo). T2 was active but
did not commit, so its changes must be undone (Undo). Checkpoints help the DBMS know
that everything before T2 was definitely on disk, but committed transactions since the
checkpoint still need re-applying.
A) It is rejected.
B) It stays in the "Active" state until the user logs out.
C) It is treated as an "Autocommit" transaction (Implicit Transaction).
D) It is only saved if the user types 'COMMIT' manually later.
Standard SQL operates in "Autocommit" mode by default. Every single statement is wrapped
in its own implicit transaction that commits immediately upon successful execution.
Q11. (Conceptual) Which database component is primarily responsible for ensuring the
Atomicity and Durability of transactions?
A) Query Optimizer
B) Recovery Manager (Log Manager)
C) Concurrency Controller
D) Schema Compiler
The Recovery Manager uses logs (Undo logs for Atomicity and Redo logs for Durability) to
ensure that transactions are either fully completed or fully undone, even across system
crashes.
Q12. (Comparison) What is the advantage of using Shadow Paging over Log-based
recovery?
Shadow Paging maintains a "Current" page table and a "Shadow" page table. Changes are
made to a copy; if the transaction commits, the pointer switches to the copy. It avoids log
processing but suffers from data fragmentation and is rarely used in modern massive RDBMS.
Q13. (Scenario-Based) An application requires that if a transaction is committed on a
database in the USA, a synchronized backup database in Singapore must also have that
data permanently stored before the transaction is acknowledged as "Success" to the user.
This is an example of:
A) Local Durability
B) Synchronous Distributed Durability
C) Isolation Level 4
D) Lazy Writing
Distributed transactions must ensure Durability across all nodes. Synchronous durability
means the "Commit" doesn't return to the user until all replicas have confirmed the write to
their respective non-volatile storages.
Durability dictates that once the system acknowledges a success (Step 3), the data change is
permanent. The DB must have flushed the log to disk *before* it sent that success message.
A) Atomicity
B) Consistency
C) Isolation
D) Persistence
The 'C' in ACID stands for Consistency. A transaction must leave the database in a consistent
state, meaning all structural and logical rules (integrity constraints) are followed.
• Steal: The Buffer Manager is allowed to "steal" a page from an active, uncommitted
transaction and write it to disk to make room for other data. This requires an Undo
Log to reverse the change if the transaction later aborts.
• No-Force: The DBMS is NOT required to "force" every modified page to disk
exactly when a transaction commits. This is much faster as it allows grouped disk
writes. This requires a Redo Log to re-play the changes if a crash happens while
the data was only in RAM.
Together, this is handled by the Write-Ahead Log (WAL) protocol: Never write a data
page to disk before the corresponding log record is safely stored in non-volatile
memory.
Quick Notes
◦ Lost Update: Two transactions update the same data; the second overwrites the
first.
◦ Dirty Read: A transaction reads data updated by an uncommitted transaction.
◦ Non-Repeatable Read: A transaction reads the same row twice and gets different
values because another transaction committed an update.
◦ Phantom Read: A transaction executes a query twice and gets a different number
of rows because another transaction inserted/deleted rows.
• Locking Protocols: Shared (S) locks for reading, Exclusive (X) locks for writing.
◦ Two-Phase Locking (2PL): Growing Phase (acquiring locks) and Shrinking Phase
(releasing locks). Prevents conflict serializability issues but NOT deadlocks.
◦ Strict 2PL: All Exclusive locks are held until the transaction commits or aborts.
Prevents cascading rollbacks.
• Deadlock vs. Starvation: A deadlock is a circular wait where no one can move.
Starvation is where one transaction is repeatedly passed over and waits indefinitely
while others proceed.
• 2PL does NOT mean No Deadlocks: A common exam trap is assuming 2PL
prevents deadlocks. In fact, 2PL can easily lead to deadlocks. Protocols like "Wait-
Die" or "Wound-Wait" are needed to prevent them.
• Isolation Levels: READ COMMITTED is the default in many DBs (like Postgres),
which prevents Dirty Reads but still allows Non-Repeatable and Phantom reads.
MCQs
Q1. (Scenario-Based) User A and User B are both trying to buy the last ticket for a concert.
1. User A reads the ticket count (Count = 1). 2. User B reads the ticket count (Count = 1). 3.
User A decrements the count and writes (Count = 0). 4. User B decrements the count and
writes (Count = 0). Both users receive a "Success" message, but only one physical ticket
exists. Which concurrency problem occurred?
A) Dirty Read
B) Phantom Read
C) Lost Update
D) Non-Repeatable Read
This is a classic Lost Update. User B's update was based on a stale version of the data
(Count=1), and when User B wrote the result, it effectively erased the fact that User A had
already "taken" the ticket.
Q2. (Code-Tracing) Analyze the following SQL execution sequence between two sessions.
What will Transaction 1 see at Time T3?
sql-isolation-test
-- Transaction 2
T2: UPDATE accounts SET balance = 1000 WHERE id = 1;
T2: COMMIT;
-- Transaction 1 again
T3: SELECT balance FROM accounts WHERE id = 1;
A) 500
B) 1000
C) NULL
D) Error: Transaction must be restarted.
In READ COMMITTED isolation, a transaction sees any data committed by others as soon as it
happens. Between T1 and T3, Transaction 2 committed an update, so T1 sees the new value
(1000). This is a "Non-Repeatable Read," which is permitted in this isolation level.
Q4. (Scenario-Based) A transaction follows the Basic Two-Phase Locking (2PL) protocol.
It has just released a shared lock on Table A. Can it now acquire an exclusive lock on Table
B?
Under 2PL, a transaction cannot acquire any new locks once it has released its first lock.
Releasing a lock marks the end of the "Growing Phase" and the start of the "Shrinking
Phase."
Q5. (Calculation) In a database using the Wait-Die scheme for deadlock prevention,
Transaction T1 (Timestamp 10) requests a lock held by Transaction T2 (Timestamp 20).
What happens?
SELECT ... FOR UPDATE is used for "Pessimistic Locking." It places an Exclusive (X) lock
on the rows. While a normal UPDATE also locks the row, FOR UPDATE is specifically used to
lock the row during a SELECT so that the application can calculate a change before writing it
back.
Q7. (Conceptual) What is the primary benefit of Strict Two-Phase Locking (Strict 2PL)
over Basic 2PL?
Strict 2PL holds all exclusive locks until the transaction commits or aborts. This ensures that
no other transaction can read "dirty" uncommitted data. If a transaction fails, it can be rolled
back without needing to roll back any other transactions that might have read its data.
A) Read Phase
B) Validation Phase
C) Write Phase
D) Recovery Phase
OCC has three phases: 1. Read (work on a local copy), 2. Validation (check if any other
transaction modified the data in the meantime), and 3. Write (apply changes to the DB). The
check happens during the Validation Phase.
Q9. (Calculation) If there are 3 transactions (T1, T2, T3) executing concurrently, how many
possible Serial Schedules exist?
A) 3
B) 6
C) 9
D) 27
A serial schedule executes transactions one after another in some order. For $n$ transactions,
there are $n!$ possible serial schedules. $3! = 3 \times 2 \times 1 = 6$.
Q10. (Conceptual) The Thomas Write Rule is an optimization of the Basic Timestamp
Ordering protocol. What does it do when a transaction Ti tries to write to a data item that
already has a Write-Timestamp newer than Ti's own timestamp?
The Thomas Write Rule handles out-of-order writes. If a newer transaction has already written
a value, an older transaction's write is "obsolete." Instead of aborting the older transaction, we
simply ignore its write, as the newer value is already the "current" one.
Q11. (Comparison) In the context of database locking, what is the difference between
Deadlock and Starvation?
A) Dirty Read
B) Phantom Read
C) Inconsistent Analysis (or Fuzzy Read)
D) Lost Update
Inconsistent Analysis occurs when a transaction (usually an aggregation) reads a partial state
of the database while other updates are occurring, leading to a result that never existed in any
single consistent version of the database.
Q13. (Calculation) Which of the following lock requests would be GRANTED if the data
item is currently held under a Shared (S) Lock by another transaction?
Shared locks are compatible with other Shared locks. Multiple transactions can read the same
data simultaneously. Exclusive locks, however, are incompatible with any other locks.
Q15. (Comparison) When comparing Pessimistic and Optimistic locking, which is more
suitable for a system with high data contention (many users trying to update the same
rows)?
In high contention, Optimistic locking would result in too many transaction restarts (aborts
during validation), wasting CPU. Pessimistic locking (using actual locks) is better because it
forces transactions to wait in a queue rather than constantly failing and retrying.
• For every transaction $T_i$ and $T_j$, an edge $T_i \rightarrow T_j$ exists if:
• If this graph contains any cycle (e.g., $T1 \rightarrow T2 \rightarrow T1$), the
schedule is not conflict serializable.
• View Serializability: A schedule might be View Serializable even if it's not Conflict
Serializable. View serializability is harder to check (it's an NP-Complete problem)
and focuses on ensuring each transaction reads the same values it would in a serial
schedule. All conflict-serializable schedules are also view-serializable.
Stable Web Resources
Quick Notes
• Recovery Goal: To restore the database to the most recent consistent state that
existed before a failure occurred.
• Log-Based Recovery: Records all database modifications in a stable log.
◦ Deferred Update: Database is not updated until the transaction commits. Only
Redo is needed.
◦ Immediate Update: Database is updated as the transaction executes. Both Undo
and Redo are needed.
• Write-Ahead Logging (WAL): The fundamental rule that log records representing a
change must be flushed to stable storage *before* the actual data page is written to
disk.
• Checkpoint: A point in the log where the DBMS ensures all buffers are flushed to
disk. It drastically shortens the recovery time by limiting how far back the log must
be scanned.
• Idempotency: Recovery operations (Redo/Undo) must be idempotent; executing
them multiple times must result in the same state as executing them once.
• Undo vs. Redo Logic: If a transaction has a [Start] but no [Commit] or [Abort] in the
log at the time of crash, it must be Undone. If it has both [Start] and [Commit], it
must be Redone.
• Checkpoint Fallacy: Many students think a checkpoint stops all transactions.
Modern databases use "Fuzzy Checkpoints" which allow transactions to continue
running while dirty pages are being flushed to disk.
• Storage Failure: Recovery management usually refers to "System Failures" (crash/
power loss). If the physical disk fails ("Media Failure"), the only recovery path is
restoring from a Backup and then replaying the logs.
MCQs
db-recovery-log
T1 has a [Commit] entry, meaning its changes were logically completed and must be re-
applied to the data files (Redo). T2 was active but did not commit before the crash; its partial
changes must be reversed (Undo) to maintain atomicity.
A) The item will have the new value, and the recovery manager must undo it.
B) The item will still have the old value because no writes were "forced" to the data
files yet.
C) The item will be corrupted.
D) The database will automatically delete the table.
Deferred update ensures that no changes are written to the actual database files until the
transaction reaches its commit point. If a crash occurs before commit, the data on disk is still
in its original state, so no "Undo" is necessary. Recovery only involves re-applying (Redoing)
transactions that successfully committed but whose changes were only in the RAM buffer.
WAL is the golden rule of recovery. If the database page were written first and the system
crashed before the log was written, the recovery manager would have no way to "Undo" that
change, violating atomicity.
A checkpoint guarantees that all transactions that committed *before* the checkpoint have
their changes safely on disk. The recovery manager only needs to process log records from
the checkpoint forward, though it must also look back to find the [Start] records of transactions
that were still active during the checkpoint.
Q5. (Scenario-Based) A data center experiences a catastrophic fire that physically destroys
the primary database hard drives. To recover the data, which of the following is strictly
required?
Recovery from "Media Failure" is different from "System Failure." Since the physical disks are
gone, the internal logs on those disks are also gone. Recovery requires restoring the most
recent off-site backup and then "Rolling Forward" by replaying all transaction logs generated
since that backup.
Q6. (Conceptual) Why must the Redo and Undo operations in a recovery algorithm be
Idempotent?
If the system crashes *during* recovery, the recovery process will start again from the last
checkpoint. This means some "Redo" operations might be performed a second time. If they
weren't idempotent (e.g., "Add 5 to Balance" is not idempotent, whereas "Set Balance to 500"
is), the data would become incorrect.
Q7. (Comparison) What is the primary difference between a System Crash and an
Aborted Transaction from a recovery perspective?
When a single transaction aborts (e.g., due to a constraint violation), the DBMS simply uses
the log to "Undo" that specific transaction. When a system crash occurs, the entire state of the
RAM buffer is lost, requiring a full scan of the log to determine which transactions need Redo
and which need Undo.
Q8. (Code-Tracing) Consider the following log entry for an Immediate Update protocol:
<T1, X, 50, 100>. During the Undo phase of recovery, what value is written to data
item X?
A) 100
B) 50
C) 150
D) NULL
In an Immediate Update log record <T, Item, Old_Val, New_Val>, the Undo operation
uses the Old_Val (50) to restore the state, while the Redo operation uses the New_Val
(100).
Q9. (Calculation) In the ARIES recovery algorithm, the "Analysis Phase" creates two
tables. What are they?
The Analysis Phase of ARIES scans the log from the last checkpoint to the end to identify: 1.
Which transactions were active at the time of the crash (Transaction Table) and 2. Which data
pages in the buffer were "dirty" and might not have been written to disk (Dirty Page Table).
Q10. (Conceptual) What is the purpose of the Log Sequence Number (LSN) in modern
recovery management?
A) To uniquely identify each log record and allow the DBMS to compare the "age"
of a data page with the "age" of the log.
B) To count the number of users logged into the system.
C) To sort the results of a SELECT query.
D) To encrypt the communication between the client and server.
Each data page in the database contains the LSN of the last log record that modified it. During
recovery, the DBMS checks the page's LSN; if the page's LSN is greater than or equal to the
LSN of the Redo log record, the DBMS knows the change is already on disk and can skip that
Redo operation.
Most industrial-strength databases (DB2, SQL Server) use the ARIES (Algorithms for
Recovery and Isolation Exploiting Semantics) method. It operates in three distinct
phases:
1. Analysis Phase: Starts from the last Checkpoint. It determines which pages were
dirty (not on disk) and which transactions were active at the time of the crash.
2. Redo Phase: It repeats history. It starts from the oldest dirty page and re-applies all
changes in the log (even those for transactions that eventually aborted). This brings
the database to the exact state it was in at the moment of the crash.
3. Undo Phase: It scans backward from the end of the log and reverses the changes
of all transactions that were active (uncommitted) at the time of the crash.
The genius of ARIES is Logging during Undo: When it reverses a change, it writes a
"Compensation Log Record" (CLR). If the system crashes again *during* the recovery,
the CLR ensures the system doesn't try to undo the same thing twice, preventing
infinite loops in recovery.
Quick Notes
• The Purpose: Indexing is a data structure technique used to minimize the number
of disk I/O operations required to locate specific data.
• B-Tree vs. B+ Tree: In a B-Tree, data and pointers exist in all nodes. In a B+ Tree
(the standard for RDBMS), actual data records (or pointers to them) exist only in
leaf nodes. Leaf nodes are linked together for fast sequential range scans.
• Clustered Index: The actual physical rows in the table are sorted based on the
index key. A table can have only one clustered index (usually the Primary Key).
• Non-Clustered (Secondary) Index: A separate structure containing the key and a
pointer (RID) to the actual data row. A table can have many non-clustered indexes.
• Hashing: Best for exact matches (O(1) average), but extremely poor for range
queries (e.g., WHERE price > 100).
• File Organizations: Heap (unordered), Sequential (sorted by key), Hashed (spread
by hash function), and Clustered.
Common Traps & Confusions
• Index on Every Column: While indexes speed up SELECT, they significantly slow
down INSERT, UPDATE, and DELETE because the index structure must be updated
synchronously.
• Primary Index vs. Primary Key: Usually, they are the same, but they are different
concepts. The Key is a logical constraint; the Index is a physical storage
optimization.
• Index Selectivity: An index on a "Gender" column (Low Selectivity) is often useless
because the engine might decide a full table scan is faster than jumping back and
forth between the index and the data.
• B+ Tree Fan-out: High fan-out (many children per node) is the goal, as it keeps the
tree short (usually 3-4 levels for millions of rows), resulting in fewer disk reads.
MCQs
Q1. (Scenario-Based) An e-commerce platform's "Product Search" is slow when users filter
by "Category" and "Price". The DBA creates a single index on (category, price). Now,
searches for WHERE category = 'Toys' are fast, but searches for WHERE price =
50.00 are still slow. Why?
In a composite index on (A, B), the data is sorted by A first, then by B. Searching by B alone is
like searching for a name in a phonebook using only the first name—the alphabetical sorting
of the last names (the first column) makes the search impossible without a full scan.
Q2. (Comparison) Why do modern RDBMS (like MySQL and SQL Server) utilize B+ Trees
instead of standard B-Trees for disk-based indexing?
Because B+ Trees store data only at the leaf level, internal nodes can fit more keys per block
(higher fan-out), making the tree shallower. Furthermore, the linked list connecting leaf nodes
allows the engine to perform range queries (e.g., BETWEEN) by finding the first leaf and then
just following the pointers.
Q3. (Code-Tracing) Analyze the following execution plan output. Why did the database
engine decide to ignore the index on the status column?
mysql - explain_output
-- Output:
type: ALL (Full Table Scan)
Possible Keys: idx_status
key: NULL
B-tree indexes are optimized for finding specific values or ranges. Most optimizers will skip an
index for "NOT EQUAL" (!=) queries because "everything except X" usually involves a large
portion of the table, making a sequential scan more efficient than random-access index jumps.
Q4. (Calculation) A database block is 4KB (4096 bytes). An index entry consists of a 4-byte
integer key and an 8-byte child pointer. Ignoring header overhead, what is the maximum
fan-out (number of children) of an internal node in this B+ Tree?
A) 1024
B) 512
C) 341
D) 256
Each internal node has $n$ pointers and $n-1$ keys.
Formula: $(n \times 8) + ((n-1) \times 4) \le 4096$.
$8n + 4n - 4 \le 4096 \rightarrow 12n \le 4100 \rightarrow n \le 341.6$.
The fan-out is 341.
In a clustered index, the "leaf" of the index is the actual data row. Since physical rows on a
disk can only be sorted in one way, you can have only one clustered index per table.
A) Hash Index
B) B+ Tree Index
C) Linear Probing
D) Inverted Index
Range queries (BETWEEN, >, <) require an ordered data structure. B+ Trees maintain keys in
sorted order, allowing the engine to find the start of the range and scan sequentially. Hash
indexes are unsuitable because they randomize the location of keys.
Q7. (Comparison) What is the main difference between a Dense Index and a Sparse
Index?
A) A Dense Index has an entry for every search key in the data file; a Sparse Index
only has entries for some search keys (e.g., one per block).
B) Dense indexes are for strings; Sparse indexes are for numbers.
C) Sparse indexes are faster for insertion; Dense indexes are faster for deletion.
D) Dense indexes only work on clustered tables.
In a Dense Index, every record is represented in the index. In a Sparse Index, we only store a
pointer to the start of each block. Sparse indexes use less space but require the data file to be
sorted by the search key.
Q8. (Code-Tracing) You execute the following command to optimize a report. What is the
technical term for the type of index created?
sql-shell - optimization
A) Clustered Index
B) Bitmap Index
C) Covering Index (or Index with Included Columns)
D) Reverse Key Index
By "including" total_amount, the index now contains all the columns needed for a query
like SELECT total_amount FROM sales WHERE sale_date = '...'. The engine
can return the result directly from the index without ever reading the actual table (Heap),
which is called an "Index-Only Scan."
Q9. (Calculation) A table is stored as an unordered Heap File with 1,000,000 blocks. If no
index exists, what is the average number of block reads required to find a record by its
Primary Key?
A) 1
B) 1,000
C) 500,000
D) 1,000,000
In a heap file (unordered), the only way to find a record is a linear search. On average, the
record will be found after scanning half the blocks (N/2). In the worst case, 1,000,000 reads.
Q10. (Conceptual) Why is it often recommended to use a Surrogate Key (like an Auto-
incrementing Integer) as a Clustered Index rather than a natural key (like a UUID or Name)?
A) Integers take less disk space than strings.
B) Sequential values prevent "Page Splitting" in the B+ Tree, keeping the tree
balanced and reducing fragmentation.
C) Integers are the only data type supported by B+ Trees.
D) Surrogate keys automatically encrypt the data.
When you insert random values (like a UUID) into a clustered index, the engine must insert
rows in the middle of pages. If a page is full, it must "split" the page to make room, which is an
expensive operation and leads to low "Fill Factor" and fragmentation. Sequential integers
always append to the end.
A) B+ Tree Index
B) Bitmap Index
C) Hash Index
D) Spatial Index
Bitmap indexes use bit arrays (1s and 0s) to represent the presence of a value. They are
extremely space-efficient for columns with very few unique values (low cardinality) and allow
for lightning-fast logical operations (AND, OR, NOT) when multiple filters are applied.
Q12. (Calculation) If a B+ Tree index has a height of 3 (Root, 1 Intermediate level, 1 Leaf
level) and the table is NOT clustered, how many I/O operations are required to retrieve a
single row using this index?
A) 3
B) 4
C) 2
D) 1
The engine must read 3 index blocks (Root → Intermediate → Leaf). Since it's a non-
clustered index, the leaf only contains a pointer to the data. One final I/O is required to fetch
the actual row from the data file. Total = 4.
Static hashing has a fixed number of buckets. As data exceeds the initial capacity, many keys
map to the same bucket, forcing the use of "overflow chains" (linked lists), which makes
search time linear (O(N)) instead of constant (O(1)). Dynamic/Extendible hashing solves this
by splitting buckets.
Q14. (Conceptual) In indexing theory, what does "Index Selective" refer to?
Selectivity = (Distinct Values / Total Rows). An index on a Primary Key has a selectivity of 1.0
(Highest). High selectivity means the index is very effective at pinpointing specific rows. Low
selectivity (like Gender) often leads the Query Optimizer to ignore the index.
sql-shell - performance_bug
When you insert a row into a table with a B+ Tree index, the engine traverses the tree
to find the correct leaf page. If that page is full (usually kept at 80% to allow for some
growth), a Page Split occurs:
This is why Clustered Index fragmentation happens. Frequent splits lead to pages
that are physically scattered across the disk, making sequential scans much slower.
Regularly "Rebuilding" or "Reorganizing" indexes defragments the pages and restores
sequential order.
• Wikipedia: B+ Tree
• PostgreSQL: Index Types
• B+ Tree: Standard DB index. Data in leaves, leaves linked. Best for range queries.
• Hash Index: Fast O(1) for = queries. Useless for >, <, or BETWEEN.
• Clustered: Data sorted by key. Max 1 per table.
• Non-Clustered: Pointer to data. Multiple per table allowed.
• SARGable: Keep indexed columns "naked" in queries (no functions) to ensure
index usage.
• Low Cardinality: Use Bitmap indexes (if supported).
• Composite Index: Order of columns matters (Leftmost prefix rule).
Quick Notes
1. Parsing & Translation: Checks syntax and translates SQL into a Relational
Algebra expression (Query Tree).
2. Optimization: Evaluates multiple execution plans to find the "cheapest" one.
3. Execution: The engine runs the chosen plan and returns results.
• Declarative vs. Procedural: SQL is declarative (you say what). The Query
Optimizer makes it procedural (it decides how).
• Stale Statistics: If a table grows from 100 rows to 1,000,000 rows but the
"Statistics" aren't updated, the optimizer might still think a Full Table Scan is faster
than an Index Scan, causing massive performance lag.
• The "Magic" of the Optimizer: The optimizer is not perfect. Sometimes a "hint"
(like FORCE INDEX) is required to override its decision if the statistics are
misleading.
MCQs
Q1. (Conceptual) During the Query Translation phase, into which mathematical
representation is a declarative SQL statement typically converted?
SQL is a high-level language. To process it, the DBMS translates it into Relational Algebra,
which provides a sequence of operations (Select, Project, Join) that the engine can actually
execute step-by-step.
One of the most powerful heuristic rules is "Selection Pushing." By filtering rows (Selection)
and removing unnecessary columns (Projection) early, the engine reduces the size of the data
it has to process in the expensive Join phases.
Q3. (Code-Tracing) Analyze the output of the EXPLAIN command below. What does the
"cost" value (e.g., 0.00..10.50) represent to the optimizer?
postgres - plan_analyzer
-- Output:
Index Scan using pk_users on users (cost=0.00..10.50 rows=1 width=40)
The "cost" is an estimate, not a real time. It usually represents the number of disk page
fetches and CPU cycles. The optimizer compares the costs of multiple plans and chooses the
one with the lowest total cost.
Q4. (Conceptual) In a Cost-Based Optimizer (CBO), where does the engine retrieve the
information needed to estimate the cost of a plan (e.g., number of rows, distinct values)?
A CBO relies on the Data Dictionary/System Catalog. This catalog stores statistics (histogram
of values, row counts, height of B+ trees) that are updated periodically (e.g., via the ANALYZE
or RUNSTATS commands).
Q5. (Scenario-Based) A query that was previously fast suddenly becomes extremely slow
after 1 million new rows are inserted into the table, even though an index exists. What is the
most likely cause?
If statistics aren't updated, the optimizer might still think the table is small and choose a "Full
Table Scan." Updating the statistics allows the optimizer to "see" the new data and realize that
using the index is now significantly faster.
Q6. (Comparison) What is the difference between Materialization and Pipelining in the
context of query execution?
Pipelining is generally faster as it avoids slow disk I/O. As rows are produced by one operation
(e.g., a Scan), they are immediately "piped" into the next operation (e.g., a Filter) in memory.
Materialization is only used when the intermediate result is too large for RAM or must be
sorted.
A) Use Table A as the "Build" table (the hash table in RAM) and scan Table B.
B) Use Table B as the "Build" table and scan Table A.
C) Join them both into a third temporary table first.
D) Join order does not affect performance in a Hash Join.
In a Hash Join, the optimizer builds a hash table in RAM for one table. To save memory, it
should choose the smaller table (Table A) to be the "Build" table and then stream the larger
table (Table B) against it.
Q8. (Conceptual) Why is the Join Order one of the most complex parts of query
optimization?
Joining multiple tables is associative and commutative. For 4 tables, there are 24 possible
orders. For 10 tables, there are over 3.6 million. The optimizer uses "Dynamic Programming"
or "Genetic Algorithms" to find a good order without checking every single one.
Q9. (Code-Tracing) A developer notices that a query using SELECT * is slower than
SELECT id, name. From a query processing perspective, why is this?
If the query only needs id and name, and an index exists on those two columns, the optimizer
can perform an "Index-Only Scan" (Pipelining). If * is used, the engine must perform a
"Bookmark Lookup" to fetch the full row from the disk, adding significant I/O cost.
This is a "Type Mismatch" error. If the column is an Integer but you provide a String (or vice
versa), the DBMS may perform an implicit conversion on every row (like CAST(id AS
VARCHAR) = '500'). This transformation makes the query Non-SARGable, forcing a full
scan.
The "Cost" of a plan is calculated using a cost model that assigns weights to different
hardware operations. In a typical model (like PostgreSQL):
• seq_page_cost (1.0): The cost to read one 8KB page sequentially from disk.
• random_page_cost (4.0): The cost to read one page via a random access (much
higher because of disk seek time).
• cpu_tuple_cost (0.01): The cost to process one row in the CPU.
• cpu_index_tuple_cost (0.005): The cost to process one index entry.
The optimizer calculates: Total Cost = (Number of Pages * Page Cost) +
(Number of Rows * CPU Cost). This is why on very small tables (e.g., 5 rows),
the optimizer often chooses a Sequential Scan over an Index Scan—the
"random_page_cost" of jumping to the index and then to the data is higher than simply
reading the one sequential page containing all 5 rows.
Quick Notes
• Principal of Least Privilege (PoLP): A user should be granted only the minimum
level of access (privileges) necessary to perform their specific job functions.
• Discretionary Access Control (DAC): The owner of an object (e.g., a table) has
the authority to grant or revoke access to that object to other users (using GRANT
and REVOKE).
• Role-Based Access Control (RBAC): Privileges are assigned to roles (e.g.,
'Analyst', 'Admin'), and users are assigned to those roles. This simplifies
management.
• SQL Injection (SQLi): A vulnerability where an attacker "injects" malicious SQL
code into a query via user input. Prevented primarily by Prepared Statements
(Parameterized Queries).
• Encryption:
• Database Auditing: The process of tracking and logging database activities (who
accessed what data and when).
• GRANT ALL: A common security mistake in development. Never use GRANT ALL
in production; it violates PoLP and provides a massive attack surface if a service
account is compromised.
• Obscurity vs. Security: Changing a database port from 3306 to 9000 is "Security
by Obscurity." It might slow down a script kiddie but won't stop a determined
attacker. Real security relies on robust authentication and authorization.
• SQLi in Stored Procedures: Many believe stored procedures are inherently safe
from SQLi. This is false; if the procedure uses dynamic SQL (string concatenation)
inside, it is still vulnerable.
• Revoking "Public": In some systems (like SQL Server), every user is a member of
the public role. Revoking permissions from a specific user but leaving them on
public might mean the user still has access.
MCQs
Q1. (Scenario-Based) A newly hired intern needs to generate a report on monthly sales.
The intern should be able to read the Orders and Products tables but must not be able to
modify them or see the Salaries table. Which security principle is the DBA applying by
granting only SELECT on those two specific tables?
A) Separation of Duties
B) Mandatory Access Control (MAC)
C) Principle of Least Privilege (PoLP)
D) Defense in Depth
PoLP states that a user should only have the exact permissions required for their task.
Granting SELECT (read-only) on only the relevant tables follows this principle strictly.
Q2. (Code-Tracing) Trace the result of the following DCL commands. Can 'User_A' delete
records from the Logs table at Time T4?
sql-auth-monitor
A) Yes, because User_A inherited the DELETE privilege from the 'DataCleanup' role.
B) No, because User_A was not granted the privilege directly.
C) No, because roles cannot hold DML privileges.
D) Yes, but only if User_A is also the owner of the table.
This is the core of Role-Based Access Control (RBAC). Privileges granted to a role are
automatically inherited by any user assigned to that role, making permission management
much more scalable.
Q3. (Scenario-Based) A web application takes a username from a login form and builds a
query like this:
"SELECT * FROM Users WHERE username = '" + userInput + "';"
An attacker enters ' OR '1'='1 into the field. What is the most effective architectural way
to prevent this SQL Injection attack?
Q4. (Code-Tracing) A medical database contains sensitive patient data. The DBA creates
the following object. What security benefit does this provide?
postgres - view_security
Views are a powerful security tool. By granting access to the view but not the underlying table,
the DBA can strictly control which columns and rows (or aggregates) a user can see,
effectively masking Personally Identifiable Information (PII).
Q5. (Conceptual) In a highly secure government database, access is granted based on the
sensitivity label of the data (e.g., Top Secret) and the clearance level of the user. The user
cannot pass their access rights to others. This is known as:
Q6. (Scenario-Based) A company needs to show customer support agents the last 4 digits
of a credit card number (e.g., XXXX-XXXX-XXXX-1234) but hide the rest. Which database
security technique is specifically used for this?
Data Masking hides sensitive data in the result set of a query without changing the actual data
on the disk. It is ideal for call centers or support environments where only partial info is
needed.
At-Rest encryption (like TDE) ensures that if a backup tape or hard drive is stolen, the data
cannot be read. In-Transit encryption (like TLS/SSL) ensures that an attacker on the same
network cannot "see" the SQL queries or results being exchanged between the app and the
DB.
Q8. (Code-Tracing) An auditor reviews the following system log. Which security
vulnerability is being attempted here?
db-audit-log
The attacker is using a semicolon (;) to terminate the legitimate query and "piggyback" a
second, malicious command (DROP TABLE). The -- at the end is used to comment out the
rest of the original query so it doesn't throw a syntax error.
Q9. (Scenario-Based) A financial organization requires that the person who writes the
software to update account balances must not be the same person who administers the
production database. This is a classic example of:
A) Least Privilege
B) Separation of Duties (SoD)
C) Multi-Factor Authentication
D) Non-Repudiation
SoD is a security principle where more than one person is required to complete a critical task,
preventing any single individual from having enough power to commit and conceal fraud.
Q10. (Conceptual) Which of the following is the primary goal of Database Auditing?
Auditing doesn't prevent an attack; it records it. It is essential for compliance (like HIPAA or
GDPR) and for forensic analysis after a security incident to determine what was stolen or
changed.
Deep Dive: Parameterized Queries and the Query Plan
This not only provides security but also Performance, as the database can reuse the
same execution plan for thousands of different inputs.
Quick Notes
MCQs
Q1. (Scenario-Based) A global retail company stores its Inventory table across three
servers: Tokyo, London, and New York. The Tokyo server only contains rows where region
= 'Asia'. What specific type of distributed database design is this?
A) Vertical Fragmentation
B) Horizontal Fragmentation (Sharding)
C) Full Replication
D) Heterogeneous Distribution
Horizontal fragmentation (also called sharding) involves dividing a table into subsets of rows
based on a specific attribute (like region). This keeps data geographically close to the users
who access it most frequently.
Q2. (Conceptual) According to the CAP Theorem, if a network partition occurs (nodes
cannot communicate), a database that chooses to remain Available (A) must sacrifice
which of the following?
If nodes cannot talk to each other (Partition), and you allow users to keep writing data to any
available node (Availability), those nodes will eventually hold different data. Therefore, you
can no longer guarantee that every user sees the exact same "Consistent" state.
Q3. (Code-Tracing) A user executes the following query in a distributed environment. If the
system possesses Location Transparency, what is true about the user's knowledge?
distributed-shell
A) The user must specify the IP address of the server holding account 9001.
B) The user must know whether the table is fragmented or replicated.
C) The user does not need to know where the data is physically stored; the DBMS
handles the routing.
D) The query will only work if the data is stored on the local machine.
Location Transparency is a property of a DDBMS where the user can query any table by its
logical name without needing to know the physical node or site where the data resides.
In synchronous replication, a transaction is not considered "Committed" until all replicas have
written the data. This ensures high consistency but means the user must wait for multiple
network round-trips, slowing down every write operation.
The 2PC protocol follows an "All or Nothing" rule. If even one node votes to ABORT during the
first phase (Prepare phase), the coordinator must instruct every node in the transaction to
rollback to maintain global atomicity.
Q6. (Conceptual) In a Heterogeneous Distributed Database System, which of the
following is true?
A) All nodes must run the exact same DBMS software (e.g., all MySQL).
B) All nodes must have the same hardware specifications.
C) Different nodes may run different DBMS software (e.g., some Oracle, some
PostgreSQL) using a common interface.
D) Data is only stored in one central location.
Heterogeneous systems integrate different types of database software across nodes. This is
common during company mergers where different departments use different database
technologies but need a unified query view.
Q7. (Comparison) Why would a database designer choose Vertical Fragmentation for a
Personnel table?
Q8. (Logic/Calculation) A distributed query requires joining Table A (Node 1) and Table B
(Node 2). To minimize network traffic, the DBMS sends only the unique Join-Keys from
Table A to Node 2, filters Table B there, and sends back only the matching rows. What is this
optimization technique called?
A) Full Join
B) Remote Join
C) Semi-Join
D) Distributed Scan
A) Strong Consistency
B) High Availability and Low Latency
C) Data Integrity
D) Vertical Scalability
Eventual consistency is a trade-off. By not forcing every node in the world to synchronize
immediately (which would be slow), the system remains fast (Low Latency) and stays online
even if some nodes are slow (Availability), with the promise that all nodes will "eventually" be
the same.
The coordinator is the "Master" node for a specific transaction. It initiates the 2PC protocol,
collects votes from participating "Participant" nodes, and issues the final global decision
(Commit or Abort).
In distributed databases, the "Network Cost" is usually the bottleneck, often 100x
slower than local disk I/O. Suppose we need to join Employees (Site 1, 10,000 rows)
and Departments (Site 2, 50 rows) on dept_id. Instead of shipping 10,000
employee rows to Site 2, the DDBMS performs a Semi-Join:
1. Site 2 sends the list of its 50 dept_ids to Site 1. (Very small transfer).
2. Site 1 filters the Employees table locally, keeping only those who belong to those
50 departments.
3. Site 1 sends the filtered employees to Site 2 for the final join.
This drastically reduces the amount of data crossing the network, which is the primary
goal of distributed query optimization.
Quick Notes
• Schemaless: Data can be inserted without a predefined structure, allowing for rapid
evolution of data models.
• "NoSQL means no SQL": Many NoSQL databases actually support SQL-like query
languages (e.g., CQL for Cassandra or N1QL for Couchbase). The "No" stands for
"Not Only".
• Consistency Misconception: NoSQL is not "Always Inconsistent." Many NoSQL
systems allow you to tune consistency (e.g., Read-at-Quorum) to act like an ACID
database when needed.
• Joins: NoSQL generally does not support server-side Joins. If you need to join
data, you either denormalize the data (duplicate it) or perform the join in the
application code.
• Vector Databases: A modern trend for AI. They store data as high-dimensional
vectors (embeddings) to allow for "Similarity Search" rather than exact keyword
matches.
MCQs
Q1. (Scenario-Based) A startup is building a real-time analytics dashboard for a global IoT
network. Sensors send millions of small updates every second. The schema changes
frequently as new sensor types are added. Which database type is most appropriate for this
high-velocity, flexible-schema workload?
NoSQL databases like MongoDB or Cassandra are designed for "Horizontal Scaling"
(handling massive velocity by adding nodes) and "Schema Flexibility" (handling changing data
structures without migrations), making them ideal for IoT and real-time big data.
mongo-shell
[Link]({
"item": "Laptop",
"specs": { "cpu": "i7", "ram": "16GB" },
"tags": ["electronics", "sale"]
});
Document databases allow "Nested Data Models." In SQL, specs and tags would likely
require separate tables and Joins. In NoSQL, related data is co-located in one document for
faster retrieval.
Q3. (Conceptual) The BASE consistency model used by many NoSQL systems prioritizes
"Availability" and "Partition Tolerance". What does the "Soft-state" property imply?
Soft-state means the system is "not stable" at all times. Due to "Eventual Consistency," the
data on Node A might differ from Node B for a short window until the background
synchronization processes align them.
Q4. (Comparison) In a relational database, you "Scale Up" (Vertical Scaling). In NoSQL,
you "Scale Out" (Horizontal Scaling). What is the main architectural challenge of Scaling
Out?
A) It is more expensive to buy many small servers than one giant server.
B) It makes the SQL syntax more complex.
C) It requires complex data partitioning (Sharding) and dealing with the CAP
theorem trade-offs.
D) It limits the database to a maximum of 1,000 rows.
Scaling out involves distributing data across a cluster. This introduces the "Network Partition"
problem. Deciding how to split data (Sharding) so that queries don't have to hit every single
server is the primary challenge of horizontal scaling.
A) Key-Value Store
B) Document Store
C) Graph Database
D) Wide-Column Store
Graph databases (like Neo4j) treat relationships as "first-class citizens." Instead of performing
multiple index-lookups or joins, they use "pointer chasing" to traverse links between nodes in
constant time, making them exponentially faster for social network analysis.
Q6. (Code-Tracing) A developer is using Redis to store session data. Which of the
following commands demonstrates the "Key-Value" nature of this database?
redis-cli
T1: SET session:user:101 "{'id': 101, 'name': 'Alice'}"
T2: EXPIRE session:user:101 3600
T3: GET session:user:101
A) T3, because it retrieves a specific blob of data using a unique string identifier.
B) T2, because it manages the hardware memory.
C) T1, because it defines a table schema for users.
D) All of the above are relational commands.
Key-Value stores act like a massive distributed dictionary/hash-map. You provide a Key
(session:user:101) and get back a Value (the JSON string). There is no "internal"
structure visible to the database for that value; it is just a blob.
In SQL, we normalize to save space and prevent anomalies. In NoSQL, "Disk is cheap, but
Joins are expensive." By duplicating data (e.g., putting the Customer Name inside the Order
document), we ensure that the entire order can be read in one operation without a network
jump.
Q8. (Conceptual) What is the primary use-case for a Column-Family Database (like
Cassandra or HBase)?
Q9. (Scenario-Based) A modern AI company needs to build a search engine that finds
"Images similar to this one." They use a Vector Database. How does this search work
technically?
Vector databases (like Pinecone or Milvus) store embeddings (arrays of numbers representing
features). Searching involves finding vectors that are "closest" to the input vector in high-
dimensional space, enabling "Semantic Search" or similarity matching.
NewSQL is a modern class of RDBMS that seeks to provide the "Scale-out" benefits of
NoSQL without sacrificing the "Strong Consistency" and ACID transactions that enterprise
financial systems require.
Deep Dive: Polyglot Persistence
• Relational (SQL): For structured data with complex relationships and strict
consistency needs (e.g., Users, Financial Transactions).
• Redis (Key-Value): For ultra-fast caching of session data and leaderboard scores in
RAM.
• ElasticSearch (Document/Search): For full-text search across product catalogs.
• Neo4j (Graph): For the social connection engine or fraud detection patterns.
• Prometheus (Time-Series): For monitoring server metrics and logs over time.
This approach ensures that every microservice uses the data model that optimizes its
specific performance and scaling requirements.
• Wikipedia: NoSQL
• AWS: What is NoSQL?
Quick Notes
• Data Warehouse (DW): A central repository of integrated data from one or more
disparate sources, used for reporting and data analysis.
• OLTP vs. OLAP:
• ETL Process: Extract (get data), Transform (clean/format), Load (insert into DW).
• DW Schemas:
◦ Star Schema: One central Fact table connected to multiple Dimension tables.
Simplest and fastest for joins.
◦ Snowflake Schema: A normalized Star Schema where dimension tables are
broken down into further sub-dimensions.
• Data Mining: The process of discovering patterns in large data sets (KDD -
Knowledge Discovery in Databases).
• Mining Tasks: Classification (predicting labels), Regression (predicting numbers),
Clustering (grouping), Association (finding relationships).
• Data Warehouse vs. Database: A database is for "running" the business (Current
data). A warehouse is for "analyzing" the business (Historical data).
• Fact vs. Dimension: A Fact is a quantitative measurement (e.g., Sales Amount,
Quantity). A Dimension is the context (e.g., Time, Location, Product).
• Data Mart: A subset of a Data Warehouse, usually oriented to a specific business
line or team (e.g., a "Marketing Data Mart").
• Correlation is not Causation: A common data mining pitfall. Finding that "Ice
cream sales and drowning incidents increase together" (Correlation) doesn't mean
one causes the other (both are caused by Summer).
MCQs
Q1. (Comparison) A bank uses two systems. System A processes every ATM withdrawal
and update instantly. System B generates a report every quarter showing spending trends
across different age groups over the last 5 years. How are these systems categorized?
Q2. (Code/Logic Tracing) Look at the following SQL structure. Which Data Warehouse
schema is being implemented here?
dw-schema-design
A) Star Schema
B) Snowflake Schema
C) Relational 3NF Schema
D) Galaxy Schema
This is a classic Star Schema. There is a single, central Fact table (Sales_Fact) surrounded
by non-normalized, flat Dimension tables. Each dimension is directly connected to the fact
table by a single join.
Classification is used to predict a discrete label or category (e.g., Churn vs. No-Churn, Spam
vs. Not-Spam). Regression would be used if the company wanted to predict a continuous
numerical value, like the "Probability" of churn.
Q4. (Conceptual) During the ETL (Extract, Transform, Load) process, in which phase
does "Data Cleaning" and "Deduplication" primarily occur?
A) Extract
B) Transform
C) Load
D) Data Warehousing
The Transformation phase is the most complex part of ETL. This is where data from different
sources is mapped to a common format, cleaned of errors, and merged to remove duplicate
records before being loaded into the Warehouse.
A) Roll-up
B) Drill-down
C) Slice
D) Dice
Drill-down navigates from less-detailed data (Year) to more-detailed data (Quarter/Month).
The reverse (moving from Month to Year) is called Roll-up.
A) Regression Analysis
B) K-Means Clustering
C) Association Rule Mining (Market Basket Analysis)
D) Classification
Association Rule Mining finds interesting correlations and co-occurrences between items in a
dataset. Market Basket Analysis is the most common application of this technique.
Q7. (Comparison) What is the primary architectural difference between a Data Warehouse
and a Data Mart?
A Data Warehouse is the "Single Source of Truth" for the entire organization. A Data Mart is a
decentralized slice of that data tailored for the specific needs of a single business unit, making
it easier and faster for them to query.
Q8. (Conceptual) In Data Mining, the KDD (Knowledge Discovery in Databases) process
defines several steps. Which of the following represents the correct sequence?
KDD starts with selecting the relevant data, cleaning/preprocessing it, transforming it into a
mineable format, applying mining algorithms, and finally interpreting/evaluating the results.
Q9. (Logic/Schema) How does a Snowflake Schema differ from a Star Schema regarding
Normalization?
A Snowflake Schema normalizes the dimension tables (e.g., splitting a Product dimension
into Product and Category tables). This saves space but results in more complex queries
with more JOINs compared to a Star Schema.
Clustering is an "Unsupervised" task because the algorithm identifies natural groupings in the
data without being told what those groups are (no training labels).
Q11. (Code/Logic Tracing) In a data mining context, what does Confidence represent in
an association rule {A} → {B}?
association-metrics
A) 30 / 100 (30%)
B) 30 / 40 (75%)
C) 40 / 100 (40%)
D) 10 / 40 (25%)
Confidence is the probability that item B is purchased given that item A is purchased. Formula:
Count(A and B) / Count(A). Here: 30 / 40 = 0.75. (30/100 would be the "Support"
of the rule).
A Factless Fact table doesn't have numerical measures (like 'Amount'). It is used to track
events, like "Student X attended Class Y on Date Z." The "measurement" is simply the count
of those occurrences.
Q13. (Scenario-Based) A real estate website wants to estimate the selling price of a house
based on its square footage, number of bedrooms, and location. Which data mining
technique is appropriate?
A) Classification
B) Regression
C) Clustering
D) Association
Q14. (Comparison) What is the main reason an organization would choose Cloud Data
Warehousing (e.g., Snowflake, BigQuery) over a traditional On-Premise Data Warehouse?
Q15. (Conceptual) In OLAP operations, what does "Slicing" the data cube involve?
A "Slice" picks one specific value of one dimension (e.g., Year = 2023) to see the rest of the
data. A "Dice" picks a sub-range of multiple dimensions (e.g., Year 2023 AND City = Karachi).
To ensure that different Data Marts (Finance, Sales, HR) can talk to each other, Ralph
Kimball (a pioneer of DW) introduced the Bus Architecture based on Conformed
Dimensions:
• The Problem: If the Finance team defines "Date" differently than the Sales team,
their reports will never match.
• The Solution: The organization builds one single Date_Dimension and one
single Customer_Dimension and shares them across every single Fact table in
the enterprise.
• Benefit: This allows "Drill-across" queries, where you can join the Sales_Fact and
the Finance_Fact tables because they both reference the exact same Date_ID.
This creates a unified "Single Source of Truth."