0% found this document useful (0 votes)
30 views64 pages

Comprehensive Guide to DBMS Concepts

A Database Management System (DBMS) is software that is used to create, manage databases.

Uploaded by

yeah.9121921
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views64 pages

Comprehensive Guide to DBMS Concepts

A Database Management System (DBMS) is software that is used to create, manage databases.

Uploaded by

yeah.9121921
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DataBase Management System

📘 1. Introduction to DBMS

• What is a DBMS?
• Database vs File System
• Advantages of DBMS
• Users of DBMS (DBA, developers, end users)
• Types of DBMS (Hierarchical, Network, Relational, Object-oriented)

📗 2. Data Models

• Data model concepts


• Hierarchical Model
• Network Model
• Relational Model
• Entity-Relationship (ER) Model
o Entities and Attributes
o Relationships (1:1, 1:M, M:N)
o Keys (Primary, Candidate, Super, Foreign)
o ER Diagrams

📙 3. Relational Model

• Relational schema and instances


• Keys (Primary, Super, Candidate, Foreign)
• Integrity Constraints
o Domain Constraint
o Key Constraint
o Entity Integrity
o Referential Integrity
• Relational Algebra (operations: select, project, join, union, etc.)
• Relational Calculus (Tuple & Domain)

📕 4. SQL (Structured Query Language)

• DDL (CREATE, DROP, ALTER)


• DML (SELECT, INSERT, UPDATE, DELETE)
• DCL (GRANT, REVOKE)
• TCL (COMMIT, ROLLBACK, SAVEPOINT)
• Advanced SQL:
o Joins (INNER, LEFT, RIGHT, FULL)
o Subqueries and Nested Queries
o Views
o Indexes
o Triggers
o Stored Procedures and Functions
o Cursors

📒 5. Database Design

• Normalization:
o 1NF, 2NF, 3NF, BCNF, 4NF, 5NF
o Decomposition and Lossless Join
o Dependency Preservation
• Functional Dependencies
• Design Strategies:
o ER to Relational Mapping
o Top-Down and Bottom-Up Design

📓 6. Transaction Management

• Transactions and Properties (ACID)


• States of Transactions
• Serializability:
o Conflict and View Serializability
o Precedence Graph
• Concurrency Control:
o Lock-Based Protocols
o Two-Phase Locking (2PL)
o Deadlocks and Prevention
• Timestamp Ordering

📔 7. Recovery System

• Types of Failures
• Recovery Techniques:
o Log-Based Recovery
o Shadow Paging
o Checkpoints
o ARIES Algorithm

📘 8. Indexing and Hashing

• Need for Indexing


• Single-Level and Multi-Level Index
• Dense vs Sparse Index
• B+ Tree Indexing
• Hashing Techniques:
o Static Hashing
o Dynamic Hashing (Extendible, Linear)

📙 9. File Organization

• File Structures (Heap, Sorted, Hashed)


• Sequential File Organization
• Clustering and Multi-key Access

📗 10. Query Processing and Optimization

• Steps in Query Processing


• Measures of Query Cost
• Query Trees and Evaluation Plans
• Transformation Rules
• Heuristics for Query Optimization
• Cost-Based Optimization

📕 11. Distributed Databases (Advanced)

• Characteristics of Distributed DB
• Fragmentation and Replication
• Distributed Transactions
• Commit Protocols (2PC, 3PC)
• CAP Theorem
📒 12. NoSQL and New Trends

• Introduction to NoSQL
• Types: Key-Value, Document, Column, Graph DBs
• CAP Theorem Revisited
• BASE vs ACID
• Comparison with RDBMS

📓 13. Security and Authorization

• Database Security Issues


• Access Control Models
• Discretionary and Mandatory Access Control
• Role-Based Access Control
• SQL Injection and Mitigation

🧩 Recommended Tools & Practice Platforms

• Databases: MySQL, PostgreSQL, SQLite, MongoDB (NoSQL)


• ER Diagram Tool: [Link], Lucidchart
• Practice: LeetCode (DB problems), HackerRank, SQLZoo, W3Schools SQL

📘 1. Introduction to DBMS

🔹 What is a DBMS?

A Database Management System (DBMS) is software that allows users to create, retrieve,
update, and manage data in databases. It provides an interface between the database and its
users or application programs, ensuring data is organized and easily accessible.

Example:

• Consider an online store like Amazon. A DBMS stores all data like:
o Product details
o Customer info
o Orders
o Payments

Without a DBMS, managing and retrieving this vast amount of information efficiently would be
extremely difficult.

🔹 Database vs File System

Feature File System DBMS


Data Redundancy High Low (uses normalization and relationships)
Data Integrity Hard to enforce Enforced via constraints
Data Access Sequential or manual Query-based (SQL)
Concurrency Control Very limited Supports multiple users simultaneously
Security Basic file permissions Fine-grained access control
Backup and Recovery Manual Automatic and robust

Example:

• File System: A company stores employee records in Excel or text files.


• DBMS: The same company uses MySQL to manage employee records with SQL
queries, constraints, and user permissions.

🔹 Advantages of DBMS

1. Data Redundancy Control:


a. Eliminates unnecessary duplication using normalization.
b. Example: Stores customer information only once, even if they place multiple
orders.
2. Data Integrity and Accuracy:
a. Ensures data correctness through constraints (e.g., primary key, foreign key).
b. Example: You can’t insert a sales record without a valid product ID.
3. Data Security:
a. Access can be controlled with authentication and authorization.
b. Example: Only HR staff can view salary details.
4. Concurrent Access:
a. Supports multiple users accessing the same database simultaneously.
b. Example: Multiple users adding products to their cart at the same time.
5. Backup and Recovery:
a. Automatic mechanisms to recover data after system failure.
b. Example: Restoring the latest backup after a crash.
6. Data Independence:
a. Application programs are insulated from how data is structured and stored.
b. Example: Changing the data format doesn’t require changes in application
code.

🔹 Users of DBMS

1. Database Administrator (DBA):


a. Manages DBMS, configures settings, sets up security, backups, etc.
b. Example: Sets user roles and access levels in Oracle.
2. Database Designers:
a. Designs the database schema and structure.
b. Example: Designs ER diagrams and converts them to tables.
3. Application Developers:
a. Build apps that interact with the database using queries and APIs.
b. Example: A developer builds a login system using SQL queries.
4. End Users:
a. Regular users who interact with the database indirectly via applications.
b. Example: An Amazon customer searching for products.

🔹 Types of DBMS

1. Hierarchical DBMS
a. Organizes data in a tree-like structure (parent-child relationships).
b. Example: Windows Registry or early IBM systems.
c. Limitation: Rigid structure; each child has only one parent.

markdown
CopyEdit
Company
└── Department
└── Employee
2. Network DBMS
a. More flexible than hierarchical: a child can have multiple parents.
b. Example: Integrated Data Store (IDS), used in early defense systems.

markdown
CopyEdit
Student ─┬─> Course
└─> Instructor

3. Relational DBMS (RDBMS)


a. Stores data in tables (relations) with rows and columns.
b. Uses SQL to manage data.
c. Example: MySQL, PostgreSQL, Oracle, MS SQL Server

pgsql
CopyEdit
Table: Employees
+----+--------+--------+
| ID | Name | DeptID |
+----+--------+--------+

Table: Departments
+--------+------------+
| DeptID | DeptName |
+--------+------------+

4. Object-Oriented DBMS (OODBMS)


a. Stores data as objects, similar to object-oriented programming.
b. Supports complex data types (e.g., multimedia, CAD data).
c. Example: db4o, ObjectDB

java
CopyEdit
class Employee {
String name;
Address address; // Address is another object
}
📗 2. Data Models
A data model defines how data is structured, stored, and accessed in a database. It includes
rules for relationships between data elements and provides a framework for designing and
implementing databases.

🔹 Data Model Concepts

Key components of a data model include:

• Entities – Real-world objects (e.g., Student, Book)


• Attributes – Properties of entities (e.g., Name, ID)
• Relationships – Associations among entities (e.g., Student borrows Book)
• Constraints – Rules (e.g., each student must have a unique ID)

🔸 1. Hierarchical Model

• Organizes data in a tree-like structure.


• Each parent can have multiple children, but each child has only one parent.

Example:

markdown
CopyEdit
University
└── Faculty
└── Department
└── Professor

• A professor belongs to one department only.


• Good for applications with strict hierarchical relationships.

Drawback: Rigid; difficult to represent complex many-to-many relationships.


🔸 2. Network Model

• Uses graph-like structure.


• Entities (called records) are connected through multiple relationships.
• Each child can have multiple parents.

Example:

markdown
CopyEdit
Student ─────┐
├──> Course
Instructor ──┘

• Students and Instructors are connected to multiple Courses.


• Supports many-to-many relationships.

Drawback: Complex to design and navigate using pointers.

🔸 3. Relational Model

• Most widely used model (basis for SQL).


• Represents data as tables (relations) with rows (tuples) and columns (attributes).
• Relationships are established using keys.

Example:

Students Table

StudentID Name Major


101 Alice CS
102 Bob EE

Courses Table

CourseID CourseName
C1 DBMS
C2 Networks
Enrollments Table

StudentID CourseID
101 C1
102 C1

• Easy to query using SQL and supports normalization.

🔸 4. Entity-Relationship (ER) Model

A high-level conceptual model for database design. It visually represents:

• Entities
• Attributes
• Relationships

Used during database design phase to map real-world objects to tables later.

🔹 Entities and Attributes

• Entity: A real-world object with an independent existence.


o Example: Student, Course, Library
• Attribute: A property or characteristic of an entity.
o Example: Student has attributes like StudentID, Name, Age

Types of Attributes:

• Simple: Cannot be divided (e.g., Age)


• Composite: Can be divided (e.g., Name → First, Last)
• Derived: Computed from other attributes (e.g., Age from DOB)
• Multi-valued: More than one value (e.g., PhoneNumbers)

🔹 Relationships

Defines how two or more entities are associated.

Types:
Relationship Type Description Example
1:1 One entity relates to one of another Person ↔ Passport
1:M One entity relates to many of another Department → Employees
M:N Many entities relate to many others Students ↔ Courses

🔹 Keys

Keys uniquely identify rows in tables and establish relationships.

🔸 Primary Key

• Uniquely identifies each row.


• Cannot be NULL.
• Example: StudentID in Student table.

🔸 Candidate Key

• All possible choices for primary key.


• Example: Email, StudentID could both uniquely identify a student.

🔸 Super Key

• Any set of attributes that uniquely identifies a row.


• Example: {StudentID}, {StudentID, Name}

🔸 Foreign Key

• A field that links to the primary key of another table.


• Example: StudentID in Enrollments table refers to StudentID in Students table.

🔹 ER Diagrams (Entity-Relationship Diagrams)

A visual representation of entities, attributes, and relationships.

Symbols:

• Entity → Rectangle
• Attribute → Oval
• Relationship → Diamond
• Primary Key → Underlined Attribute
• Lines connect entities to relationships and attributes

Example ER Diagram (Textual):

less
CopyEdit
[Student] ──(enrolls)── [Course]
| |
[ID] [CourseID]
[Name] [Title]

This shows:

• A Student enrolls in a Course


• Student has ID, Name
• Course has CourseID, Title

📙 3. Relational Model
The Relational Model organizes data into tables (relations) consisting of rows (tuples) and
columns (attributes). It’s the foundation of Relational Database Management Systems
(RDBMS) like MySQL, PostgreSQL, Oracle, etc.

🔹 Relational Schema and Instances

• Relational Schema: The structure or blueprint of a relation (table).


o It defines the table name, column names, and their data types.
o Example:

php
CopyEdit
Student(StudentID: int, Name: varchar, Major: varchar)
• Instance: A snapshot of the data in the table at a particular moment.
o Example:

pgsql
CopyEdit
+-----------+--------+--------+
| StudentID | Name | Major |
+-----------+--------+--------+
| 101 | Alice | CS |
| 102 | Bob | EE |

🔹 Keys

🔸 Primary Key

• Uniquely identifies each tuple.


• Must be unique and NOT NULL.
• Example: StudentID in Student table

🔸 Candidate Key

• A set of attributes that can uniquely identify tuples.


• A table may have multiple candidate keys; one is chosen as the primary key.
• Example: Email and StudentID can both uniquely identify a student.

🔸 Super Key

• Any combination of attributes that can uniquely identify a tuple.


• Includes candidate keys plus additional attributes.
• Example: {StudentID}, {StudentID, Name}

🔸 Foreign Key

• An attribute that references the primary key of another table.


• Ensures referential integrity between related tables.
• Example: StudentID in Enrollment(StudentID, CourseID) refers to
Student(StudentID)
🔹 Integrity Constraints

Rules that ensure accuracy and consistency of data.

🔸 1. Domain Constraint

• Each attribute must contain only values from a predefined domain (data type +
constraints).
• Example:
o Age INT CHECK (Age > 0)
o Grade VARCHAR(2)

🔸 2. Key Constraint

• No two tuples can have the same value for a primary key.
• Example:

sql
CopyEdit
StudentID must be unique:
+-----------+--------+
| StudentID | Name |
+-----------+--------+
| 101 | Alice |
| 101 | Bob | Invalid

🔸 3. Entity Integrity

• Primary key cannot be NULL.


• Ensures each entity (record) is identifiable.
• Example:

pgsql
CopyEdit
+-----------+--------+
| StudentID | Name |
+-----------+--------+
| NULL | Alice | Invalid

🔸 4. Referential Integrity

• A foreign key must match a primary key in another table or be NULL.


• Ensures relationships between tables remain consistent.
• Example:

scss
CopyEdit
Enrollments(StudentID) must exist in Student(StudentID)

🔹 Relational Algebra

A set of theoretical operations used to query and manipulate relational data.

Key Operations:

Sym
Operation Description Example
bol
Select σ Filters rows based on condition σ Major='CS'(Student)
Project π Selects specific columns π Name, Major(Student)
Union ∪ Combines tuples from 2 relations CS_Students ∪ EE_Students
All_Students −
Set Difference − Tuples in one relation but not in another
Graduated_Students
Cartesian
× All possible combinations Student × Course
Product
Rename ρ Renames a relation or attributes ρ(S ← Student)
Combines related tuples from two
Join ⨝ Student ⨝ Enrollments
relations

Example:

Given:
• Student(StudentID, Name, Major)
• Enrollments(StudentID, CourseID)

Get names of students enrolled in courses:

scss
CopyEdit
π Name (Student ⨝ Enrollments)

🔹 Relational Calculus

A non-procedural query language (what to retrieve, not how).

🔸 1. Tuple Relational Calculus (TRC)

• Uses tuple variables.


• Syntax: { t | P(t) }
("Return all tuples t such that predicate P(t) is true.")

Example:

text
CopyEdit
{ s | Student(s) ∧ [Link] = 'CS' }
→ Returns all students whose major is CS

🔸 2. Domain Relational Calculus (DRC)

• Uses domain variables (each variable ranges over a domain of values).


• Syntax: { <x1, x2, ... xn> | P(x1, x2, ..., xn) }

Example:

text
CopyEdit
{ <n> | ∃ id, m (Student(id, n, m) ∧ m = 'CS') }
→ Returns names of all CS students

✅ Summary
Concept Key Idea
Relational Schema Defines table structure
Instance Current data in a table
Keys Uniquely identify tuples; maintain relationships
Integrity Constraints Ensure valid and consistent data
Relational Algebra Procedural query language with set-based operations
Relational Calculus Non-procedural language; describes what to retrieve

📕 4. SQL (Structured Query Language)


SQL is the standard language used to communicate with relational databases. It is used for
defining structure, manipulating data, controlling access, and ensuring transaction integrity.

SQL commands are divided into several categories:

🔹 1. DDL – Data Definition Language


Used to define or modify the structure of database objects (tables, schemas, etc.).

🔸 CREATE

Creates a new table or object.

sql
CopyEdit
CREATE TABLE Students (
StudentID INT PRIMARY KEY,
Name VARCHAR(50),
Major VARCHAR(50)
);

🔸 DROP

Deletes an object (table, view, index).

sql
CopyEdit
DROP TABLE Students;

🔸 ALTER

Modifies an existing table.

sql
CopyEdit
ALTER TABLE Students ADD Age INT;
ALTER TABLE Students DROP COLUMN Age;

🔹 2. DML – Data Manipulation Language


Used to manipulate data in tables.

🔸 SELECT

Retrieves data from the table.

sql
CopyEdit
SELECT * FROM Students;
SELECT Name FROM Students WHERE Major = 'CS';
🔸 INSERT

Adds new data.

sql
CopyEdit
INSERT INTO Students (StudentID, Name, Major) VALUES (1, 'Alice', 'CS');

🔸 UPDATE

Modifies existing data.

sql
CopyEdit
UPDATE Students SET Major = 'IT' WHERE StudentID = 1;

🔸 DELETE

Deletes data.

sql
CopyEdit
DELETE FROM Students WHERE StudentID = 1;

🔹 3. DCL – Data Control Language


Controls access permissions to the database.

🔸 GRANT

Gives privileges to users.

sql
CopyEdit
GRANT SELECT, INSERT ON Students TO user1;
🔸 REVOKE

Takes back privileges.

sql
CopyEdit
REVOKE INSERT ON Students FROM user1;

🔹 4. TCL – Transaction Control Language


Used to manage transactions.

🔸 COMMIT

Saves all changes made during the transaction.

sql
CopyEdit
COMMIT;

🔸 ROLLBACK

Undoes changes made during the transaction.

sql
CopyEdit
ROLLBACK;

🔸 SAVEPOINT

Creates a checkpoint to rollback to.

sql
CopyEdit
SAVEPOINT sp1;
-- Some operations
ROLLBACK TO sp1;
🔹 5. Advanced SQL

🔸 Joins

Combines rows from two or more tables based on related columns.

✅ INNER JOIN

Returns matching rows.

sql
CopyEdit
SELECT [Link], [Link]
FROM Students
INNER JOIN Enrollments ON [Link] = [Link]
INNER JOIN Courses ON [Link] = [Link];

✅ LEFT JOIN

Returns all from the left table + matched from the right.

sql
CopyEdit
SELECT [Link], [Link]
FROM Students
LEFT JOIN Enrollments ON [Link] = [Link]
LEFT JOIN Courses ON [Link] = [Link];

✅ RIGHT JOIN / FULL JOIN

Similar logic, with RIGHT prioritizing the second table and FULL including all.
🔸 Subqueries and Nested Queries

Query inside another query.

sql
CopyEdit
SELECT Name FROM Students
WHERE StudentID IN (
SELECT StudentID FROM Enrollments WHERE CourseID = 'CSE101'
);

🔸 Views

Virtual tables based on SELECT queries.

sql
CopyEdit
CREATE VIEW CS_Students AS
SELECT * FROM Students WHERE Major = 'CS';

-- Use it like a table


SELECT * FROM CS_Students;

🔸 Indexes

Speeds up data retrieval (especially on large tables).

sql
CopyEdit
CREATE INDEX idx_major ON Students(Major);

🔸 Triggers

Automatic actions triggered by events on a table (e.g., insert, update).

sql
CopyEdit
CREATE TRIGGER log_insert
AFTER INSERT ON Students
FOR EACH ROW
BEGIN
INSERT INTO LogTable (Action, Time) VALUES ('Insert', NOW());
END;

🔸 Stored Procedures

Saved SQL code that can be executed later.

sql
CopyEdit
CREATE PROCEDURE GetAllStudents()
BEGIN
SELECT * FROM Students;
END;

CALL GetAllStudents();

🔸 Functions

Similar to procedures but return a value.

sql
CopyEdit
CREATE FUNCTION GetStudentCount()
RETURNS INT
BEGIN
DECLARE count INT;
SELECT COUNT(*) INTO count FROM Students;
RETURN count;
END;
🔸 Cursors

Used to process rows one by one in a loop (like a pointer in SQL).

sql
CopyEdit
DECLARE cur CURSOR FOR SELECT Name FROM Students;
OPEN cur;
FETCH cur INTO @name;
-- Process @name
CLOSE cur;

✅ Summary Table
Category Commands Purpose
DDL CREATE, DROP, ALTER Define database structure
DML SELECT, INSERT, UPDATE, DELETE Manipulate data
DCL GRANT, REVOKE Control access
TCL COMMIT, ROLLBACK, SAVEPOINT Control transactions
Advanced JOINS, VIEWS, INDEXES, TRIGGERS, etc. Enhanced querying and automation

📒 5. Database Design
Database design ensures efficient storage, minimal redundancy, and data integrity. It
typically involves normalization and using design strategies like ER mapping.

🔹 Normalization
Normalization is the process of organizing data to:
• Eliminate redundancy
• Prevent anomalies (insert, update, delete)
• Improve data integrity

It uses normal forms (NF) — each level refines the design further.

🔸 1NF – First Normal Form

A table is in 1NF if:

• All attributes contain atomic (indivisible) values.


• Each record is unique.

Example (violates 1NF):

StudentID Name Courses


1 Alice Math, Physics

Courses contains multiple values.

After 1NF:

StudentID Name Course


1 Alice Math
1 Alice Physics

🔸 2NF – Second Normal Form

A table is in 2NF if:

• It is already in 1NF.
• No partial dependency (non-prime attribute depends on a part of a composite key).

Example:

Consider:

scss
CopyEdit
Enrollment(StudentID, CourseID, StudentName)
Here, (StudentID, CourseID) is the primary key.

StudentName depends only on StudentID → Partial dependency.

Solution: Decompose:

• Student(StudentID, StudentName)
• Enrollment(StudentID, CourseID)

🔸 3NF – Third Normal Form

A table is in 3NF if:

• It is in 2NF.
• There is no transitive dependency (non-key → non-key).

Example:

scss
CopyEdit
Employee(EmpID, DeptID, DeptName)

DeptName depends on DeptID, not directly on EmpID.

Solution:

• Employee(EmpID, DeptID)
• Department(DeptID, DeptName)

🔸 BCNF – Boyce-Codd Normal Form

A stricter version of 3NF:

• Every determinant is a candidate key.

Example:

scss
CopyEdit
Course(CourseID, Instructor, Room)

If Instructor → Room but Instructor is not a candidate key, this violates BCNF.

Decompose:

• InstructorRoom(Instructor, Room)
• Course(CourseID, Instructor)

🔸 4NF – Fourth Normal Form

• Removes multi-valued dependencies.

Example:

scss
CopyEdit
Student(StudentID, Course, Hobby)

Each student can take multiple courses and have multiple hobbies independently.

Decompose:

• StudentCourses(StudentID, Course)
• StudentHobbies(StudentID, Hobby)

🔸 5NF – Fifth Normal Form

• Removes join dependencies that are not implied by candidate keys.

Used in rare, complex cases with multiple independent many-to-many relationships.


🔹 Decomposition and Lossless Join
Decomposition: Splitting a relation into two or more relations.

Lossless Join: When decomposed tables can be joined back to get the original table without
loss of data.

Example:

Original:

scss
CopyEdit
Employee(EmpID, DeptID, DeptName)

Decompose:

• Employee(EmpID, DeptID)
• Department(DeptID, DeptName)

Join:

sql
CopyEdit
SELECT * FROM Employee
JOIN Department USING(DeptID);

→ No data loss

🔹 Dependency Preservation
Dependency Preservation: All functional dependencies must be preserved after decomposition.

Example:

Original FD:

css
CopyEdit
A → B, B → C

After decomposition, both FDs should still hold in the new relations.

🔹 Functional Dependencies (FD)


A → B means: If two rows have the same value for A, they must have the same value for B.

Example:

nginx
CopyEdit
StudentID → Name

If StudentID is same, Name should also be the same.

Used to determine:

• Candidate Keys
• Decomposition
• Normalization

🔹 Design Strategies

🔸 1. ER to Relational Mapping

Steps:

1. Entity → Table
a. Entity: Student(StudentID, Name) → Table: Student
2. 1:1 Relationship → Merge or create foreign key
a. Student ↔ Passport → Student(PassportID)
3. 1:N Relationship → Foreign key in "N" side
a. Department 1:N Employee → Employee(DeptID)
4. M:N Relationship → Create junction table
a. Student M:N Course → Enrollment(StudentID, CourseID)
5. Attributes → Table columns
a. Multi-valued → New table
b. Derived → Usually ignored

🔸 2. Top-Down Design

Start with a general overview (high-level entities) and refine:

• Use ER diagrams
• Normalize step by step

Best for complex databases from scratch.

🔸 3. Bottom-Up Design

Start with existing data and build relations.

• Identify patterns, dependencies


• Useful in legacy systems

✅ Summary
Concept Description
Normalization Reduces redundancy and anomalies through normal forms
1NF to 5NF Gradual refinement from atomic values to join dependency free
Lossless Join Ensures no data is lost after decomposition
Dependency Preservation Ensures all FDs are maintained post-decomposition
Functional Dependencies Used to define relationships and normalization boundaries
ER to Relational Mapping Translates design to actual schema
Design Strategies Top-Down (from ER) or Bottom-Up (from data)
📓 6. Transaction Management
A transaction is a sequence of database operations that must be executed as a single logical
unit of work. Its goal is to preserve data integrity.

🔹 Transactions and ACID Properties

ACID ensures that transactions are processed reliably:

Property Description
Atomicity All operations in a transaction are completed or none at all.
Consistency A transaction brings the database from one valid state to another.
Isolation Transactions are isolated from each other. No intermediate state is visible.
Durability Once a transaction is committed, changes are permanent.

Example: Transferring ₹500 from A to B

sql
CopyEdit
BEGIN;
UPDATE Accounts SET Balance = Balance - 500 WHERE AccNo = 'A';
UPDATE Accounts SET Balance = Balance + 500 WHERE AccNo = 'B';
COMMIT;

If any part fails → ROLLBACK the transaction.

🔹 States of a Transaction

1. Active: Transaction is executing.


2. Partially Committed: Final operation executed.
3. Committed: Changes permanently saved.
4. Failed: Error occurred, cannot proceed.
5. Aborted: Rolled back; no changes persisted.

pgsql
CopyEdit
Active → (Partial Commit) → Commit or Abort
↘︎ Failed

🔹 Serializability (Correctness in Concurrency)


Ensures that concurrent transactions produce the same result as if they executed serially.

🔸 Conflict Serializability

Two operations conflict if:

• They access the same data item and


• At least one of them is a write

Schedule: Order of operation execution.

Conflict-serializable: If a schedule can be transformed into a serial schedule by swapping


non-conflicting operations.

🔸 View Serializability

Two schedules are view-equivalent if:

• They read the same initial values


• They perform same final writes
• Every read operation reads the same value in both schedules

Less strict than conflict serializability.

🔸 Precedence Graph (Serialization Graph)

Used to test conflict serializability:

• Nodes = Transactions
• Edges = Conflicting operations (T1 → T2 if T1 executes before T2)
A schedule is conflict serializable if the graph is acyclic.

🔹 Concurrency Control
Concurrency control ensures isolation and consistency when multiple transactions run
simultaneously.

🔸 Lock-Based Protocols

Use locks to control access:

• Shared (S) Lock: Read


• Exclusive (X) Lock: Read/Write

Rules:

• Only one X lock at a time


• Multiple S locks allowed
• No other locks allowed with X lock

Example:

• T1 gets S(A) → reads A


• T2 wants X(A) → must wait

🔸 Two-Phase Locking (2PL)

2PL ensures serializability:

1. Growing phase: Only acquire locks


2. Shrinking phase: Only release locks

Once a transaction releases a lock, it cannot acquire new ones.

Example:

text
CopyEdit
T1: lock(A) → lock(B) → unlock(A) → unlock(B)

Any schedule following 2PL is conflict-serializable.

🔸 Deadlocks and Prevention

Deadlock: Two or more transactions wait indefinitely for each other’s locks.

Detection:

• Wait-For Graph (WFG): cycle → deadlock

Prevention Techniques:

• Wait-die: Older transaction waits; younger aborts.


• Wound-wait: Older transaction preempts; younger waits.
• Timeout: Abort after waiting too long.

🔸 Timestamp Ordering

Every transaction gets a timestamp (TS) when it starts.

Each data item Q has:

• read_TS(Q): Largest TS of transaction that read Q


• write_TS(Q): Largest TS of transaction that wrote Q

Rules:

1. If T wants to read Q:
a. Allow if TS(T) ≥ write_TS(Q)
b. Else → abort T
2. If T wants to write Q:
a. Allow if TS(T) ≥ read_TS(Q) and TS(T) ≥ write_TS(Q)
b. Else → abort T

Ensures serializability based on timestamps


✅ Summary Table
Topic Description
ACID Ensures reliable transaction execution
Transaction States Lifecycle: Active → Commit/Abort
Serializability Correctness in concurrency
Conflict/View Serial. Conflict (operations-based), View (result-based)
Precedence Graph Tests conflict serializability
Locks Shared/Exclusive access to data items
2PL Ensures serializability via lock phases
Deadlocks Circular waits; resolved via prevention/detection
Timestamp Ordering Orders transactions to ensure serializability

📔 7. Recovery System
A recovery system in a DBMS ensures that the database can be restored to a consistent state
after failures. It relies on logs, backups, and recovery protocols to undo or redo transactions.

🔹 Types of Failures
1. Transaction Failure
a. Logical errors (e.g., division by zero)
b. System errors (e.g., unexpected abort)
2. System Crash
a. Hardware/software failure; memory contents lost but disk data safe
3. Disk Failure
a. Data is lost or corrupted due to hardware failure (e.g., head crash)
4. Media Failure
a. Affects all or parts of the storage media
5. Communication Failure
a. In distributed systems; network issues affect transaction coordination
🔹 Recovery Techniques

🔸 1. Log-Based Recovery

A log file records all changes made by transactions. It’s stored in stable storage (not lost in
crashes).

Types of logs:

• WRITE(T, X, old_val, new_val)


• START T
• COMMIT T
• ABORT T

Two main policies:

• Undo Logging: Rollback uncommitted transactions


• Redo Logging: Redo committed transactions

Example:

css
CopyEdit
Log:
START T1
WRITE(T1, A, 50, 60)
WRITE(T1, B, 30, 40)
COMMIT T1

If crash occurs after COMMIT, redo T1.

🔸 2. Shadow Paging

Used for atomicity and durability without logs.


How it works:

• Maintain two page tables:


o Shadow Page Table: Points to current consistent state
o Current Page Table: Used for ongoing transactions
• On commit: Replace shadow table with current table

Example:

less
CopyEdit
Page Table:
[Page1 → Addr100], [Page2 → Addr200]

On write:
Copy Page2 to Addr300
Current Table: [Page2 → Addr300]

• On commit: Shadow Table ← Current Table


• On crash: Use Shadow Table (unchanged)

Drawback: High overhead due to page copying

🔸 3. Checkpoints

Checkpointing is a mechanism to limit the number of log entries to be processed during


recovery.

Process:

1. Pause transactions briefly


2. Flush all changes to disk
3. Write a CHECKPOINT record to the log

On recovery:

• Start from the last checkpoint


• Redo committed transactions after the checkpoint
• Undo uncommitted ones
Example:

css
CopyEdit
Log:
CHECKPOINT
START T3
WRITE(T3, A, 10, 20)

→ Start recovery from CHECKPOINT, not from the beginning

🔸 4. ARIES Algorithm (Advanced Recovery)

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is the most widely used
recovery algorithm.

It uses:

• Write-Ahead Logging (WAL)


• Repeating history during redo
• Three-phase recovery

Phases:

1. Analysis Phase:
a. Determine which transactions were active at the time of crash
b. Identify dirty pages in buffer
2. Redo Phase:
a. Repeat all actions starting from the last checkpoint
b. Redo all operations to bring DB to the state at crash time
3. Undo Phase:
a. Rollback uncommitted transactions

Example:

sql
CopyEdit
Log:
START T1
WRITE T1 A 100 110
START T2
WRITE T2 B 200 220
COMMIT T1
--- crash ---

• Redo all actions from the last checkpoint


• Undo T2 because it did not commit

✅ Summary Table
Concept Description
Log-Based Recovery Uses logs to undo/redo transactions
Shadow Paging Uses a backup page table; no logs required
Checkpoints Periodic save points to reduce recovery time
ARIES Algorithm Industry-standard, robust algorithm for crash recovery
Types of Failures Transaction, system, disk, media, and communication failures

💡 Quick Visual (Recovery Overview)


pgsql
CopyEdit
+-------------------+ +-----------------------+
| Failure | ---> | Recovery Module |
+-------------------+ +-----------------------+
|
+--------------------------+---------------------------+
| | |
Log-Based Recovery Shadow Paging ARIES Algorithm
| | |
Undo / Redo Page Tables Analyze / Redo / Undo
📘 8. Indexing and Hashing
Indexing and hashing are techniques used to quickly locate data in a database without scanning
every record.

🔹 Need for Indexing


In large databases, searching through every row (linear scan) is inefficient. Indexes work like
the index of a book, allowing faster access to specific rows.

🔸 Without Index:

sql
CopyEdit
SELECT * FROM Employee WHERE EmpID = 104;

• Full table scan is needed.

🔸 With Index:

• DBMS can jump directly to the location of EmpID = 104.

🔹 Types of Indexing

🔸 1. Single-Level Index

A file contains a list of key-pointer pairs sorted by key. The index is one level above the data.

Example:

mathematica
CopyEdit
Index File:
[101 → Block 1]
[105 → Block 2]
[110 → Block 3]

Search for 105 → Go to Block 2 directly

🔸 2. Multi-Level Index

If the index file is large, we can index the index file (hierarchical index):

mathematica
CopyEdit
Level 2 (Top): [Key1 → Level 1 Block A], [Key2 → Block B]
Level 1: [Key101 → Data Block 1], [Key105 → Block 2]

Speeds up lookup using a binary search tree-like approach

🔸 3. Dense vs Sparse Index

Type Description Example


Dense Index has one entry per record Every EmpID → Pointer
Sparse Index has entry per data block Only first EmpID of each block

Dense: Fast search


Sparse: Less space

🔹 B+ Tree Indexing
B+ Trees are the most commonly used dynamic indexing structure.

🔸 Structure:

• All data entries are stored at the leaf level


• Internal nodes contain only keys and pointers
• Leaves are linked for fast range queries
Properties:

• Balanced: All leaves at the same depth


• Fast for equality and range queries

🔸 Example (Order = 3):

less
CopyEdit
Internal Nodes:
[30 | 60]
/ | \
Leaf: [10 20] [30 40 50] [60 70]

Search 40 → Go through [30 | 60] → Middle leaf

🔹 Hashing Techniques
Hashing is used to compute a direct address based on a hash function.

🔸 1. Static Hashing

• Hash function maps a key to a fixed number of buckets.

text
CopyEdit
h(EmpID) = EmpID mod 10

• EmpID 101 → 1 → Goes to Bucket 1

Problem: Doesn’t scale well when data grows (overflow occurs)

🔸 2. Dynamic Hashing

Used to overcome limitations of static hashing.


🔹 a) Extendible Hashing

• Uses a directory that grows/shrinks dynamically


• Directory entries point to buckets
• Each bucket has a local depth, directory has global depth

Example:

sql
CopyEdit
h(EmpID) = binary bits of hash
Global depth = 2 → 4 buckets (00, 01, 10, 11)

Insert 101 → 1100101 → Index = 01 → Bucket 01


If bucket full → Split bucket, increase depth

🔹 b) Linear Hashing

• Buckets are split incrementally


• No directory needed
• Uses a level and a split pointer

Good for large dynamic datasets, avoids sudden rehashing

✅ Summary Table
Concept Description
Indexing Improves query speed
Single-Level Index One index file with key-pointer pairs
Multi-Level Index Index on index; good for large data
Dense Index One index entry per record
Sparse Index One index entry per block
B+ Tree Balanced tree with fast range and point queries
Static Hashing Fixed buckets using hash function
Extendible Hashing Uses binary directory that grows dynamically
Linear Hashing Incremental bucket split without directory
✅ Visual Example: B+ Tree Lookup
css
CopyEdit
[50]
/ \
[10 20 30] [50 60 70]

Search for 60 → Traverse to right child → Found at leaf

📙 9. File Organization
File organization refers to how records are arranged in a file on storage media (like hard
disks). It directly affects performance, especially for searching, inserting, updating, and
deleting records.

🔹 1. File Structures

🔸 a) Heap File Organization (Unordered File)

• Records are stored in no particular order.


• New records are inserted at the end of the file.
• Fast for inserts, but slow for searches unless indexed.

Use Case: Temporary tables, logs.

Example:

less
CopyEdit
Employee Table: (No specific order)
[103, "Alex"] → [101, "John"] → [104, "Jane"] → ...
Searching for EmpID = 104 may require scanning entire file.

🔸 b) Sorted File Organization

• Records are stored in sorted order based on a key (e.g., EmpID).


• Binary search can be used → faster than heap files for searches.
• Insertion is costly (requires shifting records to maintain order).

Use Case: Static or rarely updated datasets.

Example:

less
CopyEdit
Sorted by EmpID:
[101, "John"] → [103, "Alex"] → [104, "Jane"]

🔸 c) Hashed File Organization

• Uses a hash function to compute the address (bucket) of a record.


• Very fast for exact match queries, poor for range queries.

Use Case: Lookup tables, exact match queries.

Example:

lua
CopyEdit
h(EmpID) = EmpID mod 10

EmpID 104 → Bucket 4 → [104, "Jane"]

🔹 2. Sequential File Organization

• A type of sorted file where records are stored in a sequential manner.


• Supports sequential processing (e.g., reports, summaries).
• Often used with a primary key index and optionally overflow areas to handle inserts.

Example: Bank accounts sorted by Account Number.


Inserts are added in overflow area until a batch reorganization.

🔹 3. Clustering and Multi-Key Access

🔸 a) Clustering

• Related records from different tables are stored physically close together.
• Reduces disk I/O in join operations and frequently accessed groups.

Example: Clustering customer and their orders:

yaml
CopyEdit
[CustomerID: 101, Name: John]
└─ [OrderID: 1, Amount: $100]
└─ [OrderID: 2, Amount: $150]

Instead of storing customers and orders in separate files, they are clustered together.

🔸 b) Multi-key Access (Inverted File Organization)

• Allows retrieval of records based on multiple keys.


• Uses inverted indexes to map each attribute to a list of matching records.

Use Case: Text search, library systems, keyword indexing.

Example: Search a book database by both Author and Genre:

diff
CopyEdit
Index: Author → [BookID 1, BookID 3]
Index: Genre → [BookID 2, BookID 3]

→ BookID 3 matches both.

✅ Comparison Table
File Organization Search Efficiency Insert/Update Use Case
Heap Low (linear scan) Fast Logs, temp data
Sorted High (binary search) Slow Static lookup tables
Hashed High (exact match) Moderate Lookups, user authentication
Sequential Medium Moderate Reports, batch processing
Clustered High (related access) Moderate Joined queries
Multi-key High (multi-attribute) Complex Search engines, catalogs

✅ Summary

• Heap files: Fast inserts, poor for searching.


• Sorted/sequential files: Good for searches, expensive insertions.
• Hashed files: Great for exact searches, not for range.
• Clustering: Stores related records together, helpful in joins.
• Multi-key: Enables indexing/searching on multiple attributes.

📗 10. Query Processing and Optimization


Query processing and optimization in a DBMS is the process of translating a high-level SQL
query into an efficient execution strategy. The goal is to minimize resource usage (cost) like
CPU, I/O, and memory while producing the correct result.
🔹 1. Steps in Query Processing
Query processing involves multiple stages to transform and execute a query:

✅ Step 1: Parsing and Translation

• The SQL query is parsed for syntax and semantics.


• Converted into an internal query representation (usually a relational algebra
expression).

Example:

sql
CopyEdit
SELECT name FROM Employee WHERE dept = 'HR';

→ Translated to:

bash
CopyEdit
π_name (σ_dept='HR'(Employee))

✅ Step 2: Query Optimization

• Multiple logical plans are generated.


• DBMS chooses the least-cost plan using transformation rules and heuristics.

✅ Step 3: Evaluation Plan Generation

• A physical query plan is generated (e.g., which join algorithm, which index to use).
• A query tree is produced with execution operators.

✅ Step 4: Query Execution

• The chosen execution plan is run, and results are returned to the user.
🔹 2. Measures of Query Cost
The cost of executing a query depends on:

Measure Description
Disk I/O Cost of reading/writing from/to disk (major cost)
CPU Time Time to process tuples, perform comparisons, etc.
Memory Usage Space needed for sorting, hashing, buffering
Communication In distributed systems, cost of sending/receiving data

Goal of Optimization: Minimize total estimated cost.

🔹 3. Query Trees and Evaluation Plans

✅ Query Tree

A tree representation of relational algebra operations.

Example SQL:

sql
CopyEdit
SELECT name FROM Employee WHERE salary > 50000;

Query Tree:

markdown
CopyEdit
π_name
|
σ_salary > 50000
|
Employee

Each node is an operator, and children are its inputs.


✅ Evaluation Plan

• Specifies the physical operators (e.g., table scan, index scan, hash join).
• Includes access paths and join methods.

🔹 4. Transformation Rules
These are equivalency rules used to transform one relational algebra expression into another
equivalent one (that may be more efficient).

✅ Common Rules:

1. Cascade of selections:

bash
CopyEdit
σ_age > 30 ∧ dept = 'HR'(Employee)
= σ_age > 30(σ_dept = 'HR'(Employee))

2. Commutativity of joins:

nginx
CopyEdit
A⋈B=B⋈A

3. Associativity of joins:

mathematica
CopyEdit
(A ⋈ B) ⋈ C = A ⋈ (B ⋈ C)

4. Push selections down:


• Perform selections as early as possible to reduce intermediate results.

Example:

sql
CopyEdit
SELECT * FROM Employee, Department WHERE [Link] = [Link] AND
[Link] = 'HR';

Transformation:

pgsql
CopyEdit
σ_Department.name='HR'(Employee ⋈ Department)
→ Employee ⋈ σ_name='HR'(Department)

Fewer records are joined.

🔹 5. Heuristics for Query Optimization


Heuristic rules help generate a good query plan quickly (without exhaustively checking all
possibilities):

✅ Common Heuristics:

1. Perform selection and projection early to reduce tuple size.


2. Use indexes if available.
3. Use the most selective selection conditions first (those that reduce rows most).
4. Perform joins in order of increasing result size (small joins first).
5. Replace nested queries with joins where possible.
6. Avoid cross products unless necessary.

✅ Example: Optimization in Action


SQL Query:

sql
CopyEdit
SELECT name
FROM Employee, Department
WHERE Employee.dept_id = [Link]
AND [Link] = 'HR'
AND [Link] > 30;

Step 1: Relational Algebra

bash
CopyEdit
π_name (σ_Department.name='HR' ∧ [Link] > 30 (Employee ⋈ Employee.dept_id =
[Link] Department))

Step 2: Apply Transformation Rules

scss
CopyEdit
π_name (
(σ_age > 30(Employee))

(σ_name = 'HR'(Department))
)

→ Now each relation is filtered before joining — reduces size of join input.

Step 3: Evaluation Plan

• Use index scan on [Link] = 'HR'


• Use index scan or selection on [Link] > 30
• Use hash join or merge join on dept_id = id

✅ Summary Table
Concept Description
Query Processing Converts SQL to efficient physical execution
Query Cost Measured by disk I/O, CPU, memory, etc.
Query Trees Represent algebra expressions as trees
Transformation Rules Logical rewrites of queries for efficiency
Heuristics Practical rules to quickly find a good (not necessarily best) query plan
📕 11. Distributed Databases (Advanced)
A Distributed Database System (DDBS) is a database where data is stored across multiple
locations (nodes), but it appears to users as a single logical database.

🔹 1. Characteristics of Distributed Databases


Feature Description
Location Transparency Users don’t need to know the physical location of data
Replication Transparency Users don’t know whether data is replicated or not
Fragmentation Transparency Users don’t know if data is split across multiple sites (fragmented)
Concurrency Control Ensures consistency in multi-user environments across locations
Fault Tolerance System continues operating despite failures in some nodes
Scalability Easy to add new nodes and distribute data load

Example: A bank with branches in different cities stores customer data locally but provides
unified access from any branch.

🔹 2. Fragmentation and Replication

✅ a) Fragmentation

Splits the database into smaller pieces (fragments) distributed across multiple sites.

Types of Fragmentation:

• Horizontal Fragmentation: Rows are split based on conditions.


o Example:

sql
CopyEdit
Employee_USA = σ_country='USA'(Employee)
Employee_India = σ_country='India'(Employee)

• Vertical Fragmentation: Columns are split.


o Example:

text
CopyEdit
Fragment 1: (EmpID, Name) Fragment 2: (EmpID, Salary)

• Hybrid: Combination of horizontal and vertical.

✅ b) Replication

Copies of data are maintained at multiple sites for availability and fault tolerance.

Types of Replication:

• Full Replication: Entire DB is copied at each site.


• Partial Replication: Only some data is replicated.
• No Replication: Each site has unique data (minimal redundancy).

Example: Customer data is stored in both New York and London servers to improve
availability and performance for users in both regions.

🔹 3. Distributed Transactions
A distributed transaction is one that accesses and modifies data on more than one site.

Characteristics:

• Must satisfy ACID properties across sites.


• Coordinated using a Transaction Manager (TM).

Example: A fund transfer involves:

• Debiting from Account A (Site 1)


• Crediting to Account B (Site 2)
This must be atomic — both must succeed or both must fail.

🔹 4. Commit Protocols
Used to ensure atomicity in distributed transactions.

✅ a) Two-Phase Commit Protocol (2PC)

Used to ensure that all nodes agree to commit or abort.

Phases:

1. Prepare Phase (Voting):


a. Coordinator asks all participants: “Can you commit?”
b. Participants respond: “Yes” or “No”
2. Commit/Abort Phase:
a. If all say “Yes” → Commit
b. If any say “No” → Abort

Example:

pgsql
CopyEdit
Client initiates transaction → Coordinator asks Site A, B
→ Both respond YES → All Commit

Problem: If the coordinator crashes after the “prepare” phase, participants may wait
indefinitely (blocking problem).

✅ b) Three-Phase Commit Protocol (3PC)

Solves the blocking problem of 2PC by adding an intermediate step.

Phases:

1. Can Commit: Like prepare


2. Pre-Commit: Coordinator tells all to prepare for commit
3. Commit: Final commit command

Advantage: Non-blocking even if coordinator crashes during phase 2.

🔹 5. CAP Theorem
Proposed by Eric Brewer, the CAP Theorem states that a distributed system cannot
simultaneously guarantee all three of the following:

Property Meaning
C - Consistency All nodes see the same data at the same time (like in ACID)
A - Availability Every request gets a response (no downtime)
P - Partition Tolerance System continues to operate despite network failures

✅ Implication:

• Must choose 2 out of 3 in real-world systems.

Type Examples
CP HBase, MongoDB (default)
AP CouchDB, Cassandra
CA Traditional RDBMS (non-distributed, no partitioning)

Example: During a network partition, you must sacrifice consistency or availability.

✅ Summary Table
Concept Explanation
Distributed DB DB split across sites, appears as a single logical DB
Fragmentation Dividing DB into parts (horizontally or vertically)
Replication Copying data to multiple sites
Distributed Txns Transactions involving multiple sites
2PC Protocol Commit based on all participants voting
3PC Protocol Adds safety step to avoid blocking
A system can only guarantee 2 of: Consistency, Availability, Partition
CAP Theorem
Tolerance
📒 12. NoSQL and New Trends
🔹 1. Introduction to NoSQL
NoSQL (Not Only SQL) databases are designed to handle large volumes of unstructured,
semi-structured, or rapidly changing data. They are ideal for big data, real-time web apps,
and distributed systems where traditional relational databases face scalability or flexibility
issues.

✅ Key Features:

• Schema-less or dynamic schema


• High scalability (horizontal scaling)
• High availability
• Fast read/write performance
• Suitable for cloud and distributed environments

Example Use Cases: Social networks, IoT, recommendation systems, real-time analytics

🔹 2. Types of NoSQL Databases


There are four main categories, each optimized for specific data models and access patterns.

✅ a) Key-Value Stores

• Data is stored as key-value pairs.


• Simple and fast for lookups by key.
• Ideal for caching, session management.

Example:

json
CopyEdit
"User123": { "name": "Alice", "age": 30 }

• Popular DBs: Redis, Riak, DynamoDB


✅ b) Document Stores

• Store data in documents (typically JSON, BSON, or XML).


• Flexible schema: each document can have different fields.

Example:

json
CopyEdit
{
"_id": "123",
"name": "Alice",
"address": {
"city": "NYC",
"zip": "10001"
}
}

• Popular DBs: MongoDB, CouchDB

✅ c) Column-Family Stores

• Store data in tables with rows and dynamic columns, grouped in families.
• Optimized for reading/writing large volumes of data across many rows.

Example:

yaml
CopyEdit
RowID: 001
Name: John | Email: john@[Link] | Phone: 12345

• Popular DBs: Cassandra, HBase


✅ d) Graph Databases

• Designed for graph-like relationships (nodes and edges).


• Best for social networks, recommendation engines, and fraud detection.

Example:

scss
CopyEdit
(Alice) —[FOLLOWS]→ (Bob)

• Popular DBs: Neo4j, ArangoDB

🔹 3. CAP Theorem Revisited in NoSQL


NoSQL systems are often designed to tolerate network partitions (P), and then choose
between:

Model Guarantees
CP Consistency & Partition Tolerance (e.g., HBase)
AP Availability & Partition Tolerance (e.g., Cassandra, CouchDB)

NoSQL DBs sacrifice strict consistency for availability and performance, especially under
distributed environments.

🔹 4. BASE vs ACID
Property ACID (Traditional DBs) BASE (NoSQL)
Atomicity All or nothing transactions May allow partial updates
Consistency Always in a valid state Eventual consistency
Transactions do not interfere with each
Isolation Soft isolation
other
Durability Once committed, data is safe Eventually persists
Basically System appears to work most of the

Available time
Soft-state — State may change even without input
Eventual Consistent eventually, not

consistency immediately
Example:

• In a shopping cart, adding an item may not reflect instantly but will be eventually
consistent across nodes.

🔹 5. Comparison: NoSQL vs RDBMS


Feature RDBMS NoSQL
Schema Fixed schema Flexible/dynamic schema
Scalability Vertical (scale-up) Horizontal (scale-out)
Transactions ACID-compliant BASE-compliant
Data Model Tabular (tables, rows) Key-value, document, graph, column
Joins Supported Generally not supported
Best for Structured data, complex queries Big Data, real-time, high-speed environments
Examples MySQL, PostgreSQL, Oracle MongoDB, Cassandra, Redis, Neo4j

✅ Summary Table
Concept Summary
NoSQL Flexible, scalable databases for big and unstructured data
Types Key-value, Document, Column, Graph
CAP Theorem NoSQL systems usually offer either CP or AP
BASE vs ACID BASE trades strong consistency for availability and performance
Use Cases IoT, social media, real-time analytics, recommendation engines, big data

📓 13. Security and Authorization


Database security ensures that data is protected from unauthorized access, breaches, and
misuse, while maintaining data integrity and availability. It covers authentication,
authorization, auditing, and attack prevention.
🔹 1. Database Security Issues
Some common security concerns in databases include:

Threat Type Description Example


Unauthorized
Users gaining access without proper credentials Hacker accessing user data
Access
SQL Injection Malicious code inserted into SQL queries ' OR 1=1 -- in login fields
A user exploiting a flaw to become
Privilege Escalation Low-privileged user gaining high-level access
admin
Data Tampering Modifying data without permission Altering financial records
Data Leakage Sensitive data exposed to outsiders Credit card details exposed
Malware/Ransomw Destructive or ransom-based attacks on the Encrypting DB and demanding
are database payment

🔹 2. Access Control Models


Access control models define how users are permitted to interact with the database and
what operations they're allowed to perform.

✅ a) Discretionary Access Control (DAC)

• Based on user identity and privileges.


• Users can grant/revoke access to other users.

Example (SQL):

sql
CopyEdit
GRANT SELECT ON Employee TO user123;
REVOKE SELECT ON Employee FROM user123;

Advantage: Flexible
Disadvantage: Risky if users grant unnecessary access

✅ b) Mandatory Access Control (MAC)

• Based on security labels (e.g., Confidential, Secret, Top Secret)


• System enforces access rules, not users

Example:

• An employee with clearance level "Confidential" cannot access "Top Secret" data

Advantage: High-level security enforcement


Disadvantage: Less flexible, complex to manage

✅ c) Role-Based Access Control (RBAC)

• Access is based on user roles (e.g., Admin, HR, Auditor)


• Easier to manage permissions by assigning roles instead of per-user rights

Example:

sql
CopyEdit
-- Create role and grant privileges
CREATE ROLE hr_role;
GRANT SELECT, UPDATE ON Employees TO hr_role;

-- Assign user to role


GRANT hr_role TO user123;

Advantage: Scalable and easy to manage in large systems

🔹 3. SQL Injection and Mitigation

✅ a) What is SQL Injection?

An attack technique where malicious SQL statements are inserted into input fields to
manipulate the backend.

Example Attack:

sql
CopyEdit
-- User input:
username = 'admin' --
password = anything

-- Query becomes:
SELECT * FROM users WHERE username = 'admin' --' AND password = 'anything';

This bypasses login authentication.

✅ b) Mitigation Techniques

Technique Description Example


Use parameterized PreparedStatement pstmt = [Link]("SELECT * FROM
Prepared Statements
queries users WHERE id = ?")
Sanitize and
Input Validation Allow only alphanumeric characters
validate inputs
DB user has
Least Privilege Avoid using root accounts for apps
minimal rights
Web Application Filters malicious
Blocks known attack patterns
Firewall (WAF) traffic
Encapsulate SQL
Stored Procedures Reduces injection risk if written securely
logic

Best Practice: Always use prepared statements or ORM libraries that handle SQL
safely.

✅ Summary Table
Topic Key Point
Security Issues Include injection, unauthorized access, privilege abuse
DAC Users control access (flexible but risky)
MAC System enforces rules (secure but rigid)
RBAC Role-based permissions (scalable, ideal for enterprises)
SQL Injection Code injected into SQL to manipulate DB
Mitigation Use prepared statements, validation, WAF, and principle of least privilege
🔐 Example in Web App Context
Suppose you're developing an HR portal:

• RBAC:
o HR can edit salaries (role: HR)
o Managers can view reports (role: Manager)
• SQL Injection Protection:

python
CopyEdit
# Python with SQLite
[Link]("SELECT * FROM employees WHERE emp_id = ?", (emp_id,))

• Access Control:
o Use authentication tokens
o Restrict sensitive APIs based on roles

Common questions

Powered by AI

The CAP theorem states that in a distributed database, it is impossible to simultaneously guarantee Consistency, Availability, and Partition tolerance. Systems can support at most two of these properties at any given time . In NoSQL databases, a choice is often made between CP (Consistency and Partition tolerance) or AP (Availability and Partition tolerance) based on the application needs. For example, systems prioritizing Consistency and Partition tolerance may delay availability during partitions, while systems choosing Availability and Partition tolerance allow for higher availability by permitting temporary inconsistencies . This trade-off is crucial for performance tuning in environments requiring high availability, such as systems supporting real-time analytics and IoT .

ER Diagrams play a crucial role in database design by providing a visual representation of entities, attributes, and relationships among data, which helps in understanding and planning the database structure before implementation. They allow designers to map out the various entities (real-world objects) and how they interact with each other, ensuring all necessary data points are captured . This visualization aids in identifying potential redundancies and ensuring all relationships are properly defined, which in turn supports data integrity and retrieval efficiency. By using ER Diagrams, designers can normalize the data model effectively, which minimizes data redundancy and optimizes data storage .

A DBMS offers several advantages over a traditional file system, including improved data consistency and integrity due to constraints and transaction management, which prevent data anomalies and ensure ACID properties (Atomicity, Consistency, Isolation, Durability). It also supports concurrent access by multiple users through locking mechanisms and provides robust security features to protect data from unauthorized access . In addition, a DBMS allows for easier data retrieval and querying through more complex search capabilities, such as SQL, which enables users to perform complex queries on large data sets efficiently .

In the relational model, keys are fundamental to maintaining data integrity. A primary key uniquely identifies each tuple in a relation and cannot be NULL, ensuring that every record is distinct and identifiable . Candidate keys provide alternative unique identifiers, and a chosen candidate key becomes the primary key. Super keys, which include any combination of attributes that can uniquely identify tuples, provide additional layers of uniqueness . Foreign keys establish and enforce referential integrity by linking tables, ensuring that relationships between tables remain consistent and that changes in one table appropriately update related data in another .

Query optimization employs several strategies to reduce computational cost and improve performance. Heuristic rules like applying selection and projection operations early, using appropriate indexes, and prioritizing selective conditions to minimize intermediate result sizes are key . Additionally, query optimization utilizes transformation rules to reconfigure the query for more efficient execution, such as replacing nested queries with joins where appropriate and avoiding cross products unless absolutely necessary . Cost-based optimization techniques are used to evaluate different possible query execution plans by measuring resource costs such as disk I/O and CPU, selecting the most efficient one for execution .

Access control models enhance database security by defining how users access data based on predefined policies. Discretionary Access Control (DAC) allows users to grant permissions to others, providing flexibility but risking excess privilege grants . Mandatory Access Control (MAC) depends on system-enforced rules based on data sensitivity levels, offering high security but less flexibility . Role-Based Access Control (RBAC) assigns users roles with specific permissions, easing management, particularly in large systems, though it may require careful role-definition to avoid permissions overlap . Each model focuses on protecting data from unauthorized access while maintaining necessary data operations.

Normalization is a systematic process of organizing data to reduce redundancy and improve data integrity by dividing data into multiple related tables. This process involves applying a series of normal forms (from 1NF to 5NF) that each address specific logical inconsistencies and dependencies . In contrast, decomposition involves dividing a table into smaller tables without losing any data, focusing on eliminating anomalies like insertion, update, and deletion anomalies . Both processes are crucial for database efficiency as they enhance query performance by reducing data duplication, improve consistency, and simplify maintenance by ensuring that updates only affect a single source of data .

Dynamic and static hashing techniques play significant roles in database indexing by influencing performance and complexity. Static hashing maintains a fixed directory structure and is simpler to implement, but it becomes inefficient as the file grows, causing collisions and requiring complex overflow handling . Dynamic hashing, such as extendible or linear hashing, adapts to data growth by altering its directory structure, efficiently managing collisions and minimizing overflow . This flexibility allows for improved data access times and better space utilization. While dynamic hashing improves performance, it introduces management complexity as it requires handling the changing directory structures and potential index recalibrations . These trade-offs are crucial considerations when choosing a hashing mechanism for a database system.

Views in a database act as virtual tables created from SELECT queries, providing several benefits. They enhance data security by restricting user access to sensitive data through selective exposure, ensuring users only see relevant data without revealing underlying schemas . Views can also simplify complex queries by encapsulating them in a single accessible entity, reducing the need for repeated complex query logic. However, views can incur performance costs due to additional processing, especially if the view logic is complex or lacks indices to speed up data retrieval . Additionally, views might become outdated if underlying tables change, potentially leading to mismatches unless refreshed, and may restrict updates in some scenarios.

ACID properties — Atomicity, Consistency, Isolation, Durability — ensure reliable transaction processing in a database management system. Atomicity guarantees that a transaction is completely performed or not executed at all, thus leaving the database in a consistent state. Consistency ensures that a transaction only brings the database from one valid state to another, maintaining data integrity rules . Isolation prevents transactions from interfering with each other by properly managing concurrent database access, and Durability ensures that the results of a transaction are permanently stored in the database, even in case of a system failure . These properties collectively ensure that database happens consistently and accurately, even in complex multi-user environments.

You might also like