Database Management Systems Overview
Database Management Systems Overview
Lecture Notes
● Database Initial Study (Planning & Analysis): In this first phase, designers study
the current system and define the database’s purpose and scope. They analyze
organizational needs, identify problems in the current data management, and set
clear objectives. This involves understanding what data is needed, constraints (like
budget or hardware), and what the database should achieve for the business.
● Database Design (Logical and Physical Design): Here, requirements are
translated into a detailed design blueprint. It often splits into logical design, where a
high-level data model (e.g., an ER diagram) is created, and physical design, which
decides how the data will be stored on hardware. Key tasks include defining entities,
attributes, relationships, applying normalization to eliminate redundancies, and
choosing physical storage structures or indexes for performance.
● Implementation and Loading: In this phase, the design is implemented in a DBMS.
Database schemas (tables, columns, constraints) are created using DDL commands,
and initial data is loaded or migrated into the database. For example, using SQL
CREATE TABLE statements to create tables, then bulk-loading data from existing
sources or files. Proper access permissions are also configured during this phase.
● Testing and Evaluation: The new database is tested thoroughly to ensure it meets
requirements and is error-free. This includes functional testing (do queries and
transactions produce correct results?), performance testing (can it handle the
expected load quickly?), and security/recovery testing (are access controls and
backup mechanisms working?). Any issues identified are fixed or fine-tuned in this
stage.
● Operation (Deployment): Once testing is successful, the database is deployed for
real use. It goes into production operation, supporting the organization’s day-to-day
workflows. End-users and applications start interacting with the database for their
tasks. Training users and DBAs may be part of this phase to ensure smooth
adoption.
● Maintenance and Evolution: After deployment, the database enters an ongoing
maintenance phase. This involves regular tasks like performance tuning, backups,
recovery drills, updates/patches, and handling new requirements or changes over
time. As business needs evolve, the database schema might be altered or new
features added (e.g., new tables or indexes). Monitoring is continuous to ensure the
database remains reliable and efficient.
Throughout the DBLC, documentation and feedback are important. By following these
phases in order, organizations ensure the resulting database is well-designed and can be
adapted over its life cycle, providing a robust data foundation for applications.
● Homogeneous DDBMS: All sites use the same DBMS software and schema. The
distributed system appears uniform. This simplifies integration – e.g., a company
running identical database software at multiple offices. Homogeneous systems allow
seamless data sharing but may be less flexible in heterogeneous environments.
● Heterogeneous DDBMS: Different sites may use different DBMS products or
schemas. The system requires translation or middleware to enable communication
between, say, an Oracle database and a MongoDB store. This adds complexity (due
to differences in data models and query languages), but allows integrating diverse
data sources. A special case is federated databases, where independent databases
cooperate through a unified interface but retain some local autonomy.
Levels of Data and Process Distribution: We can classify database systems based on
how both data storage and processing are distributed across sites:
● Single-Site Processing, Single-Site Data (SPSD): All data resides at one site and
all query processing is done at that same site. This is essentially a centralized
database on one computer. It’s the simplest case with no distribution.
● Multiple-Site Processing, Single-Site Data (MPSD): The data is stored at one
central site, but processing is distributed across multiple sites. For example, several
application servers at different locations issue requests to a central database server.
Here, only one copy of the database exists, but various networked machines execute
parts of the application. This scenario offloads some processing to distributed nodes,
but can cause heavy network traffic since the data itself isn’t distributed.
● Multiple-Site Processing, Multiple-Site Data (MPMD): This is a fully distributed
scenario: data is partitioned and/or replicated across multiple sites, and processing
(queries/transactions) can occur at multiple sites as well. There is no central primary
location – each site holds portion(s) of the data and handles local requests, but they
work together as one database system. This offers the greatest potential for locality
and fault tolerance, but requires sophisticated DDBMS software to coordinate
distributed queries, ensure data consistency, and handle failures gracefully.
Transparency Features: A good DDBMS offers transparency, meaning users do not need
to know that data is distributed. For example, location transparency means a user’s query
doesn’t have to specify the site of the data – the system figures that out. More robust is
fragmentation transparency, where even if a single table is split into fragments across
sites, the user can query it as a whole. The system also handles distributed query
optimization and distributed transactions (often using a two-phase commit protocol to
ensure atomicity across sites).
In summary, a DDBMS “looks like” a single database to users but under the hood it manages
data and processing across a network of computers. This can yield advantages like
improved reliability (no single point of failure) and scalability (add more nodes to handle
growth) – major tech companies like Google and Amazon leverage these benefits. However,
it also introduces challenges such as maintaining consistency across replicas, higher
complexity in query processing, and added security considerations.
● Object Identity (OID): Every object is assigned a unique identifier, independent of its
attribute values. This means an object can be distinctly referenced regardless of
changes to its content. This identity stability enables objects to be shared and related
without relying on primary key values. It’s different from relational keys because an
OID is typically system-generated and not visible to the user, yet it allows for
pointers/references between objects.
● Complex Objects & Type Constructors: Objects can be composite, meaning they
can contain other objects or collections. OODBMS support complex data types
through type constructors like sets, lists, tuples, etc., allowing nested structures.
For example, an Order object might have a collection (list) of Item objects as part of
its state. These nested objects can be stored and retrieved as a single unit, reflecting
real-world hierarchies.
● Encapsulation and Methods: In an OODB, an object not only has attributes (data)
but also methods – procedures or functions that operate on that object’s data. This
is the principle of encapsulation: data and its related operations are packaged
together. The database can store the method logic, so invoking a method on a stored
object can be done through the OODBMS. This is a shift from relational DBs where
operations (queries) are separate from the data.
● Classes and Inheritance: Objects are instances of classes (or types). A class
defines the structure (attributes) and behavior (methods) that its objects will have.
OODBMS typically support class hierarchies and inheritance, meaning a class can
inherit properties from a superclass. Inheritance promotes reusability and can
simplify querying.
● Persistence of Objects: OODBMS provide object persistence, storing objects onto
disk so they outlive program termination. This can happen transparently – you can
save an object and later retrieve it with the same structure and identity. Persistence
by reachability is a common technique: any object that is reachable from an already
persistent object (via references) can automatically also be treated as persistent.
● Object Structure – Attributes and Relationships: Each object’s state is defined by
its attributes (which can be primitive or references to other objects). Relationships
between objects are represented not by foreign keys as in a relational model, but by
direct references or pointers courtesy of OIDs. This allows navigating the
relationship directly (following the pointer) instead of performing a join.
Active Database: A database with a set of such triggers is called an active database. The
DBMS continuously checks for trigger conditions as transactions execute. When a triggering
event occurs, the system automatically executes the trigger’s action. This all happens as
part of the transaction, maintaining consistency.
Use cases for triggers: They enforce complex business rules and automate reactive
behavior in the database. For instance, a trigger could be used for:
● Auditing and logging changes (recording modifications automatically).
● Maintaining derived or summary data (e.g., auto-update an inventory count after an
order insert).
● Cascading actions (like on delete, perform other clean-ups).
● Enforcing complex constraints (e.g., "any withdrawal over $10,000 should send an
alert").
Active databases thus move some logic into the DBMS, ensuring it’s consistently applied no
matter how the database is accessed. In SQL, triggers are defined with CREATE TRIGGER.
The DBA specifies the timing (BEFORE or AFTER the event), the event
(INSERT/UPDATE/DELETE), and whether it fires per statement or per row.
In summary, active databases extend the passive nature of traditional DBs by allowing
reactive behavior using triggers. This capability is particularly useful in maintaining derived
data, complex constraints, and event-driven workflows directly in the database.
● Valid Time: The time period during which a data fact is true in the real world. For
example, a row in a salary table might have a Valid Time indicating from when to
when that salary was effective for the employee. A valid-time temporal database
allows you to query past states, like “What was the price of product X on January
15?”
● Transaction Time: The time when the data was stored in the database, i.e., when it
was current in the database system. Transaction time (sometimes called system
time) tracks the history of the database itself – every row is stamped with the time it
was inserted and the time it was logically deleted. A transaction-time temporal
database preserves all changes: when a row is updated, the old version is retained
with a transaction end time. This lets you query “what did the database look like as of
last Monday?”
A bitemporal database supports both valid and transaction time. This means it can tell the
history of the world as recorded, including corrections.
● Temporal (or system-versioned) tables in the SQL:2011 standard, where the DBMS
automatically manages extra columns for transaction start and end times.
● Temporal query language support, for instance, the ability to write queries like
SELECT ... FROM Employees FOR SYSTEM_TIME AS OF '2025-01-01' to get the
state as of a certain date.
Use cases: Finance (stock price history, auditing transactions over time), insurance (policy
coverage periods), and any system requiring history/audit trails.
In summary, temporal databases provide a uniform and powerful way to deal with
time-dependent data, making it easier to answer historical and status-as-of questions directly
via queries, rather than having to maintain history tables manually.
● Spatial Data Types: A spatial DBMS provides data types like POINT (a single
location), LINESTRING (a series of points forming a line or curve), POLYGON (a
closed area), and often more complex types. These allow modeling of real-world
entities.
● Spatial Indexing: Because spatial queries involve location-based criteria (“find all
stores within 5 miles of me”), specialized indexes like R-trees and quad-trees are
used. These indexes efficiently organize the two-dimensional space so the database
can quickly find geometries that satisfy spatial relationships.
● Spatial Queries and Operators: Spatial DBMS support operations such as: overlap,
containment, adjacency, distance calculations, etc. Examples: Range query – “find
all points within a given rectangle or radius”; Nearest neighbor query – “find the
closest hospital”; Spatial join – “find all parks that intersect with flood zones”. Query
languages (like SQL with spatial extensions, e.g., PostGIS) provide functions like
ST_Contains(A,B) or ST_Distance(A,B).
● Spatial Relationships: Spatial data has relationships that are geometric.
Topological relationships (like touches, crosses), directional (north/south of), and
metric (within certain distance) are important.
A spatial database system is essentially a regular DBMS with extra capabilities for spatial
data. Use cases include Geographic Information Systems (GIS), urban planning,
location-based services (points of interest, GPS traces), and meteorology.
In summary, spatial databases treat space as a first-class citizen in data management. They
extend SQL with spatial types and operations, and use spatial indexes to handle the typically
large volume of spatial data efficiently.
Examples of multimedia databases in use: a digital library storing PDFs and allowing
full-text search; an archive of satellite images with queryable timestamps; an audio database
supporting search by artist or a short audio clip.
In summary, a multimedia DB "allows users to store and query images, video, audio, and
documents" directly, and provides functionality to query based on multimedia content. This
contrasts with a plain file system by offering query language support and possibly
content-based search, integrating media data with other enterprise data.
● Most commercial DBMSs use DAC, which is identity-based and relies on privileges
(permissions) that can be granted or revoked by users (typically, by object owners or
administrators).
● It is "discretionary" because it is at the discretion of the data owner who else gets
access.
● In SQL, this is implemented via commands like GRANT and REVOKE. For example,
GRANT SELECT, UPDATE ON Employees TO Alice;.
● DAC is flexible but relies on users to set up correctly and can be error-prone in large
settings.
Comparison – DAC vs MAC: DAC offers ease and flexibility but is less secure against
insider threats, whereas MAC offers high security assurance by enforcing a system-wide
policy, but is inflexible.
● Encryption: Critical for protecting sensitive data, both at the column level, at rest (on
disk), and in transit (using TLS/SSL).
● Auditing: Recording who did what (logins, schema changes, data accesses) to help
detect suspicious activities and support forensic analysis.
● SQL Injection prevention: Implementing procedures with strict input handling to
prevent application-level exploits.
Overall, to secure a database, one must combine multiple layers: strong authentication,
robust access controls (DAC/RBAC/MAC), encryption of data at rest and in transit, and
diligent monitoring/auditing.
● A common variant where a transaction holds all exclusive (write) locks until it
commits or aborts, and only releases them at the very end.
● This ensures serializability and simplifies recovery, and is what ensures strong
isolation (no dirty reads).
Technologies for Big Data: Traditional databases falter with big data, so technologies like
Hadoop (with its HDFS distributed storage and MapReduce processing) and NoSQL
databases emerged to handle scale. Streaming systems like Apache Kafka and Spark
Streaming handle velocity.
● Web and Social Media Analytics (aggregating and mining massive interaction data).
● E-commerce and Customer Behavior (real-time analysis of clickstreams for
recommendations).
● IoT and Sensor Data (processing millions of sensor readings per second for
monitoring).
● Finance (fraud detection by scanning for unusual patterns across millions of
transactions).
The common thread is that Big Data technologies enable organizations to derive value from
data that is too large, too fast, or too varied for traditional systems.
● Key-Value Stores: Simplest model, stores data as a key and an associated value.
Very scalable and fast for lookup by key. Use cases: caching (e.g., Redis,
Memcached), user session storage. Advantage: Simplicity and speed.
● Document Stores: Store data in documents, often JSON or BSON objects, which
can have varying structures (no fixed schema). The document is a self-contained
record. Use cases: content management, catalogs, event logging. Advantage:
Flexible schema and convenient for object-like data storage (e.g., MongoDB).
● Column-Family Stores (Wide-Column Databases): Store data in sparse,
distributed tables organized into rows and column families. Each row can have
thousands of columns, and you don’t predefine all columns. Use cases: very large
datasets where you mostly access a subset of columns (e.g., Apache Cassandra,
HBase). Advantage: Efficient storage for sparse data and fast retrieval of data by
column family.
● Graph Databases: Designed for highly interconnected data. Store data as nodes
(entities) and edges (relationships) with properties. Use cases: social networks,
recommendation engines, fraud detection (e.g., Neo4j). Advantage: The ability to
traverse relationships quickly and flexibly.
NoSQL databases are often chosen for big data and web-scale applications. However, they
may sacrifice some of the strong ACID transactional guarantees and rich querying
capabilities of SQL databases.
Basic CRUD in MongoDB: CRUD stands for Create, Read, Update, Delete.
Operation Concept Example (Mongo Shell)
Read Retrieves documents using find. Uses [Link]({ status: "A", age: {
(Query) a JSON-like query language with $gt: 25 } });
operators (e.g., $gt).
Indexing: MongoDB supports various index types to improve query performance, including:
Single Field, Compound, Multikey (on arrays), Text (for search), Geospatial, Unique, and
TTL (for data expiration).
In summary, MongoDB, with its document model, flexible schema, and built-in scaling
features (sharding, replica sets), has become a popular general-purpose NoSQL database,
often used in web applications and real-time analytics.
● A powerful, pipeline-based system for complex data processing within the database
(similar to SQL GROUP BY and joins).
● Data passes through an array of stages, such as $match (filter), $group (aggregate),
$project (reshape), $sort, and $lookup (join with another collection).
● Example: Grouping by region and summing sales. The work is done on the server,
saving bandwidth and leveraging database optimizations.
● Compound Indexes help queries that filter on a prefix of the indexed fields.
● Text Indexes enable sophisticated full-text search.
● Geospatial Indexes (2d, 2dsphere) support location-based queries (e.g., finding
points within a radius).
● TTL Indexes are used to automatically expire and remove old data.
Scaling and Sharding: MongoDB's sharding allows data to be partitioned by a shard key
across multiple servers, enabling horizontal scaling of both storage and throughput.
Choosing a good shard key (one that distributes data evenly and is commonly queried) is
vital.
In summary, MongoDB’s advanced features, combined with careful data modeling and
indexing, allow it to handle large-scale data with flexibility and high performance, serving as
a powerful, general-purpose NoSQL solution.
Table The top-level container for data. Very sparse, many rows might
not have data for all columns.
Row Key The primary identifier for rows, sorted All data access is primarily via
lexicographically. the row key. Row key design
is crucial for efficiency.
Basic Operations:
● Put: Insert or update a value, identified by row key, family, and qualifier. Puts land in
an in-memory structure (MemStore) and are logged to the WAL on HDFS for
durability. HBase treats a Put as an upsert.
● Get: Retrieve a value by a single row key (and optionally specific family/qualifier).
HBase excels at fast point lookups by row key.
● Scan: Iterate over a range of rows in sorted order (by row key). Used for sequential
reads or range queries. Scans can apply server-side filters.
● Delete: Marks a cell or row as deleted (writes a tombstone marker) to be cleaned up
later on compaction.
Use Cases and Strengths: HBase is great for time-series data and environments needing
fast random reads/writes to huge datasets (billions of rows) where you can model access
patterns around the row key. It provides strong consistency (ACID for single-row updates)
and is tightly integrated with Hadoop for analytics via MapReduce/Spark.
● Row Key: The primary index; determines storage order. Key design is essential to
distribute load and ensure related data is near each other for efficient scanning.
● Column Family & Qualifiers: Families are physical groupings on disk, allowing
efficient retrieval of subsets of a row's data. Qualifiers within a family are dynamically
defined.
● Versioning: Timestamps allow multiple versions per cell. Queries default to the latest
version, but older ones can be retrieved.
● Put:
○ Creates a new data version or overwrites the latest (upsert).
○ Data is written to the MemStore (in-memory) and synchronously to the WAL
(Write-Ahead Log on HDFS) for crash recovery and durability.
○ MemStore is eventually flushed to an HFile (on HDFS) when full.
● Get:
○ A random read operation (point lookup).
○ The region server first checks the MemStore and then all relevant HFiles for
the row key, merging the results to return the latest consistent value.
○ Very fast due to B+Tree-like indexes on keys in HFiles.
● Scan:
○ A sequential read that iterates across rows based on a range (start and
optional stop row key).
○ Used for range queries or full table analysis.
○ Efficient for sequential access because rows are physically sorted by key.
● HBase guarantees that a single Put or Delete operation that affects multiple
columns/families in a row is atomic at the row level.
● HBase is a single-row ACID system (consistent and partition-tolerant, often
sacrificing availability on partition). There is no multi-row or multi-table transaction
support built in.
● HBase provides atomic operations like increment and checkAndPut for conditional
updates, which is important for counters and other concurrency patterns.
Integration: HBase is often accessed through its native API, or via higher-level frameworks
like Apache Phoenix (a SQL layer on HBase), Hive, or directly with MapReduce/Spark
connectors for large-scale analytics.
Data Partitioning:
● Data is partitioned across nodes using a partitioner (a hash of the primary key).
● Data is stored in token ranges on a conceptual hash ring. Each node is responsible
for one or more ranges, ensuring even distribution of data.
Replication:
Coordinator Node: The node that first receives a client request acts as the coordinator. It
forwards the request to the appropriate replica nodes.
In summary, Cassandra’s architecture is ideal for high write throughput, linear scalability, and
high availability across data centers, allowing developers to balance strong consistency with
availability/latency based on the application's needs.
1. Identify Query Patterns: Determine all necessary application queries (e.g., "Get
messages by user ID and time").
2. Define Primary Keys: Design the primary key to directly support the query:
○ Partition Key: Must be included in the WHERE clause and should group all
the data that will be queried together onto the same partition (node).
○ Clustering Columns: Define the sort order within the partition. Use DESC or
ASC to match required query ordering (e.g., newest first).
3. One Query, One Table: It is common to have multiple tables storing similar data for
different access patterns. This denormalization avoids the need for joins.
4. Avoid Large Partitions: Partitions that become too large (hotspots) can slow down
the node. Strategies like "bucketing" (adding a date or hash to the partition key) can
help spread the load.
Optimizing Queries:
● Primary Key Design: The best optimization is designing the primary key to match
the query's filter and sort needs.
● Clustering Order: Using CLUSTERING ORDER BY to sort data on disk prevents
expensive in-memory sorting at query time.
● Materialized Views: Used to automatically maintain a second table (a "view") with a
different primary key, automating the data duplication for a second query pattern (but
with eventual consistency trade-offs).
● Batching: The BATCH operation is for grouping small, related writes (e.g., to
denormalized tables) that need all-or-nothing semantics; it's not for bulk data loading.
Example: To query user posts by time, you would use a table like CREATE TABLE
posts_by_user (user_id text, post_time timeuuid, content text, PRIMARY KEY((user_id),
post_time DESC)). A query on user_id will quickly retrieve the data, sorted by time.
In summary, Cassandra data modeling is about trading disk space for speed and scalability
by designing tables around your application’s access patterns. Queries that are
well-modeled (using the primary key) are extremely fast, while poorly modeled queries will
be slow or rejected.
Architecture:
● Neo4j's core strength is index-free adjacency: each node stores direct pointers to its
adjacent relationships.
● Traversal is extremely fast because it is a simple pointer-chasing operation rather
than an index lookup or join computation.
● It uses indexes only for finding starting nodes (e.g., finding the Person node with
name='Alice').
● It is ACID compliant for single-server or clustered transactions.
Cypher Query Language:
1. Identify entities and assign them node labels (:Person, :Product).
2. Identify connections and assign them relationship types (:WORKS_FOR,
:LIVES_IN).
3. Add properties (attributes) to nodes and relationships.
4. Complex relationships with their own attributes can be modeled using relationship
properties.
Neo4j Use Cases: Social Networks, Recommendation Engines, Knowledge Graphs, Fraud
Detection (finding suspicious connection rings), and Network/IT operations.
Performance Consideration: Graph databases perform best when the query is focused on
traversing (exploring the neighborhood of a few nodes). Query cost is generally proportional
to the number of nodes/edges traversed, not the total size of the graph, making deep,
localized queries extremely fast.
In summary, Neo4j provides a fundamentally different and powerful way to model and query
data where the relationships between entities are as important as the entities
themselves.