Unit – III
Data storage and manipulation
Introduction to NoSQL
A NoSQL database is a non-relational database that handles
unstructured, semi-structured, and structured data. Unlike
traditional SQL databases, NoSQL databases don't rely on fixed
schemas, making them highly flexible and adaptable. "NoSQL" stands
for "Not Only SQL," indicating their ability to support a variety of
data models beyond just relational data. These databases are optimized
for scalability, performance, and high availability.
NoSQL databases are often used for big data applications, real-time
analytics, and handling large volumes of data that may change over
time. They allow developers to store data without predefined schemas,
providing greater flexibility in handling diverse datasets.
Additionally, NoSQL databases typically scale horizontally by adding
more servers, enabling them to manage increasing data or traffic.
Several types of NoSQL databases, including document-based, key-value,
column-family, and graph databases, each suited for specific use
cases. While NoSQL databases offer fast read and write operations,
they may sacrifice ACID compliance in favor of eventual consistency.
They are popular in applications like social media, IoT, and e-
commerce, where rapid changes and large datasets are common. However,
developers must carefully consider their use case, as NoSQL databases
may require more complex management and offer limited querying
capabilities compared to traditional relational databases.
SQL vs NoSQL:
SQL databases are best for structured data with complex relationships
and require strong consistency. In contrast, NoSQL databases are ideal
for handling large, distributed, and flexible datasets with varying
structures and the need for scalability.
The table lists a few differences between the SQL and NoSQL Databases:
Feature SQL NoSQL
Relational (tables with Non-relational
Data Model (document, key-value,
rows and columns)
column, graph)
Fixed schema (must Schema-less (data can be
Schema define schema stored without
beforehand) predefined schema)
Vertical scaling Horizontal scaling
Scalability (requires more powerful (adding more servers to
hardware) distribute data)
ACID (Atomicity, Many do not fully
ACID Compliance Consistency, Isolation, support ACID. Focuses on
Durability) eventual consistency
SQL (Structured Query Varies (e.g., MongoDB
Query Language
Language) uses its query language)
Eventual consistency
Strong consistency and (may allow some
Data Integrity
data integrity inconsistencies
temporarily)
Full support for
Transaction Limited or no support
transactions (e.g.,
Support for complex transactions
complex joins)
Suitable for structured Best for flexible, large
data with complex – scale or distributed
Use Case
relationships (e.g., data(e.g., social media,
financial systems) real-time analytics)
Example MySQL, PostgreSQL, MongoDB, Cassandra,
Databases Oracle, SQL Server Redis, CouchDB, Neo4j
Faster for big data,
Slower for large–scale
real-time analytics, and
Performance applications with
applications with high
complex queries
throughput
Highly flexible with the
Less flexible due to
ability to store varied
Flexibility rigid schema and table
data formats (e.g.,
structures
JSON, key-value pairs)
Limited or no support
Supports complex joins
Joins for joins. Data is often
between tables
denormalized.
Data is often
Data is normalized to
Normalization denormalized to improve
reduce redundancy
performance.
It may allow more
Minimal redundancy due redundancy to increase
Data Redundancy
to normalization performance and
scalability
Stores data in
Stores data in tables documents, key-value
Data Storage
with rows and columns pairs, columns, or
graphs
Eventually consistent,
Firm consistency,
Consistency focusing on availability
transactions are
Model and partition tolerance
reliable.
(CAP Theorem)
Typically scales Scales horizontally
Scalability
vertically (upgrading a (adding more servers or
Model
single machine) nodes)
Migrating from SQL to NoSQL database:
Migrating from an SQL to a NoSQL database involves several necessary
steps and considerations:
1. Assess the Need: Evaluate if your application requires more
flexibility and scalability or handles large volumes of
unstructured data, which NoSQL is better suited for.
2. Choose the Right NoSQL Database: Select from various NoSQL types
such as document-based (MongoDB), key-value (Redis), column-
family (Cassandra), or graph databases (Neo4j), depending on
your use case.
3. Analyze the Data Model: SQL uses a structured schema with tables,
while NoSQL is more flexible. Data may need to be denormalized
for NoSQL, and relationships (joins) will be handled
differently.
4. Data Migration: Extract data from SQL, transform it as needed,
and load it into NoSQL. This may involve using ETL tools to
automate the process.
5. Modify Application Code: Replace SQL queries with NoSQL-specific
queries and APIs. Adjust for the NoSQL database's structure and
eventual consistency model.
6. Performance Optimization: Implement indexing, sharding, and
caching to optimize performance in the NoSQL environment.
7. Testing and Monitoring: Thoroughly test the migration, monitor
the database’s performance, and adjust as needed for
scalability.
8. Backup and Future Planning: Ensure you have a solid backup
strategy and regularly review the NoSQL database’s performance
to adapt to future needs.
Different Types of NoSQL Databases:
NoSQL databases are categorized based on their data models, and each
type is suited for specific use cases. Here are the four main types
of NoSQL databases:
1. Document-based NoSQL Databases
• Data Model: Stores data as documents (usually JSON, BSON, or XML
format). Each document contains key-value pairs and can be
nested.
• Use Case: Ideal for semi-structured data, like user profiles,
product catalogs, and content management systems.
• Example Databases:
o MongoDB
o CouchDB
o Couchbase
2. Key-Value Store NoSQL Databases
• Data Model: Stores data as pairs of keys and values. Each key
is unique, and the value can be any data type (e.g., string,
number, object).
• Use Case: Best for applications that require fast access to data
using a unique key, such as caching or session storage.
• Example Databases:
o Redis
o DynamoDB
o Riak
3. Column-family Store NoSQL Databases
• Data Model: Stores data in columns rather than rows. Each column
family stores related data together, optimizing for read and
write operations.
• Use Case: Suitable for applications that require quick access
to large amounts of data or perform analytical queries on
specific columns (e.g., time-series data, event logging).
• Example Databases:
o Cassandra
o HBase
o ScyllaDB
4. Graph-based NoSQL Databases
• Data Model: Uses graph structures consisting of nodes, edges,
and properties to represent and store data. This is ideal for
managing relationships between entities.
• Use Case: Best for applications like social networks,
recommendation engines, and fraud detection, where relationships
between entities are essential.
• Example Databases:
o Neo4j
o ArangoDB
o OrientDB
Each type of NoSQL database is tailored to specific needs, offering
unique advantages in scalability, performance, and data modeling.
CAP Theorem
The CAP Theorem is a fundamental principle in distributed systems,
and it was proposed by computer scientist Eric Brewer in 2000. It
describes the trade-offs between three key properties in a distributed
database system: Consistency, Availability, and Partition Tolerance.
The theorem states that a distributed system can achieve at most two
of these three properties simultaneously but not all three
simultaneously.
The three properties of the CAP theorem are:
1. Consistency
• Definition: Every read request in the system returns the most
recent write. This means that once data is written to the system,
all subsequent reads will reflect that data, no matter which
node the request is directed to.
• Example: In a consistent system, if you update a user's profile
information, every subsequent read (from any part of the system)
will reflect the updated information immediately.
2. Availability
• Definition: Every request (read or write) will receive a
response, regardless of whether the data is up to date. This
means the system remains operational and returns a response even
if some nodes are down.
• Example: In an available system, if a user tries to retrieve
data, the system will still return data, even if it might not
be the latest version or some replicas of the data are
unavailable.
3. Partition Tolerance
• Definition: The system will continue to operate correctly even
if network partitions (communication breakdowns) prevent some
nodes from communicating with each other. A partitioned system
can still perform reads and writes, even if some nodes are
temporarily disconnected from the rest of the system.
• Example: If a network partition occurs between two regions, the
system can still function in both areas, allowing reads and
writes despite the lack of communication between nodes.
The Trade-off (According to CAP Theorem)
The CAP theorem states that a distributed system can guarantee at most
two of these three properties at any given time. This means a system
must sacrifice one of the properties depending on the use case and
the design priorities.
• Consistency + Availability (CA): The system guarantees that all
nodes return the same data (consistency), and every request will
return a response (availability). However, if a network
partition occurs, the system may not function properly because
it sacrifices Partition Tolerance.
o Example: A system like a traditional relational database
where all nodes must be in sync but can fail when there's
a partition.
• Consistency + Partition Tolerance (CP): The system guarantees
that all nodes will have the most recent data (consistency) and
will continue to work even if a network partition occurs.
However, if there’s a partition, some requests might fail,
sacrificing Availability.
o Example: Zookeeper is a CP system that prioritizes
consistent data even during a partition. However, it might
refuse to process requests during a partition to maintain
consistency.
• Availability + Partition Tolerance (AP): The system guarantees
that every request gets a response (availability), and it
continues to work even in the case of network partitioning.
However, it might return stale data or inconsistent results
because Consistency is sacrificed.
o Example: A system like Cassandra, where even if some nodes
are partitioned, the system remains available and
responsive, but you might get outdated or inconsistent
data.
Examples of CAP Trade-offs in Real Systems
• CA (Consistency + Availability): Systems like HBase or Google
Spanner (in specific configurations) focus on consistency and
availability but are limited in handling network partitions.
• CP (Consistency + Partition Tolerance): HBase and Zookeeper are
examples of systems prioritizing consistency and partition
tolerance. In the event of network partitions, these systems may
refuse to serve some requests to maintain data consistency.
• AP (Availability + Partition Tolerance): Cassandra, Couchbase,
and Riak are examples of databases prioritizing availability and
partition tolerance. They remain operational even during network
partitions but might sometimes serve outdated or inconsistent
data.
Beyond the CAP Theorem: BASE vs. ACID
While the CAP theorem focuses on the trade-offs in distributed
systems, databases that prioritize Availability and Partition
Tolerance (AP) often use the BASE (Basically Available, Soft state,
eventually consistent) model as an alternative to the ACID (Atomicity,
Consistency, Isolation, Durability) properties of traditional
relational databases.
• BASE: Allows for temporary inconsistencies but guarantees that,
over time, the system will become consistent (eventual
consistency).
• ACID: Ensures strong consistency and reliability but may not
scale as efficiently as BASE systems in distributed
environments.
Sharding:
Sharding is a database partitioning technique used to horizontally
scale a database by distributing data across multiple servers (or
nodes). Sharding aims to handle large datasets, ensure high
availability, and improve system performance by distributing the load
and increasing capacity. Sharding helps databases scale out rather
than scaling up (which involves upgrading a single server). It is
beneficial in systems where data grows too large for a single server
to handle efficiently.
How Sharding Works
In sharding, the data in a database is split into smaller chunks,
known as shards, which are distributed across multiple servers (or
nodes). Each shard is a subset of the entire dataset, and each server
holds one or more subsets. These subsets are typically divided by a
shard key, a specific attribute of the data used to determine how the
data is split.
Types of Sharding
1. Horizontal Sharding (Data Partitioning):
o Definition: In horizontal sharding, the rows of a database
table are divided into smaller chunks, and each chunk is
stored on a different server or node. This allows data to
be distributed across multiple machines, improving
performance and scalability.
2. Vertical Sharding:
o Definition: Different table columns are stored on different
servers or nodes in vertical sharding. For example, in a
user’s table, one server may store the columns UserID,
Name, and Email, while another stores PhoneNumber and
Address.
3. Directory-Based Sharding:
o Definition: In directory-based sharding, a lookup table
(or directory) is used to track where each piece of data
is stored. The directory contains the shard key and the
location of the data, so the system knows which shard to
query for specific data.
4. Range-based Sharding:
o Definition: Data is split into ranges based on the shard
key in range-based sharding. For example, if the shard key
is a UserID, the data might be divided into shards
containing ranges of UserID values (e.g., 1-1000, 1001-
2000, etc.).
5. Hash-based Sharding:
o Definition: In hash-based sharding, the shard key is passed
through a hash function, and the resulting hash value
determines which shard the data will belong to. This
approach helps to distribute data evenly across shards.
6. Composite Sharding:
o Definition: Composite sharding uses more than one attribute
to determine how data is distributed across shards. It
combines multiple keys or fields in a compound sharding
strategy.
Advantages of Sharding
Scalability:
Sharding allows databases to scale horizontally by adding more servers
to handle growing data and traffic. This enables systems to manage
massive amounts of data and large concurrent requests.
Improved Performance:
By distributing data across multiple servers, sharding reduces the
load on any single server, leading to faster query responses and
improved performance. Each server can handle only a subset of the
data, making operations faster.
Fault Tolerance:
Sharding provides fault tolerance by storing data across multiple
servers. If one shard becomes unavailable, other shards can still
serve requests, ensuring the system remains available.
Load Balancing:
Sharding helps distribute the workload across multiple servers,
preventing one server from becoming a bottleneck. This ensures better
performance even with high traffic volumes.
Examples of Sharded Databases
1. MongoDB: A popular NoSQL database that supports sharding. It
allows data partitioned across multiple servers, improving
scalability and performance.
2. Cassandra: A highly scalable NoSQL database that uses a
decentralized approach to sharding and distributing data across
multiple nodes in a cluster.
3. Elasticsearch: A search engine that uses sharding to distribute
data and queries across a cluster of nodes, providing fast search
results.
Sharding is a powerful technique to scale a database horizontally by
distributing data across multiple servers. While it offers scalability
and improved performance, it also introduces challenges like
complexity, managing cross-shared queries, and ensuring consistency.
Choosing the right shard key, managing rebalancing, and handling
complex queries are critical to making sharding work effectively.
Sharding is commonly used in large-scale applications like social
media platforms, e-commerce websites, and cloud-based services where
the volume of data grows rapidly.