UNIT II
NOSQL Data Management
• NoSQL is a type of database management
system (DBMS) that is designed to handle
and store large volumes of unstructured and
semi-structured data.
• Unlike traditional relational databases that
use tables with pre-defined schemas to store
data, NoSQL databases use flexible data
models that can adapt to changes in data
structures and are capable of scaling
horizontally to handle growing amounts of
data.
• Horizontal scaling (scaling out) involves adding
more machines to a system to handle
increased workload, while vertical scaling
(scaling up) involves increasing the resources
(CPU, RAM, etc.) of a single machine.
• Horizontal scaling offers greater flexibility and
fault tolerance, while vertical scaling can be
simpler to implement for some applications.
• NoSQL databases are generally classified into
four main categories:
Document databases
Key-value stores
Column-family stores
Graph databases
Document databases
• Store data in JSON, BSON or XML format.
• Data are stored as documents that can contain
varying attributes.
• Examples: MongoDB, Cloudant
Key-value stores
• Data is stored as key-value pairs, making retrieval
extremely fast.
• Optimized for caching and session storage.
• Examples: Redis, Memcached, Amazon
DynamoDB
Column-family stores
• Data are stored in columns rather than rows,
enabling high-speed analytics and distributed
computations.
• Efficient for handling large-scale data with high
write/read demands.
• Examples: Apache Cassandra, HBase, Google
Bigtable
Graph databases
• Data are stored as nodes and edges, enabling
complex relationship management.
• Best suited for social networks, fraud detection,
and recommendation engines.
• Examples: Neo4j, Amazon Neptune, ArangoDB
Key Features of NoSQL :
• Dynamic schema
• Horizontal scalability
• Document-based
• Key-value-based
• Column-based
• Distributed and high availability - They are
designed to be highly available and to
automatically handle node failures and data
replication across multiple nodes in a database
cluster
• Flexibility - Allow developers to store and
retrieve data in a flexible and dynamic manner
• Performance
Advantages of NoSQL
• High scalability
• Flexibility
• High availability
• Performance
• Cost-effectiveness - deployed on commodity
hardware
• Agility - quickly and effectively extract value
and translate that value into actionable
insights and business outcomes.
Commodity hardware
• readily available, off-the-shelf computer
components that are widely used and
relatively inexpensive
Disadvantages of NoSQL
• Lack of standardization
• Lack of ACID compliance
• Narrow focus – better storage and lack in TM
• Lack of support for complex queries
• Lack of maturity – reliability and security
• Management challenge –Maintaining NOSQL
• GUI is not available
• Backup
• Large document size
When should NoSQL be used
• When a huge amount of data needs to be stored
and retrieved.
• The relationship between the data you store is
not that important
• The data changes over time and is not structured.
• Support of Constraints and Joins is not required
at the database level
• The data is growing continuously and you need to
scale the database regularly to handle the data.
NoSQL (Non-Relational
Feature SQL (Relational DB)
DB)
Flexible (Documents,
Data Model Structured, Tabular
Key-Value, Graphs)
Scalability Vertical Scaling Horizontal Scaling
Schema Predefined Dynamic & Schema-less
Limited or Eventual
ACID Support Strong
Consistency
Transactional Big data, real-time
Best For
applications analytics
MySQL, PostgreSQL, MongoDB, Cassandra,
Examples
Oracle Redis
Aggregate Data Models
• We put into four categories widely used in
the NoSQL ecosystem: key-value, document,
column-family, and graph. Of these, the first
three share a common characteristic of their
data models which we will call aggregate
orientation.
• Aggregate means collection of object that are
treated as a single unit.
• Aggregate is a term that comes from Domain-
Driven Design .
• Domain-Driven Design (DDD) is a software
development approach that emphasizes the
importance of understanding and modeling the
business domain
• In Domain-Driven Design aggregate is a
collection of related objects that are treat as a
unit.
• It doesn’t support ACID Property
• With the help of ADM, OLAP operation can be
easily performed.
Eg – Assume building a E-commerce website.
Data Model for relational Database
An aggregate data model
• The diamond shows how data fit into the
aggregate structure
• Customer contains a list of billing address
• Payment also contains the billing address
• It provides fast performance and horizontal
Scalability.
• Limited query capabilities.
• Doesn’t work well with relational data.
• When the value of data increases it is difficult
to maintain unique values.
Column-Family Stores
Graph Database
• A graph database (GDB) is a database that
uses graph structures for storing data.
• It uses nodes, edges, and properties .
• The edges represent relationships between
the nodes.
• The data is stored in the nodes of the graph
and the relationship between the data are
represented by the edges between the
nodes.
Advantages
• It solves Many-To-Many relationship problems
• When relationships between data elements are
more important
• Low latency with large scale data
Disadvantages
• Graph Databases may not be offering better
choice over the NoSQL variations.
• If application needs to scale horizontally this
may introduces poor performance.
• Not very efficient when it needs to update all
nodes with a given parameter.
Materialized view
• A materialized view is a database object that
stores the results of a query as a physical
table.
• SELECT c.customer_id,
SUM(o.order_total) as lifetime_value
FROM Customers c
JOIN Orders o ON c.customer_id =
o.customer_id
GROUP BY c.customer_id;
• CREATE MATERIALIZED VIEW
customer_lifetime_value
AS
SELECT c.customer_id,
SUM(o.order_total) as lifetime_value
FROM Customers c
JOIN Orders o ON c.customer_id =
o.customer_id
GROUP BY c.customer_id;
• SELECT * FROM customer_lifetime_value;
• Advantages:
Improve performance
Increase the speed of the queries
Efficient
• Disadvantages:
Not every DB type support
Read only
Cannot create key,constraint,triggers.
Distribution Models
• NoSQL : Data distributed over large cluster
• Data distribution model – Single and Multiple
server
• Orthogonal data distribution – Sharding and
Replication
• Advantages – handle larger quantity of data,
to process a greater read and write traffic
• Disadvantages – Cost and Complexity.
Single Server
• The first and the simplest distribution option
is — no distribution at all.
• Run the database on a single machine that
handles all the reads and writes to the data
store. We prefer this option because it
eliminates all the complexities.
• it’s easy for operations people to manage and
easy for application developers.
• When to use?
Sharding
Master-Slave Replication
Peer-to-Peer Replication
• There are two styles of distributing data:
Sharding –
Distributes different data across
multiple servers, so each server acts as
the single source for a subset of data.
Replication –
Copies data across multiple servers, so
each bit of data can be found in multiple
• A system may use either or both techniques.
Replication comes in two forms:
Master-slave replication -
Makes one node the authoritative copy that
handles writes while slaves synchronize with
the master and may handle reads.
Peer-to-peer replication –
Allows writes to any node; the nodes
coordinate to synchronize their copies of the
data.
Master-slave replication reduces the chance
of update conflicts but peer-to-peer
replication avoids loading all writes onto a
single point of failure.
CASSANDRA
• Column Oriented DB
• Peer to Peer Architecture
• Distributed, High Performance,Scalable,Fault
tolerant and NoSQL DB
• It is created at Facebook.
• Flexible on cloud as well as On Premise system
• Does not separate layer on cashing
• Writes and read - tuneable level of consistency
• Deployed on commodity hardware
• Compress – Google snappy data compression
algorithm
• CQL
Cassandra disadvantages
• Not support ACID
• High throughput of write operation
• Not support TM , join
• Data is distributed across all nodes- if there is
failure then spread across all nodes
Cassandra Architecture
• Peer to Peer architecture
• Does not have a Single pt of Failure
• Looks like a ring – gossip protocol
• Data spread using hash value
• Commit log
• Seeds & Gossip
Rack and Data Center
Key Components
• Node
• Rack
• Data Center
• Cluster
• Commit log
• Memtable (In Memory Cache)
• SSTable(Sorted String Table)
• Bloom Filter - Probabilistic data structure
• CQL
CASSANDRA DATA MODEL
Cassandra Data Model
• Query Driven Approach
• Fast read and write
• Tables – Primary key, Column family,
Alphabets (_)
• Columns – Define Data Structure within a
table
Column
• A column is the basic data structure of
Cassandra with three values, namely key or
column name, value, and a time stamp,TTL.
Given below is the structure of a column.
Row
Keyspace
• Keyspace is the outermost container for data in
Cassandra.
• The basic attributes of a Keyspace in Cassandra are −
• Replication factor − It is the number of machines in
the cluster that will receive copies of the same data.
• Replica placement strategy − It is nothing but the
strategy to place replicas in the ring. We have
strategies such as simple strategy , old network
topology strategy and network topology strategy .
• Column families − Keyspace is a container for a list of
one or more column families.
• The syntax of creating a Keyspace is as follows
CREATE KEYSPACE Keyspace name WITH
replication = {'class': 'SimpleStrategy',
'replication_factor' : 3};
Column Family
• A column family is a container for an ordered
collection of rows. Each row, in turn, is an
ordered collection of columns.
• A schema in a relational model is fixed. Once we
define certain columns for a table, while
inserting data, in every row all the columns must
be filled at least with a null value.
• In Cassandra, although the column families are
defined, the columns are not. You can freely add
any column to any column family at any time.
A Cassandra column family has the following
aIributes −
• keys_cache − Holds column family locaJon
keys (2,00,000)
• rows_cache − when a data or set of columns
is being used frequently and considered as
hot data.
SuperColumn
• A super column is a special column,
therefore, it is also a key-value pair. But not
support CQL.
CQL Data Definition Commands
• CREATE KEYSPACE − Creates a KeySpace in Cassandra.
• USE − Connects to a created KeySpace.
• ALTER KEYSPACE − Changes the properJes of a KeySpace.
• DROP KEYSPACE − Removes a KeySpace
• CREATE TABLE − Creates a table in a KeySpace.
• ALTER TABLE − Modifies the column properJes of a table.
• DROP TABLE − Removes a table.
• TRUNCATE − Removes all the data from a table.
• CREATE INDEX − Defines a new index on a single column of
a table.
• DROP INDEX − Deletes a named index.
CQL Data Manipulation Commands
• INSERT − Adds columns for a row in a table.
• UPDATE − Updates a column of a row.
• DELETE − Deletes data from a table.
• BATCH − Executes mulJple DML statements
at once.
CQL Clauses
• SELECT − This clause reads data from a table
• WHERE − The where clause is used along with
select to read a specific data.
• ORDERBY − The orderby clause is used along
with select to read a specific data in a specific
order.
• CREATE (TABLE | COLUMNFAMILY)
<tablename> ('<column-definition>' ,
'<column-definition>') (WITH <option> AND
<option>)
• ALTER (TABLE | COLUMNFAMILY)
<tablename> <instruction>
• Using ALTER command, you can perform the
following operaJons −
Add a column
Drop a column
• cqlsh> USE ks;
• cqlsh:ks>; CREATE TABLE emp(
emp_id int PRIMARY KEY,
emp_name text,
emp_city text,
emp_sal varint,
emp_phone varint );
cqlsh:ks> select * from emp;
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+----------+----------+-----------+---------
(0 rows)