Unit 2
Introduction to NoSQL, aggregate data models, aggregates, key-value and document data
models, relationships, graph databases, schema less databases, materialized views,
distribution models, sharding, master-slave replication, peer- peer replication, sharding and
replication, consistency, relaxing consistency, version stamps, Working with
Cassandra ,Table creation, loading and reading data.
Introduction to NoSQL
The NoSQL system or "Not Only SQL" is essentially a database that is made specifically for
unstructured and semi-structured data in very large quantities. Unlike Conventional Relational
Databases, where data are organized into tables using predefined schemas. NoSQL allows
flexible models to be organized and horizontally scalable.
Key Features of NoSQL Databases
Dynamic schema: Allow flexible shaping of data to meet new requirements without the
need to migrate or change schemas.
Horizontal scalability: They scale horizontally for adding more nodes into the existing ones
and acquire enough storage for even bigger datasets and much higher traffic by distributing
the load on multiple servers.
Document-based: Data are presented in flexible, semi-structured formats like JSON/BSON
(e.g., MongoDB).
Key-value-based: They possess a simple but fast access pattern (e.g., Redis) by storing data
as pairs of keys and values.
Column-based: Data are organized into columns instead of rows (e.g., CASSANDRA).
Distributed and high availability: They are designed to be highly available and to
automatically handle node failures and data replication across multiple nodes in a database
cluster.
Flexibility: Allow developers to store and retrieve data in a flexible and dynamic manner,
with support for multiple data types and changing data structures.
Performance: Perfect for big data and real-time analytics and high volume applications.
Why Use NoSQL?
Unlike relational databases, which use Structured Query Language, NoSQL databases do not
have a universal query language. In fact, each NoSQL database has its own approach to query
languages. Traditional relational databases will follow ACID principles, assuring a strong
consistency and a structured relationship between the data.
The needs of applications have been changing through time, due to increased requirements
related to big data, real-time analytics and distributed environments NoSQL emerged to
satisfy:
where scaling can be done horizontally by adding nodes instead of upgrading the existing
machine.
Flexibility in supporting unstructured or semi-structured data without a rigid schema.
Optimized for fast read/write operations with large datasets resulting in higher performance.
Distributed Architecture to build highly available and partition-tolerating system.
Challenges of NoSQL Databases
Lack of standardization: NoSQL systems can be vastly different from one another,
making it even harder to choose the right one for a specific use case.
Lack of ACID compliance: NoSQL databases may not provide consistency, which is a
disadvantage for applications that need strict data integrity.
Narrow focus: Great for storage but lack functionalities as transaction management, in
which relational databases are great.
Absence of Complex Query Support: They are not designed to handle complex queries,
which means that they are not a good fit for applications that require complex data analysis
or reporting.
Lack of maturity: Being relatively new, NoSQL may not have the reliability, security and
feature set of traditional relational databases.
Management complexity: For large datasets, maintaining a NoSQL database could be quite
more complicated than managing a relational database.
Limited GUI Tools: While some NoSQL databases, like MongoDB offer GUI tools like
MongoDB Compass, not all NoSQL databases provide flexible or user-friendly GUI tools.
SQL vs. NoSQL: When to use What
Feature SQL (Relational DB) NoSQL (Non-Relational DB)
Data Model Structured, Tabular Flexible (Documents, Key-Value, Graphs)
Scalability Vertical Scaling Horizontal Scaling
Schema Predefined Dynamic & Schema-less
ACID Support Strong Limited or Eventual Consistency
Best For Transactional applications Big data, real-time analytics
Examples MySQL, PostgreSQL, Oracle MongoDB, Cassandra, Redis
Popular NoSQL Databases & Their Use Cases
NoSQL Database Type Use Cases
MongoDB Document-based Content management, product catalogs
Redis Key-Value Store Caching, real-time analytics, session storage
Cassandra Column-Family Store Big data, high availability systems
Neo4j Graph Database Fraud detection, social networks
NoSQL are databases that store data in another format other than relational databases.
NoSQL deals in nearly every industry nowadays. For the people who interact with data in
databases, the Aggregate Data model will help in that interaction.
Features of NoSQL Databases:
• Schema Agnostic: NoSQL Databases do not require any specific schema or storage
structure than traditional RDBMS.
• Scalability: NoSQL databases scale horizontally as data grows rapidly certain
commodity hardware could be added and scalability features could be preserved for
NoSQL.
• Performance: To increase the performance of the NoSQL system one can add a
different commodity server than reliable and fast access of database transfer with
minimum overhead.
• High Availability: In traditional RDBMS it relies on primary and secondary nodes
for fetching the data, Some NoSQL databases use master place architecture.
• Global Availability: As data is replicated among multiple servers and clouds the
data is accessible to anyone, this minimizes the latency period.
Aggregate Data Models:
The term aggregate means a collection of objects that we use to treat as a unit. An aggregate is a
collection of data that we interact with as a unit. These units of data or aggregates form the
boundaries for ACID operation.
Example of Aggregate Data Model:
Here in the diagram have two Aggregate:
Customer and Orders link between them represent an aggregate.
The diamond shows how data fit into the aggregate structure.
Customer contains a list of billing address
Payment also contains the billing address
The address appears three times and it is copied each time
The domain is fit where we don't want to change shipping and billing address.
Consequences of Aggregate Orientation:
Aggregation is not a logical data property It is all about how the data is being used by
applications.
An aggregate structure may be an obstacle for others but help with some data interactions.
It has an important consequence for transactions.
NoSQL databases don’t support ACID transactions thus sacrificing consistency.
aggregate-oriented databases support the atomic manipulation of a single aggregate at a time.
Advantage:
It can be used as a primary data source for online applications.
Easy Replication.
No single point Failure.
It provides fast performance and horizontal Scalability.
It can handle Structured semi-structured and unstructured data with equal effort.
Disadvantage:
No standard rules.
Limited query capabilities.
Doesn't work well with relational data.
Not so popular in the enterprise.
When the value of data increases it is difficult to maintain unique values.
Aggregate-Oriented Databases in NoSQL
The aggregate-Oriented database is the NoSQL database which does not support ACID
transactions and they sacrifice one of the ACID properties. Aggregate orientation operations are
different compared to relational database operations. We can perform OLAP operations on the
Aggregate-Oriented database. The efficiency of the Aggregate-Oriented database is high if the
data transactions and interactions take place within the same aggregate. Several fields of data
can be put in the aggregates such that they can be commonly accessed together. We can
manipulate only a single aggregate at a time. We cannot manipulate multiple aggregates at a
time in an atomic way.
Aggregate - Oriented databases are classified into four major data models. They are as follows:
Key-value
Document
Column family
Graph-based
Each of the Data models above has its own query language.
key-value Data Model: Key-value and document databases were strongly aggregate-oriented.
The key-value data model contains the key or Id which is used to access the data of the
aggregates. key-value Data Model is very secure as the aggregates are opaque to the database.
Aggregates are encrypted as the big blog of bits that can be decrypted with key or id. In the
key-value Data Model, we can place data of any structure and datatypes in it. The advantage
of the key-value Data Model is that we can store the sensitive information in the aggregate.
But the disadvantage of this model the database has some general size limits. We can store
only the limited data.
Document Data Model: In Document Data Model we can access the parts of aggregates. The
data in this model can be accessed inflexible manner. we can submit queries to the database
based on the fields in the aggregate. There is a restriction on the structure and data types of
data to be paced in this data model. The structure of the aggregate can be accessed by the
Document Data Model.
Column family Data Model: The Column family is also called a two-level map. But,
however, we think about the structure, it has been a model that influenced later databases such
as HBase and Cassandra. These databases with a big table-style data model are often referred
to as column stores. Column-family models divide the aggregate into column families. The
Column-family model is a two-level aggregate structure. The first level consists of keys that
act as a row identifier that selects the aggregate. The second-level values in the Column
family Data Model are referred to as columns.
In the above example, the row key is 234 which selects the aggregate. Here the row key
selects the column families customer and orders. Each column family contains the columns of
data. In the orders column family, we have the orders placed by the customers.
Graph Data Model: In a graph data model, the data is stored in nodes that are connected by
edges. This model is preferred to store a huge amount of complex aggregates and
multidimensional data with many interconnections between them. Graph Data Model has the
application like we can store the Facebook user accounts in the nodes and find out the friends
of the particular user by following the edges of the graph.
We can find the friends of a person by observing this graph data model. If there is an edge
between two nodes then we can say they are friends. Here we also consider the indirect links
between the nodes to determine the friend suggestions.
Schema Design and Relationship in NoSQL Document-Base Databases
NoSQL databases are powerful alternatives to traditional relational databases,
offering flexibility, scalability, and performance. Among the various types of NoSQL
databases, document-based databases stand out for their ability to store and retrieve data in
flexible, schema-less documents.
Understanding Document-Based Databases
Unlike traditional relational databases, which organize data into tables with predefined
schemas, document-based databases store data in flexible, self-descriptive documents. These
documents, typically in JSON or BSON format, encapsulate information in key-value pairs or
nested structures, resembling the hierarchical nature of real-world objects.
This schema-less approach liberates developers from the constraints of fixed schemas, enabling
them to iteratively evolve data models in response to changing requirements.
Schema Design in NoSQL Document-Based Databases
In NoSQL document-based databases, schema design revolves
around denormalization and data embedding, wherein related information is encapsulated
within a single document to optimize data retrieval and minimize the need for complex joins.
Let's illustrate this with an example:
Consider a blogging platform where users can create posts and comment on them. In a
relational database, you might have separate tables for users, posts, and comments, linked
through foreign key relationships. However, in a document-based database like MongoDB, you
could represent this relationship by embedding comments within each post document
{
"_id": "post1",
"title": "Introduction to NoSQL Databases",
"content": "NoSQL databases offer flexibility and scalability...",
"author": {
"name": "John Doe",
"email": "john@[Link]"
},
"comments": [
{
"user": "Alice",
"comment": "Great article!"
},
{
"user": "Bob",
"comment": "Informative read."
}
]
}
By embedding comments within the post document, we eliminate the need for separate comment
documents and complex join operations, thereby streamlining data access and improving
performance.
Managing Relationships
While denormalization simplifies data access, it also raises concerns about data
consistency and redundancy. In scenarios where data updates are frequent or where the
embedded data is shared across multiple documents, maintaining consistency becomes
paramount. Let's illustrate this with an example
Suppose you have a social media platform where users can follow each other. In a document-
based database, you might represent the follower-followee relationship as follows:
{
"_id": "user1",
"name": "Alice",
"followers": ["user2", "user3"]
}
{
"_id": "user2",
"name": "Bob",
"followers": ["user1"]
}
Here, each user document maintains an array of follower IDs. While this design facilitates quick
retrieval of a user's followers, it introduces redundancy and complexity when updating
follower lists. To address this, you might consider employing a reference model, where user IDs
are stored instead of the entire user document
{
"_id": "user1",
"name": "Alice",
"followers": ["user2", "user3"]
}
{
"_id": "user2",
"name": "Bob"
}
In this revised design, each user document stores only the IDs of their followers, reducing
redundancy and simplifying updates. However, retrieving follower details now requires
additional queries.
Which Data Modeling Approach is Better?
The choice between normalization and denormalization depends on various factors, including
application requirements, query patterns, and scalability concerns. Normalization is preferable
for scenarios requiring strong data consistency, complex relationships, and efficient storage
utilization.
Denormalization shines in read-heavy workloads with frequent queries involving related
data, offering superior read performance and simplified data access.
Conclusion
NoSQL document-based databases offer unparalleled flexibility in schema design, empowering
developers to build scalable and adaptable applications. By embracing denormalization and
judiciously managing relationships, developers can harness the full potential of these databases
while ensuring data consistency and performance.
While the examples presented here offer insights into schema design and relationship
management, it's crucial to tailor these principles to the specific requirements of your
application. As the landscape of database technologies continues to evolve, mastering the
nuances of NoSQL document-based databases will be indispensable for building robust and
efficient systems.
Introduction to Graph Database on NoSQL
A graph database is a type of NoSQL database that is designed to handle data with complex
relationships and interconnections. In a graph database, data is stored as nodes and edges, where
nodes represent entities and edges represent the relationships between those entities.
1. Graph databases are particularly well-suited for applications that require deep and complex
queries, such as social networks, recommendation engines, and fraud detection systems. They
can also be used for other types of applications, such as supply chain management, network and
infrastructure management, and bioinformatics.
2. One of the main advantages of graph databases is their ability to handle and represent
relationships between entities. This is because the relationships between entities are as
important as the entities themselves, and often cannot be easily represented in a traditional
relational database.
3. Another advantage of graph databases is their flexibility. Graph databases can handle data with
changing structures and can be adapted to new use cases without requiring significant changes
to the database schema. This makes them particularly useful for applications with rapidly
changing data structures or complex data requirements.
4. However, graph databases may not be suitable for all applications. For example, they may not
be the best choice for applications that require simple queries or that deal primarily with data
that can be easily represented in a traditional relational database. Additionally, graph databases
may require more specialized knowledge and expertise to use effectively.
Some popular graph databases include Neo4j, OrientDB, and ArangoDB. These databases
provide a range of features, including support for different data models, scalability, and high
availability, and can be used for a wide variety of applications.
As we all know the graph is a pictorial representation of data in the form of nodes and
relationships which are represented by edges. A graph database is a type of database used to
represent the data in the form of a graph. It has three components: nodes, relationships, and
properties. These components are used to model the data. The concept of a Graph Database is
based on the theory of graphs. It was introduced in the year 2000. They are commonly referred
to NoSql databases as data is stored using nodes, relationships and properties instead of
traditional databases. A graph database is very useful for heavily interconnected data. Here
relationships between data are given priority and therefore the relationships can be easily
visualized. They are flexible as new data can be added without hampering the old ones. They are
useful in the fields of social networking, fraud detection, AI Knowledge graphs etc.
The description of components are as follows:
Nodes: represent the objects or instances. They are equivalent to a row in database. The node
basically acts as a vertex in a graph. The nodes are grouped by applying a label to each
member.
Relationships: They are basically the edges in the graph. They have a specific direction, type
and form patterns of the data. They basically establish relationship between nodes.
Properties: They are the information associated with the nodes.
Some examples of Graph Databases software are Neo4j, Oracle NoSQL DB, Graph base etc. Out
of which Neo4j is the most popular one.
In traditional databases, the relationships between data is not established. But in the case of
Graph Database, the relationships between data are prioritized. Nowadays mostly interconnected
data is used where one data is connected directly or indirectly. Since the concept of this database
is based on graph theory, it is flexible and works very fast for associative data. Often data are
interconnected to one another which also helps to establish further relationships. It works fast in
the querying part as well because with the help of relationships we can quickly find the desired
nodes. join operations are not required in this database which reduces the cost. The relationships
and properties are stored as first-class entities in Graph Database.
Graph databases allow organizations to connect the data with external sources as well. Since
organizations require a huge amount of data, often it becomes cumbersome to store data in the
form of tables. For instance, if the organization wants to find a particular data that is connected
with another data in another table, so first join operation is performed between the tables, and
then search for the data is done row by row. But Graph database solves this big problem. They
store the relationships and properties along with the data. So if the organization needs to search
for a particular data, then with the help of relationships and properties the nodes can be found
without joining or without traversing row by row. Thus the searching of nodes is not dependent
on the amount of data.
Types of Graph Databases:
Property Graphs: These graphs are used for querying and analyzing data by modelling the
relationships among the data. It comprises of vertices that has information about the particular
subject and edges that denote the relationship. The vertices and edges have additional attributes
called properties.
RDF Graphs: It stands for Resource Description Framework. It focuses more on data
integration. They are used to represent complex data with well defined semantics. It is
represented by three elements: two vertices, an edge that reflect the subject, predicate and
object of a sentence. Every vertex and edge is represented by URI(Uniform Resource
Identifier).
When to Use Graph Database?
Graph databases should be used for heavily interconnected data.
It should be used when amount of data is larger and relationships are present.
It can be used to represent the cohesive picture of the data.
How Graph and Graph Databases Work?
Graph databases provide graph models They allow users to perform traversal queries since data
is connected. Graph algorithms are also applied to find patterns, paths and other relationships this
enabling more analysis of the data. The algorithms help to explore the neighboring nodes,
clustering of vertices analyze relationships and patterns. Countless joins are not required in this
kind of database.
Example of Graph Database:
Recommendation engines in E commerce use graph databases to provide customers with
accurate recommendations, updates about new products thus increasing sales and satisfying the
customer's desires.
Social media companies use graph databases to find the "friends of friends" or products that the
user's friends like and send suggestions accordingly to user.
To detect fraud Graph databases play a major role. Users can create graph from the transactions
between entities and store other important information. Once created, running a simple query
will help to identify the fraud.
Advantages of Graph Database:
Potential advantage of Graph Database is establishing the relationships with external sources as
well
No joins are required since relationships is already specified.
Query is dependent on concrete relationships and not on the amount of data.
It is flexible and agile.
it is easy to manage the data in terms of graph.
Efficient data modeling: Graph databases allow for efficient data modeling by representing data
as nodes and edges. This allows for more flexible and scalable data modeling than traditional
relational databases.
Flexible relationships: Graph databases are designed to handle complex relationships and
interconnections between data elements. This makes them well-suited for applications that
require deep and complex queries, such as social networks, recommendation engines, and fraud
detection systems.
High performance: Graph databases are optimized for handling large and complex datasets,
making them well-suited for applications that require high levels of performance and
scalability.
Scalability: Graph databases can be easily scaled horizontally, allowing additional servers to be
added to the cluster to handle increased data volume or traffic.
Easy to use: Graph databases are typically easier to use than traditional relational databases.
They often have a simpler data model and query language, and can be easier to maintain and
scale.
Disadvantages of Graph Database:
Often for complex relationships speed becomes slower in searching.
The query language is platform dependent.
They are inappropriate for transactional data
It has smaller user base.
Limited use cases: Graph databases are not suitable for all applications. They may not be the
best choice for applications that require simple queries or that deal primarily with data that can
be easily represented in a traditional relational database.
Specialized knowledge: Graph databases may require specialized knowledge and expertise to
use effectively, including knowledge of graph theory and algorithms.
Immature technology: The technology for graph databases is relatively new and still evolving,
which means that it may not be as stable or well-supported as traditional relational databases.
Integration with other tools: Graph databases may not be as well-integrated with other tools and
systems as traditional relational databases, which can make it more difficult to use them in
conjunction with other technologies.
Overall, graph databases on NoSQL offer many advantages for applications that require
complex and deep relationships between data elements. They are highly flexible, scalable, and
performant, and can handle large and complex datasets. However, they may not be suitable for
all applications, and may require specialized knowledge and expertise to use effectively.
Future of Graph Database:
Graph Database is an excellent tool for storing data but it cannot be used to completely replace
the traditional database. This database deals with a typical set of interconnected data. Although
Graph Database is in the developmental phase it is becoming an important part as business and
organizations are using big data and Graph databases help in complex analysis. Thus these
databases have become a must for today's needs and tomorrow success.
Schemaless database
What is a schemaless database?
A schemaless database manages information without the need for a blueprint. The onset of
building a schemaless database doesn’t rely on conforming to certain fields, tables, or data model
structures. There is no Relational Database Management System (RDBMS) to enforce any
specific kind of structure. In other words, it’s a non-relational database that can handle any
database type, whether that be a key-value store, document store, in-memory, column-oriented,
or graph data model. NoSQL databases’ flexibility is responsible for the rising popularity of a
schemaless approach and is often considered more user-friendly than scaling a schema or SQL
database.
How does a schemaless database work?
With a schemaless database, you don’t need to have a fully-realized vision of what your data
structure will be. Because it doesn’t adhere to a schema, all data saved in a schemaless database
is kept completely intact. A relational database, on the other hand, picks and chooses what data it
keeps, either changing the data to fit the schema, or eliminating it altogether. Going schemaless
allows every bit of detail from the data to remain unaltered and be completely accessible at any
time. For businesses whose operations change according to real-time data, it’s important to have
that untouched data as any of those points can prove to be integral to how the database is later
updated. Without a fixed data structure, schemaless databases can include or remove data types,
tables, and fields without major repercussions, like complex schema migrations and outages.
Because it can withstand sudden changes and parse any data type, schemaless databases are
popular in industries that are run on real-time data, like financial services, gaming, and social
media.
Going schemaless allows every bit of detail from the data to remain unaltered and be completely
accessible at any time.
Schemaless vs. schema databases pros and cons
How much information do you know about your new database setup? Can you see its structure
well ahead of time and know for certain it will never change? If so, you may be dealing with a
situation that best suits a schema database. Its strictness is the basis of its appeal. Let’s get
granular and weigh the pros and cons of going one way or the other.
Data modeling
and planning
Rigorous
must be
testing
flexible and
predefined
Rules are Difficult to
inflexible expedite the
Data modeling
and planning
Rigorous
must be
testing
flexible and
predefined
launch of the
database
The rigidity
makes altering
Code is
the schema at a
more
later date a
intelligible
laborious
process
Streamlines
the process
Experimenting
of migrating
with fields is
data
very difficult
between
systems
No universal
All data (and
language
metadata)
available to
remains
query data in
unaltered
a non-
and
relational
accessible
database
Though the
NoSQL
community is
There is no
still growing
existing
at a
“schema” for
tremendous
the data to be
rate, not all
structured
troubleshootin
around
g issues have
been properly
documented
Can add
additional
Lack of
fields that
compatibility
SQL
with SQL
databases
instructions
can’t
accommodate
No universal
All data (and
language
metadata)
available to
remains
query data in
unaltered
a non-
and
relational
accessible
database
Accommodate No ACID-
s key-value level
store, compliance,
document as data
store, in- retrievals can
memory, have
column- inconsistencie
oriented, or s given their
graph data distributed
models approach
Schemaless Database FAQs
Yes. Redis is a NoSQL, multi-model, in-memory database that leverages its varying modules to
allow full connectivity and interaction between the different models within the database. It does
not need a schema to manage unstructured data.
Though NoSQL/non-relational databases are called “schemaless,” it doesn’t mean that a schema
is not eventually settled upon. Whereas a relational database uses a certain language to query
data of a certain model, in a schemaless database, the developer is the one that settles on the
architecture. So, the schema exists in a schemaless database, it’s just dictated by the developer,
not the database.
Materialized Views
The Materialized View pattern in NoSQL databases is a design approach that optimizes query
performance by precomputing and storing the results of complex queries. This pattern is
particularly effective when the data’s original format is not ideally structured for frequent query
operations. By creating these precomputed views, the pattern aids in efficient data retrieval,
especially for large datasets or where queries involve aggregations like sum, average, or count.
In practice, materialized views store only the necessary data required by specific queries,
enabling applications to access information swiftly. These views are typically disposable and can
be rebuilt from the source data, meaning they are not directly updated but refreshed or
regenerated in response to changes in the source data. This approach can significantly reduce
computational overhead during query execution, leading to faster responses and improved
system performance.
Materialized views are beneficial in scenarios such as simplifying complex queries, improving
query performance, and providing access to specific data subsets. They are also useful in
bridging different data stores to leverage their individual capabilities. However, they might not
be ideal in situations where source data changes rapidly or where high consistency between the
view and the original data is required.
Overall, the Materialized View pattern is a strategic choice in NoSQL database design,
enhancing data access efficiency and catering to specific querying needs while managing the
trade-offs between data storage and retrieval.
The Scenario:
Materialized views are a useful way of improving query performance by precalculating and
saving optimized data representations. This process involves creating derived tables that record
and maintain the results of certain queries. By doing this, materialized views solve the problem
of slower and less effective data [Link] practice, materialized views are used in different
situations, each meeting different optimization needs:
1. Views with Different Partition Keys: Materialized views can be tailored to
accommodate diverse partition keys, allowing for more efficient organization and
retrieval of data based on varying criteria. This capability is particularly beneficial in
systems where data needs to be accessed and manipulated using multiple perspectives.
2. Subsets of Data: When working with large datasets, it often makes sense to focus on
specific subsets of information that are frequently queried. Materialized views can be
employed to create summarized versions of these subsets, optimizing access to relevant
data and minimizing the need for resource-intensive full-table scans.
3. Aggregate Views: Analytical queries frequently involve aggregations, such as sum,
average, or count, performed on certain data attributes. Materialized views can
precompute these aggregations, enabling swift execution of analytical queries without the
need to repeatedly process the raw data.
In essence, materialized views serve as a performance-enhancing layer that strikes a balance
between data storage and retrieval efficiency. By precomputing and storing query results, these
views significantly reduce the computational burden during query execution, leading to quicker
responses and improved overall system responsiveness.
Introduction to Apache Cassandra
Cassandra is a distributed database management system which is open source with wide column
store, NoSQL database to handle large amount of data across many commodity servers which
provides high availability with no single point of failure. It is written in Java and developed by
Apache Software Foundation.
Avinash Lakshman & Prashant Malik initially developed the Cassandra at Facebook to power
the Facebook inbox search feature. Facebook released Cassandra as an open source project on
Google code in July 2008. In March 2009 it became an Apache Incubator project and in
February 2010 it becomes a top-level project. Due to its outstanding technical features Cassandra
becomes so popular.
Introduction to Cassandra
Apache Cassandra is used to manage very large amounts of structure data spread out across the
world. It provides highly available service with no single point of failure. Listed below are some
points of Apache Cassandra:
It is scalable, fault-tolerant, and consistent.
It is column-oriented database.
Its distributed design is based on Amazon's Dynamo and its data model on Google's Big table.
It is Created at Facebook and it differs sharply from relational database management systems.
Cassandra implements a Dynamo-style replication model with no single point of failure but its
add a more powerful "column family" data model. Cassandra is being used by some of the
biggest companies such as Facebook, Twitter, Cisco, Rackspace, eBay, Netflix, and more. The
design goal of a Cassandra is to handle big data workloads across multiple nodes without any
single point of failure. Cassandra has peer-to-peer distributed system across its nodes, and data is
distributed among all the nodes of the cluster. All the nodes of Cassandra in a cluster play the
same role. Each node is independent, at the same time interconnected to other nodes. Each node
in a cluster can accept read and write requests, regardless of where the data is actually located in
the cluster. When a node goes down, read/write request can be served from other nodes in the
network.
Features of Cassandra:
Cassandra has become popular because of its technical features. There are some of the features
of Cassandra:
1. Easy data distribution - It provides the flexibility to distribute data where you need by
replicating data across multiple data centers. for example: If there are 5 node let say N1, N2, N3,
N4, N5 and by using partitioning algorithm we will decide the token range and distribute data
accordingly. Each node have specific token range in which data will be distribute. let's have a
look on diagram for better understanding.
Ring structure with token range.
2. Flexible data storage - Cassandra accommodates all possible data formats including:
structured, semi-structured, and unstructured. It can dynamically accommodate changes to your
data structures accordingly to your need.
3. Elastic scalability - Cassandra is highly scalable and allows to add more hardware to
accommodate more customers and more data as per requirement.
4. Fast writes - Cassandra was designed to run on cheap commodity hardware. Cassandra
performs blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the
read efficiency.
5. Always on Architecture - Cassandra has no single point of failure and it is continuously
available for business-critical applications that can't afford a failure.
6. Fast linear-scale performance - Cassandra is linearly scalable therefore it increases your
throughput as you increase the number of nodes in the cluster. It maintains a quick response
time.
Table creation
In Cassandra, the CQL table has a name and it stores rows. when you create a table, you define
the columns for the rows, a mandatory primary key to identify each row, column data type, and
any other additional you may choose. To create a table used "creating a table" statement given
below as following. The following is a typical table creation statement.
Syntax: Creating a Table.
CREATE TABLE [ IF NOT EXISTS ] table_name
'('
column_definition
( ', ' column_definition )*
[ ', ' PRIMARY KEY '(' primary_key ')' ]
')' [ WITH table_options ]
Now, here you can use any existing keyspace such as named App_data.
use App_data;
Now, you can create table User_data in which Name, id, address are the fields in the table.
CREATE TABLE User_data (
Name text,
id uuid,
address text,
PRIMARY KEY (id)
);
Now, you can verify the table whether it is created or not, and if it is created then verify the
table definition. By using existing keyspace such as App_data.
cassandra@cqlsh> use App_data;
cassandra@cqlsh:app_data>
cassandra@cqlsh:app_data> CREATE TABLE User_data (
... id uuid,
... Name text,
... address text,
... PRIMARY KEY (id)
... );
In Cassandra, A primary key consists of the first column or columns is the mandatory partition
key, followed by one or more clustering columns. COLUMN_DEFINITION In Cassandra, a
column_definition clause consists of the name of the column and its type, as well as two
modifiers. Static: In Cassandra, a static column has the same value for all rows that share the
same partition key (explained in a little bit). Of course, only non-primary keys can be
static. Primary key: A primary key uniquely identifies a row, and It is a good practice all tables
must define a primary key. Now, here you can verify the created table User_data.
cassandra@cqlsh:app_data> describe User_data;
CREATE TABLE app_data.user_data (
id uuid PRIMARY KEY,
address text,
name text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction =
{
'class': '[Link]
.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'
}
AND compression = {'chunk_length_in_kb': '64',
'class': '[Link]
.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Read and Write path in Cassandra
Last Updated : 15 Jul, 2025
Prerequisites -
Introduction to Apache Cassandra
Apache Cassandra (NOSQL database)
Architecture of Cassandra
Write Path Execution in Cassandra :
In Cassandra, while writing data, writes are written to any node in the cluster (coordinator).
when any user will insert data, it means they write the data first to commit log then to
memtable.
When any user will write the data, every write will include a timestamp.
Once memtable starts getting full then it is flushed to disk periodically (SSTable).
After that a new memtable is created in memory.
In the case of write path execution, deletes are special write cases which are called a
tombstone.
Inserting Data : In case of inserting data in Cassandra, we will create a keyspace and then
create a table and then insert data into the table. Example -
// Creating a keyspace
create keyspace UniersityData
replication = {'class': 'SimpleStrategy', 'replication_factor' : '3' };
// Creating a table and declaring the columns
create table CSE_Student(
student_id int,
name text,
email text,
primary key(student_id)
);
// Using the newly created keyspace
use UniersityData;
// Inserting values in the table for all the columns
Insert into CSE_student(student_id, name, email)
values(012345, 'Ashish', 'ashish@[Link]');
Insert into CSE_student(student_id, name, email)
values(012346, 'Abi', 'abi@[Link]');
Insert into CSE_student(student_id, name, email)
values(012347, 'Rana', 'rana@[Link]');
Insert into CSE_student(student_id, name, email)
values(012348, 'Aayush', 'aayush@[Link]');
Insert into CSE_student(student_id, name, email)
values(012349, 'harsh', 'haarsh@[Link]');
Read Path Execution :
In Cassandra while reading data, any server may be queried which acts as the coordinator.
when we want to access read data then we contact nodes with requested key.
In a data center, on each node, data is pulled from SStable and is merged.
In Cassandra, while considering read consistency, we can check - Consistency < ALL performs
read repair in the background (read_repair_chance).
Reading Data : Write a cqlsh query to read data from CSE_student and give output for the
same.
select *
from CSE_student;
Output :
i
n
d stu
e den na
x t_id me email
A
ashish
123 sh
0 @gmai
45 is
[Link]
h
abi@g
123 A
1 [Link]
46 bi
m
R rana@
123
2 an gmail.c
47
a om
3 123 A aayush
ay
@gmai
48 us
[Link]
h
ha haarsh
123
4 rs @gmai
49
h [Link]