0% found this document useful (0 votes)
16 views53 pages

Understanding NoSQL Databases and Types

The document compares NoSQL and RDBMS databases, highlighting the characteristics, advantages, and limitations of each. It explains the ACID properties of transactions in RDBMS and the BASE principles in NoSQL, emphasizing NoSQL's flexibility, scalability, and suitability for big data applications. Additionally, it discusses the CAP theorem, which outlines the trade-offs between consistency, availability, and partition tolerance in distributed systems.

Uploaded by

contentbyme247
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views53 pages

Understanding NoSQL Databases and Types

The document compares NoSQL and RDBMS databases, highlighting the characteristics, advantages, and limitations of each. It explains the ACID properties of transactions in RDBMS and the BASE principles in NoSQL, emphasizing NoSQL's flexibility, scalability, and suitability for big data applications. Additionally, it discusses the CAP theorem, which outlines the trade-offs between consistency, availability, and partition tolerance in distributed systems.

Uploaded by

contentbyme247
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NOSQL

RDBMS Characteristics
• Data stored in columns and tables
• Relationships represented by data
• Data Manipulation Language
• Data Definition Language
• Transactions
• Abstraction from physical layer
• Applications specify what, not how
• Physical layer can change without modifying applications
• Create indexes to support queries
• In Memory databases
Transactions – ACID Properties
• Atomic – All of the work in a transaction completes (commit) or none of it
completes
• a transaction to transfer funds from one account to another involves making a withdrawal
operation from the first account and a deposit operation on the second. If the deposit
operation failed, you don’t want the withdrawal operation to happen either.
• Consistent – A transaction transforms the database from one consistent state to
another consistent state. Consistency is defined in terms of constraints.
• a database tracking a checking account may only allow unique check numbers to exist for each
transaction
• Isolated – The results of any changes made during a transaction are not visible until
the transaction has committed.
• a teller looking up a balance must be isolated from a concurrent transaction involving a
withdrawal from the same account. Only when the withdrawal transaction commits
successfully and the teller looks at the balance again will the new balance be reported.
• Durable – The results of a committed transaction survive failures
• A system crash or any other failure must not be allowed to lose the results of a transaction or
the contents of the database. Durability is often achieved through separate transaction logs
that can "re-create" all transactions from some picked point in time (like a backup).
Limitations of Relational Database
• Issues with scaling up when the database is just too big e.g. Big Data
• Slow/Speed for a large size
• Resource intensive
• Not designed to be distributed.
• In relational model (SQL), every database has a pre-defined
structure(schema). All records of the table is restricted to use the same
column name and data types.
NoSQL
• NoSQL stands for:
• No Relational
• No RDBMS
• Not Only SQL
• NoSQL is an umbrella term for all databases and data stores that don’t follow the
RDBMS principles
• A class of products
• A collection of several (related) concepts about data storage and manipulation
• Often related to large data sets
• NoSQL Database is a non-relational Data Management System, that does not
require a fixed schema. It avoids joins and is easy to scale.
• NoSQL is used for Big data and real-time web apps. For example, companies like
Twitter, Facebook and Google collect terabytes of user data every single day.
• NoSQL database follows BASE principle while RDBMS follows ACID principle.
BASE Principle in NoSQL
• BA->Basically Available
• The system remains operational and provides a basic level of availability, even in the
presence of failures or network partitions.
• S-> Soft State
• This property allows the database to be in a state of flux, where changes may not be
immediately reflected in all replicas.
• Data in the system may be in an intermediate or transient state, allowing temporary
inconsistencies or partial updates during concurrent operations.
• E-> Eventually Consistent
• This property ensures that the database will eventually converge to a consistent state, but
this may take some time.
• The system will reach a consistent state over time, acknowledging that temporary
inconsistencies may exist due to factors like network latency or replication delays.
• The BASE property is a trade-off between consistency and availability. By
relaxing the consistency requirements, NoSQL databases can achieve higher
availability and scalability.
• i.e., NoSQL databases prioritize scalability, performance, and availability over strict
consistency.
NoSQL...
• Never follows relational model
• In relational model, data is structured while in NOSQL, data is structured/semi-
structured/unstructured
• Never provides Table with flat fixed Column records
• Flexible:
• NoSL is a non-relational database system that does not require a fixed schema.
• In non-relational model(NOSQL), there is no pre-defened structure (i.e. schema less).
Any data can be stored in any record. It provides more flexibility.
• Scalable
• Distributed: Multiple NoSQL databases can be executed in a distributed
fashion
• Has low-cost hardware
• Faster performance
Limitations of NoSQL
• Limited query capabilities
• NoSQL databases are often not as good at querying data as relational
databases.
• Doesn’t work well with relational data
• Doesn’t offer traditional database capabilities like consistency when
multiple transactions are performed simultaneously
• There are fewer tools and documentation available for NoSQL
databases than for relational databases. This can make it difficult to
learn how to use NoSQL databases and to troubleshoot problems.
• No standardization rules
• NoSQL databases can be less secure than relational databases.
Types of NoSql Database
NoSQL Databases are mainly categorized into four
types: Key-value pair, Column-oriented, Graph-
based and Document-oriented.

1. Document-oriented
2. Key-value Pair Based
3. Column-oriented Graph (Wide-column
stores)
4. Graphs based
Key Value Pair Based Database
• Every single item in the database is stored as an attribute name (or key) together with its
value.
• Key-value pair storage databases store data as a hash table where each key is unique, and
the value can be a integer string or complex data types such as sets of data, etc.
• Data is retrieved via an exact match of the key.
• Data is stored in key/value pairs. It is designed in such a way to handle lots of data and
heavy load.
• New types of data can easily be added to the database as new key-value pairs
• The 3 operations performed on a key-value database are:
• put(key,value)
• get(key)
• delete(key)
• Redis, Dynamo, Riak, Oracle NoSQL, are some NoSQL examples of key value store
DataBases.
• This kind of NoSQL database is used as a collection, dictionaries, associative arrays, etc
Key Value Pair database schema

[Link]
Document Oriented database
• Document database are similar to key-value databases in that, there is a key and
a value.
• But in a document database, the value contains structured or semi-structured
data. This structured/semi-structured value is referred to as document.
• Mostly used in content management system, blogging platforms, real time
analytics (log analytics), social media and e-commerce applications.

Example:
• MongoDB
• CouchDB
Document Based No SQL
• Document-Oriented NoSQL DB stores and retrieves data as a key
value pair but the value part is stored as a document. The document
is stored in JSON or XML formats. The value is understood by the DB
and can be queried.
• Documents in NoSQL is equivalent to the rows in RDBMS
Document Databases (Document Store)
• The central concept is the notion of a "document“ which corresponds to a row in RDBMS.
• A document comes in some standard formats like JSON.
• Documents are addressed in the database via a unique key that represents that document.
• The database offers an API or query language that retrieves documents based on their contents.
• Documents are schema free, i.e., different documents can have structures and schema that differ from one
another. (An RDBMS requires that each row contain the same columns.)
• JSON:
{
_id: ObjectId("51156a1e056d6f966f268f81"),
type: "Article",
author: "Derick Rethans",
title: "Introduction to Document Databases with MongoDB",
date: ISODate("2013-04-24T16:26:31.911Z"),
body: "This arti…"
},
{
_id: ObjectId("51156a1e056d6f966f268f82"),
type: "Book",
author: "Derick Rethans",
title: "php|architect's Guide to Date and Time Programming with PHP",
isbn: "978-0-9738621-5-7"
}
16
Column Store Database
• It stores data using a column-oriented model.
• Data is organized and stored by column rather than by row.
• Each column is stored separately, allowing for efficient data
retrieval and analysis.
• Column-oriented databases work on columns and are based
on BigTable paper by Google. Every column is treated
separately. Values of single column databases are stored
contiguously
• Column store databases use concept of keyspace(like
schema in relational model)
• A keyspace contains column families (like tables in
relational model), which contain rows and within rows there
are columns.
• Column-based NoSQL databases are widely used to
manage data warehouses, business intelligence, CRM,
Library card catalogs.
• HBase, Cassandra, HBase, Hypertable are NoSQL query
examples of column based database
•A column family consists of multiple rows.
•Each row can contain a different number of columns to the other rows. And the columns don’t have to match the columns in the other rows (i.e. they can have
different column names, data types, etc).
•Each column is contained to its row. It doesn’t span all rows like in a relational database. Each column contains a name/value pair, along with a timestamp.
Note that this example uses Unix/Epoch time for the timestamp.
Here’s how each row is constructed:

•Row Key. Each row has a unique key, which is a unique identifier for that row.
•Column. Each column contains a name, a value, and timestamp.
•Name. This is the name of the name/value pair.
•Value. This is the value of the name/value pair.
•Timestamp. This provides the date and time that the data was inserted. This can be used to determine the
most recent version of data.
Sorted Ordered Column-Oriented Stores
• Data are stored in a column-oriented way
• Data efficiently stored
• Avoids consuming space for storing nulls
• Columns are grouped in column-families
• Data isn’t stored as a single table but is stored
by column families
• Unit of data is a set of key/value pairs
• Identified by “row-key”
• Ordered and sorted based on row-key
• Notable for:
• Google's Bigtable (used in all
Google's services)
• HBase (Facebook, StumbleUpon,
Hulu, Yahoo!, ...)
Graph-Based NoSQL Database
• It organizes data in the form of a graph.
• A graph database contains a collection of
nodes and edges.
• A node represents an entity, and an edge
represents the connection or relationship
between two entities.
• A graph type database stores entities as
well the relations amongst those entities.
• Every node and edge has a unique
identifier.
• Each node and edge can have any
number of attributes.
• Both the nodes and edges can be
labelled.
Graph Database example
• Graph databases are well-suited for storing and
querying data that has a natural graph structure.
• Example:
• Social networks: Graph databases can be used to
store and query data about social networks, such
as the relationships between users, groups, and
posts.
• Transportation: Graph databases can be used to
store and query data about transportation
networks, such as the relationships between
roads, intersections, and traffic signals.
• Logistics: Graph databases can be used to store
and query data about logistics networks, such as
the relationships between warehouses,
shipments, and transportation routes.
Comparison of Relational, Document and
Graph database models
CAP Theorem
• CAP theorem is also known as Brewer's theorem.
• It states that a distributed system can deliver only two of three desired characteristics: consistency,
availability, and partition tolerance (the ‘C,’ ‘A’ and ‘P’ in CAP).
• It states that it is impossible to achieve all three of these properties in a distributed system at the same time.
• Consistency:
• Consistency means that all replicas of a data item must have the same value at all times.
• Achieving strong consistency ensures that the system behaves as if it were a single, centralized database.
• Availability:
• Availability means that the system continues to operate and provide responses despite the presence of
failures or network partitions.
• High availability implies that the system remains functional (all requests for data must be answered
)even if some nodes fail or become unreachable.
• Partition tolerance:
• Partition tolerance means that the system must continue to operate even if some of the nodes are
partitioned from the network.
• A partition is a communications break within a distributed system—a lost or temporarily delayed
connection between two nodes.

[Link]
CAP Theorem…
• According to the CAP theorem, in the presence of a network partition (P), a distributed
system must choose between either consistency (C) or availability (A).
• Therefore, in such a scenario we either choose to compromise on Consistency or on
Availability.
• Hence, a NoSQL distributed database is either characterized as CP or AP.
• CP Database:
• A CP database offers consistency and partition tolerance but sacrifices availability. The practical result is
that when a partition occurs, the system must make the inconsistent node unavailable until it can resolve
the partition. MongoDB and Redis are examples of CP databases.
• Example: Financial Systems: stock trading platforms and banking systems, e-commerce websites
may prioritize consistency to ensure accurate account balances and transaction records.
• AP Database:
• An AP database provides availability and partition tolerance but not consistency in the event of a failure.
All nodes remain available when a partition occurs, but some might return an older version of the data.
CouchDB, Cassandra, and ScyllaDB are examples of AP databases
• Example: Content Delivery Networks , Social media platforms may prioritize availability and
partition tolerance to ensure fast content delivery and reduce latency
CAP Theorem…
• CA Database:
• A CA database delivers consistency and availability, but it can’t deliver fault tolerance if any two nodes in the
system have a partition between them.
• There are no NoSQL databases we can classify as CA under the CAP theorem.
• In a distributed database, there is no way to avoid system partitions. So, there is currently no true CA
distributed database system.
• The modern goal of CAP theorem analysis should be for system designers to generate optimal combinations of
consistency and availability for particular applications.
• The CAP theorem states that a distributed database system has to make a tradeoff
between Consistency and Availability when a Partition occurs.
• A distributed database system is bound to have partitions in a real-world system due to network
failure or some other reason. Therefore, partition tolerance is a property we cannot avoid while
building our system. So, a distributed system will either choose to give up on Consistency or
Availability but not on Partition tolerance.
• For example,
• in a distributed system, if a partition occurs between two nodes, it is impossible to provide consistent data on
both the nodes and availability of complete data.
• Therefore, in such a scenario we either choose to compromise on Consistency or on Availability.
• Hence, a NoSQL distributed database is either characterized as CP or AP.
• CA type databases are generally the monolithic databases that work on a single node and provide no
distribution. Hence, they require no partition tolerance
Big Data
[Link]
Big Data
• A Collection of large (relative term) and complex datasets which are difficult to
store and process using the traditional database and data processing tools
• Big Data technologies as a new generation of technologies and architectures,
designed to economically extract value from very large volumes of a wide variety
of data by enabling high-velocity capture, discovery, and/or analysis.
• A term related to extracting meaningful data by analyzing the huge amount of
complex, variously formatted data generated at high speed, that cannot be
handled, processed by the traditional system.
• Some of the challenges associated with big data include:
• Complexity: Big data can be very complex, making it difficult to understand and
analyze.
• Scalability: Big data solutions need to be scalable in order to accommodate the ever-
increasing volume of data.
• Cost: Big data solutions can be expensive to implement and maintain.
5Vs of Big Data
1. Volume:
• Huge amount of data; growing exponentially
2. Velocity:
• High speed of accumulation of data ( how fast the data is generated and processed to meet the
demands)
• This means that organizations need to be able to process data in real time or near real
time in order to make informed decisions.
3. Variety :
• Nature of data that is structured, semi-structured and unstructured ( heterogeneous sources)
• Organizations need to be able to collect and process all of these different types of data in
order to get a complete picture of their business.
4. Veracity :
• Veracity refers to the quality of data. The accuracy and reliability of the data is also
important.
• Organizations need to be able to trust that the data they are collecting is accurate and
reliable in order to make informed decisions.
5. Value:
• Data needs to be converted into something valuable to extract Information
• Motivation to use Big Data
• The size of data is growing rapidly, data is spread
across multiple machines and stored in different
formats
• Moving data to databases is expensive
• Possible Solutions
• Analyze the data in the format they are ( For
example: the text file need not be uploaded into
database to analyze it)
• The data has to be read by your code to analyze the
data ( i.e. don’t move the data out of the box)
Benefits of Big Data
• Improved decision-making
• Increased customer insights
• Enhanced operational efficiency
• New product and service development
• Competitive advantage
Some examples of big data
• Transactional data: This includes data from customer transactions,
such as sales, purchases, and returns.
• Sensor data: This includes data from sensors that are used to
monitor equipment, such as machinery, vehicles, and buildings.
• Social media data: This includes data from social media platforms,
such as Facebook, Twitter, and LinkedIn.
• Log data: This includes data from logs that are generated by
applications, servers, and networks.
• Image and video data: This includes data from images and videos
that are captured by cameras, smartphones, and other devices.
Big Data Architecture [Link]
styles/big-data

Big Data Architecture Layers

•Big Data Sources Layer: Collects data from a variety of sources.


•Management & Storage Layer: This layer receives data from the source, converts the data into a format
comprehensible for the data analytics tool, and stores the data according to its format.
•Analysis Layer: This layer extracts business intelligence from the big data storage layer using analytics tools.
•Consumption Layer: This layer receives results from the big data analysis layer and presents them to the
pertinent output layer, also known as the business intelligence layer.
Big Data Architecture
• Data sources:
• All big data solutions start with one or more data sources. These can be structured data,
such as customer data from a CRM system, or unstructured data, such as social media
posts or sensor data.
• Data storage:
• Big data is often stored in a distributed fashion, using a Hadoop cluster or other
distributed file system. This allows for scalability and fault tolerance.
• Batch processing:
• Batch processing is used to process large amounts of data that does not need to be
processed in real time. This type of processing is often used for historical data analysis
or for generating reports.
• Real-time message ingestion:
• Real-time message ingestion is used to process data that needs to be processed in real
time. This type of processing is often used for fraud detection or for monitoring systems.
• Stream processing:
• Stream processing is a type of real-time processing that is used to process data that is
continuously flowing in. This type of processing is often used for financial trading or for
social media analytics.
Big Data Architecture…
• Machine learning:
• Machine learning is used to extract insights from big data. This type of processing is
often used for predictive analytics or for customer segmentation.
• Analytical data store:
• The analytical data store is where the processed data is stored for analysis and
reporting. This data store can be a traditional relational database or a NoSQL
database.
• Analysis and reporting:
• Analysis and reporting is the final step in the big data architecture process. This is
where the insights from the data are used to make decisions or to improve business
processes.
• Orchestration:
• It is the process of automating the flow of data through a big data architecture. It
involves the coordination of different data processing tasks, such as data ingestion,
data storage, data processing, and data analysis. Orchestration technology such
Azure Data Factory or Apache Oozie and Sqoop can be used to automate the
workflow.
Hadoop
• Hadoop is an open-source software framework for distributed storage and processing of large data
sets.
• It is designed to scale up from single servers to thousands of nodes.
• Hadoop uses a master-slave architecture.
• The master node is responsible for managing the cluster, while the slave nodes are responsible for
storing and processing data.
• Hadoop uses a distributed file system called HDFS (Hadoop Distributed File System) to store data.
• HDFS is designed to be fault-tolerant and scalable.
• Hadoop uses a programming model called MapReduce to process data.
• MapReduce is a divide-and-conquer approach to processing data.
• MapReduce is a programming model and processing framework that allows for parallel processing
of data across the cluster by dividing it into smaller tasks (map and reduce) that are executed in
parallel on different nodes.
• It supports batch processing, making it suitable for large-scale data analytics, data transformation,
and ETL (Extract, Transform, Load) operations.
• Hadoop has a built-in fault tolerance mechanism. When a node fails, Hadoop automatically
redistributes the data and reassigns the failed tasks to other available nodes, ensuring
uninterrupted processing.
• Hadoop has a vibrant ecosystem with various tools and frameworks built on top of it, such as
Apache Hive (SQL-like query language for data warehousing), Apache Pig (data processing and
scripting), Apache Spark (in-memory processing), and Apache HBase (NoSQL database).
• It is being used by Facebook, Yahoo, Google, Twitter,
LinkedIn and many more.
• Hadoop can be used for a variety of big data applications, such
as:
• Log analysis
• Data mining
• Machine learning
• Business intelligence
Benefits of using Hadoop
• Scalability:
• Hadoop is designed to scale up from single servers to thousands of
nodes.
• Fault tolerance:
• Hadoop is designed to be fault-tolerant. If a node fails, the data is still
available on the other nodes.
• Cost-effectiveness:
• Hadoop is a cost-effective solution for big data processing.
• Open-source:
• Hadoop is an open-source software framework, so it is free to use and
modify.
Modules of Hadoop
• HDFS (Hadoop Distributed File System)
• Google published its paper GFS and on the basis of that HDFS was developed.
• It states that the files will be broken into blocks and stored in nodes over the distributed
architecture.
• This makes HDFS highly scalable, as it can be easily expanded to add more nodes.
• It is designed to be fault-tolerant and scalable.
• Yarn
• Yet another Resource Negotiator is used for job scheduling and manage the cluster.
• Map Reduce:
• MapReduce is a programming model to process large sets of data in parallel.
• It is designed to run on a large cluster of computers.
• MapReduce works by dividing the data into smaller chunks, which are then processed in parallel
by a cluster of computers. This makes MapReduce very efficient for processing large data sets.
• The Map task takes input data and converts it into a data set which can be computed in Key
value pair.
• The output of Map task is consumed by reduce task and then the out of reducer gives the desired
result.
• Hadoop Common:
• These Java libraries are used to start Hadoop and are used by other Hadoop modules.
HDFS (Hadoop Distributed File System)
• HDFS (Hadoop Distributed File System) is a distributed file system
designed to run on commodity hardware.
• It is designed to be fault-tolerant and scalable.
• Data in HDFS is divided into blocks of fixed size (default is 128 MB), which
are replicated across multiple machines in the cluster for data redundancy
and availability.
• This makes HDFS highly scalable, as it can be easily expanded to add
more nodes.
• HDFS is also fault-tolerant, as it can continue to operate even if some of
the nodes in the cluster fail.
• HDFS is a key component of the Hadoop ecosystem, and it is used to
store data for a variety of big data applications.
• HDFS follows a master-slave architecture, where the NameNode serves
as the master and manages the metadata, while multiple DataNodes act
as slaves and store the actual data blocks.
HDFS…
• NameNode
• The NameNode is the master node in HDFS.
• It is a single master server exist in the HDFS cluster.
• It stores the metadata for the filesystem, such as the location of the blocks.
• As it is a single node, it may become the reason of single point failure.
• It manages the file system namespace by executing an operation like the
opening, renaming and closing the files.
• DataNode
• A DataNode is a slave node in HDFS.
• The HDFS cluster contains multiple DataNodes.
• Each DataNode contains multiple data blocks.
• These data blocks are used to store data.
• The DataNodes are responsible for serving read and write requests for files.
• It performs block creation, deletion, and replication upon instruction from the
NameNode.
• They also replicate the blocks of data to other DataNodes for fault tolerance.
Responsibilities of NameNode and DataNode in HDFS:

• NameNode:
• Keeps track of the location of all the blocks in the file system.
• Assigns blocks to DataNodes.
• Tracks which DataNodes have which blocks.
• Handles file operations like opening, closing, and renaming.
• DataNode:
• Stores blocks of data.
• Serves read and write requests for files.
• Replicates blocks of data to other DataNodes for fault tolerance.
Map Reduce
• MapReduce is a programming model and an associated
implementation for processing and generating large data sets.
• It is designed to run on a large cluster of computers.
• MapReduce works by dividing the data into smaller chunks, which
are then processed in parallel by a cluster of computers.
• The MapReduce programming model consists of two phases: the
map phase and the reduce phase.
• In the map phase, each chunk of data is processed by a mapper
function.
• The mapper function produces a set of intermediate key-value pairs.
• In the reduce phase, the intermediate key-value pairs are grouped
together by key and then processed by a reducer function.
• The reducer function produces a final set of output key-value pairs.

You might also like