CAP and BASE Theorems in Distributed Databases
CAP and BASE Theorems in Distributed Databases
CAP
Within a distributed environment, it's expected that network partitions may occur among the nodes. The
CAP theorem establishes that when facing a network partition, the system must choose between being
Available or Consistent.
Consistency:
In a consistent system, all nodes see the same data simultaneously. If we perform a read
operation on a consistent system, it should return the value of the most recent write operation.
The read should cause all nodes to return the same data. All users see the same data at the same
time, regardless of the node they connect to. When data is written to a single node, it is then
replicated across the other nodes in the system.
Availability:
All the active nodes at any moment must be able to respond to different operations. it means
that the system remains operational all of the time. Every request will get a response regardless
of the individual state of the nodes.
Partition tolerance:
The system must be able to tolerate network partition among its participant nodes. It means that
there’s a break in communication between nodes. If a system is partition-tolerant, the system
does not fail, regardless of whether messages are dropped or delayed between nodes within the
system. To have partition tolerance, the system must replicate records across combinations of
nodes and networks.
BASE
Basically-Available:
Soft-state:
The database system may keep changing states as and when it receives new information, This
can happen due to the effects of background processes, updates to data, and other factors. The
database should be designed to handle this change gracefully and ensure that it does not lead to
data corruption or loss.
Eventually-consistent:
The elements within the system might not instantaneously display an identical value or state for
a record at any given instance. However, they will gradually reconcile this disparity over time.
This concept pertains to the eventual consistency of data within the database, even in the
presence of evolving changes. Essentially, the database is anticipated to ultimately reach a
harmonized and consistent state, even if the propagation and reflection of all updates demand a
certain duration. This stands in opposition to the instantaneous consistency demanded by
conventional ACID-compliant databases.
1 Distributed DBMSs
A distributed DBMS divides a single logical database across multiple physical resources. The application
is (usually) unaware that data is split across separated hardware. The system relies on the techniques and
algorithms from single-node DBMSs to support transaction processing and query execution in a distributed
environment. An important goal in designing a distributed DBMS is fault tolerance (i.e., avoiding a single
one node failure taking down the entire system).
The differences between parallel and distributed DBMSs are:
Parallel Database:
• Nodes are physically close to each other.
• Nodes are connected via high-speed LAN (fast, reliable communication fabric).
• The communication cost between nodes is assumed to be small. As such, one does not need to
worry about nodes crashing or packets getting dropped when designing internal protocols.
Distributed Database:
• Nodes can be far from each other.
• Nodes are potentially connected via a public network, which can be slow and unreliable.
• The communication cost and connection problems cannot be ignored (i.e., nodes can crash, and
pack- ets can get dropped).
2 System Architectures
A DBMS’s system architecture specifies what shared resources are directly accessible to CPUs. It affects
how CPUs coordinate with each other and where they retrieve and store objects in the database.
A single-node DBMS uses what is called a shared everything architecture. This single node executes work-
ers on a local CPU(s) with its own local memory address space and disk.
Shared Memory
An alternative to shared everything architecture in distributed systems is shared memory. CPUs have
access to common memory address space via a fast interconnect. CPUs also share the same disk.
In practice, most DBMSs do not use this architecture, as it is provided at the OS / kernel level. It also causes
problems, since each process’s scope of memory is the same memory address space, which can be modified
by multiple processes.
Each processor has a global view of all the in-memory data structures. Each DBMS instance on a processor
has to “know” about the other instances.
Shared Disk
In a shared disk architecture, all CPUs can read and write to a single logical disk directly via an interconnect,
but each have their own private memories. The local storage on each compute node can act as caches. This
approach is more common in cloud-based DBMSs.
The DBMS’s execution layer can scale independently from the storage layer. Adding new storage nodes
or execution nodes does not affect the layout or location of data in the other layer.
Nodes must send messages between them to learn about other node’s current state. That is, since memory
is local, if data is modified, changes must be communicated to other CPUs in the case that piece of data
is in main memory for the other CPUs.
Nodes have their own buffer pool and are considered stateless. A node crash does not affect the state of
the database since that is stored separately on the shared disk. The storage layer persists the state in the
case of crashes.
Shared Nothing
In a shared nothing environment, each node has its own CPU, memory, and disk. Nodes only communicate
with each other via network. Before the rise of cloud storage platforms, the shared nothing architecture used
to be considered the correct way to build distributed DBMSs.
It is more difficult to increase capacity in this architecture because the DBMS has to physically move data
to new nodes. It is also difficult to ensure consistency across all nodes in the DBMS, since the nodes must
coordinate with each other on the state of transactions. The advantage, however, is that shared nothing
DBMSs can potentially achieve better performance and are more efficient then other types of distributed
DBMS architectures.
3 Design Issues
Distributed DBMSs aim to maintain data transparency, meaning that users should not be required to know
where data is physically located, or how tables are partitioned or replicated. The details of how data is being
stored is hidden from the application. In other words, a SQL query that works on a single-node DBMS
should work the same on a distributed DBMS.
The key design questions that distributed database systems must address are the following:
• How does the application find data?
• How should queries be executed on a distributed data? Should the query be pushed to where the
data is located? Or should the data be pooled into a common location to execute the query?
• How does the DBMS ensure correctness?
Another design decision to make involves deciding how the nodes will interact in their clusters. Two options
are homogeneous and heterogeneous nodes, which are both used in modern-day systems.
Homogeneous Nodes: Every node in the cluster can perform the same set of tasks (albeit on potentially
different partitions of data), lending itself well to a shared nothing architecture. This makes provisioning
and failover “easier”. Failed tasks are assigned to available nodes.
Heterogeneous Nodes: Nodes are assigned specific tasks, so communication must happen between nodes
to carry out a given task. This allows a single physical node to host multiple “virtual” node types for
dedicated tasks that can independently scale from one node to other. An example is MongoDB, which has
router nodes routing queries to shards and config server nodes storing the mapping from keys to shards.
4 Partitioning Schemes
Distributed system must partition the database across multiple resources, including disks, nodes, processors.
This process is sometimes called sharding in NoSQL systems. When the DBMS receives a query, it first
analyzes the data that the query plan needs to access. The DBMS may potentially send fragments of the
query plan to different nodes, then combines the results to produce a single answer.
The goal of a partitioning scheme is to maximize single-node transactions, or transactions that only access
data contained on one partition. This allows the DBMS to not need to coordinate the behavior of concurrent
transactions running on other nodes. On the other hand, a distributed transaction accesses data at one or
more partitions. This requires expensive, difficult coordination, discussed in the below section.
For logically partitioned nodes, particular nodes are in charge of accessing specific tuples from a shared
disk. For physically partitioned nodes, each shared nothing node reads and updates tuples it contains on
its own local disk.
Implementation
The simplest way to partition tables is naive data partitioning. Each node stores one table, assuming enough
storage space for a given node. This is easy to implement because a query is just routed to a specific
partitioning. This can be bad, since it is not scalable. One partition’s resources can be exhausted if that one
table is queried on often, not using all nodes available. See Figure 2 for an example.
Another way of partitioning is vertical partitioning, which splits a table’s attributes into separate partitions.
Each partition must also store tuple information for reconstructing the original record.
More commonly, horizontal partitioning s used which splits a table’s tuples into disjoint subsets. Choose
column(s) that divides the database equally in terms of size, load or usage, called the partitioning key(s).
Figure 2: Naive Table Partitioning – Given two tables, place all the tuples in table
one into one partition and the tuples in table two into the other.
The DBMS can partition a database physically (shared nothing) or logically (shared disk) based on hashing,
data ranges or predicates. See Figure 3 for an example. The problem of hash partitioning is that when a node
is added or removed, a lot of data has to be shuffled around. The solution for this is Consistent Hashing.
Consistent Hashing assigns every node to a location on some logical ring. Then the hash of every partition
key maps to a location on the ring. The node that is closest to the key in the clockwise direction is responsible
for that key. See Figure 4 for an example. When a node is added or removed, keys are only moved between
nodes adjacent to the new/removed node and so only 1/n fraction of the keys are moved. A replication
factor of k means that each key is replicated at the k closest nodes in the clockwise direction.
Logical Partitioning: A node is responsible for a set of keys, but it doesn’t actually store those keys. This
is commonly used in a shared disk architecture.
Physical Partitioning: A node is responsible for a set of keys, and it physically stores those keys. This
is commonly used in a shared nothing architecture.
Figure 4: Consistent Hashing – All nodes are responsible for some portion of hash
ring. Here, node P 1 is responsible for storing key1 and node P 3 is responsible for
storing key2.
Middleware
Centralized coordinators can be used as middleware, which accepts query requests and routes queries to
correct partitions.
Decentralized coordinator
In a decentralized approach, nodes organize themselves. The client directly sends queries to one of the
partitions. This home partition will send results back to the client. The home partition is in charge of
communicating with other partitions and committing accordingly.
Centralized approaches give way to bottlenecks in the case that multiple clients are trying to acquire locks
on the same partitions. It can be better for distributed 2PL as it has a central view of the locks and can handle
deadlocks more quickly. This is non-trivial with decentralized approaches.
BIG DATA
For example, data might include individual prices, weights, addresses, ages, names, temperatures, dates,
or distances.
1. Accuracy
Data should be sufficiently accurate for the intended use and should be captured only once,
although it may have multiple uses. Data should be captured at the point of activity.
2. Validity
Data should be recorded and used in compliance with relevant requirements, includingthe correct
application of any rules or definitions. This will ensure consistency between periods and with
similar organizations, measuring what is intended to be measured.
3. Reliability
Data should reflect stable and consistent data collection processes across collection points and
over time. Progress toward performance targets should reflect real changesrather than variations
in data collection approaches or methods. Source data is clearly identified and readily available
from manual, automated, or other systems and records.
4. Timeliness
Data should be captured as quickly as possible after the event or activity and must beavailable
for the intended use within a reasonable time period. Data must be available quickly and
frequently enough to support information needs and to influence service or management
decisions.
5. Relevance
Data captured should be relevant to the purposes for which it is to be used. This will require a
periodic review of requirements to reflect changing needs.
6. Completeness
Data requirements should be clearly specified based on the information needs of the
organization and data collection processes matched to these requirements.
Structured Data:
Structured data refers to any data that resides in a fixed field within a record or file.
Having a particular Data Model.
Meaningful data.
Data arranged in arow and column.
Structured data has the advantage of being easily entered, stored, queried andanalysed.
E.g.: Relational Data Base, Spread sheets.
Structured data is often managed using Structured Query Language (SQL)
Efficient storage and retrieval: Structured data is typically stored in relational databases,
which are designed to efficiently store and retrieve large amounts of data. This makes
it easy to access and process data quickly.
Enhanced data security: Structured data can be more easily secured than unstructured or
semi-structured data, as access to the data can be controlled through database security
protocols.
Clear data lineage: Structured data typically has a clear lineage or history, making it easy
to track changes and ensure data quality.
Unstructured Data:
Unstructured data can not readily classify and fit into a neat box
Also called unclassified data.
Which does not confirm to any data model.
Business rules are not applied.
Indexing is not required.
E.g.: photos and graphic images, videos, streaming instrument data, webpages, Pdf files,
PowerPoint presentations, emails, blog entries, wikis and word processing documents.
Faster data processing: Semi-structured data can be processed more quickly than
traditional structured data, as it can be indexed and queried in a more flexible way. This
makes it easier to retrieve specific subsets of data for analysis and reporting.
Improved data integration: Semi-structured data can be more easily integrated with other
types of data, such as unstructured data, making it easier to combineand analyze data
from multiple sources.
Richer data analysis: Semi-structured data often contains more contextual information
than traditional structured data, such as metadata or tags. This canprovide additional
insights and context that can improve the accuracy and relevance of data analysis.
Data security: Semi-structured data can be more difficult to secure than structured data,
as it may contain sensitive information in unstructured or less- visible parts of the data.
This can make it more challenging to identify and protect sensitive information from
unauthorized access.
Overall, while semi-structured data offers many advantages in terms of flexibility and
scalability, it also presents some challenges and limitations that need to be carefully
considered when designing and implementing data processing and analysis pipelines.
New York Stock Exchange : The New York Stock Exchange is an example of Big Data that
generates about one terabyte of new trade data per day.
Social Media: The statistic shows that 500+terabytes of new data get ingested into the databases
of social media site Facebook, every day. This data is mainly generated in terms of photo and
video uploads, message exchanges, putting comments etc.
Jet engine :A single Jet engine can generate 10+terabytes of data in 30 minutes offlight time.
With many thousand flights per day, generation of data reaches up to many Petabytes.
1.5 Big Data Characteristics
Volume:
The name Big Data itself is related to an enormous size. Big Data is a vast ‘volume’ of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.
Variety:
Big Data can be structured, unstructured, and semi-structured that are being collected
from different sources. Data will only be collected from databases and
sheets in the past, but these days the data will comes in array forms, that are PDFs, Emails,
audios, SM posts, photos, videos, etc.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It is
valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.
Big data velocity deals with the speed at the data flows from sources like applicationlogs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Companies are using Big Data to know what their customers want, who are their best customers,
why people choose different products. The more a company knows about its customers, the
more competitive it becomes.
We can use it with Machine Learning for creating market strategies based on predictions about
customers. Leveraging big data makes companies customer-centric.
Companies can use Historical and real-time data to assess evolving consumers’ preferences.
This consequently enables businesses to improve and update their marketing strategies which
make companies more responsive to customer needs.
Big Data importance doesn’t revolve around the amount of data a company has. Its importance lies in
the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the company uses its
data, more rapidly it grows.
The companies in the present market need to collect it and analyze it because:
1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses when
they have to store large amounts of data. These tools help organizations in identifying more
effective ways of doing business.
2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various sources. Tools like
Hadoop help them to analyze data immediately thus helping in making quick decisions based on
the learnings.
If we don’t know what our customers want then it will degrade companies’ success. It will result
in the loss of clientele which creates an adverse effect on business growth. Big data analytics
helps businesses to identify customer related trends and patterns. Customer behavior analysis
leads to a profitable business.
To handle this challenge, companies are migrating their IT infrastructure to the cloud. Cloud
storage solutions can scale dynamically as more storage is needed. Big data software is
designed to store large volumes of data that can be accessed and queried quickly.
To deal with this challenge, businesses use data integration software, ETL software, and
business intelligence software to map disparate data sources into a common structure and
combine them so they can generate accurate reports.
When your business begins a data project, start with goals in mind and strategies for how
you will use the data you have available to reach those goals. The team involved in
implementing a solution needs to plan the type of data they need and the schemas they will
use before they start building the system so the project doesn't go in the wrong direction.
They also need to create policies for purging old data from the system once it is no longer
useful.
There are a few ways to solve this problem. One is to hire a big data specialist and
have that specialist manage and train your data team until they are up to speed. The
specialist can either be hired on as a full -time employee or as a consultant who trains
your team and moves on, depending on your budget. Another option, if you have time to
prepare ahead, is to offer training to your current team members so they will have the
skills once your big data project is in motion.
A third option is to choose one of the self-service analytics or business intelligence solutions
that are designed to be used by professionals who don't have a data science background.
8. Organizational resistance
Another way people can be a challenge to a data project is when they resist change. The
bigger an organization is, the more resistant it is to change. Leaders may not see the value
in big data, analytics, or machine learning. Or they may simply not want to spend the time
and money on a new project.
This can be a hard challenge to tackle, but it can be done. You can start with a smaller project
and a small team and let the results of that project prove the valueof big data to other leaders
and gradually become a data-driven business. Another option is placing big data experts in
leadership roles so they can guide your business towards transformation.
BI supports fact-based decision making using historical data rather than assumptions and gut
feeling.
BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and
charts to provide users with detailed intelligence about the nature of the business.
Why is BI important?
Step 2) The data is cleaned and transformed into the data warehouse. The table can be linked,
and data cubes are formed.
Step 3) Using BI system the user can ask quires, request ad-hoc reports or conduct any other
analysis.
Advantages of Business Intelligence
Here are some of the advantages of using Business Intelligence System:
1. Boost productivity
With a BI program, It is possible for businesses to create reports with a single click thus saves
lots of time and resources. It also allows employees to be more productiveon their tasks.
2. To improve visibility
BI also helps to improve the visibility of these processes and make it possible to identify any
areas which need attention.
3. Fix Accountability
BI system assigns accountability in the organization as there must be someone who should own
accountability and ownership for the organization’s performance against its set goals.
2. Complexity:
Another drawback of BI is its complexity in implementation of datawarehouse. It can be so
complex that it can make business techniques rigid to deal with.
3. Limited use
Like all improved technologies, BI was first established keeping in consideration the buying
competence of rich firms. Therefore, BI system is yet not affordable for many small and medium
size companies.
Hadoop
There are three core components of Hadoop as mentioned earlier. They are HDFS,
MapReduce, and YARN. These together form the Hadoop framework architecture.
Features:
• The storage is distributed to handle a large data pool
• Distribution increases data security
• It is fault-tolerant, other blocks can pick up the failure of one block
2. MapReduce:
The MapReduce framework is the processing unit. All data is distributed and processed
parallelly. There is a MasterNode that distributes data amongst SlaveNodes. The SlaveNodes
do the processing and send it back to the MasterNode.
Features:
• Consists of two phases, Map Phase and Reduce Phase.
• Processes big data faster with multiples nodes working under one CPU
Features:
• It is a filing system that acts as an Operating System for the data stored on HDFS
• It helps to schedule the tasks to avoid overloading any system
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode performs
the role of master, and multiple DataNodes performs the role of a slave. Both NameNode and
DataNode are capable enough to run on commodity machines. The Java language is used to
develop HDFS. So any machine that supports Java language can easily run the NameNode
and DataNode software.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like theopening,
renaming and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file
system's clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and
process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job
to Job Tracker. In response, the Job Tracker sends the request to theappropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
Hadoop 1 vs Hadoop 2
Hadoop 1 Hadoop 2
HDFS HDFS
2. Daemons:
Hadoop 1 Hadoop 2
Namenode Namenode
Datanode Datanode
3. Working:
In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce which
works as Resource Management as well as Data Processing. Due to this workload on Map
Reduce, it will affect the performance.
In Hadoop 2, there is again HDFS which is again used for storage and on the top of
HDFS, there is YARN which works as Resource Management. It basicallyallocates the
resources and keeps all the things going on.
4. Limitations:
Hadoop 2 is also a Master-Slave architecture. But this consists of multiple masters ([Link]
namenodes and standby namenodes) and multiple slaves. If here master node got crashed then
standby master node will take over it. You can make multiple combinations of active-
standby nodes. Thus Hadoop 2 will eliminate the problem of asingle point of failure.
5. Ecosystem
Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
Sqoop is used to import and export structured data. You can directly import and export
the data into HDFS using SQL database.
Flume is used to import and export the unstructured data and streaming data.
MapReduce and YARN framework
The input to each phase is key-value pairs. In addition, every programmer needs to
specify two functions: map function and reduce function.
Let us understand more about MapReduce and its components. MapReduce majorly
has the following three Classes. They are,
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value pair.
Hadoop’s Mapper store saves this intermediate data into the local disk.
Input Split
RecordReader
It interacts with the Input split and converts the obtained data in the form of Key-
Value Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which
processes it and generates the final output which is then saved in the HDFS.
Driver Class
Yet Another Resource Manager takes programming to the next level beyond Java ,
and makes it interactive to let another application Hbase, Spark etc. to work on
[Link] Yarn applications can co-exist on the same cluster so MapReduce, Hbase,
Spark all can run at the same time bringing great benefits for manageability and cluster
utilization.
Components Of YARN
o Map Reduce Application Master: Checks tasks running the MapReduce job.
The application master and the MapReduce tasks run in containers that are
scheduled by the resource manager, and managed by the node managers.
Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were
responsible for handling resources and checking progress management. However,
Hadoop 2.0 has Resource manager and NodeManager to overcome the shortfall of
Jobtracker & Tasktracker.
• Scheduler
• Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running application.
The scheduler is pure scheduler it means that it performs no monitoring no tracking
for the application and even doesn’t guarantees about restarting failed tasks either due
to application failure or hardware failures.
b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case
of failures.
One application master runs per application. It negotiates resources from the resource
manager and works with the node manager. It Manages the application life cycle.
The AM acquires containers from the RM’s Scheduler before contacting the
corresponding NMs to start the application’s individual tasks.