0% found this document useful (0 votes)
7 views40 pages

CAP and BASE Theorems in Distributed Databases

The document discusses distributed databases and big data, focusing on the CAP and BASE theorems which address consistency, availability, and partition tolerance in distributed systems. It outlines the architecture of distributed DBMSs, design issues, partitioning schemes, and concurrency control methods, emphasizing the importance of data transparency and efficient data management. Additionally, it explains the characteristics and types of data, particularly structured, unstructured, and semi-structured data, highlighting their advantages and disadvantages.

Uploaded by

alvine degpro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views40 pages

CAP and BASE Theorems in Distributed Databases

The document discusses distributed databases and big data, focusing on the CAP and BASE theorems which address consistency, availability, and partition tolerance in distributed systems. It outlines the architecture of distributed DBMSs, design issues, partitioning schemes, and concurrency control methods, emphasizing the importance of data transparency and efficient data management. Additionally, it explains the characteristics and types of data, particularly structured, unstructured, and semi-structured data, highlighting their advantages and disadvantages.

Uploaded by

alvine degpro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DISTRIBUTED DATABASES AND BIG DATA

CAP and BASE theorem

CAP

Within a distributed environment, it's expected that network partitions may occur among the nodes. The
CAP theorem establishes that when facing a network partition, the system must choose between being
Available or Consistent.

Consistency:

In a consistent system, all nodes see the same data simultaneously. If we perform a read
operation on a consistent system, it should return the value of the most recent write operation.
The read should cause all nodes to return the same data. All users see the same data at the same
time, regardless of the node they connect to. When data is written to a single node, it is then
replicated across the other nodes in the system.

Availability:

All the active nodes at any moment must be able to respond to different operations. it means
that the system remains operational all of the time. Every request will get a response regardless
of the individual state of the nodes.

Partition tolerance:

The system must be able to tolerate network partition among its participant nodes. It means that
there’s a break in communication between nodes. If a system is partition-tolerant, the system
does not fail, regardless of whether messages are dropped or delayed between nodes within the
system. To have partition tolerance, the system must replicate records across combinations of
nodes and networks.

BASE

Basically-Available:

A Database must offer availability by providing a response, whether it's an acknowledgment or


even a failure message, to every incoming request, The database may experience brief periods
of unavailability, but it should be designed to minimize downtime and provide quick recovery
from failures.

Soft-state:

The database system may keep changing states as and when it receives new information, This
can happen due to the effects of background processes, updates to data, and other factors. The
database should be designed to handle this change gracefully and ensure that it does not lead to
data corruption or loss.
Eventually-consistent:

The elements within the system might not instantaneously display an identical value or state for
a record at any given instance. However, they will gradually reconcile this disparity over time.
This concept pertains to the eventual consistency of data within the database, even in the
presence of evolving changes. Essentially, the database is anticipated to ultimately reach a
harmonized and consistent state, even if the propagation and reflection of all updates demand a
certain duration. This stands in opposition to the instantaneous consistency demanded by
conventional ACID-compliant databases.

1 Distributed DBMSs
A distributed DBMS divides a single logical database across multiple physical resources. The application
is (usually) unaware that data is split across separated hardware. The system relies on the techniques and
algorithms from single-node DBMSs to support transaction processing and query execution in a distributed
environment. An important goal in designing a distributed DBMS is fault tolerance (i.e., avoiding a single
one node failure taking down the entire system).
The differences between parallel and distributed DBMSs are:
Parallel Database:
• Nodes are physically close to each other.
• Nodes are connected via high-speed LAN (fast, reliable communication fabric).
• The communication cost between nodes is assumed to be small. As such, one does not need to
worry about nodes crashing or packets getting dropped when designing internal protocols.

Distributed Database:
• Nodes can be far from each other.
• Nodes are potentially connected via a public network, which can be slow and unreliable.
• The communication cost and connection problems cannot be ignored (i.e., nodes can crash, and
pack- ets can get dropped).

2 System Architectures
A DBMS’s system architecture specifies what shared resources are directly accessible to CPUs. It affects
how CPUs coordinate with each other and where they retrieve and store objects in the database.
A single-node DBMS uses what is called a shared everything architecture. This single node executes work-
ers on a local CPU(s) with its own local memory address space and disk.

Shared Memory
An alternative to shared everything architecture in distributed systems is shared memory. CPUs have
access to common memory address space via a fast interconnect. CPUs also share the same disk.
In practice, most DBMSs do not use this architecture, as it is provided at the OS / kernel level. It also causes
problems, since each process’s scope of memory is the same memory address space, which can be modified
by multiple processes.
Each processor has a global view of all the in-memory data structures. Each DBMS instance on a processor
has to “know” about the other instances.

Figure 1: Database System Architectures – Four system architecture approaches


ranging from sharing everything (used by non distributed systems) to sharing memory,
disk, or nothing.

Shared Disk
In a shared disk architecture, all CPUs can read and write to a single logical disk directly via an interconnect,
but each have their own private memories. The local storage on each compute node can act as caches. This
approach is more common in cloud-based DBMSs.
The DBMS’s execution layer can scale independently from the storage layer. Adding new storage nodes
or execution nodes does not affect the layout or location of data in the other layer.
Nodes must send messages between them to learn about other node’s current state. That is, since memory
is local, if data is modified, changes must be communicated to other CPUs in the case that piece of data
is in main memory for the other CPUs.
Nodes have their own buffer pool and are considered stateless. A node crash does not affect the state of
the database since that is stored separately on the shared disk. The storage layer persists the state in the
case of crashes.

Shared Nothing
In a shared nothing environment, each node has its own CPU, memory, and disk. Nodes only communicate
with each other via network. Before the rise of cloud storage platforms, the shared nothing architecture used
to be considered the correct way to build distributed DBMSs.
It is more difficult to increase capacity in this architecture because the DBMS has to physically move data
to new nodes. It is also difficult to ensure consistency across all nodes in the DBMS, since the nodes must
coordinate with each other on the state of transactions. The advantage, however, is that shared nothing
DBMSs can potentially achieve better performance and are more efficient then other types of distributed
DBMS architectures.
3 Design Issues
Distributed DBMSs aim to maintain data transparency, meaning that users should not be required to know
where data is physically located, or how tables are partitioned or replicated. The details of how data is being
stored is hidden from the application. In other words, a SQL query that works on a single-node DBMS
should work the same on a distributed DBMS.
The key design questions that distributed database systems must address are the following:
• How does the application find data?
• How should queries be executed on a distributed data? Should the query be pushed to where the
data is located? Or should the data be pooled into a common location to execute the query?
• How does the DBMS ensure correctness?
Another design decision to make involves deciding how the nodes will interact in their clusters. Two options
are homogeneous and heterogeneous nodes, which are both used in modern-day systems.
Homogeneous Nodes: Every node in the cluster can perform the same set of tasks (albeit on potentially
different partitions of data), lending itself well to a shared nothing architecture. This makes provisioning
and failover “easier”. Failed tasks are assigned to available nodes.
Heterogeneous Nodes: Nodes are assigned specific tasks, so communication must happen between nodes
to carry out a given task. This allows a single physical node to host multiple “virtual” node types for
dedicated tasks that can independently scale from one node to other. An example is MongoDB, which has
router nodes routing queries to shards and config server nodes storing the mapping from keys to shards.

4 Partitioning Schemes
Distributed system must partition the database across multiple resources, including disks, nodes, processors.
This process is sometimes called sharding in NoSQL systems. When the DBMS receives a query, it first
analyzes the data that the query plan needs to access. The DBMS may potentially send fragments of the
query plan to different nodes, then combines the results to produce a single answer.
The goal of a partitioning scheme is to maximize single-node transactions, or transactions that only access
data contained on one partition. This allows the DBMS to not need to coordinate the behavior of concurrent
transactions running on other nodes. On the other hand, a distributed transaction accesses data at one or
more partitions. This requires expensive, difficult coordination, discussed in the below section.
For logically partitioned nodes, particular nodes are in charge of accessing specific tuples from a shared
disk. For physically partitioned nodes, each shared nothing node reads and updates tuples it contains on
its own local disk.

Implementation
The simplest way to partition tables is naive data partitioning. Each node stores one table, assuming enough
storage space for a given node. This is easy to implement because a query is just routed to a specific
partitioning. This can be bad, since it is not scalable. One partition’s resources can be exhausted if that one
table is queried on often, not using all nodes available. See Figure 2 for an example.
Another way of partitioning is vertical partitioning, which splits a table’s attributes into separate partitions.
Each partition must also store tuple information for reconstructing the original record.
More commonly, horizontal partitioning s used which splits a table’s tuples into disjoint subsets. Choose
column(s) that divides the database equally in terms of size, load or usage, called the partitioning key(s).

Figure 2: Naive Table Partitioning – Given two tables, place all the tuples in table
one into one partition and the tuples in table two into the other.

The DBMS can partition a database physically (shared nothing) or logically (shared disk) based on hashing,
data ranges or predicates. See Figure 3 for an example. The problem of hash partitioning is that when a node
is added or removed, a lot of data has to be shuffled around. The solution for this is Consistent Hashing.

Figure 3: Horizontal Table Partitioning – Use hash partitioning to decide where to


send the data. When the DBMS receives a query, it will use the table’s partitioning
key(s) to find out where the data is.

Consistent Hashing assigns every node to a location on some logical ring. Then the hash of every partition
key maps to a location on the ring. The node that is closest to the key in the clockwise direction is responsible
for that key. See Figure 4 for an example. When a node is added or removed, keys are only moved between
nodes adjacent to the new/removed node and so only 1/n fraction of the keys are moved. A replication
factor of k means that each key is replicated at the k closest nodes in the clockwise direction.
Logical Partitioning: A node is responsible for a set of keys, but it doesn’t actually store those keys. This
is commonly used in a shared disk architecture.
Physical Partitioning: A node is responsible for a set of keys, and it physically stores those keys. This
is commonly used in a shared nothing architecture.

5 Distributed Concurrency Control


A distributed transaction accesses data at one or more partitions, which requires expensive coordination.
Centralized coordinator
The centralized coordinator acts as a global “traffic cop” that coordinates all the behavior. See Figure 5
for a diagram.

Figure 4: Consistent Hashing – All nodes are responsible for some portion of hash
ring. Here, node P 1 is responsible for storing key1 and node P 3 is responsible for
storing key2.

Figure 5: Centralized Coordinator – The client communicates with the coordinator


to acquire locks on the partitions that the client wants to access. Once it receives an
acknowledgement from the coordinator, the client sends its queries to those partitions.
Once all queries for a given transaction are done, the client sends a commit request
to the coordinator. The coordinator then communicates with the partitions involved in
the transaction to determine whether the transaction is allowed to commit.

Middleware
Centralized coordinators can be used as middleware, which accepts query requests and routes queries to
correct partitions.

Decentralized coordinator
In a decentralized approach, nodes organize themselves. The client directly sends queries to one of the
partitions. This home partition will send results back to the client. The home partition is in charge of
communicating with other partitions and committing accordingly.
Centralized approaches give way to bottlenecks in the case that multiple clients are trying to acquire locks
on the same partitions. It can be better for distributed 2PL as it has a central view of the locks and can handle
deadlocks more quickly. This is non-trivial with decentralized approaches.

BIG DATA

1.1 What is Data?


Data is defined as individual facts, such as numbers, words, measurements, observations or
just descriptions of things.

For example, data might include individual prices, weights, addresses, ages, names, temperatures, dates,
or distances.

There are two main types of data:


1. Quantitative data is provided in numerical form, like the weight, volume, or cost of an
item.
2. Qualitative data is descriptive, but non-numerical, like the name, sex, or eye colour
of a person.

1.2 Characteristics of Data


The following are six key characteristics of data which discussed below:
1. Accuracy
2. Validity
3. Reliability
4. Timeliness
5. Relevance
6. Completeness

1. Accuracy
Data should be sufficiently accurate for the intended use and should be captured only once,
although it may have multiple uses. Data should be captured at the point of activity.

2. Validity
Data should be recorded and used in compliance with relevant requirements, includingthe correct
application of any rules or definitions. This will ensure consistency between periods and with
similar organizations, measuring what is intended to be measured.
3. Reliability
Data should reflect stable and consistent data collection processes across collection points and
over time. Progress toward performance targets should reflect real changesrather than variations
in data collection approaches or methods. Source data is clearly identified and readily available
from manual, automated, or other systems and records.

4. Timeliness
Data should be captured as quickly as possible after the event or activity and must beavailable
for the intended use within a reasonable time period. Data must be available quickly and
frequently enough to support information needs and to influence service or management
decisions.

5. Relevance
Data captured should be relevant to the purposes for which it is to be used. This will require a
periodic review of requirements to reflect changing needs.

6. Completeness
Data requirements should be clearly specified based on the information needs of the
organization and data collection processes matched to these requirements.

1.3 Types of Digital Data


➢ Digital data is the electronic representation of information in a format or language that
machines can read and understand.
➢ In more technical terms, Digital data is a binary format of information that's converted
into a machine-readable digital format.
➢ The power of digital data is that any analog inputs, from very simple text documents to
genome sequencing results, can be represented with the binary system.

Types of Digital Data:


Structured
Unstructured
Semi Structured Data

Structured Data:
Structured data refers to any data that resides in a fixed field within a record or file.
Having a particular Data Model.
Meaningful data.
Data arranged in arow and column.
Structured data has the advantage of being easily entered, stored, queried andanalysed.
E.g.: Relational Data Base, Spread sheets.
Structured data is often managed using Structured Query Language (SQL)

Sources of Structured Data:


SQL Databases
Spreadsheets such as Excel
OLTP Systems
Online forms
Sensors such as GPS or RFID tags
Network and Web server logs
Medical devices

Advantages of Structured Data:


Easy to understand and use: Structured data has a well-defined schema or datamodel,
making it easy to understand and use. This allows for easy data retrieval, analysis, and
reporting.
Consistency: The well-defined structure of structured data ensures consistency and
accuracy in the data, making it easier to compare and analyze data across different
sources.

Efficient storage and retrieval: Structured data is typically stored in relational databases,
which are designed to efficiently store and retrieve large amounts of data. This makes
it easy to access and process data quickly.
Enhanced data security: Structured data can be more easily secured than unstructured or
semi-structured data, as access to the data can be controlled through database security
protocols.
Clear data lineage: Structured data typically has a clear lineage or history, making it easy
to track changes and ensure data quality.

Disadvantages of Structured Data:


Inflexibility: Structured data can be inflexible in terms of accommodating new types of
data, as any changes to the schema or data model require significant changes to the
database.
Limited complexity: Structured data is often limited in terms of the complexity of
relationships between data entities. This can make it difficult to modelcomplex real-
world scenarios.
Limited context: Structured data often lacks the additional context and information that
unstructured or semi-structured data can provide, making it more difficult to understand
the meaning and significance of the data.
Expensive: Structured data requires the use of relational databases and related
technologies, which can be expensive to implement and maintain.
Data quality: The structured nature of the data can sometimes lead to missing or
incomplete data, or data that does not fit cleanly into the defined schema, leading to data
quality issues.

Unstructured Data:
Unstructured data can not readily classify and fit into a neat box
Also called unclassified data.
Which does not confirm to any data model.
Business rules are not applied.
Indexing is not required.
E.g.: photos and graphic images, videos, streaming instrument data, webpages, Pdf files,
PowerPoint presentations, emails, blog entries, wikis and word processing documents.

Sources of Unstructured Data:


Web pages
Images (JPEG, GIF, PNG, etc.)
Videos
Memos
Reports
Word documents and PowerPoint presentations
Surveys

Advantages of Unstructured Data:


Its supports the data which lacks a proper format or sequence
The data is not constrained by a fixed schema
Very Flexible due to absence of schema.
Data is portable
It is very scalable
It can deal easily with the heterogeneity of sources.
These type of data have a variety of business intelligence and analytics applications.

Disadvantages Of Unstructured data:


It is difficult to store and manage unstructured data due to lack of schema and structure
Indexing the data is difficult and error prone due to unclear structure and not having pre-
defined attributes. Due to which search results are not very accurate.
Ensuring security to data is difficult task.

Semi structured Data:


Self-describing data.
Metadata (Data about data).
Also called quiz data: data in between structured and semi structured.
It is a type of structured data but not followed data model.
Data which does not have rigid structure.
E.g.: E-mails, word processing software.
XML and other markup language are often used to manage semi structured data.

Sources of semi-structured Data:


E-mails
XML and other markup languages
Binary executables
TCP/IP packets
Zipped files
Integration of data from different sources
Web pages

Advantages of Semi-structured Data:


The data is not constrained by a fixed schema
Flexible i.e Schema can be easily changed.
Data is portable
It is possible to view structured data as semi-structured data
Its supports users who can not express their need in SQL
It can deal easily with the heterogeneity of sources.
Flexibility: Semi-structured data provides more flexibility in terms of data storage and
management, as it can accommodate data that does not fit into a strict, predefined
schema. This makes it easier to incorporate new types of data into an existing database
or data processing pipeline.
Scalability: Semi-structured data is particularly well-suited for managing large volumes
of data, as it can be stored and processed using distributed computing systems, such as
Hadoop or Spark, which can scale to handle massive amounts of data.

Faster data processing: Semi-structured data can be processed more quickly than
traditional structured data, as it can be indexed and queried in a more flexible way. This
makes it easier to retrieve specific subsets of data for analysis and reporting.
Improved data integration: Semi-structured data can be more easily integrated with other
types of data, such as unstructured data, making it easier to combineand analyze data
from multiple sources.
Richer data analysis: Semi-structured data often contains more contextual information
than traditional structured data, such as metadata or tags. This canprovide additional
insights and context that can improve the accuracy and relevance of data analysis.

Disadvantages of Semi-structured data


Lack of fixed, rigid schema make it difficult in storage of the data
Interpreting the relationship between data is difficult as there is no separation of the
schema and the data.
Queries are less efficient as compared to structured data.
Complexity: Semi-structured data can be more complex to manage and process than
structured data, as it may contain a wide variety of formats, tags, and metadata. This can
make it more difficult to develop and maintain data models and processing pipelines.
Lack of standardization: Semi-structured data often lacks the standardization and
consistency of structured data, which can make it more difficult to ensure data quality
and accuracy. This can also make it harder to compare and analyzedata across different
sources.
Reduced performance: Processing semi-structured data can be more resource- intensive
than processing structured data, as it often requires more complex parsing and indexing
operations. This can lead to reduced performance and longer processing times.
Limited tooling: While there are many tools and technologies available for working with
structured data, there are fewer options for working with semi- structured data. This can
make it more challenging to find the right tools and technologies for a particular use
case.

Data security: Semi-structured data can be more difficult to secure than structured data,
as it may contain sensitive information in unstructured or less- visible parts of the data.
This can make it more challenging to identify and protect sensitive information from
unauthorized access.
Overall, while semi-structured data offers many advantages in terms of flexibility and
scalability, it also presents some challenges and limitations that need to be carefully
considered when designing and implementing data processing and analysis pipelines.

1.4 Big Data


Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It
is a data with so large size and complexity that none of traditional data management tools can
store it or process it efficiently. Big data is also a data but with huge size.

What is an Example of Big Data?


Following are some of the Big Data examples-

New York Stock Exchange : The New York Stock Exchange is an example of Big Data that
generates about one terabyte of new trade data per day.
Social Media: The statistic shows that 500+terabytes of new data get ingested into the databases
of social media site Facebook, every day. This data is mainly generated in terms of photo and
video uploads, message exchanges, putting comments etc.

Jet engine :A single Jet engine can generate 10+terabytes of data in 30 minutes offlight time.
With many thousand flights per day, generation of data reaches up to many Petabytes.
1.5 Big Data Characteristics

Volume:
The name Big Data itself is related to an enormous size. Big Data is a vast ‘volume’ of data
generated from many sources daily, such as business processes, machines, social media
platforms, networks, human interactions, and many more.

Variety:
Big Data can be structured, unstructured, and semi-structured that are being collected
from different sources. Data will only be collected from databases and
sheets in the past, but these days the data will comes in array forms, that are PDFs, Emails,
audios, SM posts, photos, videos, etc.

Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.

Value
Value is an essential characteristic of big data. It is not the data that we process or store. It is
valuable and reliable data that we store, process, and also analyze.

Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of
change, and activity bursts. The primary aspect of Big Data is to provide demanding data
rapidly.

Big data velocity deals with the speed at the data flows from sources like applicationlogs,
business processes, networks, and social media sites, sensors, mobile devices, etc.

1.6 Why Big Data?


Big Data initiatives were rated as “extremely important” to 93% of companies. Leveraging a
Big Data analytics solution helps organizations to unlock the strategic values and take full
advantage of their assets.
It helps organizations like
• To understand Where, When and Why their customers buy
• Protect the company’s client base with improved loyalty programs
• Seizing cross-selling and upselling opportunities
• Provide targeted promotional information
• Optimize Workforce planning and operations
• Improve inefficiencies in the company’s supply chain
• Predict market trends
• Predict future needs
• Make companies more innovative and competitive
• It helps companies to discover new sources of revenue

Companies are using Big Data to know what their customers want, who are their best customers,
why people choose different products. The more a company knows about its customers, the
more competitive it becomes.
We can use it with Machine Learning for creating market strategies based on predictions about
customers. Leveraging big data makes companies customer-centric.

Companies can use Historical and real-time data to assess evolving consumers’ preferences.
This consequently enables businesses to improve and update their marketing strategies which
make companies more responsive to customer needs.

Importance of big data

Big Data importance doesn’t revolve around the amount of data a company has. Its importance lies in
the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the company uses its
data, more rapidly it grows.

The companies in the present market need to collect it and analyze it because:

1. Cost Savings
Big Data tools like Apache Hadoop, Spark, etc. bring cost-saving benefits to businesses when
they have to store large amounts of data. These tools help organizations in identifying more
effective ways of doing business.

2. Time-Saving
Real-time in-memory analytics helps companies to collect data from various sources. Tools like
Hadoop help them to analyze data immediately thus helping in making quick decisions based on
the learnings.

3. Understand the market conditions


Big Data analysis helps businesses to get a better understanding of market situations. For
example, analysis of customer purchasing behavior helps companies to identify the products
sold most and thus produces those products accordingly. This helps companies to get ahead of
their competitors.

4. Social Media Listening


Companies can perform sentiment analysis using Big Data tools. These enable them to get
feedback about their company, that is, who is saying what about the company. Companies can
use big data tools to improve their online presence.

5. Boost Customer Acquisition and Retention


Customers are a vital asset on which any business depends on. No single business can achieve
its success without building a robust customer base. But even with a solid customer base, the
companies can’t ignore the competition in the market.

If we don’t know what our customers want then it will degrade companies’ success. It will result
in the loss of clientele which creates an adverse effect on business growth. Big data analytics
helps businesses to identify customer related trends and patterns. Customer behavior analysis
leads to a profitable business.

6. Solve Advertisers Problem and Offer Marketing Insights


Big data analytics shapes all business operations. It enables companies to fulfill customer
expectations. Big data analytics helps in changing the company’s product line. It ensures
powerful marketing campaigns.

7. The driver of Innovations and Product Development


Big data makes companies capable to innovate and redevelop their products.

1.7 Challenges of Big Data


When implementing a big data solution, here are some of the common challenges your business
might run into, along with solutions.
1. Managing massive amounts of data
It's in the name—big data is big. Most companies are increasing the amount of data they
collect daily. Eventually, the storage capacity a traditional data center can provide will be
inadequate, which worries many business leaders. Forty-three percent of IT decision-makers in
the technology sector worry about this data influx overwhelming their infrastructure [2] .

To handle this challenge, companies are migrating their IT infrastructure to the cloud. Cloud
storage solutions can scale dynamically as more storage is needed. Big data software is
designed to store large volumes of data that can be accessed and queried quickly.

2. Integrating data from multiple sources


The data itself presents another challenge to businesses. There is a lot, but it is also diverse
because it can come from a variety of different sources. A business could have analytics data
from multiple websites, sharing data from social media, user information from CRM software,
email data, and more. None of this data is structured the same but may have to be integrated
and reconciled to gather necessary insights and create reports.

To deal with this challenge, businesses use data integration software, ETL software, and
business intelligence software to map disparate data sources into a common structure and
combine them so they can generate accurate reports.

3. Ensuring data quality


Analytics and machine learning processes that depend on big data to run also depend on clean,
accurate data to generate valid insights and predictions. If the data is corrupted or
incomplete, the results may not be what you expect. But as the sources, types, and quantity
of data increase, it can be hard to determine if the data has the quality you need for accurate
insights.
Fortunately, there are solutions for this. Data governance applications will help organize,
manage, and secure the data you use in your big data projects while
also validating data sources against what you expect them to be and cleaning up corrupted and
incomplete data sets. Data quality software can also be used specifically for the task of
validating and cleaning your data before it is processed.

4. Keeping data secure


Many companies handle data that is sensitive, such as:
• Company data that competitors could use to take a bigger market share of the
industry
• Financial data that could give hackers access to accounts
• Personal user information of customers that could be used for identity theft
If a business handles sensitive data, it will become a target of hackers. To protect this data
from attack, businesses often hire cybersecurity professionals who keep up to date on
security best practices and techniques to secure their systems. Whether you hire a
consultant or keep it in-house, you need to ensure that data is encrypted, so the data is
useless without an encryption key. Add identity and access authorization control to all
resources so only the intended users can access it. Implement endpoint protection
software so malware can't infect the system and real-time monitoring to stop threats
immediately if they are detected.

5. Selecting the right big data tools


Fortunately, when a business decides to start working with data, there is no shortage of
tools to help them do it. At the same time, the wealth of options is also a challenge. Big
data software comes in many varieties, and their capabilities often overlap. How do you
make sure you are choosing the right big data tools? Often, the best option is to hire a
consultant who can determine which tools will fit best with what your business wants to
do with big data. A big data professional can look at your current and future needs and
choose an enterprise data streaming or ETL solution that will collect data from all your data
sources and aggregate it. They can configure your cloud services and scale dynamically
based on workloads. Once your system is set up with big data tools that fit your needs,
the system will run seamlessly with very little maintenance.
Thinking about hiring a data analytics company to help your business implement a big
data strategy? Browse our list of top data analytics companies, and learn more about their
services in our hiring guide.

6. Scaling systems and costs efficiently


If you start building a big data solution without a well-thought-out plan, you can spend a lot
of money storing and processing data that is either useless or notexactly what your business
needs. Big data is big, but it doesn' t mean you have to process all of your data.

When your business begins a data project, start with goals in mind and strategies for how
you will use the data you have available to reach those goals. The team involved in
implementing a solution needs to plan the type of data they need and the schemas they will
use before they start building the system so the project doesn't go in the wrong direction.
They also need to create policies for purging old data from the system once it is no longer
useful.

7. Lack of skilled data professionals


One of the big data problems that many companies run into is that their current staff have
never worked with big data before, and this is not the type of skill set you build overnight.
Working with untrained personnel can result in dead ends, disruptions of workflow, and errors
in processing.

There are a few ways to solve this problem. One is to hire a big data specialist and
have that specialist manage and train your data team until they are up to speed. The
specialist can either be hired on as a full -time employee or as a consultant who trains
your team and moves on, depending on your budget. Another option, if you have time to
prepare ahead, is to offer training to your current team members so they will have the
skills once your big data project is in motion.

A third option is to choose one of the self-service analytics or business intelligence solutions
that are designed to be used by professionals who don't have a data science background.
8. Organizational resistance
Another way people can be a challenge to a data project is when they resist change. The
bigger an organization is, the more resistant it is to change. Leaders may not see the value
in big data, analytics, or machine learning. Or they may simply not want to spend the time
and money on a new project.

This can be a hard challenge to tackle, but it can be done. You can start with a smaller project
and a small team and let the results of that project prove the valueof big data to other leaders
and gradually become a data-driven business. Another option is placing big data experts in
leadership roles so they can guide your business towards transformation.

1.8 What is Business Intelligence?


BI(Business Intelligence) is a set of processes, architectures, and technologies that convert raw
data into meaningful information that drives profitable business actions. It is a suite of software
and services to transform data into actionable intelligence and knowledge.

BI has a direct impact on organization’s strategic, tactical and operational business


decisions.

BI supports fact-based decision making using historical data rather than assumptions and gut
feeling.

BI tools perform data analysis and create reports, summaries, dashboards, maps, graphs, and
charts to provide users with detailed intelligence about the nature of the business.
Why is BI important?

• Measurement: creating KPI (Key Performance Indicators) based on historic data


• Identify and set benchmarks for varied processes.
• With BI systems organizations can identify market trends and spot businessproblems
that need to be addressed.
• BI helps on data visualization that enhances the data quality and thereby the quality
of decision making.
• BI systems can be used not just by enterprises but SME (Small and Medium
Enterprises)

How Business Intelligence systems are implemented?


Here are the steps:
Step 1) Raw Data from corporate databases is extracted. The data could be spread across
multiple systems heterogeneous systems.

Step 2) The data is cleaned and transformed into the data warehouse. The table can be linked,
and data cubes are formed.

Step 3) Using BI system the user can ask quires, request ad-hoc reports or conduct any other
analysis.
Advantages of Business Intelligence
Here are some of the advantages of using Business Intelligence System:

1. Boost productivity
With a BI program, It is possible for businesses to create reports with a single click thus saves
lots of time and resources. It also allows employees to be more productiveon their tasks.

2. To improve visibility
BI also helps to improve the visibility of these processes and make it possible to identify any
areas which need attention.

3. Fix Accountability
BI system assigns accountability in the organization as there must be someone who should own
accountability and ownership for the organization’s performance against its set goals.

4. It gives a bird’s eye view:


BI system also helps organizations as decision makers get an overall bird’s eye view through
typical BI features like dashboards and scorecards.

5. It streamlines business processes:


BI takes out all complexity associated with business processes. It also automates analytics by
offering predictive analysis, computer modeling, benchmarking and othermethodologies.

6. It allows for easy analytics.


BI software has democratized its usage, allowing even nontechnical or non-analysts users to
collect and process data quickly. This also allows putting the power of analytics from the hand’s
many people.
BI System Disadvantages
1. Cost:
Business intelligence can prove costly for small as well as for medium-sized enterprises. The
use of such type of system may be expensive for routine business transactions.

2. Complexity:
Another drawback of BI is its complexity in implementation of datawarehouse. It can be so
complex that it can make business techniques rigid to deal with.

3. Limited use
Like all improved technologies, BI was first established keeping in consideration the buying
competence of rich firms. Therefore, BI system is yet not affordable for many small and medium
size companies.

4. Time Consuming Implementation


It takes almost one and half year for data warehousing system to be completely implemented.
Therefore, it is a time-consuming process.

Hadoop

2.1 Requirement of Hadoop Framework


• Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming
models.
• The Hadoop framework application works in an environment that provides distributed
storage and computation across clusters of computers.
• Hadoop is designed to scale up from single server to thousands of machines, each
offering local computation and storage.

2.2 Design Principles of Hadoop


The Design principles of Hadoop on which it works:

a) System shall manage and heal itself


• Automatically and transparently route around failure (Fault Tolerant)
• Speculatively execute redundant tasks if certain nodes are detected to be slow
b) Performance shall scale linearly
• Proportional change in capacity with resource change (Scalability)

c) Computation should move to data


• Lower latency, lower bandwidth (Data Locality)

d) Simple core, modular and extensible (Economical)

2.3 Comparison with other system like SQL

Parameter Hadoop SQL


Hadoop supports an open-source [Link] stands for Structured Query
In Hadoop data sets are distributed acrossLanguage. It is based on domain-
Architecture computer/server clusters with parallel dataspecific language, used to handle
processingfeatures. database management operations in
relational databases.
Hadoop is used for storing, processing, SQL is used to store, process, retrieve,
retrieving, and pattern extraction fromdata and pattern mine data stored in a
Operations
across a wide range of formats like XML, relational database
Text, JSON, etc. only.
Hadoop handles both structured and
Data Type/ SQL works only for structured data
unstructured data formats. For data
Data update but unlike Hadoop, data can be written
update, Hadoop writes data once but reads
and read multiple times.
data multiple times.
Hadoop is developed for Big Data hence, it
Data Volume SQL works better on low volumesof
usually handles data volumes up to
Processed data, usually in Gigabytes.
Terabytes and Petabytes.
Hadoop stores data in the form of key-
SQL stores structured data in a
value pairs, hash, maps, tables, etc in
Data Storage tabular format using tables only with
distributed systems with dynamicschemas.
fixed schemas.

Schema Hadoop supports dynamic schema SQL supports static schema


Structure structure. structure.
Hadoop supports NoSQL data type structures,
Data SQL works on the property of
columnar data structures, etc. meaning you will
Structures Atomicity, Consistency, Isolation, and
have to provide codes forimplementation or for
Supported Durability (ACID) which is
rolling back during a
fundamental to RDBMS.
transaction.
Fault
Hadoop is highly fault-tolerant. SQL has good fault tolerance.
Tolerance

As Hadoop uses the notion of distributed SQL supporting databases areusually


computing and the principle of map-reduce available on-prises or onthe cloud,
Availability therefore it handles data therefore it can’t utilizethe benefits
availability on multiple systems across of distributed computing.
multiple geo-locations.
Integrity Hadoop has low integrity. SQL has high integrity.
Scaling in Hadoop based system requiresScaling in SQL required purchasing
connecting computers over the [Link] SQL servers and
Scaling
Horizontal Scaling with Hadoop isconfiguration which is
cheap and flexible. expensive and time-consuming.
SQL supports real-time data
Hadoop supports large-scale batch data
processing known as Online
Data
processing known as Online Analytical
Transaction Processing (OLTP)
Processing
Processing (OLAP).
thereby making it interactive and
batch-oriented.
Statements in Hadoop are executedvery
Execution SQL syntax can be slow when
quickly even when millions of
Time executed in millions of rows.
queries are executed at once.
Hadoop uses appropriate Java Database
SQL systems can read and write
Connectivity (JDBC) to interact with SQL
Interaction
data to Hadoop systems.
systems to transfer and receive data
between them.
Hadoop supports advanced machine learning
Support for ML SQL’s support for ML and AIis
and artificial intelligence techniques.
and AI limited compared to Hadoop.
Hadoop requires an advanced skill level for
The SQL skill level required to use it is
you to be proficient in using it and trying to
intermediate as it can be learned
Skill Level learn Hadoop as a beginner can be moderately
easily for beginners and entry-level
difficult as it requires
professionals.
certain kinds of skill sets.

Language Hadoop framework is built with Java SQL is a traditional database


Supported programming language. language used to perform

database management operationson


relational databases such as
MySQL, Oracle, SQL Server, etc.
When you need to manage unstructured SQL performs well in a moderate
Use Case
data, structured data, or semi-structureddata volume of data and it supports
in huge volume, Hadoop is a good fit. structured data only.
With SQL supported system,
Hardware In Hadoop, commodity hardware
propriety hardware installation is
Configuration installation is required on the server.
required.
SQL supporting systems are
Pricing Hadoop is a free open-source framework.
mostly licensed.

2.4 Comparison with other system like RDBMS


Below is the comparison table between Hadoop vs RDBMS.

Feature RDBMS Hadoop


Data Variety Mainly for Structured data Used for Structured, Semi-
Structured, and Unstructured data
Data Storage Average size data (GBS) Use for large data sets (Tbs and
Pbs)
Querying SQL Language HQL (Hive Query Language)
Schema Required on write (static Required on reading (dynamic
schema) schema)
Speed Reads are fast Both reads and writes are fast
Cost License Free
Use Case OLTP (Online transaction Analytics (Audio, video, logs, etc.),
processing) Data Discovery
Data Objects Works on Relational Works on Key/Value Pair
Tables
Throughput Low High
Scalability Vertical Horizontal
Hardware Profile High-End Servers Commodity/Utility Hardware
Integrity High (ACID) Low

2.5 Components of Hadoop

There are three core components of Hadoop as mentioned earlier. They are HDFS,
MapReduce, and YARN. These together form the Hadoop framework architecture.

1. HDFS (Hadoop Distributed File System):


It is a data storage system. Since the data sets are huge, it uses a distributed system to store this
data. It is stored in blocks where each block is 128 MB. It consists of NameNode and DataNode.
There can only be one NameNode but multiple DataNodes.

Features:
• The storage is distributed to handle a large data pool
• Distribution increases data security
• It is fault-tolerant, other blocks can pick up the failure of one block

2. MapReduce:
The MapReduce framework is the processing unit. All data is distributed and processed
parallelly. There is a MasterNode that distributes data amongst SlaveNodes. The SlaveNodes
do the processing and send it back to the MasterNode.

Features:
• Consists of two phases, Map Phase and Reduce Phase.
• Processes big data faster with multiples nodes working under one CPU

3. YARN (yet another Resources Negotiator):


It is the resource management unit of the Hadoop framework. The data which is stored can be
processed with help of YARN using data processing engines like interactive processing. It can
be used to fetch any sort of data analysis.

Features:
• It is a filing system that acts as an Operating System for the data stored on HDFS
• It helps to schedule the tasks to avoid overloading any system

2.6 Hadoop Architecture


The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It
contains a master/slave architecture. This architecture consist of a single NameNode performs
the role of master, and multiple DataNodes performs the role of a slave. Both NameNode and
DataNode are capable enough to run on commodity machines. The Java language is used to
develop HDFS. So any machine that supports Java language can easily run the NameNode
and DataNode software.

NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like theopening,
renaming and closing the files.
o It simplifies the architecture of the system.

DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file
system's clients.
o It performs block creation, deletion, and replication upon instruction from the
NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and
process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.

MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job
to Job Tracker. In response, the Job Tracker sends the request to theappropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.

2.7 Difference between Hadoop 1 and Hadoop 2


Hadoop is an open source software programming framework for storing a large amount of data
and performing the computation. Its framework is based on Java programming with some native
code in C and shell scripts.

Hadoop 1 vs Hadoop 2

1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet


Another Resource Negotiator) and MapReduce version 2.

Hadoop 1 Hadoop 2

HDFS HDFS

Map Reduce YARN / MRv2

2. Daemons:
Hadoop 1 Hadoop 2

Namenode Namenode

Datanode Datanode

Secondary Namenode Secondary Namenode

Job Tracker Resource Manager

Task Tracker Node Manager

3. Working:

In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce which
works as Resource Management as well as Data Processing. Due to this workload on Map
Reduce, it will affect the performance.
In Hadoop 2, there is again HDFS which is again used for storage and on the top of
HDFS, there is YARN which works as Resource Management. It basicallyallocates the
resources and keeps all the things going on.

4. Limitations:

Hadoop 1 is a Master-Slave architecture. It consists of a single master and multiple slaves.


Suppose if master node got crashed then irrespective of your best slave nodes, your cluster will
be destroyed. Again for creating that cluster means copying system files, image files, etc. on
another system is too much time consuming which will not betolerated by organizations in
today’s time.

Hadoop 2 is also a Master-Slave architecture. But this consists of multiple masters ([Link]
namenodes and standby namenodes) and multiple slaves. If here master node got crashed then
standby master node will take over it. You can make multiple combinations of active-
standby nodes. Thus Hadoop 2 will eliminate the problem of asingle point of failure.

5. Ecosystem

Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
Sqoop is used to import and export structured data. You can directly import and export
the data into HDFS using SQL database.
Flume is used to import and export the unstructured data and streaming data.
MapReduce and YARN framework

3.1 Introduction to MapReduce in Hadoop

MapReduce is a software framework and programming model used for processing


huge amounts of data. MapReduce program work in two phases, namely, Map and
Reduce. Map tasks deal with splitting and mapping of data while Reduce tasks shuffle
and reduce the data.
Hadoop is capable of running MapReduce programs written in various languages:
Java, Ruby, Python, and C++. The programs of Map Reduce in cloud computing are
parallel in nature, thus are very useful for performing large-scale data analysis using
multiple machines in the cluster.

The input to each phase is key-value pairs. In addition, every programmer needs to
specify two functions: map function and reduce function.

3.2 Processing data with Hadoop using MapReduce

MapReduce is a programming framework that allows us to perform distributed and


parallel processing on large data sets in a distributed environment.

• MapReduce consists of two distinct tasks – Map and Reduce.


• As the name MapReduce suggests, the reducer phase takes place after the
mapper phase has been completed.
• So, the first is the map job, where a block of data is read and processed
to produce key-value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to the Reducer.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples (intermediate key-
value pair) into a smaller set of tuples or key-value pairs which is the final
output.

Let us understand more about MapReduce and its components. MapReduce majorly
has the following three Classes. They are,

Mapper Class

The first stage in Data Processing using MapReduce is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value pair.
Hadoop’s Mapper store saves this intermediate data into the local disk.

Input Split

It is the logical representation of data. It represents a block of work that contains a


single map task in the MapReduce Program.

RecordReader

It interacts with the Input split and converts the obtained data in the form of Key-
Value Pairs.

Reducer Class

The Intermediate output generated from the mapper is fed to the reducer which
processes it and generates the final output which is then saved in the HDFS.

Driver Class

The major component in a MapReduce job is a Driver Class. It is responsible for


setting up a MapReduce Job to run-in Hadoop. We specify the names
of Mapper and Reducer Classes long with data types and their respective job names.

3.3 Introduction to YARN

Yet Another Resource Manager takes programming to the next level beyond Java ,
and makes it interactive to let another application Hbase, Spark etc. to work on
[Link] Yarn applications can co-exist on the same cluster so MapReduce, Hbase,
Spark all can run at the same time bringing great benefits for manageability and cluster
utilization.

Components Of YARN

o Client: For submitting MapReduce jobs.


o Resource Manager: To manage the use of resources across the cluster
o Node Manager:For launching and monitoring the computer containers on
machines in the cluster.

o Map Reduce Application Master: Checks tasks running the MapReduce job.
The application master and the MapReduce tasks run in containers that are
scheduled by the resource manager, and managed by the node managers.

Jobtracker & Tasktrackerwere were used in previous version of Hadoop, which were
responsible for handling resources and checking progress management. However,
Hadoop 2.0 has Resource manager and NodeManager to overcome the shortfall of
Jobtracker & Tasktracker.

3.4 Hadoop Yarn Architecture


Apache Yarn Framework consists of a master daemon known as “Resource Manager”,
slave daemon called node manager (one per slave node) and Application Master (one
per application).

1. Resource Manager (RM)

It is the master daemon of Yarn. RM manages the global assignments of resources


(CPU and memory) among all the applications. It arbitrates system resources between
competing applications. follow Resource Manager guide to learn Yarn Resource
manager in great detail.

Resource Manager has two Main components:

• Scheduler
• Application manager
a) Scheduler

The scheduler is responsible for allocating the resources to the running application.
The scheduler is pure scheduler it means that it performs no monitoring no tracking
for the application and even doesn’t guarantees about restarting failed tasks either due
to application failure or hardware failures.

b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case
of failures.

2. Node Manager (NM)

It is the slave daemon of Yarn. NM is responsible for containers monitoring their


resource usage and reporting the same to the ResourceManager. Manage the user
process on that machine. Yarn NodeManager also tracks the health of the node on
which it is running. The design also allows plugging long-running auxiliary services
to the NM; these are application-specific services, specified as part of the
configurations and loaded by the NM during startup. A shuffle is a typical auxiliary
service by the NMs for MapReduce applications on YARN

3. Application Master (AM)

One application master runs per application. It negotiates resources from the resource
manager and works with the node manager. It Manages the application life cycle.

The AM acquires containers from the RM’s Scheduler before contacting the
corresponding NMs to start the application’s individual tasks.

You might also like