0% found this document useful (0 votes)

16 views32 pages

Understanding Big Data Analytics Basics

Uploaded by

Bhavani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views32 pages

Understanding Big Data Analytics Basics

Uploaded by

Bhavani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

BIG DATA ANALYTICS

Module-1

Big Data
Big Data refers to massive datasets that grow exponentially and come from a
variety of sources, presenting challenges in handling, processing, and analysis.
These datasets can be structured, unstructured, or semi-structured. To effectively
manage this data, Hadoop comes into the picture. Let's dive into Big Data and how
Hadoop revolutionizes data processing.

What is Big Data?

Big Data is characterized by its large volume, high velocity, and diverse variety,
making it difficult to process with traditional tools.
Big Data Analytics uses advanced analytical methods that can extract important
business insights from bulk datasets. Within these datasets lies both structured
(organized) and unstructured (unorganized) data.

Its applications cover different industries such as healthcare, education,

insurance, AI, retail, and manufacturing.

What is Big-Data Analytics?

Big Data Analytics is all about crunching massive amounts of information to

uncover hidden trends, patterns, and relationships. It's like sifting through a giant
mountain of data to find the gold nuggets of insight.

Here's a breakdown of what it involves:

 Collecting Data: Such data is coming from various sources such as social
media, web traffic, sensors, and customer reviews.

 Cleaning the Data: Imagine having to assess a pile of rocks that included
some gold pieces in it. You would have to clean the dirt and the debris first.
When data is being cleaned, mistakes must be fixed, duplicates must be
removed and the data must be formatted properly.
 Analyzing the Data: It is here that the wizardry takes place. Data analysts
employ powerful tools and techniques to discover patterns and trends. It is
the same thing as looking for a specific pattern in all those rocks that you
sorted through.

How does big data analytics work?

Big Data Analytics is a powerful tool which helps to find the potential of large and
complex datasets. To get better understanding, let's break it down into key steps:

 Data Collection: Data is the core of Big Data Analytics. It is the gathering
of data from different sources such as the customers’ comments, surveys,
sensors, social media, and so on. The primary aim of data collection is to
compile as much accurate data as possible. The more data, the more insights.

 Data Cleaning (Data Preprocessing): The next step is to process this

information. It often requires some cleaning. This entails the replacement of
missing data, the correction of inaccuracies, and the removal of duplicates. It
is like sifting through a treasure trove, separating the rocks and debris and
leaving only the valuable gems behind.

 Data Processing: After that we will be working on the data processing. This
process contains such important stages as writing, structuring, and
formatting of data in a way it will be usable for the analysis. It is like a chef
who is gathering the ingredients before cooking. Data processing turns the
data into a format suited for analytics tools to process.

 Data Analysis: Data analysis is being done by means of statistical,

mathematical, and machine learning methods to get out the most important
findings from the processed data. For example, it can uncover customer
preferences, market trends, or patterns in healthcare data.

 Data Visualization: Data analysis usually is presented in visual form, for

illustration – charts, graphs and interactive dashboards. The visualizations
provided a way to simplify the large amounts of data and allowed for
decision makers to quickly detect patterns and trends.
 Data Storage and Management: The stored and managed analyzed data is
of utmost importance. It is like digital scrapbooking. May be you would
want to go back to those lessons in the long run, therefore, how you store
them has great importance. Moreover, data protection and adherence to
regulations are the key issues to be addressed during this crucial stage.

 Continuous Learning and Improvement: Big data analytics is a

continuous process of collecting, cleaning, and analyzing data to uncover
hidden insights. It helps businesses make better decisions and gain a
competitive edge.

Types of Big Data Analytics

Big Data Analytics comes in many different types, each serving a different
purpose:

1. Descriptive Analytics: This type helps us understand past events. In social

media, it shows performance metrics, like the number of likes on a post.

2. Diagnostic Analytics: In Diagnostic analytics delves deeper to uncover the

reasons behind past events. In healthcare, it identifies the causes of high
patient re-admissions.

3. Predictive Analytics: Predictive analytics forecasts future events based on

past data. Weather forecasting, for example, predicts tomorrow's weather by
analyzing historical patterns.

4. Prescriptive Analytics: However, this category not only predicts results but
also offers recommendations for action to achieve the best results. In e-
commerce, it may suggest the best price for a product to achieve the highest
possible profit.

5. Real-time Analytics: The key function of real-time analytics is data

processing in real time. It swiftly allows traders to make decisions based on
real-time market events.

6. Spatial Analytics: Spatial analytics is about the location data. In urban

management, it optimizes traffic flow from the data unde the sensors and
cameras to minimize the traffic jam.
7. Text Analytics: Text analytics delves into the unstructured data of text. In
the hotel business, it can use the guest reviews to enhance services and guest
satisfaction.

Big Data Analytics Technologies and Tools

Big Data Analytics relies on various technologies and tools that might sound
complex, let's simplify them:

 Hadoop: Imagine Hadoop as an enormous digital warehouse. It's used by

companies like Amazon to store tons of data efficiently. For instance, when
Amazon suggests products you might like, it's because Hadoop helps
manage your shopping history.

 Spark: Think of Spark as the super-fast data chef. Netflix uses it to quickly
analyze what you watch and recommend your next binge-worthy show.

 NoSQL Databases: NoSQL databases, like MongoDB, are like digital filing
cabinets that Airbnb uses to store your booking details and user data. These
databases are famous because of their quick and flexible, so the platform can
provide you with the right information when you need it.

 Tableau: Tableau is like an artist that turns data into beautiful pictures. The
World Bank uses it to create interactive charts and graphs that help people
understand complex economic data.

 Python and R: Python and R are like magic tools for data scientists. They
use these languages to solve tricky problems. For example, Kaggle uses
them to predict things like house prices based on past data.

 Machine Learning Frameworks (e.g., TensorFlow): In Machine

learning frameworks are the tools who make predictions. Airbnb
uses TensorFlow to predict which properties are most likely to be booked in
certain areas. It helps hosts make smart decisions about pricing and
availability.
These tools and technologies are the building blocks of Big Data Analytics and
helps organizations gather, process, understand, and visualize data, making it easier
for them to make decisions based on information.

Benefits of Big Data Analytics

Big Data Analytics offers a host of real-world advantages, and let's understand with
examples:

1. Informed Decisions: Imagine a store like Walmart. Big Data Analytics

helps them make smart choices about what products to stock. This not only
reduces waste but also keeps customers happy and profits high.

2. Enhanced Customer Experiences: Think about Amazon. Big Data

Analytics is what makes those product suggestions so accurate. It's like
having a personal shopper who knows your taste and helps you find what
you want.

3. Fraud Detection: Credit card companies, like MasterCard, use Big Data
Analytics to catch and stop fraudulent transactions. It's like having a
guardian that watches over your money and keeps it safe.

4. Optimized Logistics: FedEx, for example, uses Big Data Analytics to

deliver your packages faster and with less impact on the environment. It's
like taking the fastest route to your destination while also being kind to the
planet.

Challenges of Big data analytics

While Big Data Analytics offers incredible benefits, it also comes with its set of
challenges:

 Data Overload: Consider Twitter, where approximately 6,000 tweets are

posted every second. The challenge is sifting through this avalanche of data
to find valuable insights.
 Data Quality: If the input data is inaccurate or incomplete, the insights
generated by Big Data Analytics can be flawed. For example, incorrect
sensor readings could lead to wrong conclusions in weather forecasting.

 Privacy Concerns: With the vast amount of personal data used, like in
Facebook's ad targeting, there's a fine line between providing personalized
experiences and infringing on privacy.

 Security Risks: With cyber threats increasing, safeguarding sensitive data

becomes crucial. For instance, banks use Big Data Analytics to detect
fraudulent activities, but they must also protect this information from
breaches.

 Costs: Implementing and maintaining Big Data Analytics systems can be

expensive. Airlines like Delta use analytics to optimize flight schedules, but
they need to ensure that the benefits outweigh the costs.

Usage of Big Data Analytics

Big Data Analytics has a significant impact in various sectors:

 Healthcare: It aids in precise diagnoses and disease prediction, elevating

patient care.

 Retail: Amazon's use of Big Data Analytics offers personalized product

recommendations based on your shopping history, creating a more tailored
and enjoyable shopping experience.

 Finance: Credit card companies such as Visa rely on Big Data Analytics to
swiftly identify and prevent fraudulent transactions, ensuring the safety of
your financial assets.

 Transportation: Companies like Uber use Big Data Analytics to optimize

drivers' routes and predict demand, reducing wait times and improving
overall transportation experiences.

 Agriculture: Farmers make informed decisions, boosting crop yields while

conserving resources.
 Manufacturing: Companies like General Electric (GE) use Big Data
Analytics to predict machinery maintenance needs, reducing downtime and
enhancing operational efficiency.

Conclusion

Big Data Analytics is a game-changer that's shaping a smarter future. From

improving healthcare and personalizing shopping to securing finances and
predicting demand, it's transforming various aspects of our lives. However,
Challenges like managing overwhelming data and safeguarding privacy are real
concerns. In our world flooded with data, Big Data Analytics acts as a guiding
light. It helps us make smarter choices, offers personalized experiences, and
uncovers valuable insights. It's a powerful and stable tool that promises a better
and more efficient future for everyone.

What is Unstructured Data?

Unstructured data refers to information that does not have a predefined format or
structure. It is messy, unorganized and hard to sort. Unlike structured data, which is
organized into rows and columns (like an Excel sheet), unstructured data comes in
many different forms such as text documents, images, audio files, videos and social
media posts. Because this type of data does not follow a clear pattern, it’s harder to
store, process and search.
Characteristics of Unstructured Data

 Lack of Format: Unstructured data does not fit neatly into tables or
databases. It can be textual or non-textual, making it difficult to categorize
and organize.

 Variety: This type of data can include a wide range of formats, such as:

o Text documents (e.g., emails, reports, articles)

o Multimedia files (e.g., images, audio, video)

o Social media content (e.g., posts, comments, tweets)

o Web pages and blogs

 Volume: Unstructured data represents a significant portion of the data

generated today. It is often larger in volume compared to structured data.

 Diverse Sources: It can originate from various sources, including user-

generated content, sensor data, customer interactions and more.

Importance of unstructured Data

Even though unstructured data is harder to deal with, it is extremely valuable. Let
us see that in the below :

 It helps businesses understand their customers better. For example,

businesses can learn what customers think about their products by reading
reviews or social media posts.

 It contains real world insights, like what people are talking about online or
what videos are trending.

 It’s growing rapidly. More and more data being created today is unstructured
like photos, tweets and videos.

Examples of Unstructured Data

Unstructured data can come in many different forms. Here are some examples:
 Social Media: Posts, tweets, comments and pictures on Facebook,
Instagram, or Twitter

 Emails: Your inbox full of messages, attachments and conversations

 Photos & Videos: Pictures on your phone or videos on YouTube

 Audio Files: Podcasts, voice messages, music files

 Documents: Reports, articles, PDFs, or Word files

 Websites & Blogs: Articles, reviews and posts online

Extracting Information from Unstructured Data

Unstructured data do not have any structure. So it can not easily interpreted by
conventional algorithms. It is also difficult to tag and index unstructured data. So
extracting information from them is a tough job. However, there are ways to
organize and extract useful information from it:

 Tagging: We can label or tag data with keywords. For example, a photo of a
dog might be tagged with the words “dog,” “pet,” or “animal” so it can be
found easily later.

 Classifying Data: This is like organizing things into groups. For example,
grouping customer reviews into positive or negative feedback. This makes it
easier to search and analyze.

 Data Mining: This technique helps find patterns in unstructured data. For
example, analyzing customer reviews to see common complaints or finding
patterns in social media posts to predict trends.

Storing Unstructured Data

 Unstructured data can be converted to easily manageable formats.

 Using a content addressable storage system (CAS) to store unstructured

data.

 It stores data based on their metadata and a unique name is assigned to every
object stored in it. The object is retrieved based on content, not its location.
 Unstructured data can be stored in XML format.

 Unstructured data can be stored in RDBMS which supports BLOBs.

Unstructured Data vs Structured Data

Structured data is neatly organized into rows and columns, much like a
spreadsheet or a database. For instance, a table listing people's names, ages and
addresses is structured data ,it follows a clear format and is easy to search or
analyze.

Unstructured data, on the other hand, doesn’t follow a set structure. It includes
things like photos, videos, audio clips or tweets. There's no consistent format,
which makes it harder to organize or process.

Feature Structured Data Unstructured Data

Organized in rows and

No fixed format or predefined
columns (e.g., tables,
structure.
Format spreadsheets).

Names, ages and addresses in a Photos, videos, emails, social

Examples database. media posts.

Stored in relational databases Stored in files, cloud storage, or

Storage (e.g., SQL). NoSQL databases.

Ease of Easy to search, sort and Requires advanced processing

Analysis analyze with tools. (e.g., NLP, image recognition).

Text and numbers in a Mixed data types: text, audio,

Data Type predictable format. video, etc.
Feature Structured Data Unstructured Data

Real-World A neatly arranged bookshelf A scattered pile of books,

Analogy with categorized books. photos, papers and sticky notes.

Applications

Unstructured data is already being used across industries:

 Healthcare: Doctors use unstructured patient records, lab notes and imaging
reports to diagnose and personalize treatment.

 Retail: Analyzing customer reviews and social media comments to improve

product quality and customer experience.

 Finance: Processing news feeds, analyst reports and customer emails to

manage risk and improve investment decisions.

 Legal: Automating document review and e-discovery in law firms through

text mining.

 Media & Entertainment: Recommending content based on viewing habits,

comments and user preferences.

Challenges with Unstructured Data

There are a few challenges with unstructured data that make it difficult to manage:

 Hard to Store: Since unstructured data comes in so many different formats

(like images or audio), it takes up a lot of space to store. You need big
storage systems to hold it all.

 Difficult to Search: Without labels or organization, it’s hard to find specific

information in unstructured data. For example, if you have thousands of
tweets, finding one tweet might be tricky.
 Hard to Analyze: Unlike structured data, which is easy to analyze using
simple tools, unstructured data requires special software and complex
techniques to make sense of it.

What is Semi-structured data?

 Semi-structured data is data that does not reside in a traditional relational
database (like SQL) but still has some organizational properties, such as
tags or markers, that make it easier to analyze than completely unstructured
data.
 It doesn't follow a strict schema like structured data, but it still contains
elements like labels or keys that make the data identifiable and searchable.

Characteristics of Semi-Structured Data

1. Flexible Schema: The structure can vary from one entry to another. For
example, one JSON object may have five fields while another has only
three.

2. Human-Readable Format: Many types like XML or JSON are easy for
humans and machines to understand.

3. Scalable: Easily handled by modern NoSQL databases, making it great for

Big Data environments.
4. Metadata-Rich: Tags and attributes provide context that helps with sorting
and analysis.

Importance of Semi-Structured Data

As data becomes more complex and varied, semi-structured formats offer a

balance between flexibility and manageability. They allow organizations to
store and process different types of information in one place, making it
easier to handle diverse data formats. Additionally, semi-structured data
enables quick adaptation to new data sources without the need to redesign
existing databases. This flexibility supports more efficient data analysis
and integration, especially when combining structured and unstructured
data, making it a valuable asset in modern data-driven environments.

Examples of Semi-Structured Data:

 JSON (JavaScript Object Notation)

 XML (eXtensible Markup Language)

 CSV files with inconsistent rows

 Emails (with structured headers and unstructured body text)

 Sensor data from IoT devices

 HTML web pages

Extracting Information from Semi-Structured Data

Semi-structured data have different structure because of heterogeneity of

the sources. Sometimes they do not contain any structure at all. This makes
it difficult to tag and index. So while extract information from them is
tough job. Here are possible solutions -

 Graph based models (e.g OEM) can be used to index semi-structured data

 Data modelling technique in OEM allows the data to be stored in graph

based model. The data in graph based model is easier to search and index.

 XML allows data to be arranged in hierarchical order which enables the

data to be indexed and searched
 Use of various data mining tools

Semi-Structured Data Management

Unlike structured data, semi-structured data is best managed using NoSQL

databases or document stores. Popular technologies include:

 MongoDB: A document-based NoSQL database that works well with

JSON-like formats.

 Cassandra: Handles wide-column data with semi-structured schema

design.

 Elasticsearch: Can index and search through semi-structured log files and
documents.

 Cloud Storage (e.g. AWS S3, Azure Blob): Used to store large volumes
of semi-structured data like logs, emails, and telemetry data.

Applications

Semi-structured data is used across various industries:

 E-commerce: Product catalogs stored in JSON format, allowing flexibility

in item attributes.

 Healthcare: Patient forms and reports stored in XML with variable fields.

 IoT and Smart Devices: Sensor data captured in key-value formats.

 Web Development: HTML and JSON used to render dynamic content on

websites.

 Social Media Platforms: User activity and messages logged in semi-

structured logs.

Challenges

Despite its flexibility, semi-structured data comes with a few challenges:

 Complex Querying: Not as straightforward as SQL queries on structured

data.
 Data Cleaning: Irregular structure may lead to inconsistency and harder
integration.

 Tool Compatibility: Not all analytics tools support semi-structured

formats out of the box.

 What is Hadoop ?

 Hadoop is an open-source framework written in Java that allows distributed

storage and processing of large datasets. Before Hadoop, traditional
systems were limited to processing structured data mainly using RDBMS
and couldn't handle the complexities of Big Data. In this section we will
learn how Hadoop offers a solution to handle Big Data.

 Components of Hadoop
 In this section, we will explore HDFS for distributed and fault-tolerant data
storage, MapReduce programming model for data processing and YARN
for resource management and job scheduling in a Hadoop cluster.

 Hadoop - HDFS (Hadoop Distributed File System)

Before head over to learn about the HDFS(Hadoop Distributed File
System), we should know what actually the file system is. The file system
is a kind of Data structure or method which we use in an operating system
to manage file on disk space. This means it allows the user to keep
maintain and retrieve data from the local disk.

An example of the windows file system is NTFS(New Technology File

System) and FAT32(File Allocation Table 32). FAT32 is used in some
older versions of windows but can be utilized on all versions of windows
xp. Similarly like windows, we have ext3, ext4 kind of file system for
Linux OS.
What is DFS?

DFS stands for the distributed file system, it is a concept of storing the file
in multiple nodes in a distributed manner. DFS actually provides the
Abstraction for a single large system whose storage is equal to the sum of
storage of other nodes in a cluster.

Let's understand this with an example. Suppose you have a DFS comprises
of 4 different machines each of size 10TB in that case you can store let say
30TB across this DFS as it provides you a combined Machine of size
40TB. The 30TB data is distributed among these Nodes in form of Blocks.

Why We Need DFS?

You might be thinking that we can store a file of size 30TB in a single
system then why we need this DFS. This is because the disk capacity of a
system can only increase up to an extent. If somehow you manage the data
on a single system then you'll face the processing problem, processing
large datasets on a single machine is not efficient.

Let's understand this with an example. Suppose you have a file of size
40TB to process. On a single machine, it will take suppose 4hrs to process
it completely but what if you use a DFS(Distributed File System). In that
case, as you can see in the below image the File of size 40TB is distributed
among the 4 nodes in a cluster each node stores the 10TB of file. As all
these nodes are working simultaneously it will take the only 1 Hour to
completely process it which is Fastest, that is why we need DFS.

Local File System Processing:

Distributed File System Processing:

Overview - HDFS

Now we think you become familiar with the term file system so let's begin with
HDFS. HDFS(Hadoop Distributed File System) is utilized for storage permission
is a Hadoop cluster. It mainly designed for working on commodity Hardware
devices(devices that are inexpensive), working on a distributed file system design.
HDFS is designed in such a way that it believes more in storing the data in a large
chunk of blocks rather than storing small data blocks. HDFS in Hadoop provides
Fault-tolerance and High availability to the storage layer and the other devices
present in that Hadoop cluster.

HDFS is capable of handling larger size data with high volume velocity and variety
makes Hadoop work more efficient and reliable with easy access to all its
components. HDFS stores the data in the form of the block where the size of each
data block is 128MB in size which is configurable means you can change it
according to your requirement in [Link] file in your Hadoop directory.

Some Important Features of HDFS(Hadoop Distributed File System)

 It's easy to access the files stored in HDFS.

 HDFS also provides high availability and fault tolerance.

 Provides scalability to scaleup or scaledown nodes as per our requirement.

 Data is stored in distributed manner i.e. various Datanodes are responsible

for storing the data.

 HDFS provides Replication because of which no fear of Data Loss.

 HDFS Provides High Reliability as it can store data in a large range

of Petabytes.

 HDFS has in-built servers in Name node and Data Node that helps them to
easily retrieve the cluster information.

 Provides high throughput.

HDFS Storage Daemon's

As we all know Hadoop works on the MapReduce algorithm which is a master-

slave architecture, HDFS has NameNode and DataNode that works in the similar
pattern.

1. NameNode(Master)
2. DataNode(Slave)

1. NameNode: NameNode works as a Master in a Hadoop cluster that Guides the

Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. nothing
but the data about the data. Meta Data can be the transaction logs that keep track of
the user's activity in a Hadoop cluster.

Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.

As our NameNode is working as a Master it should have a high RAM or

Processing power in order to Maintain or Guide all the slaves in a Hadoop cluster.
Namenode receives heartbeat signals and block reports from all the slaves i.e.
DataNodes.

2. DataNode: DataNodes works as a Slave DataNodes are mainly utilized for

storing the data in a Hadoop cluster, the number of DataNodes can be from 1 to
500 or even more than that, the more number of DataNode your Hadoop cluster
has More Data can be stored. so it is advised that the DataNode should have High
storing capacity to store a large number of file blocks. Datanode performs
operations like creation, deletion, etc. according to the instruction provided by the
NameNode.

Objectives and Assumptions Of HDFS

1. System Failure: As a Hadoop cluster is consists of Lots of nodes with are
commodity hardware so node failure is possible, so the fundamental goal of HDFS
figure out this failure problem and recover it.

2. Maintaining Large Dataset: As HDFS Handle files of size ranging from GB to

PB, so HDFS has to be cool enough to deal with these very large data sets on a
single cluster.

3. Moving Data is Costlier then Moving the Computation: If the computational

operation is performed near the location where the data is present then it is quite
faster and the overall throughput of the system can be increased along with
minimizing the network congestion which is a good assumption.

4. Portable Across Various Platform: HDFS Posses portability which allows it to

switch across diverse Hardware and software platforms.

5. Simple Coherency Model: A Hadoop Distributed File System needs a model to

write once read much access for Files. A file written then closed should not be
changed, only data can be appended. This assumption helps us to minimize the
data coherency issue. MapReduce fits perfectly with such kind of file model.

6. Scalability: HDFS is designed to be scalable as the data storage requirements

increase over time. It can easily scale up or down by adding or removing nodes to
the cluster. This helps to ensure that the system can handle large amounts of data
without compromising performance.

7. Security: HDFS provides several security mechanisms to protect data stored on

the cluster. It supports authentication and authorization mechanisms to control
access to data, encryption of data in transit and at rest, and data integrity checks to
detect any tampering or corruption.

8. Data Locality: HDFS aims to move the computation to where the data resides
rather than moving the data to the computation. This approach minimizes network
traffic and enhances performance by processing data on local nodes.
9. Cost-Effective: HDFS can run on low-cost commodity hardware, which makes
it a cost-effective solution for large-scale data processing. Additionally, the ability
to scale up or down as required means that organizations can start small and
expand over time, reducing upfront costs.

10. Support for Various File Formats: HDFS is designed to support a wide range
of file formats, including structured, semi-structured, and unstructured data. This
makes it easier to store and process different types of data using a single system,
simplifying data management and reducing costs.

Map Reduce and its Phases with numerical example.

Map Reduce is a framework in which we can write applications to run huge

amount of data in parallel and in large cluster of commodity hardware in a reliable
manner.

Phases of MapReduce

MapReduce model has three major and one optional phase.

1. Mapping

2. Shuffling and Sorting

3. Reducing

4. Combining

1) Mapping

It is the first phase of MapReduce programming. Mapping Phase accepts key-value

pairs as input as (k, v), where the key represents the Key address of each record
and the value represents the entire record [Link] output of the Mapping phase
will also be in the key-value format (k’, v’).

2) Shuffling and Sorting

The output of various mapping parts (k’, v’), then goes into Shuffling and Sorting
phase. All the same values are deleted, and different values are grouped together
based on same keys. The output of the Shuffling and Sorting phase will be key-
value pairs again as key and array of values (k, v[ ]).

3) Reducer

The output of the Shuffling and Sorting phase (k, v[]) will be the input of the
Reducer phase. In this phase reducer function’s logic is executed and all the values
are Collected against their corresponding keys. Reducer stabilize outputs of various
mappers and computes the final output.

4) Combining

It is an optional phase in the MapReduce phases . The combiner phase is used to

optimize the performance of MapReduce phases. This phase makes the Shuffling
and Sorting phase work even quicker by enabling additional performance features
in MapReduce phases.
Numerical Example

We will be using MovieLens Data.

USER_ID MOVIE_ID RATING TIMESTAMP

196 242 3 881250949

186 302 3 891717742

196 377 1 878887116

244 51 2 880606923

166 346 1 886397596

186 474 4 884182806

186 265 2 881171488

Solution

Step 1: First we have to map the values , it is happen in 1st phase of Map Reduce
model.

196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ; 186:274 ; 186:265

Step 2: After Mapping we have to shuffle and sort the values.

166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51

Step 3: After completion of step1 and step2 we have to reduce each key's values.

Now, put all values together

Common Use Cases of MapReduce

 Counting word frequency (as shown above)

 Log analysis

 Indexing web pages

 Processing large datasets for ETL (Extract, Transform, Load)

 Recommendation systems and data mining

Advantages of MapReduce

 Simple and easy abstraction for large-scale data processing

 Efficient for batch processing of massive datasets

 Fault-tolerant and scalable

 Integrates well with Hadoop Distributed File System (HDFS)

Limitations of MapReduce

 Not ideal for real-time processing

 Complex data workflows can be hard to express

 Debugging and testing are more challenging

 High latency due to intermediate disk I/O (especially in the shuffle phase)

Hadoop YARN Architecture

YARN stands for "Yet Another Resource Negotiator". It was introduced in

Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in Hadoop
1.0. YARN was described as a "Redesigned Resource Manager" at the time of its
launching, but it has now evolved to be known as large-scale distributed operating
system used for Big Data processing.
YARN architecture basically separates resource management layer from the
processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split
between the resource manager and application manager.

YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System) thus making the
system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large
volume data processing, it is quite necessary to manage the available resources
properly so that every application can leverage them.

YARN Features

YARN gained popularity because of the following features:

1. Scalability: The scheduler in Resource manager of YARN architecture

allows Hadoop to extend and manage thousands of nodes and clusters.

2. Compatibility: YARN supports the existing map-reduce applications

without disruptions thus making it compatible with Hadoop 1.0 as well.
3. Cluster Utilization:Since YARN supports Dynamic utilization of cluster in
Hadoop, which enables optimized Cluster Utilization.

4. Multi-tenancy: It allows multiple engine access thus giving organizations a

benefit of multi-tenancy.

Hadoop YARN Architecture

The main components of YARN architecture include:

1) Client

The Client is the entity that initiates and submits the application (such as a
MapReduce job) to the YARN framework. It communicates with the Resource
Manager to request execution, monitors the job status, and can also interact with
the Application Master for progress updates. It essentially acts as the user's
interface to launch and manage applications on the Hadoop cluster.

2) Resource Manager

It is the master daemon of YARN and is responsible for resource assignment and
management among all the applications. Whenever it receives a processing request,
it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:

 Scheduler: It performs scheduling based on the allocated application and

available resources. It is a pure scheduler, means it does not perform other
tasks such as monitoring or tracking and does not guarantee a restart if a task
fails. The YARN scheduler supports plugins such as Capacity Scheduler and
Fair Scheduler to partition the cluster resources.

 Application manager: It is responsible for accepting the application and

negotiating the first container from the resource manager. It also restarts the
Application Master container if a task fails.

3) Node Manager

It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Resource
Manager. It registers with the Resource Manager and sends heartbeats with the
health status of the node. It monitors resource usage, performs log management
and also kills a container based on directions from the resource manager. It is also
responsible for creating the container process and start it on the request of
Application master.

4) Application Master

An application is a single job submitted to a framework. The application master is

responsible for negotiating resources with the resource manager, tracking the status
and monitoring progress of a single application. The application master requests
the container from the node manager by sending a Container Launch
Context(CLC) which includes everything an application needs to run. Once the
application is started, it sends the health report to the resource manager from time-
to-time.

5) Container

It is a collection of physical resources such as RAM, CPU cores and disk on a

single node. The containers are invoked by Container Launch Context(CLC) which
is a record that contains information such as environment variables, security
tokens, dependencies etc.

Application workflow in Hadoop YARN:

1. Client submits an application

2. The Resource Manager allocates a container to start the Application

Manager

3. The Application Manager registers itself with the Resource Manager

4. The Application Manager negotiates containers from the Resource Manager

5. The Application Manager notifies the Node Manager to launch containers

6. Application code is executed in the container

7. Client contacts Resource Manager/Application Manager to monitor
application’s status

8. Once the processing is complete, the Application Manager un-registers with

the Resource Manager

Advantages

 Flexibility: YARN offers flexibility to run various types of distributed

processing systems such as Apache Spark, Apache Flink, Apache Storm, and
others. It allows multiple processing engines to run simultaneously on a
single Hadoop cluster.

 Resource Management: YARN provides an efficient way of managing

resources in the Hadoop cluster. It allows administrators to allocate and
monitor the resources required by each application in a cluster, such as CPU,
memory, and disk space.

 Scalability: YARN is designed to be highly scalable and can handle

thousands of nodes in a cluster. It can scale up or down based on the
requirements of the applications running on the cluster.

 Improved Performance: YARN offers better performance by providing a

centralized resource management system. It ensures that the resources are
optimally utilized, and applications are efficiently scheduled on the available
resources.

 Security: YARN provides robust security features such as Kerberos

authentication, Secure Shell (SSH) access, and secure data transmission. It
ensures that the data stored and processed on the Hadoop cluster is secure.

Disadvantages

 Complexity: YARN adds complexity to the Hadoop ecosystem. It requires

additional configurations and settings, which can be difficult for users who
are not familiar with YARN.
 Overhead: YARN introduces additional overhead, which can slow down the
performance of the Hadoop cluster. This overhead is required for managing
resources and scheduling applications.

 Latency: YARN introduces additional latency in the Hadoop ecosystem.

This latency can be caused by resource allocation, application scheduling,
and communication between components.

 Single Point of Failure: YARN can be a single point of failure in the

Hadoop cluster. If YARN fails, it can cause the entire cluster to go down. To
avoid this, administrators need to set up a backup YARN instance for high
availability.

 Limited Support: YARN has limited support for non-Java programming

languages. Although it supports multiple processing engines, some engines
have limited language support, which can limit the usability of YARN in
certain environments.

Big Data Analytics Overview and Techniques
No ratings yet
Big Data Analytics Overview and Techniques
108 pages
Big Data Unit I
No ratings yet
Big Data Unit I
16 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
12 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
16 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
47 pages
Understanding Big Data Analytics: Benefits & Challenges
No ratings yet
Understanding Big Data Analytics: Benefits & Challenges
22 pages
Big Data Analytics Overview and Applications
No ratings yet
Big Data Analytics Overview and Applications
43 pages
Big Data Analytics Syllabus Overview
No ratings yet
Big Data Analytics Syllabus Overview
54 pages
Full Note Big Data
No ratings yet
Full Note Big Data
115 pages
Understanding Big Data Analytics Process
No ratings yet
Understanding Big Data Analytics Process
13 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
5 pages
Unit 1 Intorduction
No ratings yet
Unit 1 Intorduction
76 pages
Understanding Big Data Analytics Benefits
No ratings yet
Understanding Big Data Analytics Benefits
14 pages
Big Data Analytics: Benefits & Use Cases
No ratings yet
Big Data Analytics: Benefits & Use Cases
9 pages
Understanding Big Data and Analytics
No ratings yet
Understanding Big Data and Analytics
4 pages
Chapter 1
No ratings yet
Chapter 1
15 pages
Understanding Big Data and Its Applications
No ratings yet
Understanding Big Data and Its Applications
15 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
20 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
7 pages
Big Data Analytics: Key Benefits & Types
No ratings yet
Big Data Analytics: Key Benefits & Types
6 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
25 pages
Big Data Analytics Overview and Insights
No ratings yet
Big Data Analytics Overview and Insights
28 pages
Understanding Big Data Analytics Essentials
No ratings yet
Understanding Big Data Analytics Essentials
5 pages
Understanding Big Data Analytics Essentials
No ratings yet
Understanding Big Data Analytics Essentials
3 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
7 pages
Data Collection in Big Data Analytics
No ratings yet
Data Collection in Big Data Analytics
69 pages
Key Drivers of Big Data Analytics
100% (2)
Key Drivers of Big Data Analytics
7 pages
Big Data Analytics: Process & Benefits
No ratings yet
Big Data Analytics: Process & Benefits
5 pages
Big Data Analytics: Overview & Applications
No ratings yet
Big Data Analytics: Overview & Applications
85 pages
Week2 - Big Data Analytics
No ratings yet
Week2 - Big Data Analytics
18 pages
Big Data Analytics: Insights & Challenges
No ratings yet
Big Data Analytics: Insights & Challenges
6 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
3 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
4 pages
Big Data and Analytics Overview
No ratings yet
Big Data and Analytics Overview
28 pages
Big Data Analytics Process Overview
No ratings yet
Big Data Analytics Process Overview
22 pages
Understanding Big Data and Analytics
No ratings yet
Understanding Big Data and Analytics
8 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
10 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
23 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
22 pages
Big Data Unit Iand II SVEN
No ratings yet
Big Data Unit Iand II SVEN
77 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
22 pages
Big Data Analytics in Education Overview
No ratings yet
Big Data Analytics in Education Overview
22 pages
Data Analytics
No ratings yet
Data Analytics
4 pages
Big Data Analytics Overview and Tools
No ratings yet
Big Data Analytics Overview and Tools
26 pages
Understanding Big Data Essentials
No ratings yet
Understanding Big Data Essentials
20 pages
Big Data Analytics: Benefits & Challenges
No ratings yet
Big Data Analytics: Benefits & Challenges
4 pages
Big Data Analytics Overview and Applications
No ratings yet
Big Data Analytics Overview and Applications
9 pages
Big Data Overview and Analytics Guide
No ratings yet
Big Data Overview and Analytics Guide
18 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
33 pages
Big Data Analytics: Processes and Applications
100% (1)
Big Data Analytics: Processes and Applications
11 pages
Big Data Analytics in IIoT
No ratings yet
Big Data Analytics in IIoT
34 pages
E-Commerce Item Recommendation Techniques
No ratings yet
E-Commerce Item Recommendation Techniques
88 pages
Lecture 2 About Big Data
No ratings yet
Lecture 2 About Big Data
22 pages
Benefits of Big Data Analytics
No ratings yet
Benefits of Big Data Analytics
9 pages
Uncovering Insights with Big Data Analytics
No ratings yet
Uncovering Insights with Big Data Analytics
38 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
25 pages
Big Data Analytics Fundamentals
No ratings yet
Big Data Analytics Fundamentals
121 pages
CCS334 Big Data Analytics Overview
No ratings yet
CCS334 Big Data Analytics Overview
49 pages
Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
12 pages
File Management Concepts and Operations
No ratings yet
File Management Concepts and Operations
50 pages
Optimize BIOS Settings Guide
No ratings yet
Optimize BIOS Settings Guide
17 pages
NRC ICT Assistant Exam Questions
No ratings yet
NRC ICT Assistant Exam Questions
4 pages
AIX Journaled File System Overview
No ratings yet
AIX Journaled File System Overview
28 pages
DataStage Job Performance Optimization
No ratings yet
DataStage Job Performance Optimization
10 pages
ICT Technician: Computer Connection Guide
No ratings yet
ICT Technician: Computer Connection Guide
159 pages
StorNext Install Guide
No ratings yet
StorNext Install Guide
170 pages
H13-624 Exam Guide and Q&A
100% (7)
H13-624 Exam Guide and Q&A
89 pages
Real-Time Database Management Systems
No ratings yet
Real-Time Database Management Systems
9 pages
Database Disk Space Forecasting Techniques
No ratings yet
Database Disk Space Forecasting Techniques
6 pages
Ankit Baijal: Networking & Support Expert
No ratings yet
Ankit Baijal: Networking & Support Expert
5 pages
Installing z/OS 2.5 with z/OSMF
No ratings yet
Installing z/OS 2.5 with z/OSMF
83 pages
Computer File Management Basics
No ratings yet
Computer File Management Basics
30 pages
Understanding Databases and DBMS
No ratings yet
Understanding Databases and DBMS
9 pages
DBMS Module 1 Notes: Database Design Basics
No ratings yet
DBMS Module 1 Notes: Database Design Basics
32 pages
Computer Forensics Tools Overview
100% (1)
Computer Forensics Tools Overview
65 pages
Linux Kernel Architecture Overview
No ratings yet
Linux Kernel Architecture Overview
6 pages
Windows 7 System Administration Guide
No ratings yet
Windows 7 System Administration Guide
25 pages
Operating System Overview and Functions
No ratings yet
Operating System Overview and Functions
31 pages
Cummins College BTech IT Syllabus 2024-25
No ratings yet
Cummins College BTech IT Syllabus 2024-25
21 pages
Applied Operating System
No ratings yet
Applied Operating System
45 pages
Types of Files in Linux Explained
No ratings yet
Types of Files in Linux Explained
9 pages
E20-393 Exam Guide for Unity Specialists
0% (1)
E20-393 Exam Guide for Unity Specialists
15 pages
Axway B2Bi Multi-Cluster Guide 2.6
No ratings yet
Axway B2Bi Multi-Cluster Guide 2.6
102 pages
Windows XP Installation Guide
No ratings yet
Windows XP Installation Guide
19 pages
FriendlyARM Mini2440 User Guide
No ratings yet
FriendlyARM Mini2440 User Guide
78 pages
Overview of the UNIX Operating System
No ratings yet
Overview of the UNIX Operating System
59 pages
Linux System Administrator Profile
No ratings yet
Linux System Administrator Profile
2 pages
? Various Components of A Computer
No ratings yet
? Various Components of A Computer
37 pages
Windows Server Interview Q&A Guide
No ratings yet
Windows Server Interview Q&A Guide
65 pages