BIG DATA ANALYTICS
Module-1
Big Data
Big Data refers to massive datasets that grow exponentially and come from a
variety of sources, presenting challenges in handling, processing, and analysis.
These datasets can be structured, unstructured, or semi-structured. To effectively
manage this data, Hadoop comes into the picture. Let's dive into Big Data and how
Hadoop revolutionizes data processing.
What is Big Data?
Big Data is characterized by its large volume, high velocity, and diverse variety,
making it difficult to process with traditional tools.
Big Data Analytics uses advanced analytical methods that can extract important
business insights from bulk datasets. Within these datasets lies both structured
(organized) and unstructured (unorganized) data.
Its applications cover different industries such as healthcare, education,
insurance, AI, retail, and manufacturing.
What is Big-Data Analytics?
Big Data Analytics is all about crunching massive amounts of information to
uncover hidden trends, patterns, and relationships. It's like sifting through a giant
mountain of data to find the gold nuggets of insight.
Here's a breakdown of what it involves:
Collecting Data: Such data is coming from various sources such as social
media, web traffic, sensors, and customer reviews.
Cleaning the Data: Imagine having to assess a pile of rocks that included
some gold pieces in it. You would have to clean the dirt and the debris first.
When data is being cleaned, mistakes must be fixed, duplicates must be
removed and the data must be formatted properly.
Analyzing the Data: It is here that the wizardry takes place. Data analysts
employ powerful tools and techniques to discover patterns and trends. It is
the same thing as looking for a specific pattern in all those rocks that you
sorted through.
How does big data analytics work?
Big Data Analytics is a powerful tool which helps to find the potential of large and
complex datasets. To get better understanding, let's break it down into key steps:
Data Collection: Data is the core of Big Data Analytics. It is the gathering
of data from different sources such as the customers’ comments, surveys,
sensors, social media, and so on. The primary aim of data collection is to
compile as much accurate data as possible. The more data, the more insights.
Data Cleaning (Data Preprocessing): The next step is to process this
information. It often requires some cleaning. This entails the replacement of
missing data, the correction of inaccuracies, and the removal of duplicates. It
is like sifting through a treasure trove, separating the rocks and debris and
leaving only the valuable gems behind.
Data Processing: After that we will be working on the data processing. This
process contains such important stages as writing, structuring, and
formatting of data in a way it will be usable for the analysis. It is like a chef
who is gathering the ingredients before cooking. Data processing turns the
data into a format suited for analytics tools to process.
Data Analysis: Data analysis is being done by means of statistical,
mathematical, and machine learning methods to get out the most important
findings from the processed data. For example, it can uncover customer
preferences, market trends, or patterns in healthcare data.
Data Visualization: Data analysis usually is presented in visual form, for
illustration – charts, graphs and interactive dashboards. The visualizations
provided a way to simplify the large amounts of data and allowed for
decision makers to quickly detect patterns and trends.
Data Storage and Management: The stored and managed analyzed data is
of utmost importance. It is like digital scrapbooking. May be you would
want to go back to those lessons in the long run, therefore, how you store
them has great importance. Moreover, data protection and adherence to
regulations are the key issues to be addressed during this crucial stage.
Continuous Learning and Improvement: Big data analytics is a
continuous process of collecting, cleaning, and analyzing data to uncover
hidden insights. It helps businesses make better decisions and gain a
competitive edge.
Types of Big Data Analytics
Big Data Analytics comes in many different types, each serving a different
purpose:
1. Descriptive Analytics: This type helps us understand past events. In social
media, it shows performance metrics, like the number of likes on a post.
2. Diagnostic Analytics: In Diagnostic analytics delves deeper to uncover the
reasons behind past events. In healthcare, it identifies the causes of high
patient re-admissions.
3. Predictive Analytics: Predictive analytics forecasts future events based on
past data. Weather forecasting, for example, predicts tomorrow's weather by
analyzing historical patterns.
4. Prescriptive Analytics: However, this category not only predicts results but
also offers recommendations for action to achieve the best results. In e-
commerce, it may suggest the best price for a product to achieve the highest
possible profit.
5. Real-time Analytics: The key function of real-time analytics is data
processing in real time. It swiftly allows traders to make decisions based on
real-time market events.
6. Spatial Analytics: Spatial analytics is about the location data. In urban
management, it optimizes traffic flow from the data unde the sensors and
cameras to minimize the traffic jam.
7. Text Analytics: Text analytics delves into the unstructured data of text. In
the hotel business, it can use the guest reviews to enhance services and guest
satisfaction.
Big Data Analytics Technologies and Tools
Big Data Analytics relies on various technologies and tools that might sound
complex, let's simplify them:
Hadoop: Imagine Hadoop as an enormous digital warehouse. It's used by
companies like Amazon to store tons of data efficiently. For instance, when
Amazon suggests products you might like, it's because Hadoop helps
manage your shopping history.
Spark: Think of Spark as the super-fast data chef. Netflix uses it to quickly
analyze what you watch and recommend your next binge-worthy show.
NoSQL Databases: NoSQL databases, like MongoDB, are like digital filing
cabinets that Airbnb uses to store your booking details and user data. These
databases are famous because of their quick and flexible, so the platform can
provide you with the right information when you need it.
Tableau: Tableau is like an artist that turns data into beautiful pictures. The
World Bank uses it to create interactive charts and graphs that help people
understand complex economic data.
Python and R: Python and R are like magic tools for data scientists. They
use these languages to solve tricky problems. For example, Kaggle uses
them to predict things like house prices based on past data.
Machine Learning Frameworks (e.g., TensorFlow): In Machine
learning frameworks are the tools who make predictions. Airbnb
uses TensorFlow to predict which properties are most likely to be booked in
certain areas. It helps hosts make smart decisions about pricing and
availability.
These tools and technologies are the building blocks of Big Data Analytics and
helps organizations gather, process, understand, and visualize data, making it easier
for them to make decisions based on information.
Benefits of Big Data Analytics
Big Data Analytics offers a host of real-world advantages, and let's understand with
examples:
1. Informed Decisions: Imagine a store like Walmart. Big Data Analytics
helps them make smart choices about what products to stock. This not only
reduces waste but also keeps customers happy and profits high.
2. Enhanced Customer Experiences: Think about Amazon. Big Data
Analytics is what makes those product suggestions so accurate. It's like
having a personal shopper who knows your taste and helps you find what
you want.
3. Fraud Detection: Credit card companies, like MasterCard, use Big Data
Analytics to catch and stop fraudulent transactions. It's like having a
guardian that watches over your money and keeps it safe.
4. Optimized Logistics: FedEx, for example, uses Big Data Analytics to
deliver your packages faster and with less impact on the environment. It's
like taking the fastest route to your destination while also being kind to the
planet.
Challenges of Big data analytics
While Big Data Analytics offers incredible benefits, it also comes with its set of
challenges:
Data Overload: Consider Twitter, where approximately 6,000 tweets are
posted every second. The challenge is sifting through this avalanche of data
to find valuable insights.
Data Quality: If the input data is inaccurate or incomplete, the insights
generated by Big Data Analytics can be flawed. For example, incorrect
sensor readings could lead to wrong conclusions in weather forecasting.
Privacy Concerns: With the vast amount of personal data used, like in
Facebook's ad targeting, there's a fine line between providing personalized
experiences and infringing on privacy.
Security Risks: With cyber threats increasing, safeguarding sensitive data
becomes crucial. For instance, banks use Big Data Analytics to detect
fraudulent activities, but they must also protect this information from
breaches.
Costs: Implementing and maintaining Big Data Analytics systems can be
expensive. Airlines like Delta use analytics to optimize flight schedules, but
they need to ensure that the benefits outweigh the costs.
Usage of Big Data Analytics
Big Data Analytics has a significant impact in various sectors:
Healthcare: It aids in precise diagnoses and disease prediction, elevating
patient care.
Retail: Amazon's use of Big Data Analytics offers personalized product
recommendations based on your shopping history, creating a more tailored
and enjoyable shopping experience.
Finance: Credit card companies such as Visa rely on Big Data Analytics to
swiftly identify and prevent fraudulent transactions, ensuring the safety of
your financial assets.
Transportation: Companies like Uber use Big Data Analytics to optimize
drivers' routes and predict demand, reducing wait times and improving
overall transportation experiences.
Agriculture: Farmers make informed decisions, boosting crop yields while
conserving resources.
Manufacturing: Companies like General Electric (GE) use Big Data
Analytics to predict machinery maintenance needs, reducing downtime and
enhancing operational efficiency.
Conclusion
Big Data Analytics is a game-changer that's shaping a smarter future. From
improving healthcare and personalizing shopping to securing finances and
predicting demand, it's transforming various aspects of our lives. However,
Challenges like managing overwhelming data and safeguarding privacy are real
concerns. In our world flooded with data, Big Data Analytics acts as a guiding
light. It helps us make smarter choices, offers personalized experiences, and
uncovers valuable insights. It's a powerful and stable tool that promises a better
and more efficient future for everyone.
What is Unstructured Data?
Unstructured data refers to information that does not have a predefined format or
structure. It is messy, unorganized and hard to sort. Unlike structured data, which is
organized into rows and columns (like an Excel sheet), unstructured data comes in
many different forms such as text documents, images, audio files, videos and social
media posts. Because this type of data does not follow a clear pattern, it’s harder to
store, process and search.
Characteristics of Unstructured Data
Lack of Format: Unstructured data does not fit neatly into tables or
databases. It can be textual or non-textual, making it difficult to categorize
and organize.
Variety: This type of data can include a wide range of formats, such as:
o Text documents (e.g., emails, reports, articles)
o Multimedia files (e.g., images, audio, video)
o Social media content (e.g., posts, comments, tweets)
o Web pages and blogs
Volume: Unstructured data represents a significant portion of the data
generated today. It is often larger in volume compared to structured data.
Diverse Sources: It can originate from various sources, including user-
generated content, sensor data, customer interactions and more.
Importance of unstructured Data
Even though unstructured data is harder to deal with, it is extremely valuable. Let
us see that in the below :
It helps businesses understand their customers better. For example,
businesses can learn what customers think about their products by reading
reviews or social media posts.
It contains real world insights, like what people are talking about online or
what videos are trending.
It’s growing rapidly. More and more data being created today is unstructured
like photos, tweets and videos.
Examples of Unstructured Data
Unstructured data can come in many different forms. Here are some examples:
Social Media: Posts, tweets, comments and pictures on Facebook,
Instagram, or Twitter
Emails: Your inbox full of messages, attachments and conversations
Photos & Videos: Pictures on your phone or videos on YouTube
Audio Files: Podcasts, voice messages, music files
Documents: Reports, articles, PDFs, or Word files
Websites & Blogs: Articles, reviews and posts online
Extracting Information from Unstructured Data
Unstructured data do not have any structure. So it can not easily interpreted by
conventional algorithms. It is also difficult to tag and index unstructured data. So
extracting information from them is a tough job. However, there are ways to
organize and extract useful information from it:
Tagging: We can label or tag data with keywords. For example, a photo of a
dog might be tagged with the words “dog,” “pet,” or “animal” so it can be
found easily later.
Classifying Data: This is like organizing things into groups. For example,
grouping customer reviews into positive or negative feedback. This makes it
easier to search and analyze.
Data Mining: This technique helps find patterns in unstructured data. For
example, analyzing customer reviews to see common complaints or finding
patterns in social media posts to predict trends.
Storing Unstructured Data
Unstructured data can be converted to easily manageable formats.
Using a content addressable storage system (CAS) to store unstructured
data.
It stores data based on their metadata and a unique name is assigned to every
object stored in it. The object is retrieved based on content, not its location.
Unstructured data can be stored in XML format.
Unstructured data can be stored in RDBMS which supports BLOBs.
Unstructured Data vs Structured Data
Structured data is neatly organized into rows and columns, much like a
spreadsheet or a database. For instance, a table listing people's names, ages and
addresses is structured data ,it follows a clear format and is easy to search or
analyze.
Unstructured data, on the other hand, doesn’t follow a set structure. It includes
things like photos, videos, audio clips or tweets. There's no consistent format,
which makes it harder to organize or process.
Feature Structured Data Unstructured Data
Organized in rows and
No fixed format or predefined
columns (e.g., tables,
structure.
Format spreadsheets).
Names, ages and addresses in a Photos, videos, emails, social
Examples database. media posts.
Stored in relational databases Stored in files, cloud storage, or
Storage (e.g., SQL). NoSQL databases.
Ease of Easy to search, sort and Requires advanced processing
Analysis analyze with tools. (e.g., NLP, image recognition).
Text and numbers in a Mixed data types: text, audio,
Data Type predictable format. video, etc.
Feature Structured Data Unstructured Data
Real-World A neatly arranged bookshelf A scattered pile of books,
Analogy with categorized books. photos, papers and sticky notes.
Applications
Unstructured data is already being used across industries:
Healthcare: Doctors use unstructured patient records, lab notes and imaging
reports to diagnose and personalize treatment.
Retail: Analyzing customer reviews and social media comments to improve
product quality and customer experience.
Finance: Processing news feeds, analyst reports and customer emails to
manage risk and improve investment decisions.
Legal: Automating document review and e-discovery in law firms through
text mining.
Media & Entertainment: Recommending content based on viewing habits,
comments and user preferences.
Challenges with Unstructured Data
There are a few challenges with unstructured data that make it difficult to manage:
Hard to Store: Since unstructured data comes in so many different formats
(like images or audio), it takes up a lot of space to store. You need big
storage systems to hold it all.
Difficult to Search: Without labels or organization, it’s hard to find specific
information in unstructured data. For example, if you have thousands of
tweets, finding one tweet might be tricky.
Hard to Analyze: Unlike structured data, which is easy to analyze using
simple tools, unstructured data requires special software and complex
techniques to make sense of it.
What is Semi-structured data?
Semi-structured data is data that does not reside in a traditional relational
database (like SQL) but still has some organizational properties, such as
tags or markers, that make it easier to analyze than completely unstructured
data.
It doesn't follow a strict schema like structured data, but it still contains
elements like labels or keys that make the data identifiable and searchable.
Characteristics of Semi-Structured Data
1. Flexible Schema: The structure can vary from one entry to another. For
example, one JSON object may have five fields while another has only
three.
2. Human-Readable Format: Many types like XML or JSON are easy for
humans and machines to understand.
3. Scalable: Easily handled by modern NoSQL databases, making it great for
Big Data environments.
4. Metadata-Rich: Tags and attributes provide context that helps with sorting
and analysis.
Importance of Semi-Structured Data
As data becomes more complex and varied, semi-structured formats offer a
balance between flexibility and manageability. They allow organizations to
store and process different types of information in one place, making it
easier to handle diverse data formats. Additionally, semi-structured data
enables quick adaptation to new data sources without the need to redesign
existing databases. This flexibility supports more efficient data analysis
and integration, especially when combining structured and unstructured
data, making it a valuable asset in modern data-driven environments.
Examples of Semi-Structured Data:
JSON (JavaScript Object Notation)
XML (eXtensible Markup Language)
CSV files with inconsistent rows
Emails (with structured headers and unstructured body text)
Sensor data from IoT devices
HTML web pages
Extracting Information from Semi-Structured Data
Semi-structured data have different structure because of heterogeneity of
the sources. Sometimes they do not contain any structure at all. This makes
it difficult to tag and index. So while extract information from them is
tough job. Here are possible solutions -
Graph based models (e.g OEM) can be used to index semi-structured data
Data modelling technique in OEM allows the data to be stored in graph
based model. The data in graph based model is easier to search and index.
XML allows data to be arranged in hierarchical order which enables the
data to be indexed and searched
Use of various data mining tools
Semi-Structured Data Management
Unlike structured data, semi-structured data is best managed using NoSQL
databases or document stores. Popular technologies include:
MongoDB: A document-based NoSQL database that works well with
JSON-like formats.
Cassandra: Handles wide-column data with semi-structured schema
design.
Elasticsearch: Can index and search through semi-structured log files and
documents.
Cloud Storage (e.g. AWS S3, Azure Blob): Used to store large volumes
of semi-structured data like logs, emails, and telemetry data.
Applications
Semi-structured data is used across various industries:
E-commerce: Product catalogs stored in JSON format, allowing flexibility
in item attributes.
Healthcare: Patient forms and reports stored in XML with variable fields.
IoT and Smart Devices: Sensor data captured in key-value formats.
Web Development: HTML and JSON used to render dynamic content on
websites.
Social Media Platforms: User activity and messages logged in semi-
structured logs.
Challenges
Despite its flexibility, semi-structured data comes with a few challenges:
Complex Querying: Not as straightforward as SQL queries on structured
data.
Data Cleaning: Irregular structure may lead to inconsistency and harder
integration.
Tool Compatibility: Not all analytics tools support semi-structured
formats out of the box.
What is Hadoop ?
Hadoop is an open-source framework written in Java that allows distributed
storage and processing of large datasets. Before Hadoop, traditional
systems were limited to processing structured data mainly using RDBMS
and couldn't handle the complexities of Big Data. In this section we will
learn how Hadoop offers a solution to handle Big Data.
Components of Hadoop
In this section, we will explore HDFS for distributed and fault-tolerant data
storage, MapReduce programming model for data processing and YARN
for resource management and job scheduling in a Hadoop cluster.
Hadoop - HDFS (Hadoop Distributed File System)
Before head over to learn about the HDFS(Hadoop Distributed File
System), we should know what actually the file system is. The file system
is a kind of Data structure or method which we use in an operating system
to manage file on disk space. This means it allows the user to keep
maintain and retrieve data from the local disk.
An example of the windows file system is NTFS(New Technology File
System) and FAT32(File Allocation Table 32). FAT32 is used in some
older versions of windows but can be utilized on all versions of windows
xp. Similarly like windows, we have ext3, ext4 kind of file system for
Linux OS.
What is DFS?
DFS stands for the distributed file system, it is a concept of storing the file
in multiple nodes in a distributed manner. DFS actually provides the
Abstraction for a single large system whose storage is equal to the sum of
storage of other nodes in a cluster.
Let's understand this with an example. Suppose you have a DFS comprises
of 4 different machines each of size 10TB in that case you can store let say
30TB across this DFS as it provides you a combined Machine of size
40TB. The 30TB data is distributed among these Nodes in form of Blocks.
Why We Need DFS?
You might be thinking that we can store a file of size 30TB in a single
system then why we need this DFS. This is because the disk capacity of a
system can only increase up to an extent. If somehow you manage the data
on a single system then you'll face the processing problem, processing
large datasets on a single machine is not efficient.
Let's understand this with an example. Suppose you have a file of size
40TB to process. On a single machine, it will take suppose 4hrs to process
it completely but what if you use a DFS(Distributed File System). In that
case, as you can see in the below image the File of size 40TB is distributed
among the 4 nodes in a cluster each node stores the 10TB of file. As all
these nodes are working simultaneously it will take the only 1 Hour to
completely process it which is Fastest, that is why we need DFS.
Local File System Processing:
Distributed File System Processing:
Overview - HDFS
Now we think you become familiar with the term file system so let's begin with
HDFS. HDFS(Hadoop Distributed File System) is utilized for storage permission
is a Hadoop cluster. It mainly designed for working on commodity Hardware
devices(devices that are inexpensive), working on a distributed file system design.
HDFS is designed in such a way that it believes more in storing the data in a large
chunk of blocks rather than storing small data blocks. HDFS in Hadoop provides
Fault-tolerance and High availability to the storage layer and the other devices
present in that Hadoop cluster.
HDFS is capable of handling larger size data with high volume velocity and variety
makes Hadoop work more efficient and reliable with easy access to all its
components. HDFS stores the data in the form of the block where the size of each
data block is 128MB in size which is configurable means you can change it
according to your requirement in [Link] file in your Hadoop directory.
Some Important Features of HDFS(Hadoop Distributed File System)
It's easy to access the files stored in HDFS.
HDFS also provides high availability and fault tolerance.
Provides scalability to scaleup or scaledown nodes as per our requirement.
Data is stored in distributed manner i.e. various Datanodes are responsible
for storing the data.
HDFS provides Replication because of which no fear of Data Loss.
HDFS Provides High Reliability as it can store data in a large range
of Petabytes.
HDFS has in-built servers in Name node and Data Node that helps them to
easily retrieve the cluster information.
Provides high throughput.
HDFS Storage Daemon's
As we all know Hadoop works on the MapReduce algorithm which is a master-
slave architecture, HDFS has NameNode and DataNode that works in the similar
pattern.
1. NameNode(Master)
2. DataNode(Slave)
1. NameNode: NameNode works as a Master in a Hadoop cluster that Guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. nothing
but the data about the data. Meta Data can be the transaction logs that keep track of
the user's activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
As our NameNode is working as a Master it should have a high RAM or
Processing power in order to Maintain or Guide all the slaves in a Hadoop cluster.
Namenode receives heartbeat signals and block reports from all the slaves i.e.
DataNodes.
2. DataNode: DataNodes works as a Slave DataNodes are mainly utilized for
storing the data in a Hadoop cluster, the number of DataNodes can be from 1 to
500 or even more than that, the more number of DataNode your Hadoop cluster
has More Data can be stored. so it is advised that the DataNode should have High
storing capacity to store a large number of file blocks. Datanode performs
operations like creation, deletion, etc. according to the instruction provided by the
NameNode.
Objectives and Assumptions Of HDFS
1. System Failure: As a Hadoop cluster is consists of Lots of nodes with are
commodity hardware so node failure is possible, so the fundamental goal of HDFS
figure out this failure problem and recover it.
2. Maintaining Large Dataset: As HDFS Handle files of size ranging from GB to
PB, so HDFS has to be cool enough to deal with these very large data sets on a
single cluster.
3. Moving Data is Costlier then Moving the Computation: If the computational
operation is performed near the location where the data is present then it is quite
faster and the overall throughput of the system can be increased along with
minimizing the network congestion which is a good assumption.
4. Portable Across Various Platform: HDFS Posses portability which allows it to
switch across diverse Hardware and software platforms.
5. Simple Coherency Model: A Hadoop Distributed File System needs a model to
write once read much access for Files. A file written then closed should not be
changed, only data can be appended. This assumption helps us to minimize the
data coherency issue. MapReduce fits perfectly with such kind of file model.
6. Scalability: HDFS is designed to be scalable as the data storage requirements
increase over time. It can easily scale up or down by adding or removing nodes to
the cluster. This helps to ensure that the system can handle large amounts of data
without compromising performance.
7. Security: HDFS provides several security mechanisms to protect data stored on
the cluster. It supports authentication and authorization mechanisms to control
access to data, encryption of data in transit and at rest, and data integrity checks to
detect any tampering or corruption.
8. Data Locality: HDFS aims to move the computation to where the data resides
rather than moving the data to the computation. This approach minimizes network
traffic and enhances performance by processing data on local nodes.
9. Cost-Effective: HDFS can run on low-cost commodity hardware, which makes
it a cost-effective solution for large-scale data processing. Additionally, the ability
to scale up or down as required means that organizations can start small and
expand over time, reducing upfront costs.
10. Support for Various File Formats: HDFS is designed to support a wide range
of file formats, including structured, semi-structured, and unstructured data. This
makes it easier to store and process different types of data using a single system,
simplifying data management and reducing costs.
Map Reduce and its Phases with numerical example.
Map Reduce is a framework in which we can write applications to run huge
amount of data in parallel and in large cluster of commodity hardware in a reliable
manner.
Phases of MapReduce
MapReduce model has three major and one optional phase.
1. Mapping
2. Shuffling and Sorting
3. Reducing
4. Combining
1) Mapping
It is the first phase of MapReduce programming. Mapping Phase accepts key-value
pairs as input as (k, v), where the key represents the Key address of each record
and the value represents the entire record [Link] output of the Mapping phase
will also be in the key-value format (k’, v’).
2) Shuffling and Sorting
The output of various mapping parts (k’, v’), then goes into Shuffling and Sorting
phase. All the same values are deleted, and different values are grouped together
based on same keys. The output of the Shuffling and Sorting phase will be key-
value pairs again as key and array of values (k, v[ ]).
3) Reducer
The output of the Shuffling and Sorting phase (k, v[]) will be the input of the
Reducer phase. In this phase reducer function’s logic is executed and all the values
are Collected against their corresponding keys. Reducer stabilize outputs of various
mappers and computes the final output.
4) Combining
It is an optional phase in the MapReduce phases . The combiner phase is used to
optimize the performance of MapReduce phases. This phase makes the Shuffling
and Sorting phase work even quicker by enabling additional performance features
in MapReduce phases.
Numerical Example
We will be using MovieLens Data.
USER_ID MOVIE_ID RATING TIMESTAMP
196 242 3 881250949
186 302 3 891717742
196 377 1 878887116
244 51 2 880606923
166 346 1 886397596
186 474 4 884182806
186 265 2 881171488
Solution
Step 1: First we have to map the values , it is happen in 1st phase of Map Reduce
model.
196:242 ; 186:302 ; 196:377 ; 244:51 ; 166:346 ; 186:274 ; 186:265
Step 2: After Mapping we have to shuffle and sort the values.
166:346 ; 186:302,274,265 ; 196:242,377 ; 244:51
Step 3: After completion of step1 and step2 we have to reduce each key's values.
Now, put all values together
Common Use Cases of MapReduce
Counting word frequency (as shown above)
Log analysis
Indexing web pages
Processing large datasets for ETL (Extract, Transform, Load)
Recommendation systems and data mining
Advantages of MapReduce
Simple and easy abstraction for large-scale data processing
Efficient for batch processing of massive datasets
Fault-tolerant and scalable
Integrates well with Hadoop Distributed File System (HDFS)
Limitations of MapReduce
Not ideal for real-time processing
Complex data workflows can be hard to express
Debugging and testing are more challenging
High latency due to intermediate disk I/O (especially in the shuffle phase)
Hadoop YARN Architecture
YARN stands for "Yet Another Resource Negotiator". It was introduced in
Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in Hadoop
1.0. YARN was described as a "Redesigned Resource Manager" at the time of its
launching, but it has now evolved to be known as large-scale distributed operating
system used for Big Data processing.
YARN architecture basically separates resource management layer from the
processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split
between the resource manager and application manager.
YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System) thus making the
system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large
volume data processing, it is quite necessary to manage the available resources
properly so that every application can leverage them.
YARN Features
YARN gained popularity because of the following features:
1. Scalability: The scheduler in Resource manager of YARN architecture
allows Hadoop to extend and manage thousands of nodes and clusters.
2. Compatibility: YARN supports the existing map-reduce applications
without disruptions thus making it compatible with Hadoop 1.0 as well.
3. Cluster Utilization:Since YARN supports Dynamic utilization of cluster in
Hadoop, which enables optimized Cluster Utilization.
4. Multi-tenancy: It allows multiple engine access thus giving organizations a
benefit of multi-tenancy.
Hadoop YARN Architecture
The main components of YARN architecture include:
1) Client
The Client is the entity that initiates and submits the application (such as a
MapReduce job) to the YARN framework. It communicates with the Resource
Manager to request execution, monitors the job status, and can also interact with
the Application Master for progress updates. It essentially acts as the user's
interface to launch and manage applications on the Hadoop cluster.
2) Resource Manager
It is the master daemon of YARN and is responsible for resource assignment and
management among all the applications. Whenever it receives a processing request,
it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:
Scheduler: It performs scheduling based on the allocated application and
available resources. It is a pure scheduler, means it does not perform other
tasks such as monitoring or tracking and does not guarantee a restart if a task
fails. The YARN scheduler supports plugins such as Capacity Scheduler and
Fair Scheduler to partition the cluster resources.
Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager. It also restarts the
Application Master container if a task fails.
3) Node Manager
It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Resource
Manager. It registers with the Resource Manager and sends heartbeats with the
health status of the node. It monitors resource usage, performs log management
and also kills a container based on directions from the resource manager. It is also
responsible for creating the container process and start it on the request of
Application master.
4) Application Master
An application is a single job submitted to a framework. The application master is
responsible for negotiating resources with the resource manager, tracking the status
and monitoring progress of a single application. The application master requests
the container from the node manager by sending a Container Launch
Context(CLC) which includes everything an application needs to run. Once the
application is started, it sends the health report to the resource manager from time-
to-time.
5) Container
It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which
is a record that contains information such as environment variables, security
tokens, dependencies etc.
Application workflow in Hadoop YARN:
1. Client submits an application
2. The Resource Manager allocates a container to start the Application
Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor
application’s status
8. Once the processing is complete, the Application Manager un-registers with
the Resource Manager
Advantages
Flexibility: YARN offers flexibility to run various types of distributed
processing systems such as Apache Spark, Apache Flink, Apache Storm, and
others. It allows multiple processing engines to run simultaneously on a
single Hadoop cluster.
Resource Management: YARN provides an efficient way of managing
resources in the Hadoop cluster. It allows administrators to allocate and
monitor the resources required by each application in a cluster, such as CPU,
memory, and disk space.
Scalability: YARN is designed to be highly scalable and can handle
thousands of nodes in a cluster. It can scale up or down based on the
requirements of the applications running on the cluster.
Improved Performance: YARN offers better performance by providing a
centralized resource management system. It ensures that the resources are
optimally utilized, and applications are efficiently scheduled on the available
resources.
Security: YARN provides robust security features such as Kerberos
authentication, Secure Shell (SSH) access, and secure data transmission. It
ensures that the data stored and processed on the Hadoop cluster is secure.
Disadvantages
Complexity: YARN adds complexity to the Hadoop ecosystem. It requires
additional configurations and settings, which can be difficult for users who
are not familiar with YARN.
Overhead: YARN introduces additional overhead, which can slow down the
performance of the Hadoop cluster. This overhead is required for managing
resources and scheduling applications.
Latency: YARN introduces additional latency in the Hadoop ecosystem.
This latency can be caused by resource allocation, application scheduling,
and communication between components.
Single Point of Failure: YARN can be a single point of failure in the
Hadoop cluster. If YARN fails, it can cause the entire cluster to go down. To
avoid this, administrators need to set up a backup YARN instance for high
availability.
Limited Support: YARN has limited support for non-Java programming
languages. Although it supports multiple processing engines, some engines
have limited language support, which can limit the usability of YARN in
certain environments.