BIG DATA ANALYTICS
MODULE - 2
Dr. Prasanna Lakshmi G S,
Dept. of ISE,
SVIT.
Module 2
Chapter -1
Introduction to Hadoop
Introducing Hadoop
• Big Data is a trending topic as enterprises recognize the vast amount of untapped
structured, semi-structured, and unstructured data available across
networks.
• Data Generated:
Why Hadoop?
Hadoop is a popular technology because it can
quickly handle large and diverse data.
Its key advantages include:
• Low Cost – Open-source and uses inexpensive
hardware.
• High Computing Power – Uses distributed
computing to process large data efficiently.
• Scalability – Easily expands by adding more
nodes.
• Storage Flexibility – Stores any type of data
without pre-processing.
• Data Protection – Automatically manages
failures and stores multiple copies of data.
Why not RDBMS?
• RDBMS is not ideal for handling
large files, images, and videos
because it is designed for
structured data.
• It also struggles with advanced
analytics, like machine learning.
• As data grows, using RDBMS
becomes expensive due to high
storage and processing costs.
RDBMS vs Hadoop
Parameter RDBMS Hadoop
System Structured, table-based system Node-based flat structure
Data Suitable for structured data Supports structured and unstructured data
Handling (XML, JSON, text files, etc.)
Processing Used for OLTP (Online Transaction Used for big data and analytical processing
Processing)
Choice Best when data requires consistent Best for big data processing without strict
relationships. relationships.
For example, in a banking system, customer For example, analyzing large sets of social
data, account details, and transactions must media posts, images, or sensor data, where
be linked correctly to maintain accuracy. relationships between data points are not
strictly required.
Hardware Requires expensive, high-end hardware for Works with low-cost hardware (simple
Requireme large data storage processor, network card, and hard drives)
nt
Cost $10,000 to $14,000 per terabyte $4,000 per terabyte
Distributed Computing Challenges
• In a distributed system, multiple servers (computers) work together over a network. Since many hardware
components are involved, there is always a chance that some may fail.
Why does hardware failure happen?
• Every computer has a hard disk to store data. Normally, a single hard disk may fail once in 3 years. Now,
imagine a system with 1000 hard disks—it is likely that a few of them will fail every day. This can lead to
data loss if there is no backup.
How does Hadoop solve this?
• Hadoop prevents data loss using Replication Factor (RF).
What is Replication Factor?
Replication Factor means that multiple copies of the same data are stored in different locations (servers)
within the system. If one hard disk fails, the same data can still be accessed from another copy stored elsewhere.
For example:
• If RF = 3, it means each data block is stored in three different servers.
• If one server fails, Hadoop can use the other two copies to continue operations without losing data.
How to process this gigantic store of data?
• In a distributed system, data is stored across multiple machines
instead of a single one. This creates a challenge—before
processing the data, we need to bring it together from different
machines.
• Hadoop solves this problem using MapReduce Programming.
Instead of moving all the data to one place for processing
(which would be slow and inefficient), MapReduce works in two steps:
• Map Phase: The data is divided into smaller chunks, and each chunk is
processed separately on different machines.
• Reduce Phase: The processed results from all machines are combined
to get the final output.
• This way, Hadoop ensures that large amounts of data are processed
efficiently without overloading a single machine.
History of Hadoop
• Hadoop was created by Doug Cutting, who also developed Apache Lucene, a
widely used text search library. It originated as part of Apache Nutch, an open-
source web search engine project at Yahoo, and later became a part of the
Apache Lucene project.
Why is it Called "Hadoop"?
• The name "Hadoop" is not an acronym or abbreviation. Doug Cutting named it
after his child's stuffed yellow elephant. He chose this name because it was:
➢ Short and easy to spell.
➢ Simple to pronounce.
➢ Unique and unrelated to any existing technology.
• This naming style continued in Hadoop's subprojects, which often have names
inspired by animals. For example, "Pig" is another component of Hadoop.
Hadoop Overview
❑ Hadoop is an open-source software framework that helps in storing and
processing huge amounts of data across multiple computers.
❑ It works on a cluster of machines (group of connected computers) using
commodity hardware (affordable, easily available computers).
What does Hadoop do?
Hadoop mainly does two things:
➢ Massive Data Storage – Instead of storing data on a single computer, Hadoop
splits the data into smaller parts and distributes them across many computers in
the cluster. This allows it to store huge amounts of data efficiently.
➢ Faster Data Processing – Hadoop processes data in parallel using multiple
machines at the same time. This means instead of one computer handling all the
work, many computers work together, making processing much faster.
Key Aspects of Hadoop
Hadoop Components
• Hadoop has two main components that handle
storage and processing:
1. HDFS (Hadoop Distributed File
System)
• Storage Component: Stores large amounts of
data across multiple machines.
• Distributes Data: Splits and distributes data
across different nodes (computers).
• Redundant Storage: Keeps multiple copies of
data to prevent loss in case of failure.
2. MapReduce
• Computational Framework: Helps in
processing large datasets.
• Task Distribution: Breaks down tasks into
smaller parts and assigns them to different
nodes.
• Parallel Processing: Performs computations
simultaneously to improve speed.
The Hadoop Ecosystem includes additional tools that enhance Hadoop’s functionality.
These tools help in data processing, storage, and management.
• HIVE – A SQL-like tool that helps users query and analyze large datasets stored in
Hadoop.
• PIG – A scripting language that simplifies data processing in Hadoop.
• SQOOP – Transfers data between Hadoop and relational databases (like MySQL,
PostgreSQL).
• HBASE – A NoSQL database that provides real-time access to Hadoop-stored data.
• FLUME – Collects and moves large amounts of streaming data (e.g., logs, social media
feeds) into Hadoop.
• OOZIE – Manages and schedules Hadoop jobs and workflows.
• MAHOUT – Provides machine learning capabilities within Hadoop for tasks like
clustering and recommendation.
These components work together to improve the performance, usability, and
efficiency of Hadoop in handling big data.
Hadoop Conceptual Layers
Hadoop is divided into two main layers:
• Data Storage Layer – This layer is responsible for storing
large amounts of data. It ensures that data is stored across
multiple machines in a distributed manner.
• Data Processing Layer – This layer processes the stored
data in parallel, meaning multiple parts of the data are
processed simultaneously to generate useful insights quickly.
Hadoop High-Level Architecture
Hadoop follows a Master-Slave Architecture, meaning there
is one central system (Master) that controls multiple worker
systems (Slaves).
• Master Node (NameNode) – The central node that
manages the storage and keeps track of where data is stored in
the cluster.
• Slave Nodes (DataNodes) – These nodes store actual data
and handle processing tasks.
In this setup, the NameNode (Master) directs DataNodes
(Slaves) on where to store data and how to process it. This
ensures efficient storage, quick data processing, and fault
tolerance (if one DataNode fails, another can take over).
Key Components of the Master Node in Hadoop
• The Master Node in Hadoop is responsible for managing both data storage and data
processing. It has two main components:
1. Master HDFS (Hadoop Distributed File System)
• Manages how data is stored in Hadoop.
• Working :
➢ It divides (partitions) large data files into smaller parts.
➢ These smaller parts are then stored across multiple Slave Nodes (also called
DataNodes).
➢ It keeps a record of where each part of the data is stored.
2. Master MapReduce
• Manages how data is processed in Hadoop.
• Working:
▫ It assigns data processing tasks to Slave Nodes based on where the data is stored.
▫ It ensures the tasks run efficiently by distributing the workload.
▫ It schedules the tasks and monitors their progress.
Use case of Hadoop
ClickStream Data
• ClickStream data (mouse clicks) helps businesses understand
customer purchasing behavior.
• ClickStream analysis using Hadoop offers three key benefits:
❖Data Integration – Hadoop combines ClickStream data with
other sources like customer demographics, sales, and advertising
data to provide deeper insights.
❖Scalability – Hadoop stores years of data at a low cost, enabling
long-term trend analysis.
❖Analytics Support – Tools like Apache Pig and Hive help
organize and refine ClickStream data for better visualization and
decision-making.
• This helps businesses optimize websites, promotions, and
marketing strategies effectively.
HDFS (Hadoop Distributed File System)
Some key points of HDFS are as follows:
• Storage Component: HDFS is the main storage system used in Hadoop to store large
amounts of data.
• Distributed System: It splits large files into smaller parts (blocks) and distributes
them across multiple computers (nodes).
• Inspired by Google File System: HDFS is designed based on Google's approach to
handling large-scale data storage.
• High Performance: It is optimized for handling big data efficiently by storing large
blocks of data and processing it close to where it is stored.
• Replication for Fault Tolerance: Each file is stored in multiple copies across
different nodes, making it resistant to software or hardware failures.
• Automatic Recovery: If a node fails, HDFS automatically replicates and restores
the lost data from other copies.
• Best for Large Files: HDFS is ideal for reading and writing large files (in gigabytes
or more).
• Works on Existing File Systems: HDFS runs on top of standard file
systems like ext3 (Third Extended Filesystem) and ext4 (Fourth
Extended Filesystem) (used in Linux).
• When data is written to HDFS, it is broken into blocks
and stored across multiple nodes in a distributed
environment.
• These blocks are stored as files in the underlying
Linux file system (ext3, ext4, or others).
• The Linux file system handles the actual disk-level
operations like reading, writing, and organizing data.
Windows file systems are NTFS(New Technology File
System), FAT32 (File Allocation Table 32-bit),
exFAT (Extended File Allocation Table), ReFS
(Resilient File System)
Figure 5.15 is a HDFS architecture.
Client application interacts with
namenode for metadata related
activities and communicates
with datanode to read and write
files. Datanodes converse with
each other for pipeline reads and
write.
Example: If a file named "[Link]" is 192 MB in size, HDFS will:
1. Split it into three blocks of 64 MB each (since the default block size is 64 MB).
2. Distribute these blocks across different nodes.
3. Replicate them based on the default replication factor (typically 3), ensuring data reliability.
HDFS Daemons
NameNode in HDFS
1. Manages File System: NameNode is the master of HDFS and keeps track of all files and
directories.
2. Breaks Large Files into Blocks: HDFS splits large files into blocks and distributes them
across multiple DataNodes.
3. Tracks DataNodes Using Racks:
▫ A rack is a group of DataNodes within a cluster.
▫ NameNode assigns rack IDs to DataNodes and tracks where each block is stored.
4. File System Namespace:
▫ The namespace includes a list of all files, directories, and block locations in HDFS.
▫ NameNode stores this information in a special file called FsImage.
5. Transaction Log (EditLog):
▫ Every file operation (create, delete, modify) is recorded in EditLog (a transaction log).
▫ When NameNode restarts, it reads FsImage and applies all transactions from EditLog to update the
file system state.
▫ After updates, it saves the new FsImage and clears the old EditLog.
6. Single NameNode Per Cluster:
▫ There is only one NameNode managing the entire HDFS cluster.
▫ It does not store actual file data—only metadata (file locations and properties).
➢ If you store a file in HDFS:
• NameNode breaks it into blocks and assigns them to DataNodes.
• It records file metadata (e.g., block locations, size, and permissions) in
FsImage.
• If a new file is added, the change is logged in EditLog for future
updates.
HDFS Daemons
Datanode in HDFS
➢ A Hadoop cluster has many
DataNodes to store and manage file
blocks.
➢ During file read and write, DataNodes
exchange data to ensure efficient
processing.
➢ Each DataNode sends a heartbeat
signal to the NameNode at regular
intervals. This confirms that the
DataNode is active and functioning.
➢ If a DataNode stops sending
heartbeats, the NameNode assumes it
has failed. The NameNode replicates
the lost data from other copies to
maintain fault tolerance.
HDFS Daemons
Secondarynode in HDFS
• The secondary namenode takes a snapshot of HDFS
metadata at intervals specified in the hadoop
configuration.
• Since the memory requirements of secondary
namenode are the same as namenode, it is better to run
namenode and secondary namenode on different
machines.
• In case of failure of the namenode, the secondary
namenode can be configured manually to bring up the
cluster.
• However the secondary namenode does not record any
real-time changes that happen to the HDFS metadata.
Anatomy of file read
1. The client calls open() on DistributedFileSystem to read a file.
2. The DistributedFileSystem contacts the NameNode to find out where the file's data blocks
are stored. The NameNode responds with the addresses of DataNodes storing the required
blocks. The client receives a stream (FSDataInputStream – It is responsible to read data in
chunks) to start reading the file.
3. The client calls read() on the DFSInputStream, which contains the addresses of the DataNodes.
It chooses the closest DataNode for the first block and starts reading.
4. The client continuously calls read() to fetch data from the DataNode.
5. When the end of a block is reached, the connection with the DataNode is closed. The client
finds the next best DataNode and continues reading the next block. This process repeats for all
remaining blocks.
6. After reading the full file, the client calls close() on FSDataInputStream to close the connection.
• FSDataInputStream is a special input stream in HDFS that allows the client to read a file block
by block.
• It helps the client retrieve data efficiently from different DataNodes without needing to manage
the process manually.
Anatomy of file write
1. The client calls create() on DistributedFileSystem to create a new file.
2. The DistributedFileSystem makes an RPC call (Remote Procedure Call) to
the NameNode. NameNode checks whether the file already exists. If valid,
it creates a file entry but does not yet assign data blocks. The system
returns FSDataOutputStream to the client for writing data.
3. The client writes data to the DFSOutputStream, which splits the data
into packets. These packets go into a queue called data queue. A process
called DataStreamer picks up packets and requests the NameNode to
assign DataNodes for storage. DataNodes are selected based on replication
factor (default = 3). The selected DataNodes form a pipeline to receive
and replicate the data.
4. DataStreamer sends packets to the first DataNode in the pipeline. The
first DataNode stores the packet and forwards it to the second
DataNode. The second DataNode stores the packet and forwards it to
the third DataNode. This process ensures replication across multiple
nodes.
5. Along with the data queue, DFSOutputStream maintains an ack
queue (acknowledgment queue). A packet remains in the ack queue
until all DataNodes in the pipeline confirm they have successfully
stored it.
6. When the client finishes writing, it calls close() on the stream. This
flushes all remaining packets to the DataNode pipeline. The system
waits for acknowledgments from all DataNodes. After confirmation, the
client informs the NameNode that the file creation is complete.
FaceBook Example
Generates billions of Photos and Videos daily
Namenode: Keeps track of which Datanode has which photo/video
block.
Datanodes: Store the actualy photo/video data.
Secondary Namenode: Ensures metadata doesn’t blow up by
regularly creating checkpoints.
So, when you open a Facebook photo:
1. The request first queries the NameNode to find block locations.
2. The DataNodes send the blocks to your device.
3. The system stay reliable because the SecondaryNode Keeps the
Metadata clean.
Hadoop default replica placement strategy
• Hadoop follows a smart strategy to store copies (replicas) of data across
different nodes and racks to ensure fault tolerance and reliability.
• When a file is saved in HDFS, three copies (replicas) of each block are
stored as follows:
1. First replica → Stored on the same node as the client (if the client
is inside the cluster).
2. Second replica → Placed on a different rack (a rack is a group of
nodes).
3. Third replica → Stored on the same rack as the second replica,
but on a different node.
• Advantages of this strategy:
1. High Availability: If one node fails, other copies exist on different
nodes/racks.
2. Better Fault Tolerance: Since at least one replica is on a different rack,
data is safe even if a whole rack fails.
3. Efficient Network Usage: Keeps one local copy for fast access while
balancing network load across racks.
• Once the replica locations are decided, a pipeline is built to send the data to
these nodes in order. The first node stores the data and forwards it to the
second, which then sends it to the third.
Working with HDFS commands
Special features of HDFS
• Data Replication – The client doesn't need to track file blocks. Instead,
it is automatically directed to the nearest data replica for better
performance.
• Data Pipeline – When writing data, the client sends the first block to a
DataNode, which then forwards it to the next node in the pipeline. This
continues until all blocks and their replicas are written to disk.
Processing Data with Hadoop
• MapReduce is a software framework for processing large datasets in parallel. It splits input data into independent
chunks, processes them using Map tasks, and stores intermediate results locally. The framework automatically shuffles,
sorts, and provides this output to Reduce tasks, which combine the results to generate the final output.
• The Hadoop Distributed File System (HDFS) and MapReduce framework run on the same nodes for data locality,
improving throughput.
• Two key components manage the process:
JobTracker (Master): Schedules, monitors, and re-executes failed tasks.
TaskTracker (Slave): Executes the assigned tasks.
• Job configuration includes MapReduce functions, input/output locations, and job parameters. The
JobTracker receives jobs from the Hadoop job client, schedules tasks, monitors progress, and updates the
client.
MapReduce Daemons
1) JobTracker (Master)
• It provides a connectivity between hadoop and your application.
• Acts as the manager of all jobs in a Hadoop cluster.
• When a mapreduce job is submitted, the JobTracker decides:
- Which task should be assigned to which node in the cluster.
- Monitors all running tasks.
- If a task fails, it reassigns the task to another node.
• Only one JobTracker exists per Hadoop cluster.
• It ensures the smooth execution of the entire MapReduce
process.
MapReduce Daemons
2) Task Tracker (Worker)
• This daemon is responsible for executing individual tasks that is
assigned by the job tracker.
• Works as the executor of tasks assigned by the JobTracker.
• Each worker node (slave) in the cluster has one TaskTracker.
• The TaskTracker can run multiple map and reduce tasks in
separate JVMs (Java Virtual Machines).
• It regularly updates the JobTracker by sending a heartbeat
signal.
• If the JobTracker does not receive a heartbeat, it assumes the
TaskTracker has failed and reassigns the task to another node.
TaskTracker TaskTracker
Heartbeat Heartbeat
Reduce signal signal map Reduce
map
JVM1 JVM2 JVM3
JVM1 JVM2 JVM3
Job Tracker
(Master) Node 3
Node 1
Cluster
Heartbeat
signal Heartbeat
signal
TaskTracker
TaskTracker
map Reduce Reduce
map
Node 2
Node 4
JVM1 JVM2 JVM3 JVM1 JVM2 JVM3
How does mapreduce work?
• In the given block diagram, there
are two mappers and one reducer.
Each mapper works on the partial
dataset that is stored on that node
and the reducer combines the
output from the mappers to
produce the reduced result set.
Working model of mapreduce programming
1. Splitting the Data: 4. Partitioning the Data:
• The input data is broken into smaller • The map worker uses a partitioner to
chunks so that multiple tasks can work decide which reducer should receive
on it at the same time. the output.
• This ensures that data related to the
2. Creating Master and Worker same key is sent to the same reducer.
Processes: 5. Shuffling and Sorting:
• A master process is created to control • After all map workers finish, the reduce
and assign tasks. workers collect the key-value pairs.
• Multiple worker processes are started • The data is shuffled (grouped by key)
and sorted for easier processing.
on different machines to handle the
6. Reducing Phase:
tasks.
• The reduce function processes each
3. Mapping Phase: unique key and aggregates results.
• Each worker (map task) reads its 7. Completion:
assigned chunk of data. • Once all reducers finish, the master
gives control back to the user
• The map function processes this data program.
and converts it into key-value pairs. • The user can now access the final
processed results.
Managing resources and applications with
Hadoop YARN (Yet Another Resource Negotiator)
What is hadoop YARN?
- YARN stands for Yet Another Resource Negotiator and is a part of Hadoop 2.x.
- It is a resource management framework that allows Hadoop to handle multiple types of
applications, not just MapReduce.
- Now, Hadoop can handle Batch Processing, Interactive Queries, Online Processing,
Streaming, and Graph Processing efficiently.
Limitations of Hadoop 1.0
- Single NameNode Problem
• Only one NameNode managed the entire Hadoop cluster.
• If this NameNode failed, the entire system could stop working (single point of failure).
- Limited to Batch Processing
• It could only handle batch jobs, meaning it processed large amounts of data in one go.
• It was slow for real-time tasks.
Managing resources and applications with
Hadoop YARN (Yet Another Resource Negotiator)
- Not Good for Quick Queries (Interactive Analysis)
• You couldn’t ask questions and get instant answers from large datasets.
- Not Suitable for Machine Learning & Graph Processing
• Machine learning and graph-based tasks need more memory and faster
processing. Hadoop 1.0 struggled with these workloads.
- Poor Resource Management
• The system had fixed slots for map tasks and reduce tasks.
• If map slots were full but reduce slots were empty, resources were
wasted.
• This led to inefficient resource utilization.
HDFS Limitation in Hadoop 1.0
• The NameNode in Hadoop stores file metadata (information about
files and directories) in main memory (RAM).
• If there are too many files, the NameNode’s memory gets full and
cannot handle more files.
• This slows down the system and may cause failures.
• Even though RAM is cheaper today, it still has a limit on how much
data it can store.
• If a single NameNode handles everything, it can become
overloaded when the system grows.
• In Hadoop 2.x, this is resolved with the help of HDFS Federation.
Hadoop 2: HDFS
Major Components of HDFS 2:
• Namespace Service – Manages file-related operations like creating, modifying,
and organizing files/directories.
• Block Storage Service – Manages data storage, handles DataNodes, and takes
care of replication for fault tolerance.
Features of HDFS 2:
1) Horizontal Scalability:
• Uses multiple NameNodes to handle more data efficiently.
• Each NameNode works independently without needing coordination.
• All DataNodes are shared among NameNodes.
2) High Availability:
• Hadoop 1.x had only one NameNode to manage the [Link] this
single NameNode failed, the entire Hadoop cluster would stop
working. This caused downtime and data access issues. Solution in
Hadoop 2.x – Active-Passive NameNode Setup. To ensure high
availability, Hadoop 2.x introduced the Active-Passive NameNode
architecture with automatic failover.
• Two NameNodes Exist:
a) Active NameNode → Manages all operations (reads, writes,
metadata updates).
b) Passive (Standby) NameNode → Remains idle but stays
updated.
• All namespace changes (file modifications, deletions, etc.) are recorded
in a shared NFS storage (Network File System). This ensures that both
Active and Passive NameNodes have the same updated metadata.
• If the Active NameNode fails, the Passive NameNode
automatically takes over. It becomes the new Active NameNode
and starts handling operations. The transition happens seamlessly,
ensuring that the system continues running without downtime.
• At any time, only one NameNode writes to storage to prevent
conflicts. The Passive NameNode only reads updates until it
becomes active.
Hadoop 2 YARN: Taking Hadoop beyond Batch
• YARN (Yet Another Resource Negotiator) is a key feature in Hadoop 2.x that manages
resources efficiently. It allows multiple applications to run on Hadoop, not just batch
processing like MapReduce.
• Hadoop 2.x introduced YARN, allowing better resource utilization and support for
different types of processing.
➢ Stores All Data in One Place → You don’t need separate systems for different types of
workloads.
➢ Supports Multiple Processing Models → Hadoop is no longer limited to batch jobs; it
can handle real-time, streaming, interactive, and graph-based processing.
➢ It efficiently allocates resources, ensuring smooth and optimized processing.
➢ Improves job scheduling and priority handling for better reliability and efficiency.
• Yahoo originally designed YARN to improve Hadoop’s performance and flexibility.
Fundamentals Idea
Daemons of YARN architecture are:
a) Global Resource Manager (Manages System-Wide Resources)
• Allocates resources among different applications.
• It has two main parts:
(a) Scheduler – Decides how much CPU, memory, and other
resources each application gets but does not track their status.
(b) Application Manager –
▫ Accepts job submissions.
▫ Allocates containers (a unit of resources like CPU and memory) for
applications.
▫ Restarts failed Application Masters.
b) NodeManager (Per-Machine Resource Monitor &
Executor)
• Runs on each machine in the cluster.
• Starts and monitors containers (where applications run).
• Tracks resource usage (CPU, memory, disk, etc.) and reports to the
Resource Manager.
c) ApplicationMaster (Manages a Specific Application)
• Each application gets its own ApplicationMaster.
• Requests resources from the Resource Manager.
• Works with NodeManager to run and monitor tasks.
Terminologies
YARN Architecture
1) Client submits the job → The user submits an application with all the details needed to
run it, including the instructions for launching the ApplicationMaster.
2) Resource Manager assigns a container → It selects a container (a unit of resources
like CPU & memory) to launch the ApplicationMaster.
3) ApplicationMaster registers with Resource Manager → This helps the client check
the job’s status directly with the Resource Manager.
4) ApplicationMaster requests more containers → It negotiates with the Resource
Manager to get the necessary resources (CPU, memory) for running the job.
5) Containers are assigned and launched → Once allocated, the ApplicationMaster
sends instructions to NodeManager, which starts the containers.
6) NodeManager runs the application → It executes the code inside the container and
sends progress updates to the ApplicationMaster.
7) Client communicates with ApplicationMaster → The user can check the status of
their job by talking to the ApplicationMaster directly.
8) ApplicationMaster completes and shuts down → After finishing the job, the
ApplicationMaster deregisters from the Resource Manager, freeing up resources.
Chapter 2
Introduction to mapreduce
programming
Introduction File 1 File 2 HDFS
• In mapreduce programming, jobs
(applications) are split into a set Jobs
of map tasks and reduce tasks. (Applications)
• These tasks are executed in a
distributed fashion on hadoop
cluster.
• Each task processes small subset
Map Map Map
of data that has been assigned to task 1 task 2 task 3
Reduce
taks 1
Reduce
task 2
Reduce
task 3
it. Data 1
Data
2
Data
3
Data
4
Data
5
Data
6
• This way, hadoop distributes the
Hadoop cluster
load across the cluster.
• Mapreduce job takes a set of files Tasks executed on hadoop cluster in a
distributed fashion
that is stored in HDFS (Hadoop
Distributed File System) as input.
• Map task takes care of loading, • The responsibility of reduce task is
parsing, transforming, and grouping and aggregating data
filtering. that is produced by map tasks to
❑ Loading:- Map task loads data from generate final output.
HDFS. It reads the input files line by ❑ Grouping:- Reduce task groups
line, row by row, or in another all values that have the same key.
format based on the file type.
❑ Aggregating:- The Reduce task
❑ Parsing:-After loading, the Map aggregates (summarizes) the grouped
task parses (breaks down) the data values. Aggregation means performing
into smaller components. If the input operations like sum, count,
is text, parsing means splitting the average, max, min, etc.
sentence into words
❑ Transforming:-Maptask
transforms the parsed data into
key-value pairs
❑ Filtering:-Map task filters out
unwanted data. It removes
unnecessary words, null values,
or unwanted entries.
• Each map task is broken down into the following phases:
a) RecordReader
b) Mapper
c) Combiner
d) Partitioner
• The output produced by map task is known as intermediate keys and
values. These intermediate keys and values are sent to reducer. The
reduce tasks are broken into the following phases:
a) Shuffle
b) Sort
c) Reducer
d) Output format
• Hadoop assigns map tasks to the datanode where the actual data to be
processed resides. This way hadoop ensures data locality.
(Data locality means that data is not moved over network; only computational code
is moved to process data which saves network bandwidth.)
Mapper
• The Mapper is the first stage in a MapReduce job. It processes input data and converts it
into intermediate key-value pairs, which will later be grouped and processed by the
Reducer.
1) RecordReader:-
• Reads raw data from HDFS and converts it into a structured key-value format.
• The key is usually the position of the record, and the value is the actual data.
2) Map Function:-
• Processes the key-value pairs and generates intermediate key-value pairs.
3) Combiner(optional):-
• Acts as a mini-reducer to optimize performance.
• It aggregates data locally before sending it to the reducer, reducing network traffic.
Mapper output Combiner output
4) Partitioner:-
• Decides which reducer will process a particular key-value pair.
• Ensures that keys with the same value go to the same reducer.
• The partitioned data is stored locally before being pulled by the
reducers.
Reducer
1) Shuffle and sort:-
• The Reducer collects data from all Mappers.
• It groups the same keys together and sorts them.
1 Mapper output
1 Combiner output
2 Mappers output after combiner
Output at shuffle phase
2) Reduce function:-
• The Reduce function applies operations like sum, average, filtering, etc.
• It processes one group of key-value pairs at a time.
3) Output Format
• The final output is formatted and saved to HDFS.
• By default, keys and values are separated by a tab.
Combiner
• It is an optimization technique for MapReduce Job.
Generally, the reducer class is set to be the combiner class.
• The difference between combiner class and reducer class is as
follows:
a) Output generated by the combiner is intermediate
data and it is passed to the reducer.
b) Output of the reducer is passed to the output file on
disk.
Mapper
[Link]
Combiner
Mapper 1
output
Combiner
Mapper 1
output
Reducer
Final output of
reducer
Hadoop automatically handles shuffling
and sorting internally between the
Combiner (or Mapper) and Reducer
stages.
Partitioner
• The partitioning phase happens after map phase and before
reduce phase.
• Usually the number of partitions are equal to the
number of reducers.
Purpose of using partitioner (explained by considering
word count example program) :
a) Distributes words efficiently across reducers.
b) Improves performance by reducing workload per reducer.
c) Reducers process alphabetically similar words together.
Searching
Mapper stage
Shuffle and sort
(automatically)
Reducer stage
Sorting • MapReduce automatically sorts by keys, and student
names (values) need to be sorted, we’ll swap the key
and value in the Mapper, so names become keys.
Mapper stage
Shuffle and sort
(automatically)
Reducer stage
Compression