MODULE 2
Introduction to Hadoop: Introducing hadoop, Why hadoop,
Why not RDBMS, RDBMS Vs Hadoop, History of
Hadoop, Hadoop overview, Use case of Hadoop, HDFS
(Hadoop Distributed File System),Processing data with
Hadoop, Managing resources and applications with Hadoop
YARN(Yet Another Resource Negotiator).
Introduction to Map Reduce Programming: Introduction,
Mapper, Reducer, Combiner, Partitioner, Searching,
Sorting, Compression.
Introducing hadoop
Today, Big Data is the Buzzword!
To process, analyze and make sense of these different kinds of data, we need a system
that scales and address the challenges like “ how to store terabytes of mounting data’,
“how to access the information quickly”, and “how to work with data that is very
diifernt”.
Why Hadoop ?
Ever wondered why Hadoop has been and is one of the most wanted technologies!!
The key consideration (the rationale behind its huge popularity) is:
Its capability to handle massive amounts of data, different categories of data – fairly quickly.
The other considerations are :
Why not RDBMS
RDBMS is not suitable for storing and processing large files, images, and videos.
RDBMS is not a good choice for advanced analytics involving machine learning.
In RDBMS cost increase with storage volume.
Need huge investment as the data volume increases.
RDBMS Vs Hadoop
Distributed Computing Challenges
• Hardware Failure – Replication factor (RF)
• How to Process This Gigantic Store of Data?
• MapReduce programming
History of Hadoop
Hadoop Overview
Key Aspects of Hadoop
Hadoop Components
Hadoop Components
Hadoop Core Components:
HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop High Level Architecture
Use case for Hadoop
ClickStream data (mouse clicks) helps you to understand the purchasing
behavior of customers. ClickStream analysis helps online marketers to
optimize their product web pages, promotional content, etc. to improve
their business.
ClickStream analysis using Hadoop provide 3 key benefits:
Hadoop helps to join ClickStream data with other data sources such as Customer relationship
Management Data.
Hadoop’s scalability property helps you to store years of data without ample incemental cost.
Business analysts can use Apache Pig or Apache Hive for website analysis.
Hadoop Distributors
The Following companies provide products that include Apache Hadoop ,
commercial support and / or toos and utilities related to Hadoop.
HDFS (Hadoop Distributed File System)
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).
5. You can replicate a file for a configured number of times, which is
tolerant in terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. You can realize the power of HDFS when you perform read or write on
large files (gigabytes and larger).
8. Sits on top of native file system such as ext3 and ext4
HDFS Daemons
HDFS (Hadoop Distributed File System) consists of several daemons that work together to manage
data storage across a distributed cluster. These daemons include:
1. NameNode
• The master daemon that manages the HDFS namespace.
• Keeps track of metadata, such as file locations, directory structures, and access permissions.
• Does not store actual data; instead, it maintains a mapping of file blocks to DataNodes.
• Runs on a single node (with an optional standby NameNode for high availability).
2. Secondary NameNode
• Periodically copies the NameNode’s metadata and merges edits logs to reduce the size.
• Helps in checkpointing but does not provide failover functionality.
3. DataNode
• The worker daemon responsible for storing actual file blocks.
• Reports to the NameNode and sends periodic heartbeats to indicate it is alive.
• Reads and writes data based on client requests.
• Replicates data blocks across nodes for fault tolerance.
Anatomy of File Read
Anatomy of File Write
Replica Placement Strategy
As per the Hadoop Replica Placement Strategy, first replica is placed on the same node
as the client. Then it places second replica on a node that is present on different rack. It
places the third replica on the same rack as second, but on a different node in the rack.
Once replica locations have been set, a pipeline is built. This strategy provides good
reliability.
Benefits of This Strategy
1. Fault Tolerance: Even if a node or rack fails, data remains
accessible.
2. Load Balancing: Prevents overloading a single node or rack.
3. Efficient Read Performance: Reads are optimized by selecting the
closest available replica.
4. Reduced Network Traffic: Minimizes cross-rack data transfer
during writes.
Working with HDFS Commands
Here are some commonly used HDFS (Hadoop Distributed File System) commands for
managing files and directories in a Hadoop cluster:
Special Features of HDFS
Data Replication: There is absolutely no need for a client
application to track all blocks. It directs the client to the
nearest replica to ensure high performance.
Data Pipeline: A client application writes a block to the first
DataNode in the pipeline. Then this DataNode takes over and
forwards the data to the next node in the pipeline. This process
continues for all the data blocks, and subsequently all the
replicas are written to the disk.
Processing with Hadoop
What is MapReduce Programming?
MapReduce Programming is a software framework.
MapReduce Programming helps you to process massive amounts of data
in parallel
MapReduce Daemons
How MapReduce Programming Works
Working model of MapReduce Programming
Working model of MapReduce Programming
Following steps describes how MapReduce performs its task
1. Input dataset is split into multiple pieces of data ( small subsets)
2. The framework creates a master and several workers processes and executes the
workers processes remotely.
3. Several map task work simultaneously and read pieces of data that were assigned
to each map task. (uses map function)
4. Map worker uses partitioner function to divide data into regions. Partitioner
decides which reducer should get output of the specified mapper.
5. When the map workers complete their work, master instruct the reduce workers to
begin their work. Reduce workers contact their map worker to get the key value pair.
6. Then it calls reduce functions for every unique key. This function writes the output
to the file.
When the reduce worker completes its work, the master transfer the control to the
user program.
MapReduce Example:
WordCount example – count occurrences of similar word.
MapReduce Programming require 3 things
1. Driver class: Specifies Job configuration details
2. Mapper class: Overrides the Map function based on the problem statement
3. Reducer class: Overrides the Reduce function based on the problem statement
MapReduce – Word Count Example
MANAGING RESOURCES AND APPLICATIONS
WITH HADOOP - YARN
(YET ANOTHER RESOURCE NEGOTIATOR)
Limitations of Hadoop 1.0 Architecture
1. Single NameNode is responsible for managing entire namespace for
Hadoop Cluster.
2. It has a restricted processing model which is suitable for batch-oriented
MapReduce jobs.
3. Hadoop MapReduce is not suitable for interactive analysis.
4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and
other memory intensive algorithms.
5. MapReduce is responsible for cluster resource management and data
processing.
Hadoop 2 YARN: Taking Hadoop beyond Batch
Hadoop 2 YARN: Taking Hadoop beyond Batch
The fundamental idea behind this architecture is splitting the JobTracker responsibility of
resource management and Job Scheduling/Monitoring into separate daemons. Daemons
that are part of YARN Architecture are described below.
A Global ResourceManager: Its main responsibility is to distribute resources among
various applications in the system. It has two main components:
NodeManager: This is a per-machine slave daemon. NodeManager responsibility is
launching the application containers for application execution. NodeManager monitors the
resource usage such as memory, CPU, disk, network, etc. It then reports the usage of
resources to the global ResourceManager.
Per-application ApplicationMaster: This is an application-specific entity. Its
responsibility is to negotiate required resources for execution from the ResourceManager.
It works along with the NodeManager for executing and monitoring component tasks.
Introduction to Map Reduce
Programming: Introduction, Mapper,
Reducer, Combiner, Partitioner,
Searching, Sorting, Compression.
Introduction to Map Reduce Programming
In MapReduce Programming, Jobs (Applications) are split into a
set of map tasks and reduce tasks. Then these tasks are executed in
a distributed fashion on Hadoop cluster.
Each task processes small subset of data that has been assigned to
it. This way, Hadoop distributes the load across the cluster.
MapReduce job takes a set of files that is stored in HDFS (Hadoop
Distributed File System) as input.
Map task takes care of loading, parsing, transforming and filtering.
The responsibility of reduce task is grouping and aggregating data
that is produced by map task to generate final output.
MAPPER
A mapper maps the input key−value pairs into a set of intermediate key–value
pairs. Maps are individual tasks that have the responsibility of transforming input
records into intermediate key–value pairs.
Mapper Consists of following phases:
➢ RecordReader
➢ Map
➢ Combiner
➢ Partitioner
MAPPER
RecordReader
It converts a byte-oriented view of input into record-oriented view and present it into mapper task.
It present the tasks with key and values.
Generally key is positional information and value is data that constitute the record.
Map
Map function works on key-value pair produced by the RecordReader and generates zero or more
intermediate key-value pairs.
Combiner
It is an optional function but provides high performance in terms of network bandwidth and disk space.
It takes intermediate key-value pairs provided by mapper and applies user specific aggregate function to
only that mapper.
It is also known as local reducer
Partitioner
It takes intermediate key-value pairs produced by the mapper , splits them into shard, and send them to
the particular reducer.
The key with same value goes to the same reducer.
The partitioned data of each map task is written to the local disk of that machine and pulled by the
respective reducer.
REDUCER
The primary chore of the Reducer is to reduce a set of
intermediate values (the ones that share a common key) to a
smaller set of values.
The Reducer has three primary phases:
Shuffle and Sort
Reduce
Output Format
REDUCER
Example: In the word count problem, the Reducer would sum all the values for the same word and
output the total count of that word.
The chores of Mapper, Combiner, Partitioner, and Reducer
Combiner
The Combiner is an optional optimization step that works like a mini-Reducer.
It is used to combine intermediate results before they are sent to the Reducer,
which reduces the amount of data transferred across the network.
It operates locally on the output of the Mapper and is typically used to
minimize the volume of data that needs to be shuffled across the network.
It is usually the same as the Reducer function but operates on a smaller dataset
The difference between combiner class and reducer class is as follows:
➢ Output generated by combiner is intermediate data and it is
passed to the reducer.
➢ Output of the reducer is passed to the output file on disk.
Partitioner
The partitioning phase happens after map phase and before reduce
phase.
Usually the number of partitions are equal to the number of reducers.
The default partitioner is hash partitioner.
Searching
Searching in MapReduce can be implemented by
processing the data in parallel with the Mapper and
applying a search algorithm.
The Mapper can identify the matching records based on a
search query, and the Reducer can aggregate or filter the
results.
Sorting
Sorting is essential in MapReduce, as the intermediate data needs to
be sorted by key before it is passed to the Reducer.
Sorting is typically done based on the key values. The Hadoop
MapReduce framework automatically handles the sorting of the keys
after the Mapper phase.
Custom sorting can also be implemented if necessary (e.g., sorting by
value or using a custom comparator).
Compression
In MapReduce programming, you can compress the MapReduce output file.
Compression provides two benefits as follows:
1. Reduces the space to store files.
2. Speeds up data transfer across the network.
You can specify compression format in the Driver Program as shown below:
[Link]("[Link]",true);
[Link]("[Link]",
[Link],[Link]);
Here, codec is the implementation of a compression and decompression
algorithm. GzipCodec is the compression algorithm for gzip. This compresses the
output file.
Word Count Example
1. Mapper
The Mapper processes input data line by line, splits words, and emits (word, 1)
as output.
Example Input (Text File Content):
Hello world
Hello Hadoop
Mapper Output (Key-Value Pairs):
("Hello", 1)
("world", 1)
("Hello", 1)
("Hadoop", 1)
Each word is emitted with a count of 1.
2. Reducer
The Reducer takes the intermediate key-value pairs, groups values by key,
and sums them to get the final count.
Reducer Input (Grouped by Key):
("Hello", [1, 1])
("world", [1])
("Hadoop", [1])
Reducer Output (Final Word Count):
("Hello", 2)
("world", 1)
("Hadoop", 1)
The word "Hello" appeared twice, so the Reducer sums its values (1+1=2).
3. Combiner
A Combiner is an optional optimization step that works like a mini-Reducer,
aggregating results before they are sent to the main Reducer.
• Instead of sending all (word, 1) pairs to the Reducer, the Combiner
performs local aggregation on the Mapper’s output.
• It helps reduce the amount of data transferred over the network.
• Combiner Output (Local Aggregation at Mapper Level):
• ("Hello", 2)
• ("world", 1)
• ("Hadoop", 1)
• Now, less data is sent to the Reducer, improving efficiency.
4. Partitioner
The Partitioner controls how Mapper output is distributed across multiple
Reducers.
• It ensures that all occurrences of the same word go to the same Reducer
for proper aggregation.
• The default partitioning is hash-based, meaning words with the same hash
are sent to the same Reducer.
For example, if there are two Reducers:
• Reducer 1 gets words starting with A-M (e.g., "Hadoop", "Hello")
• Reducer 2 gets words starting with N-Z (e.g., "world")
This ensures load balancing across Reducers.
5. Searching in MapReduce
Searching in MapReduce means filtering the dataset to find occurrences of
a word.
Example Query:
Find all lines containing the word "Hadoop".
• Mapper: Emits the line if it contains "Hadoop".
• Reducer: Merges results to get all matching lines.
Example Output:
("Hadoop", "Hello Hadoop")
This approach is useful for log analysis and text searching in large datasets.
6. Sorting in MapReduce
Sorting happens automatically in the shuffle and sort phase after the
Mapper step.
• The framework sorts keys before sending them to the Reducer.
• Custom sorting can be implemented using Comparator classes.
Example Sorted Output Before Reduce Phase:
("Hadoop", 1)
("Hello", 1)
("Hello", 1)
("world", 1)
Words are sorted alphabetically before reducing.
7. Compression in MapReduce
Compression helps reduce the size of data transferred between Mapper
and Reducer, improving efficiency.
• Input Compression: The input file can be compressed using formats like gzip
or bzip2.
• Intermediate Compression: Map output can be compressed before
shuffling.
• Output Compression: The final output can be stored in a compressed
format to save space.
Example:
If a 10GB dataset is compressed to 2GB, the system processes data faster
with reduced storage requirements