Bda Module 2

This document provides an introduction to Hadoop, its components, and the MapReduce programming model. It explains the advantages of Hadoop over traditional RDBMS for handling big data, detailing its architecture, including HDFS and YARN, and how data processing is achieved through mappers and reducers. The document also covers practical applications like ClickStream analysis and includes examples of MapReduce tasks such as word counting and searching.

Uploaded by

lahithareddy.k15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views51 pages

Bda Module 2

Uploaded by

lahithareddy.k15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MODULE 2

Introduction to Hadoop: Introducing hadoop, Why hadoop,

Why not RDBMS, RDBMS Vs Hadoop, History of
Hadoop, Hadoop overview, Use case of Hadoop, HDFS
(Hadoop Distributed File System),Processing data with
Hadoop, Managing resources and applications with Hadoop
YARN(Yet Another Resource Negotiator).
Introduction to Map Reduce Programming: Introduction,
Mapper, Reducer, Combiner, Partitioner, Searching,
Sorting, Compression.
Introducing hadoop
 Today, Big Data is the Buzzword!

 To process, analyze and make sense of these different kinds of data, we need a system
that scales and address the challenges like “ how to store terabytes of mounting data’,
“how to access the information quickly”, and “how to work with data that is very
diifernt”.
Why Hadoop ?
 Ever wondered why Hadoop has been and is one of the most wanted technologies!!
 The key consideration (the rationale behind its huge popularity) is:
 Its capability to handle massive amounts of data, different categories of data – fairly quickly.
 The other considerations are :
Why not RDBMS
 RDBMS is not suitable for storing and processing large files, images, and videos.
 RDBMS is not a good choice for advanced analytics involving machine learning.
 In RDBMS cost increase with storage volume.
 Need huge investment as the data volume increases.
RDBMS Vs Hadoop
Distributed Computing Challenges

• Hardware Failure – Replication factor (RF)

• How to Process This Gigantic Store of Data?

• MapReduce programming
History of Hadoop
Hadoop Overview
 Key Aspects of Hadoop
Hadoop Components
Hadoop Components
 Hadoop Core Components:

 HDFS:
 (a) Storage component.
 (b) Distributes data across several nodes.
 (c) Natively redundant.

 MapReduce:
 (a) Computational framework.
 (b) Splits a task across multiple nodes.
 (c) Processes data in parallel.
Hadoop High Level Architecture
Use case for Hadoop
 ClickStream data (mouse clicks) helps you to understand the purchasing
behavior of customers. ClickStream analysis helps online marketers to
optimize their product web pages, promotional content, etc. to improve
their business.

 ClickStream analysis using Hadoop provide 3 key benefits:

 Hadoop helps to join ClickStream data with other data sources such as Customer relationship
Management Data.
 Hadoop’s scalability property helps you to store years of data without ample incemental cost.
 Business analysts can use Apache Pig or Apache Hive for website analysis.
Hadoop Distributors
 The Following companies provide products that include Apache Hadoop ,
commercial support and / or toos and utilities related to Hadoop.
HDFS (Hadoop Distributed File System)
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).
5. You can replicate a file for a configured number of times, which is
tolerant in terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. You can realize the power of HDFS when you perform read or write on
large files (gigabytes and larger).
8. Sits on top of native file system such as ext3 and ext4
HDFS Daemons
 HDFS (Hadoop Distributed File System) consists of several daemons that work together to manage
data storage across a distributed cluster. These daemons include:
 1. NameNode
• The master daemon that manages the HDFS namespace.
• Keeps track of metadata, such as file locations, directory structures, and access permissions.
• Does not store actual data; instead, it maintains a mapping of file blocks to DataNodes.
• Runs on a single node (with an optional standby NameNode for high availability).
 2. Secondary NameNode
• Periodically copies the NameNode’s metadata and merges edits logs to reduce the size.
• Helps in checkpointing but does not provide failover functionality.
 3. DataNode
• The worker daemon responsible for storing actual file blocks.
• Reports to the NameNode and sends periodic heartbeats to indicate it is alive.
• Reads and writes data based on client requests.
• Replicates data blocks across nodes for fault tolerance.
Anatomy of File Read
Anatomy of File Write
Replica Placement Strategy
 As per the Hadoop Replica Placement Strategy, first replica is placed on the same node
as the client. Then it places second replica on a node that is present on different rack. It
places the third replica on the same rack as second, but on a different node in the rack.
Once replica locations have been set, a pipeline is built. This strategy provides good
reliability.
 Benefits of This Strategy
1. Fault Tolerance: Even if a node or rack fails, data remains
accessible.
2. Load Balancing: Prevents overloading a single node or rack.
3. Efficient Read Performance: Reads are optimized by selecting the
closest available replica.
4. Reduced Network Traffic: Minimizes cross-rack data transfer
during writes.
Working with HDFS Commands
 Here are some commonly used HDFS (Hadoop Distributed File System) commands for
managing files and directories in a Hadoop cluster:
Special Features of HDFS
Data Replication: There is absolutely no need for a client
application to track all blocks. It directs the client to the
nearest replica to ensure high performance.
Data Pipeline: A client application writes a block to the first
DataNode in the pipeline. Then this DataNode takes over and
forwards the data to the next node in the pipeline. This process
continues for all the data blocks, and subsequently all the
replicas are written to the disk.
Processing with Hadoop
 What is MapReduce Programming?
 MapReduce Programming is a software framework.
 MapReduce Programming helps you to process massive amounts of data
in parallel
MapReduce Daemons
How MapReduce Programming Works
Working model of MapReduce Programming
Working model of MapReduce Programming
 Following steps describes how MapReduce performs its task
 1. Input dataset is split into multiple pieces of data ( small subsets)
 2. The framework creates a master and several workers processes and executes the
workers processes remotely.
 3. Several map task work simultaneously and read pieces of data that were assigned
to each map task. (uses map function)
 4. Map worker uses partitioner function to divide data into regions. Partitioner
decides which reducer should get output of the specified mapper.
 5. When the map workers complete their work, master instruct the reduce workers to
begin their work. Reduce workers contact their map worker to get the key value pair.
 6. Then it calls reduce functions for every unique key. This function writes the output
to the file.
 When the reduce worker completes its work, the master transfer the control to the
user program.
MapReduce Example:
 WordCount example – count occurrences of similar word.
 MapReduce Programming require 3 things
 1. Driver class: Specifies Job configuration details
 2. Mapper class: Overrides the Map function based on the problem statement
 3. Reducer class: Overrides the Reduce function based on the problem statement
MapReduce – Word Count Example
MANAGING RESOURCES AND APPLICATIONS
WITH HADOOP - YARN
(YET ANOTHER RESOURCE NEGOTIATOR)
Limitations of Hadoop 1.0 Architecture
1. Single NameNode is responsible for managing entire namespace for
Hadoop Cluster.
2. It has a restricted processing model which is suitable for batch-oriented
MapReduce jobs.
3. Hadoop MapReduce is not suitable for interactive analysis.
4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and
other memory intensive algorithms.
5. MapReduce is responsible for cluster resource management and data
processing.
Hadoop 2 YARN: Taking Hadoop beyond Batch
Hadoop 2 YARN: Taking Hadoop beyond Batch
 The fundamental idea behind this architecture is splitting the JobTracker responsibility of
resource management and Job Scheduling/Monitoring into separate daemons. Daemons
that are part of YARN Architecture are described below.

 A Global ResourceManager: Its main responsibility is to distribute resources among

various applications in the system. It has two main components:

 NodeManager: This is a per-machine slave daemon. NodeManager responsibility is

launching the application containers for application execution. NodeManager monitors the
resource usage such as memory, CPU, disk, network, etc. It then reports the usage of
resources to the global ResourceManager.

 Per-application ApplicationMaster: This is an application-specific entity. Its

responsibility is to negotiate required resources for execution from the ResourceManager.
It works along with the NodeManager for executing and monitoring component tasks.
Introduction to Map Reduce
Programming: Introduction, Mapper,
Reducer, Combiner, Partitioner,
Searching, Sorting, Compression.
Introduction to Map Reduce Programming
In MapReduce Programming, Jobs (Applications) are split into a
set of map tasks and reduce tasks. Then these tasks are executed in
a distributed fashion on Hadoop cluster.
Each task processes small subset of data that has been assigned to
it. This way, Hadoop distributes the load across the cluster.
MapReduce job takes a set of files that is stored in HDFS (Hadoop
Distributed File System) as input.
Map task takes care of loading, parsing, transforming and filtering.
The responsibility of reduce task is grouping and aggregating data
that is produced by map task to generate final output.
MAPPER
 A mapper maps the input key−value pairs into a set of intermediate key–value
pairs. Maps are individual tasks that have the responsibility of transforming input
records into intermediate key–value pairs.
 Mapper Consists of following phases:
➢ RecordReader
➢ Map
➢ Combiner
➢ Partitioner
MAPPER
 RecordReader
 It converts a byte-oriented view of input into record-oriented view and present it into mapper task.
 It present the tasks with key and values.
 Generally key is positional information and value is data that constitute the record.
 Map
 Map function works on key-value pair produced by the RecordReader and generates zero or more
intermediate key-value pairs.
 Combiner
 It is an optional function but provides high performance in terms of network bandwidth and disk space.
 It takes intermediate key-value pairs provided by mapper and applies user specific aggregate function to
only that mapper.
 It is also known as local reducer
 Partitioner
 It takes intermediate key-value pairs produced by the mapper , splits them into shard, and send them to
the particular reducer.
 The key with same value goes to the same reducer.
 The partitioned data of each map task is written to the local disk of that machine and pulled by the
respective reducer.
REDUCER
The primary chore of the Reducer is to reduce a set of
intermediate values (the ones that share a common key) to a
smaller set of values.
The Reducer has three primary phases:
Shuffle and Sort
 Reduce
Output Format
REDUCER

Example: In the word count problem, the Reducer would sum all the values for the same word and
output the total count of that word.
The chores of Mapper, Combiner, Partitioner, and Reducer
Combiner
 The Combiner is an optional optimization step that works like a mini-Reducer.
 It is used to combine intermediate results before they are sent to the Reducer,
which reduces the amount of data transferred across the network.
 It operates locally on the output of the Mapper and is typically used to
minimize the volume of data that needs to be shuffled across the network.
 It is usually the same as the Reducer function but operates on a smaller dataset
 The difference between combiner class and reducer class is as follows:
➢ Output generated by combiner is intermediate data and it is
passed to the reducer.
➢ Output of the reducer is passed to the output file on disk.
Partitioner
 The partitioning phase happens after map phase and before reduce
phase.
 Usually the number of partitions are equal to the number of reducers.
 The default partitioner is hash partitioner.
Searching
Searching in MapReduce can be implemented by
processing the data in parallel with the Mapper and
applying a search algorithm.
The Mapper can identify the matching records based on a
search query, and the Reducer can aggregate or filter the
results.
Sorting
 Sorting is essential in MapReduce, as the intermediate data needs to
be sorted by key before it is passed to the Reducer.
 Sorting is typically done based on the key values. The Hadoop
MapReduce framework automatically handles the sorting of the keys
after the Mapper phase.
 Custom sorting can also be implemented if necessary (e.g., sorting by
value or using a custom comparator).
Compression
 In MapReduce programming, you can compress the MapReduce output file.
 Compression provides two benefits as follows:
1. Reduces the space to store files.
2. Speeds up data transfer across the network.
 You can specify compression format in the Driver Program as shown below:
 [Link]("[Link]",true);
[Link]("[Link]",
[Link],[Link]);
 Here, codec is the implementation of a compression and decompression
algorithm. GzipCodec is the compression algorithm for gzip. This compresses the
output file.
Word Count Example
 1. Mapper
 The Mapper processes input data line by line, splits words, and emits (word, 1)
as output.
 Example Input (Text File Content):
 Hello world
 Hello Hadoop
 Mapper Output (Key-Value Pairs):
 ("Hello", 1)
 ("world", 1)
 ("Hello", 1)
 ("Hadoop", 1)
 Each word is emitted with a count of 1.
 2. Reducer
 The Reducer takes the intermediate key-value pairs, groups values by key,
and sums them to get the final count.
 Reducer Input (Grouped by Key):
 ("Hello", [1, 1])
 ("world", [1])
 ("Hadoop", [1])
 Reducer Output (Final Word Count):
 ("Hello", 2)
 ("world", 1)
 ("Hadoop", 1)
 The word "Hello" appeared twice, so the Reducer sums its values (1+1=2).
 3. Combiner
 A Combiner is an optional optimization step that works like a mini-Reducer,
aggregating results before they are sent to the main Reducer.
• Instead of sending all (word, 1) pairs to the Reducer, the Combiner
performs local aggregation on the Mapper’s output.
• It helps reduce the amount of data transferred over the network.
• Combiner Output (Local Aggregation at Mapper Level):
• ("Hello", 2)
• ("world", 1)
• ("Hadoop", 1)
• Now, less data is sent to the Reducer, improving efficiency.
 4. Partitioner
 The Partitioner controls how Mapper output is distributed across multiple
Reducers.
• It ensures that all occurrences of the same word go to the same Reducer
for proper aggregation.
• The default partitioning is hash-based, meaning words with the same hash
are sent to the same Reducer.
 For example, if there are two Reducers:
• Reducer 1 gets words starting with A-M (e.g., "Hadoop", "Hello")
• Reducer 2 gets words starting with N-Z (e.g., "world")
 This ensures load balancing across Reducers.
 5. Searching in MapReduce
 Searching in MapReduce means filtering the dataset to find occurrences of
a word.
 Example Query:
 Find all lines containing the word "Hadoop".
• Mapper: Emits the line if it contains "Hadoop".
• Reducer: Merges results to get all matching lines.
 Example Output:
 ("Hadoop", "Hello Hadoop")
 This approach is useful for log analysis and text searching in large datasets.
 6. Sorting in MapReduce
 Sorting happens automatically in the shuffle and sort phase after the
Mapper step.
• The framework sorts keys before sending them to the Reducer.
• Custom sorting can be implemented using Comparator classes.
 Example Sorted Output Before Reduce Phase:
 ("Hadoop", 1)
 ("Hello", 1)
 ("Hello", 1)
 ("world", 1)
 Words are sorted alphabetically before reducing.
 7. Compression in MapReduce
 Compression helps reduce the size of data transferred between Mapper
and Reducer, improving efficiency.
• Input Compression: The input file can be compressed using formats like gzip
or bzip2.
• Intermediate Compression: Map output can be compressed before
shuffling.
• Output Compression: The final output can be stored in a compressed
format to save space.
 Example:
 If a 10GB dataset is compressed to 2GB, the system processes data faster
with reduced storage requirements

Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
44 pages
Introduction to Hadoop Overview
No ratings yet
Introduction to Hadoop Overview
37 pages
BDA Module2
No ratings yet
BDA Module2
37 pages
Big Data Analytics with Hadoop Overview
No ratings yet
Big Data Analytics with Hadoop Overview
22 pages
Introduction to Hadoop Framework Basics
No ratings yet
Introduction to Hadoop Framework Basics
56 pages
Understanding Hadoop Architecture and Components
No ratings yet
Understanding Hadoop Architecture and Components
35 pages
Overview of Hadoop Modules and Ecosystem
No ratings yet
Overview of Hadoop Modules and Ecosystem
33 pages
Introduction to Hadoop and Its Ecosystem
No ratings yet
Introduction to Hadoop and Its Ecosystem
84 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Understanding Hadoop: HDFS & MapReduce
No ratings yet
Understanding Hadoop: HDFS & MapReduce
25 pages
Big Data Analytics With Hadoop
No ratings yet
Big Data Analytics With Hadoop
22 pages
Big Data Processing with Hadoop & Cloud
No ratings yet
Big Data Processing with Hadoop & Cloud
39 pages
BDA - Module 2
No ratings yet
BDA - Module 2
81 pages
BDA - Module 2
No ratings yet
BDA - Module 2
81 pages
Overview of Hadoop Framework in Big Data
No ratings yet
Overview of Hadoop Framework in Big Data
32 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
101 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
46 pages
BDA3rdunit Final
No ratings yet
BDA3rdunit Final
41 pages
Understanding Hadoop for Big Data
No ratings yet
Understanding Hadoop for Big Data
38 pages
Introduction to Hadoop Framework Basics
No ratings yet
Introduction to Hadoop Framework Basics
14 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
6 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
59 pages
Big Data Processing with Apache Hadoop
No ratings yet
Big Data Processing with Apache Hadoop
40 pages
Introduction to Hadoop and Big Data Concepts
No ratings yet
Introduction to Hadoop and Big Data Concepts
62 pages
Big Data Analytics with Hadoop Overview
No ratings yet
Big Data Analytics with Hadoop Overview
32 pages
Overview of Hadoop Framework and HDFS
No ratings yet
Overview of Hadoop Framework and HDFS
8 pages
bdcc-2 2
No ratings yet
bdcc-2 2
12 pages
Base Principles of NoSQL Databases
100% (1)
Base Principles of NoSQL Databases
16 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
55 pages
Overview of Hadoop Components
No ratings yet
Overview of Hadoop Components
18 pages
Understanding Hadoop and Its Ecosystem
No ratings yet
Understanding Hadoop and Its Ecosystem
10 pages
Understanding Hadoop and Big Data Solutions
No ratings yet
Understanding Hadoop and Big Data Solutions
67 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
20 pages
Understanding Apache Hadoop Basics
No ratings yet
Understanding Apache Hadoop Basics
40 pages
Understanding Apache Hadoop Framework
No ratings yet
Understanding Apache Hadoop Framework
19 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
71 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
7 pages
U4@Hadoop Architecture
No ratings yet
U4@Hadoop Architecture
24 pages
Comprehensive Guide to Hadoop Basics
No ratings yet
Comprehensive Guide to Hadoop Basics
90 pages
Understanding Hadoop: Big Data Framework
No ratings yet
Understanding Hadoop: Big Data Framework
13 pages
Unit 3
No ratings yet
Unit 3
103 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
154 pages
Unit Iii
No ratings yet
Unit Iii
49 pages
Big Data Framework: Hadoop Overview
No ratings yet
Big Data Framework: Hadoop Overview
23 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
53 pages
Big Data Concepts and Hadoop Overview
No ratings yet
Big Data Concepts and Hadoop Overview
67 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
106 pages
Advanced Hadoop and MapReduce Guide
No ratings yet
Advanced Hadoop and MapReduce Guide
29 pages
Unit 2 Big Data
No ratings yet
Unit 2 Big Data
16 pages
Understanding Hadoop: Key Features & Benefits
No ratings yet
Understanding Hadoop: Key Features & Benefits
21 pages
CAP Theorem in Big Data and NoSQL
No ratings yet
CAP Theorem in Big Data and NoSQL
16 pages
Understanding Hadoop: Framework Overview
No ratings yet
Understanding Hadoop: Framework Overview
26 pages
Data Movement in Apache Hadoop
No ratings yet
Data Movement in Apache Hadoop
105 pages
Dnyanesh Badave
No ratings yet
Dnyanesh Badave
3 pages
Audio Equipment Rental Price List
No ratings yet
Audio Equipment Rental Price List
3 pages
Understanding Microcontrollers and Sensors
No ratings yet
Understanding Microcontrollers and Sensors
19 pages
Overview of SNMP and TR-069 Protocols
No ratings yet
Overview of SNMP and TR-069 Protocols
22 pages
WDG4/WDP4D Maintenance Troubleshooting Guide
No ratings yet
WDG4/WDP4D Maintenance Troubleshooting Guide
53 pages
Full Stack Developer & Open Source Contributor
No ratings yet
Full Stack Developer & Open Source Contributor
2 pages
Valid Palindromic Roman Numerals
No ratings yet
Valid Palindromic Roman Numerals
45 pages
Cryptography Concepts and Techniques
No ratings yet
Cryptography Concepts and Techniques
9 pages
The Role of Software in Product Evolution
No ratings yet
The Role of Software in Product Evolution
13 pages
Understanding Operating Systems Basics
No ratings yet
Understanding Operating Systems Basics
32 pages
Lab: Switch Security Configuration
No ratings yet
Lab: Switch Security Configuration
19 pages
Data Communication Technologies Assignment
No ratings yet
Data Communication Technologies Assignment
3 pages
Defecte ABS2
No ratings yet
Defecte ABS2
64 pages
Understanding Factors and Multiples
No ratings yet
Understanding Factors and Multiples
42 pages
PCR RealTime - LightCycler - 2-0-Instrument-Operators-Manual
No ratings yet
PCR RealTime - LightCycler - 2-0-Instrument-Operators-Manual
280 pages
AS DEALER: Cashback & Discounts Guide
No ratings yet
AS DEALER: Cashback & Discounts Guide
2 pages
Computer Applications Management Exam Paper
No ratings yet
Computer Applications Management Exam Paper
33 pages
AW-FP200 Fire Alarm Control Panel Manual
100% (5)
AW-FP200 Fire Alarm Control Panel Manual
33 pages
MLP for MNIST Handwritten Digit Classification
No ratings yet
MLP for MNIST Handwritten Digit Classification
11 pages
SPARK Matrix - Card Management System (CMS) - 2022 - BPC - Full Report - Quadrant Knowledge Solutions Jan 2023
100% (1)
SPARK Matrix - Card Management System (CMS) - 2022 - BPC - Full Report - Quadrant Knowledge Solutions Jan 2023
54 pages
Latitude E5420 Manual
No ratings yet
Latitude E5420 Manual
7 pages
Abhishek Singh: Data Analyst Profile
No ratings yet
Abhishek Singh: Data Analyst Profile
1 page
Damro Furniture Order Form Design
No ratings yet
Damro Furniture Order Form Design
5 pages
SMS Spam Techniques and Risks Guide
No ratings yet
SMS Spam Techniques and Risks Guide
10 pages
Multi-Robot Assembly Planning for Manufacturing
No ratings yet
Multi-Robot Assembly Planning for Manufacturing
22 pages
B.Tech Project Registration: Low Power VLSI
No ratings yet
B.Tech Project Registration: Low Power VLSI
2 pages
Electro 1: Basic Electricity Course Guide
No ratings yet
Electro 1: Basic Electricity Course Guide
5 pages
Machine Learning for ECG Arrhythmia Detection
No ratings yet
Machine Learning for ECG Arrhythmia Detection
21 pages
Hierarchical Report for VENUM Data
No ratings yet
Hierarchical Report for VENUM Data
3 pages
Take Home Exam: Computing & IT Insights
No ratings yet
Take Home Exam: Computing & IT Insights
10 pages

Bda Module 2

Uploaded by

Bda Module 2

Uploaded by

MODULE 2

Introduction to Hadoop: Introducing hadoop, Why hadoop,

• Hardware Failure – Replication factor (RF)

• How to Process This Gigantic Store of Data?

 ClickStream analysis using Hadoop provide 3 key benefits:

 A Global ResourceManager: Its main responsibility is to distribute resources among

 NodeManager: This is a per-machine slave daemon. NodeManager responsibility is

 Per-application ApplicationMaster: This is an application-specific entity. Its

You might also like