0% found this document useful (0 votes)

22 views44 pages

Introduction to Hadoop and MapReduce

The document provides an agenda for a lecture on Hadoop that will last between 120 to 150 minutes. The agenda covers introducing Hadoop and its key components like HDFS and MapReduce. It will discuss the architecture and workings of HDFS including file reads, writes and replication. It will also cover MapReduce programming model and an example word count program. The lecture will explain the limitations of original Hadoop 1.0 architecture and how YARN was developed to overcome these. Finally, it will discuss some other tools in the Hadoop ecosystem like Pig, Hive, Sqoop and HBase.

Uploaded by

Ponnusamy S Pichaimuthu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views44 pages

Introduction to Hadoop and MapReduce

Uploaded by

Ponnusamy S Pichaimuthu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Chapter 5

Introduction to Hadoop
Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes

Introduction to Hadoop

1. To study the features of a) To comprehend the reasons

Hadoop. behind the popularity of
Hadoop.
2. To learn the basic concepts of
HDFS and MapReduce b) To be able to perform HDFS
Programming. operations.

3. To study HDFS Architecture. c) To comprehend MapReduce

framework.
4. To study MapReduce
Programming Model d) To understand the read and
write in HDFS.
5. To study Hadoop Ecosystem.
e) To be able to understand
Hadoop Ecosystem.
Session Plan

Lecture time 120 to 150 minutes

Q/A 15 minutes
Agenda
 Hadoop - An Introduction
 RDBMS versus Hadoop
 Distributed Computing Challenges
 History of Hadoop
 Hadoop Overview
 Key Aspects of Hadoop
 Hadoop Components
 High Level Architecture of Hadoop
 Use case for Hadoop
 ClickStream Data
 Hadoop Distributors
 HDFS
 HDFS Daemons
 Anatomy of File Read
 Anatomy of File Write
 Replica Placement Strategy
 Working with HDFS commands
 Special Features of HDFS
Agenda

 Processing Data with Hadoop

 What is MapReduce Programming?
 How does MapReduce Works?
 MapReduce Word Count Example

 Managing Resources and Application with Hadoop YARN

 Limitations of Hadoop 1.0 Architecture
 Hadoop 2 YARN: Taking Hadoop Beyond Batch

 Hadoop Ecosystem
 Pig
 Hive
 Sqoop
 HBase
Hadoop – An Introduction
What is Hadoop

Hadoop is:
Ever wondered why Hadoop has been and is one of the most wanted
technologies!!

The key consideration (the rationale behind its huge popularity) is:

Its capability to handle massive amounts of data, different

categories of data – fairly quickly.

The other considerations are :

RDBMS versus HADOOP
RDBMS versus HADOOP
Distributed Computing Challenges
Distributed Computing Challenges

• Hardware Failure

• How to Process This Gigantic Store of Data?

History of Hadoop
History of Hadoop
Hadoop Overview
Key Aspects of Hadoop
Hadoop Components
Hadoop Components

Hadoop Core Components:

HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.

MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Hadoop High Level Architecture
Use case for Hadoop
ClickStream Data Analysis

ClickStream data (mouse clicks) helps you to understand the purchasing

behavior of customers. ClickStream analysis helps online marketers to
optimize their product web pages, promotional content, etc. to
improve their business.
Hadoop Distributors
Hadoop Distributors
HDFS
(HADOOP DISTRIBUTED FILE SYSTEM)
Hadoop Distributed File System
1. Storage component of Hadoop.

2. Distributed File System.

3. Modeled after Google File System.

4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).

5. You can replicate a file for a configured number of times, which is

tolerant in terms of both software and hardware.

6. Re-replicates data blocks automatically on nodes that have failed.

7. You can realize the power of HDFS when you perform read or write
on large files (gigabytes and larger).

8. Sits on top of native file system such as ext3 and ext4, which is
described
HDFS Daemons

NameNode:

• Single NameNode per cluster.

• Keeps the metadata details

DataNode:

• Multiple DataNode per cluster

• Read/Write operations

SecondaryNameNode:

• Housekeeping Daemon
Anatomy of File Read
Anatomy of File Write
Replica Placement Strategy

As per the Hadoop Replica Placement Strategy, first replica is placed on the
same node as the client. Then it places second replica on a node that is
present on different rack. It places the third replica on the same rack as
second, but on a different node in the rack. Once replica locations have been
set, a pipeline is built. This strategy provides good reliability.
Working with HDFS Commands

Objective: To create a directory (say, sample) in HDFS.

Act:

hadoop fs -mkdir /sample

Objective: To copy a file from local file system to HDFS.

Act:

hadoop fs -put /root/sample/[Link] /sample/[Link]

Objective: To copy a file from HDFS to local file system.

Act:

hadoop fs -get /sample/[Link] /root/sample/[Link]

Special Features of HDFS

Data Replication: There is absolutely no need for a client application to

track all blocks. It directs the client to the nearest replica to ensure high
performance.

Data Pipeline: A client application writes a block to the first DataNode in

the pipeline. Then this DataNode takes over and forwards the data to the
next node in the pipeline. This process continues for all the data blocks,
and subsequently all the replicas are written to the disk.
Processing with Hadoop
What is MapReduce Programming?

MapReduce Programming is a software framework. MapReduce

Programming helps you to process massive amounts of data in parallel.
How MapReduce Programming Works
MapReduce – Word Count Example
MANAGING RESOURCES AND APPLICATIONS
WITH HADOOP - YARN

(YET ANOTHER RESOURCE NEGOTIATOR)

Limitations of Hadoop 1.0 Architecture

1. Single NameNode is responsible for managing entire namespace for Hadoop

Cluster.

2. It has a restricted processing model which is suitable for batch-oriented

MapReduce jobs.

3. Hadoop MapReduce is not suitable for interactive analysis.

4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and
other memory intensive algorithms.

5. MapReduce is responsible for cluster resource management and data

processing.
Hadoop 2 YARN: Taking Hadoop beyond Batch
Hadoop 2 YARN: Taking Hadoop beyond Batch

The fundamental idea behind this architecture is splitting the

JobTracker responsibility of resource management and Job
Scheduling/Monitoring into separate daemons. Daemons that are part of
YARN Architecture are described below.

A Global ResourceManager: Its main responsibility is to distribute

resources among various applications in the system. It has two main
components:

NodeManager: This is a per-machine slave daemon. NodeManager

responsibility is launching the application containers for application
execution. NodeManager monitors the resource usage such as memory,
CPU, disk, network, etc. It then reports the usage of resources to the
global ResourceManager.

Per-application ApplicationMaster: This is an application-specific

entity. Its responsibility is to negotiate required resources for execution
from the ResourceManager. It works along with the NodeManager for
executing and monitoring component tasks.
Interacting with Hadoop Ecosystem
Interacting with Hadoop Ecosytem
Pig : Pig is a data flow system for Hadoop. It uses Pig Latin to specify data
flow. Pig is an alternative to MapReduce Programming. It abstracts some
details and allows you to focus on data processing.

Hive: Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries
can be done using an SQL-like language. Hive can be used to do ad-hoc queries,
summarization, and data analysis. Figure 5.31 depicts Hive in the Hadoop
ecosystem.

Sqoop: Sqoop is a tool which helps to transfer data between Hadoop and
Relational Databases. With the help of Sqoop, you can import data from RDBMS
to HDFS and vice-versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.

HBase: HBase is a NoSQL database for Hadoop. HBase is column-oriented

NoSQL database. HBase is used to store billions of rows and millions of
columns. HBase provides random read/write operation. It also supports record
level updates which is not possible using HDFS. HBase sits on top of HDFS.
Figure 5.33 depicts the HBase in Hadoop ecosystem.
Answer a few quick questions…
Match the columns

Column A Column B

HDFS DataNode
MapReduce Programming NameNode
Master node Processing Data
Slave node Google File System and MapReduce
Hadoop Implementation Storage
Match the columns

Column A Column B

JobTracker Executes Task

MapReduce Schedules Task
TaskTracker Programming Model
Job Configuration Converts input into Key Value pair
Map Job Parameters
Thank You

Bda Module 2
No ratings yet
Bda Module 2
51 pages
Comprehensive Guide to Hadoop Basics
No ratings yet
Comprehensive Guide to Hadoop Basics
7 pages
Introduction to Hadoop Overview
No ratings yet
Introduction to Hadoop Overview
37 pages
Introduction to Hadoop Framework Basics
No ratings yet
Introduction to Hadoop Framework Basics
56 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
53 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
101 pages
Big Data Analytics with Hadoop Overview
No ratings yet
Big Data Analytics with Hadoop Overview
22 pages
Big Data Analytics With Hadoop
No ratings yet
Big Data Analytics With Hadoop
22 pages
Overview of Hadoop Ecosystem Components
No ratings yet
Overview of Hadoop Ecosystem Components
50 pages
History and Advantages of Hadoop
No ratings yet
History and Advantages of Hadoop
53 pages
BD - 7
No ratings yet
BD - 7
13 pages
Introduction to Hadoop and Big Data Concepts
No ratings yet
Introduction to Hadoop and Big Data Concepts
62 pages
BDA Module2
No ratings yet
BDA Module2
37 pages
Big Data Concepts and Hadoop Overview
No ratings yet
Big Data Concepts and Hadoop Overview
67 pages
Understanding Hadoop Architecture and Functions
No ratings yet
Understanding Hadoop Architecture and Functions
101 pages
Overview of Hadoop Modules and Ecosystem
No ratings yet
Overview of Hadoop Modules and Ecosystem
33 pages
Understanding Hadoop for Big Data
No ratings yet
Understanding Hadoop for Big Data
38 pages
Big Data Processing with Hadoop & Cloud
No ratings yet
Big Data Processing with Hadoop & Cloud
39 pages
BDA - Module 2
No ratings yet
BDA - Module 2
81 pages
BDA - Module 2
No ratings yet
BDA - Module 2
81 pages
Overview of Hadoop Framework in Big Data
No ratings yet
Overview of Hadoop Framework in Big Data
32 pages
Big Data Challenges and Hadoop Overview
No ratings yet
Big Data Challenges and Hadoop Overview
56 pages
Introduction to Hadoop and Its Ecosystem
No ratings yet
Introduction to Hadoop and Its Ecosystem
84 pages
Understanding Hadoop: HDFS & MapReduce
No ratings yet
Understanding Hadoop: HDFS & MapReduce
25 pages
Understanding Hadoop Architecture and Components
No ratings yet
Understanding Hadoop Architecture and Components
35 pages
In-Memory Hadoop Cluster Overview
No ratings yet
In-Memory Hadoop Cluster Overview
40 pages
6 Hadoop
No ratings yet
6 Hadoop
139 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
11 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
154 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
46 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
59 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
55 pages
Understanding Hadoop: Framework Overview
No ratings yet
Understanding Hadoop: Framework Overview
26 pages
Unit 2
No ratings yet
Unit 2
24 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
32 pages
Unit Iii
No ratings yet
Unit Iii
49 pages
Overview of Hadoop Architecture and Components
No ratings yet
Overview of Hadoop Architecture and Components
17 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
30 pages
Big Data Processing with Apache Hadoop
No ratings yet
Big Data Processing with Apache Hadoop
40 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
103 pages
Overview of Apache Hadoop and HDFS
No ratings yet
Overview of Apache Hadoop and HDFS
37 pages
Unit 2 - Hadoop
No ratings yet
Unit 2 - Hadoop
46 pages
Hadoop and HDFS Notes
No ratings yet
Hadoop and HDFS Notes
10 pages
History of Hadoop Unit 2
No ratings yet
History of Hadoop Unit 2
17 pages
MapReduce Types and HDFS Scaling in Hadoop
No ratings yet
MapReduce Types and HDFS Scaling in Hadoop
46 pages
Overview of Hadoop Components
No ratings yet
Overview of Hadoop Components
18 pages
Unit 2 Big Data
No ratings yet
Unit 2 Big Data
16 pages
Understanding Hadoop Architecture and Use Cases
No ratings yet
Understanding Hadoop Architecture and Use Cases
20 pages
Comprehensive Guide to Hadoop Basics
No ratings yet
Comprehensive Guide to Hadoop Basics
90 pages
Overview of Apache Hadoop Ecosystem
No ratings yet
Overview of Apache Hadoop Ecosystem
154 pages
Big Data Module 3 Hadoop Arc
No ratings yet
Big Data Module 3 Hadoop Arc
27 pages
Hadoop and Python for Beginners
100% (1)
Hadoop and Python for Beginners
89 pages
Introduction to Hadoop Ecosystem Overview
No ratings yet
Introduction to Hadoop Ecosystem Overview
27 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
50 pages
Python Functions and Variable Scope
No ratings yet
Python Functions and Variable Scope
34 pages
MongoDB: MapReduce and Data Export
No ratings yet
MongoDB: MapReduce and Data Export
50 pages
Overview of Big Data Technologies
No ratings yet
Overview of Big Data Technologies
36 pages
Types of Digital Data Explained
No ratings yet
Types of Digital Data Explained
26 pages
Introduction to Apache Cassandra
No ratings yet
Introduction to Apache Cassandra
47 pages
NoSQL Databases: Features and Types
No ratings yet
NoSQL Databases: Features and Types
52 pages
CS Bachelor Syllabus 2019-2022
No ratings yet
CS Bachelor Syllabus 2019-2022
115 pages
Computer Networks Unit-I-New
No ratings yet
Computer Networks Unit-I-New
102 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
40 pages
Data Link Layer Design Issues Explained
No ratings yet
Data Link Layer Design Issues Explained
72 pages
Data Communication & Network: Unit - 3
No ratings yet
Data Communication & Network: Unit - 3
58 pages
XML Schema for Loan Management System
No ratings yet
XML Schema for Loan Management System
7 pages
Thiruvalluvar University M.Phil. Regulations
0% (1)
Thiruvalluvar University M.Phil. Regulations
22 pages
MCA E-Commerce Question Bank
No ratings yet
MCA E-Commerce Question Bank
32 pages
MobileTrans Billing and Permissions Log
No ratings yet
MobileTrans Billing and Permissions Log
12 pages
Data Sheet 6AV2123-2MB03-0AX0: General Information
No ratings yet
Data Sheet 6AV2123-2MB03-0AX0: General Information
9 pages
Premiere Pro Default Shortcuts Guide
No ratings yet
Premiere Pro Default Shortcuts Guide
20 pages
Topological Sort: Algorithm and Example
No ratings yet
Topological Sort: Algorithm and Example
9 pages
SQL Praktikum untuk Toko Online Harimbale
No ratings yet
SQL Praktikum untuk Toko Online Harimbale
6 pages
Fujitsu ESPRIMO P900 E90+ Desktop PC: Data Sheet
No ratings yet
Fujitsu ESPRIMO P900 E90+ Desktop PC: Data Sheet
9 pages
Siebel User Properties Guide
0% (1)
Siebel User Properties Guide
2 pages
Leapfrog Geo 4.2 File Types Overview
100% (3)
Leapfrog Geo 4.2 File Types Overview
5 pages
English Manual G730
No ratings yet
English Manual G730
9 pages
Oasys XDisp Help Guide Overview
No ratings yet
Oasys XDisp Help Guide Overview
190 pages
Camera App Crash Log Analysis
No ratings yet
Camera App Crash Log Analysis
17 pages
Shell PDF Recovery Guide
No ratings yet
Shell PDF Recovery Guide
238 pages
CATIA Brand Essentials Overview
No ratings yet
CATIA Brand Essentials Overview
168 pages
AAMUSTED Security App Proposal
No ratings yet
AAMUSTED Security App Proposal
2 pages
Eucalyptus Cloud Software Overview
No ratings yet
Eucalyptus Cloud Software Overview
6 pages
SMM Project Ideas for Skill Enhancement
No ratings yet
SMM Project Ideas for Skill Enhancement
11 pages
College Suggestor for Engineering Aspirants
No ratings yet
College Suggestor for Engineering Aspirants
23 pages
20742B ENU Companion
No ratings yet
20742B ENU Companion
197 pages
Network Telemetry Streaming Services in SDN-Based Disaggregated Optical Networks
No ratings yet
Network Telemetry Streaming Services in SDN-Based Disaggregated Optical Networks
8 pages
Web-Based Class Scheduling System
100% (2)
Web-Based Class Scheduling System
90 pages
MS Excel Basics for Grade 6 Students
No ratings yet
MS Excel Basics for Grade 6 Students
6 pages
2013 - SIGMOD - Moerkotte - Correct and Complete Enumeration of Search Space
No ratings yet
2013 - SIGMOD - Moerkotte - Correct and Complete Enumeration of Search Space
12 pages
Input and Output Devices Overview
No ratings yet
Input and Output Devices Overview
17 pages
AI Techniques for Educators Guide
No ratings yet
AI Techniques for Educators Guide
26 pages
In-Flight Telemedicine System Proposal
No ratings yet
In-Flight Telemedicine System Proposal
26 pages
CD3291 Data Structures Syllabus
No ratings yet
CD3291 Data Structures Syllabus
2 pages
Understanding the Cyber Kill Chain
No ratings yet
Understanding the Cyber Kill Chain
3 pages
Website Architecture Exam Guide
No ratings yet
Website Architecture Exam Guide
1 page
Lmx58-N Low-Power, Dual-Operational Amplifiers: 1 Features 3 Description
No ratings yet
Lmx58-N Low-Power, Dual-Operational Amplifiers: 1 Features 3 Description
36 pages
Arindam Paul's Professional Profile
No ratings yet
Arindam Paul's Professional Profile
4 pages

Introduction to Hadoop and MapReduce

Uploaded by

Introduction to Hadoop and MapReduce

Uploaded by

Chapter 5

Learning Objectives Learning Outcomes

1. To study the features of a) To comprehend the reasons

3. To study HDFS Architecture. c) To comprehend MapReduce

Lecture time 120 to 150 minutes

 Processing Data with Hadoop

 Managing Resources and Application with Hadoop YARN

Its capability to handle massive amounts of data, different

The other considerations are :

• How to Process This Gigantic Store of Data?

Hadoop Core Components:

ClickStream data (mouse clicks) helps you to understand the purchasing

2. Distributed File System.

3. Modeled after Google File System.

5. You can replicate a file for a configured number of times, which is

6. Re-replicates data blocks automatically on nodes that have failed.

• Single NameNode per cluster.

• Multiple DataNode per cluster

Objective: To create a directory (say, sample) in HDFS.

hadoop fs -mkdir /sample

Objective: To copy a file from local file system to HDFS.

hadoop fs -put /root/sample/[Link] /sample/[Link]

Objective: To copy a file from HDFS to local file system.

hadoop fs -get /sample/[Link] /root/sample/[Link]

Data Replication: There is absolutely no need for a client application to

Data Pipeline: A client application writes a block to the first DataNode in

MapReduce Programming is a software framework. MapReduce

(YET ANOTHER RESOURCE NEGOTIATOR)

1. Single NameNode is responsible for managing entire namespace for Hadoop

2. It has a restricted processing model which is suitable for batch-oriented

3. Hadoop MapReduce is not suitable for interactive analysis.

5. MapReduce is responsible for cluster resource management and data

The fundamental idea behind this architecture is splitting the

A Global ResourceManager: Its main responsibility is to distribute

NodeManager: This is a per-machine slave daemon. NodeManager

Per-application ApplicationMaster: This is an application-specific

HBase: HBase is a NoSQL database for Hadoop. HBase is column-oriented

JobTracker Executes Task

You might also like