0% found this document useful (0 votes)

3 views81 pages

BDA - Module 2

The document provides an introduction to Hadoop, highlighting its significance in managing big data through its low-cost, scalable, and flexible storage solutions. It contrasts Hadoop with traditional RDBMS, emphasizing Hadoop's suitability for large files and advanced analytics. Additionally, it outlines the Hadoop ecosystem, including components like HDFS, MapReduce, and YARN, and discusses the architecture and challenges associated with distributed computing.

Uploaded by

soundaryasriram23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views81 pages

BDA - Module 2

Uploaded by

soundaryasriram23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MODULE-2

Introduction to Hadoop
2.1: Introducing Hadoop
• Big Data is a buzz word
• Enterprise/ world is realizing the huge volume
of untapped data.
• Amount of data generated every day/ every
minute / every second.
Data: the treasure trove
Challenges in Big data
2.2 Why hadoop
• Low cost : Hadoop is an open source framework that uses
commodity hardware that uses to store large amount of data.
• Computing power : based on distributed computing model
which processes very large amount of data
• Scalability :
• Storage flexibility :do not need pre-processed before sturing
Provides the convenience of storing as much data
as one needs & flexibility of deciding later as to
how to use the stored data.
can store unstructured data can be stored as
images,videos etc
Hadoop makes use of commodity hardware, DFS & distributed computing
2.3 Why not RDBMS

• RDBMS is not suitable for storing & processing large files & images

• Not suitable for advanced analytics involve machine learning

• Requires huge investment

2.4 Describes the difference between RDBMS & hadoop
2.5 DISTRIBUTED COMPUTING CHALLENGES
two major challenges.

Hardware Failure

How to process the Gigantic store of data?

5.6 History of Hadoop

Created by Doung Cutting, creator of Apache Lucene

Nutch – gather
data from web
& create
searchable
indexes
2.7 Hadoop overview
Hadoop ecosystem
Hive → SQL-like querying engine for Hadoop, commonly used for data
warehousing.

Pig → A high-level scripting language for analyzing large data sets.

Sqoop → Transfers data between Hadoop and relational databases.

HBase → A NoSQL database for real-time read/write access to large datasets.

Flume → Captures and moves large amounts of log data into Hadoop.

Oozie → A workflow scheduler for managing Hadoop jobs.

Mahout → is a Java Library which implements Machine Learning techniques

for clustering, classification & recommendation.
Hadoop is a master-slave architecture

Master node – Namenode

Slave- Datanode
2.8 use case of hadoop
2.9 HDFS
2.9.1 HDFS Daemons

Breaks large files into smaller blocks

Namenode uses rackID to identify the Datanode

Rack is collection of Datanode within the cluster

•The NameNode keeps an image of the FsImage and EditLog from disk
•The NameNode applies all the transactions from the EditLog to the in-memory
representation of the FsImage
Secondary NameNode

NameNode logs every change to the file system metadata into the
edit log

• The Secondary NameNode periodically: Downloads the current fsimage

and edit log from the NameNode.

• Applies all changes from the edit log to the fsimage.

• Creates a new, updated fsimage.

• Sends this new fsimage back to the NameNode.

Anatomy of file read
Anatomy of file write
Replica replacement strategy
2.10 Data processing with hadoop
Job tracker
Connectivity between Hadoop & application
Decides to divide the task to node ; single Job tracker
Monitors the running tasks
If task fails, it automatically re-schedules the task to different node

Task Tracker
Executes the task assigned by the Job tracker; single task tracker per slave
Sends heartbeat
MapReduce programming workflow
Managing resources and applications with Hadoop
YARN
Limtation of Hadoop 1

HDFS limitation

- NameNode saves all its file metadata in main memory

- NameNode can quickly become overwhelemed with load on the system increasing
Hadoop 2: HDFS

Features

1. Horizontal scalability
2. High availability

HDFS federation uses multiple independent NameNodes for

horizontal scalability.

All DataNodes in the cluster registers with each NameNode in the

cluster.
Fundamental Idea

Global Resource Manager

Scheduler
ApplicationManager

NodeManger

Per-application ApplicationMaster

Basic Concepts

Application – Job submitted to the framework

MapReduce Job
Container
YARN architecture
Interacting with Hadoop ecosystem
Introduction to MapReduce
Jobs are split into set of map tasks & reduce tasks

Tasks are executed in distributed fashion on Hadoop cluster

Map task- loading, parsing, transforming & filtering

Reduce task- grouping & aggregating data

Each map task is broken down into

Reducer tasks are broken down into

MAPPER
Reducer
Combiner
Partitioner
Searching
Sorting
Compression

BDA - Module 2
No ratings yet
BDA - Module 2
81 pages
Introduction to Hadoop Framework Basics
No ratings yet
Introduction to Hadoop Framework Basics
56 pages
Introduction to Hadoop and Big Data Concepts
No ratings yet
Introduction to Hadoop and Big Data Concepts
62 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
44 pages
Introduction to Hadoop Overview
No ratings yet
Introduction to Hadoop Overview
37 pages
BDA Module2
No ratings yet
BDA Module2
37 pages
Open Source Distributed File Systems Overview
No ratings yet
Open Source Distributed File Systems Overview
60 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
46 pages
Bda Module 2
No ratings yet
Bda Module 2
51 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
53 pages
Comprehensive Guide to Hadoop Basics
No ratings yet
Comprehensive Guide to Hadoop Basics
90 pages
Introduction to Hadoop and Big Data Analytics
No ratings yet
Introduction to Hadoop and Big Data Analytics
83 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
59 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Module 2
No ratings yet
Module 2
80 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
154 pages
Big Data Challenges and Hadoop Solutions
No ratings yet
Big Data Challenges and Hadoop Solutions
40 pages
6 Hadoop
No ratings yet
6 Hadoop
139 pages
Overview of Hadoop Framework in Big Data
No ratings yet
Overview of Hadoop Framework in Big Data
32 pages
Unit 2 Big Data
No ratings yet
Unit 2 Big Data
16 pages
Jenny's Guide to Hadoop Essentials
No ratings yet
Jenny's Guide to Hadoop Essentials
12 pages
Hadoop Framework Overview and Components
No ratings yet
Hadoop Framework Overview and Components
75 pages
Unit Iii
No ratings yet
Unit Iii
49 pages
Understanding Hadoop for Big Data
No ratings yet
Understanding Hadoop for Big Data
38 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
42 pages
Hadoop Mapreduce Examples
No ratings yet
Hadoop Mapreduce Examples
65 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
11 pages
Hadoop and HDFS Notes
No ratings yet
Hadoop and HDFS Notes
10 pages
Comprehensive Guide to Hadoop Basics
No ratings yet
Comprehensive Guide to Hadoop Basics
7 pages
Overview of Apache Hadoop Ecosystem
No ratings yet
Overview of Apache Hadoop Ecosystem
154 pages
Big Data Framework: Hadoop Overview
No ratings yet
Big Data Framework: Hadoop Overview
23 pages
Understanding Hadoop Architecture and Components
No ratings yet
Understanding Hadoop Architecture and Components
35 pages
Overview of Apache Hadoop Framework
No ratings yet
Overview of Apache Hadoop Framework
76 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
101 pages
Overview of Hadoop Ecosystem Components
No ratings yet
Overview of Hadoop Ecosystem Components
23 pages
Introduction to Hadoop and Its Ecosystem
No ratings yet
Introduction to Hadoop and Its Ecosystem
84 pages
History and Advantages of Hadoop
No ratings yet
History and Advantages of Hadoop
53 pages
Understanding Hadoop for Big Data Processing
No ratings yet
Understanding Hadoop for Big Data Processing
24 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
5 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
103 pages
Big Data Concepts and Hadoop Overview
No ratings yet
Big Data Concepts and Hadoop Overview
67 pages
Overview of Hadoop Framework and HDFS
No ratings yet
Overview of Hadoop Framework and HDFS
8 pages
MapReduce Types and HDFS Scaling in Hadoop
No ratings yet
MapReduce Types and HDFS Scaling in Hadoop
46 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
6 pages
Understanding Hadoop for Big Data Solutions
No ratings yet
Understanding Hadoop for Big Data Solutions
19 pages
Big Data Analytics With Hadoop
No ratings yet
Big Data Analytics With Hadoop
22 pages
Overview of Hadoop Components
No ratings yet
Overview of Hadoop Components
18 pages
Big Data Processing with Apache Hadoop
No ratings yet
Big Data Processing with Apache Hadoop
40 pages
Overview of Hadoop Modules and Ecosystem
No ratings yet
Overview of Hadoop Modules and Ecosystem
33 pages
Hadoop 2 Ecosystem Overview and Tools
No ratings yet
Hadoop 2 Ecosystem Overview and Tools
23 pages
Hadoop Overview and Evolution Guide
No ratings yet
Hadoop Overview and Evolution Guide
30 pages
Hadoop: History and Components Overview
No ratings yet
Hadoop: History and Components Overview
21 pages
Introduction to Hadoop and Big Data
No ratings yet
Introduction to Hadoop and Big Data
100 pages
Hadoop MapReduce Overview and Architecture
No ratings yet
Hadoop MapReduce Overview and Architecture
33 pages
Cleaning Iris Dataset in Pandas
No ratings yet
Cleaning Iris Dataset in Pandas
2 pages
Effective Class Management Strategies
No ratings yet
Effective Class Management Strategies
31 pages
Comparison Between DSP Processors and General Purpose Microprocessors
No ratings yet
Comparison Between DSP Processors and General Purpose Microprocessors
2 pages
4th Grade English Grammar Test
No ratings yet
4th Grade English Grammar Test
4 pages
Spiritual Cleansing Practices in Yoruba
No ratings yet
Spiritual Cleansing Practices in Yoruba
40 pages
Classroom Etiquette Guidelines for Students
No ratings yet
Classroom Etiquette Guidelines for Students
2 pages
CV Calon Pengurus BEM STIE Pancasetia
No ratings yet
CV Calon Pengurus BEM STIE Pancasetia
4 pages
Essential Resume Writing Guide
No ratings yet
Essential Resume Writing Guide
62 pages
Mobile Phone Etiquette Guidelines
No ratings yet
Mobile Phone Etiquette Guidelines
10 pages
AP Precalculus Practice Exam Answers
No ratings yet
AP Precalculus Practice Exam Answers
50 pages
Formative Assessment Strategies in English
No ratings yet
Formative Assessment Strategies in English
2 pages
Mastering Object-Oriented Python Guide
100% (1)
Mastering Object-Oriented Python Guide
38 pages
English 8 Curriculum Map Overview
No ratings yet
English 8 Curriculum Map Overview
12 pages
Sacred Cows in English Linguistics
No ratings yet
Sacred Cows in English Linguistics
3 pages
System Design Specification Template
No ratings yet
System Design Specification Template
4 pages
PLC Symbols and Ladder Programming Guide
No ratings yet
PLC Symbols and Ladder Programming Guide
7 pages
ESOL III Literature Course Overview
No ratings yet
ESOL III Literature Course Overview
4 pages
ApplicationNote Cloud-Connectivity
No ratings yet
ApplicationNote Cloud-Connectivity
63 pages
Korean Folk Music and Rituals
No ratings yet
Korean Folk Music and Rituals
3 pages
Understanding David Diop's "Africa" Poem
No ratings yet
Understanding David Diop's "Africa" Poem
50 pages
Infinitely Many Prime Divisors in Sets
No ratings yet
Infinitely Many Prime Divisors in Sets
2 pages
English Worksheet for Class V Students
No ratings yet
English Worksheet for Class V Students
3 pages
The Art of Literary Casting
100% (5)
The Art of Literary Casting
41 pages
Tobacco Exporter Contact List 2019
No ratings yet
Tobacco Exporter Contact List 2019
34 pages
Middle School ELA News Article Analysis
No ratings yet
Middle School ELA News Article Analysis
2 pages
TRF Objective 9: Strategies for Learners
No ratings yet
TRF Objective 9: Strategies for Learners
2 pages
Linear Algebra Review for Quantum Computing
No ratings yet
Linear Algebra Review for Quantum Computing
8 pages
Understanding Descriptive Adjectives
No ratings yet
Understanding Descriptive Adjectives
11 pages
Skill 2 Listening IELTS
No ratings yet
Skill 2 Listening IELTS
6 pages
HP 3D DriveGuard Release Notes
No ratings yet
HP 3D DriveGuard Release Notes
14 pages

BDA - Module 2

Uploaded by

BDA - Module 2

Uploaded by

MODULE-2

• Not suitable for advanced analytics involve machine learning

• Requires huge investment

How to process the Gigantic store of data?

Created by Doung Cutting, creator of Apache Lucene

Pig → A high-level scripting language for analyzing large data sets.

Sqoop → Transfers data between Hadoop and relational databases.

HBase → A NoSQL database for real-time read/write access to large datasets.

Oozie → A workflow scheduler for managing Hadoop jobs.

Mahout → is a Java Library which implements Machine Learning techniques

Master node – Namenode

Breaks large files into smaller blocks

Namenode uses rackID to identify the Datanode

Rack is collection of Datanode within the cluster

• The Secondary NameNode periodically: Downloads the current fsimage

• Applies all changes from the edit log to the fsimage.

• Creates a new, updated fsimage.

• Sends this new fsimage back to the NameNode.

- NameNode saves all its file metadata in main memory

HDFS federation uses multiple independent NameNodes for

All DataNodes in the cluster registers with each NameNode in the

Global Resource Manager

Application – Job submitted to the framework

Tasks are executed in distributed fashion on Hadoop cluster

Map task- loading, parsing, transforming & filtering

Each map task is broken down into

Reducer tasks are broken down into

You might also like