0% found this document useful (0 votes)
3 views81 pages

BDA - Module 2

The document provides an introduction to Hadoop, highlighting its significance in managing big data through its low-cost, scalable, and flexible storage solutions. It contrasts Hadoop with traditional RDBMS, emphasizing Hadoop's suitability for large files and advanced analytics. Additionally, it outlines the Hadoop ecosystem, including components like HDFS, MapReduce, and YARN, and discusses the architecture and challenges associated with distributed computing.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views81 pages

BDA - Module 2

The document provides an introduction to Hadoop, highlighting its significance in managing big data through its low-cost, scalable, and flexible storage solutions. It contrasts Hadoop with traditional RDBMS, emphasizing Hadoop's suitability for large files and advanced analytics. Additionally, it outlines the Hadoop ecosystem, including components like HDFS, MapReduce, and YARN, and discusses the architecture and challenges associated with distributed computing.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MODULE-2

Introduction to Hadoop
2.1: Introducing Hadoop
• Big Data is a buzz word
• Enterprise/ world is realizing the huge volume
of untapped data.
• Amount of data generated every day/ every
minute / every second.
Data: the treasure trove
Challenges in Big data
2.2 Why hadoop
• Low cost : Hadoop is an open source framework that uses
commodity hardware that uses to store large amount of data.
• Computing power : based on distributed computing model
which processes very large amount of data
• Scalability :
• Storage flexibility :do not need pre-processed before sturing
Provides the convenience of storing as much data
as one needs & flexibility of deciding later as to
how to use the stored data.
can store unstructured data can be stored as
images,videos etc
Hadoop makes use of commodity hardware, DFS & distributed computing
2.3 Why not RDBMS

• RDBMS is not suitable for storing & processing large files & images

• Not suitable for advanced analytics involve machine learning

• Requires huge investment


2.4 Describes the difference between RDBMS & hadoop
2.5 DISTRIBUTED COMPUTING CHALLENGES
two major challenges.

Hardware Failure

How to process the Gigantic store of data?


5.6 History of Hadoop

Created by Doung Cutting, creator of Apache Lucene


Nutch – gather
data from web
& create
searchable
indexes
2.7 Hadoop overview
Hadoop ecosystem
Hive → SQL-like querying engine for Hadoop, commonly used for data
warehousing.

Pig → A high-level scripting language for analyzing large data sets.

Sqoop → Transfers data between Hadoop and relational databases.

HBase → A NoSQL database for real-time read/write access to large datasets.

Flume → Captures and moves large amounts of log data into Hadoop.

Oozie → A workflow scheduler for managing Hadoop jobs.

Mahout → is a Java Library which implements Machine Learning techniques


for clustering, classification & recommendation.
Hadoop is a master-slave architecture

Master node – Namenode


Slave- Datanode
2.8 use case of hadoop
2.9 HDFS
2.9.1 HDFS Daemons

Breaks large files into smaller blocks

Namenode uses rackID to identify the Datanode

Rack is collection of Datanode within the cluster

•The NameNode keeps an image of the FsImage and EditLog from disk
•The NameNode applies all the transactions from the EditLog to the in-memory
representation of the FsImage
Secondary NameNode

NameNode logs every change to the file system metadata into the
edit log

• The Secondary NameNode periodically: Downloads the current fsimage


and edit log from the NameNode.

• Applies all changes from the edit log to the fsimage.

• Creates a new, updated fsimage.

• Sends this new fsimage back to the NameNode.


Anatomy of file read
Anatomy of file write
Replica replacement strategy
2.10 Data processing with hadoop
Job tracker
Connectivity between Hadoop & application
Decides to divide the task to node ; single Job tracker
Monitors the running tasks
If task fails, it automatically re-schedules the task to different node

Task Tracker
Executes the task assigned by the Job tracker; single task tracker per slave
Sends heartbeat
MapReduce programming workflow
Managing resources and applications with Hadoop
YARN
Limtation of Hadoop 1

HDFS limitation

- NameNode saves all its file metadata in main memory


- NameNode can quickly become overwhelemed with load on the system increasing
Hadoop 2: HDFS

Features

1. Horizontal scalability
2. High availability

HDFS federation uses multiple independent NameNodes for


horizontal scalability.

All DataNodes in the cluster registers with each NameNode in the


cluster.
Fundamental Idea

Global Resource Manager

Scheduler
ApplicationManager

NodeManger

Per-application ApplicationMaster

Basic Concepts

Application – Job submitted to the framework


MapReduce Job
Container
YARN architecture
Interacting with Hadoop ecosystem
Introduction to MapReduce
Jobs are split into set of map tasks & reduce tasks

Tasks are executed in distributed fashion on Hadoop cluster

Map task- loading, parsing, transforming & filtering


Reduce task- grouping & aggregating data

Each map task is broken down into

Reducer tasks are broken down into


MAPPER
Reducer
Combiner
Partitioner
Searching
Sorting
Compression

You might also like