MODULE-2
Introduction to Hadoop
2.1: Introducing Hadoop
• Big Data is a buzz word
• Enterprise/ world is realizing the huge volume
of untapped data.
• Amount of data generated every day/ every
minute / every second.
Data: the treasure trove
Challenges in Big data
2.2 Why hadoop
• Low cost : Hadoop is an open source framework that uses
commodity hardware that uses to store large amount of data.
• Computing power : based on distributed computing model
which processes very large amount of data
• Scalability :
• Storage flexibility :do not need pre-processed before sturing
Provides the convenience of storing as much data
as one needs & flexibility of deciding later as to
how to use the stored data.
can store unstructured data can be stored as
images,videos etc
Hadoop makes use of commodity hardware, DFS & distributed computing
2.3 Why not RDBMS
• RDBMS is not suitable for storing & processing large files & images
• Not suitable for advanced analytics involve machine learning
• Requires huge investment
2.4 Describes the difference between RDBMS & hadoop
2.5 DISTRIBUTED COMPUTING CHALLENGES
two major challenges.
Hardware Failure
How to process the Gigantic store of data?
5.6 History of Hadoop
Created by Doung Cutting, creator of Apache Lucene
Nutch – gather
data from web
& create
searchable
indexes
2.7 Hadoop overview
Hadoop ecosystem
Hive → SQL-like querying engine for Hadoop, commonly used for data
warehousing.
Pig → A high-level scripting language for analyzing large data sets.
Sqoop → Transfers data between Hadoop and relational databases.
HBase → A NoSQL database for real-time read/write access to large datasets.
Flume → Captures and moves large amounts of log data into Hadoop.
Oozie → A workflow scheduler for managing Hadoop jobs.
Mahout → is a Java Library which implements Machine Learning techniques
for clustering, classification & recommendation.
Hadoop is a master-slave architecture
Master node – Namenode
Slave- Datanode
2.8 use case of hadoop
2.9 HDFS
2.9.1 HDFS Daemons
Breaks large files into smaller blocks
Namenode uses rackID to identify the Datanode
Rack is collection of Datanode within the cluster
•The NameNode keeps an image of the FsImage and EditLog from disk
•The NameNode applies all the transactions from the EditLog to the in-memory
representation of the FsImage
Secondary NameNode
NameNode logs every change to the file system metadata into the
edit log
• The Secondary NameNode periodically: Downloads the current fsimage
and edit log from the NameNode.
• Applies all changes from the edit log to the fsimage.
• Creates a new, updated fsimage.
• Sends this new fsimage back to the NameNode.
Anatomy of file read
Anatomy of file write
Replica replacement strategy
2.10 Data processing with hadoop
Job tracker
Connectivity between Hadoop & application
Decides to divide the task to node ; single Job tracker
Monitors the running tasks
If task fails, it automatically re-schedules the task to different node
Task Tracker
Executes the task assigned by the Job tracker; single task tracker per slave
Sends heartbeat
MapReduce programming workflow
Managing resources and applications with Hadoop
YARN
Limtation of Hadoop 1
HDFS limitation
- NameNode saves all its file metadata in main memory
- NameNode can quickly become overwhelemed with load on the system increasing
Hadoop 2: HDFS
Features
1. Horizontal scalability
2. High availability
HDFS federation uses multiple independent NameNodes for
horizontal scalability.
All DataNodes in the cluster registers with each NameNode in the
cluster.
Fundamental Idea
Global Resource Manager
Scheduler
ApplicationManager
NodeManger
Per-application ApplicationMaster
Basic Concepts
Application – Job submitted to the framework
MapReduce Job
Container
YARN architecture
Interacting with Hadoop ecosystem
Introduction to MapReduce
Jobs are split into set of map tasks & reduce tasks
Tasks are executed in distributed fashion on Hadoop cluster
Map task- loading, parsing, transforming & filtering
Reduce task- grouping & aggregating data
Each map task is broken down into
Reducer tasks are broken down into
MAPPER
Reducer
Combiner
Partitioner
Searching
Sorting
Compression