0% found this document useful (0 votes)
15 views7 pages

BDA Module 2: Hadoop Core Components

The document provides an overview of Hadoop's core components and ecosystem, detailing its architecture, including HDFS, MapReduce, and YARN, along with their functionalities. It also describes HDFS features, its master/slave architecture, and the MapReduce framework for processing large datasets. Additionally, it introduces Apache Hive, a data warehousing tool that allows SQL-like querying of data stored in Hadoop.

Uploaded by

laxmishetti1
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views7 pages

BDA Module 2: Hadoop Core Components

The document provides an overview of Hadoop's core components and ecosystem, detailing its architecture, including HDFS, MapReduce, and YARN, along with their functionalities. It also describes HDFS features, its master/slave architecture, and the MapReduce framework for processing large datasets. Additionally, it introduces Apache Hive, a data warehousing tool that allows SQL-like querying of data stored in Hadoop.

Uploaded by

laxmishetti1
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

BDA MODULE 2

1. HADOOP – CORE COMPONENTS, ECO-SYSTEM COMPONENTS.


Hadoop Core Components

1. Hadoop Common
a. Purpose: Provides essential libraries and utilities for other modules in Hadoop.
b. Functions:
i. Contains components required for the operation of Hadoop.
ii. Manages general input/output, serialization, and Java RPC (Remote Procedure Call).
iii. Facilitates file-based data structures.
2. Hadoop Distributed File System (HDFS)
a. Purpose: A Java-based distributed file system for storing large datasets.
b. Functions:
i. Stores data across multiple nodes in a cluster.
ii. Handles structured, unstructured, and semi-structured data.
iii. Provides high fault tolerance and scalability by replicating data blocks.
3. MapReduce v1
a. Purpose: A programming model for processing large data sets in parallel.
b. Functions:
i. Divides tasks into "Map" and "Reduce" steps for parallel execution.
ii. Processes data in batch mode, suitable for large-scale data computations.
4. YARN (Yet Another Resource Negotiator)
a. Purpose: Manages resources for distributed computing.
b. Functions:
i. Allocates resources for application tasks or sub-tasks running on Hadoop.
ii. Schedules tasks in parallel and handles resource requests.
iii. Ensures distributed and efficient task execution.
5. MapReduce v2 (Hadoop 2 YARN)
a. Purpose: An updated version of MapReduce for handling larger datasets.
b. Functions:
i. Enables parallel processing of large datasets.
ii. Improves scalability and resource management compared to MapReduce v1.
iii. Optimized for distributed processing of application tasks.
Hadoop Ecosystem Components
Refers to a combination of technologies and tools that work together with Hadoop.
Supports storage, processing, access, analysis, governance, security, and operations for Big Data.

1. Core Hadoop Components


a. HDFS (Hadoop Distributed File System): For distributed storage of data.
b. MapReduce: Programming model for distributed data processing.
c. YARN (Yet Another Resource Negotiator): Manages resources and schedules tasks for
distributed computing.
2. Data Store System
a. Consists of clusters, racks, DataNodes, and blocks for storing and managing data.
b. Deploys HDFS for efficient distributed file storage.
3. Application Programming Model
a. Includes models like MapReduce and HBase for data processing and storage.
b. HBase uses columnar databases and supports OLAP (Online Analytical Processing) for
analytical workloads.
4. Support Layer Components
a. AVRO: Data serialization system for efficient communication between Hadoop components.
b. ZooKeeper: Coordination service for managing distributed applications.
5. Application Layer Components
a. Pig: A high-level scripting platform for data transformation and analysis.
b. Hive: A data warehousing tool for querying and managing large datasets using SQL-like queries.
c. Sqoop: Facilitates data transfer between Hadoop and relational databases.
d. Ambari: Provides a web-based interface for managing and monitoring Hadoop clusters.
e. Chukwa: For monitoring and analyzing log data in the Hadoop ecosystem.
f. Mahout: A machine learning library for building scalable algorithms.
g. Spark: A fast and general-purpose cluster computing framework for real-time and batch data
processing.
h. Flink: A stream-processing framework for distributed data processing.
i. Flume: Used for collecting, aggregating, and transporting log data.

2. HDFS – FEATURES, CORE COMPONENTS.


Features
1. Distributed Storage: HDFS stores data across multiple nodes in a distributed manner, ensuring high
scalability and fault tolerance.
2. Replication: Data is replicated across multiple nodes (default replication factor is 3) to ensure availability
and reliability even in case of node failures.
3. Fault Tolerance: HDFS can handle hardware failures by replicating data and recovering lost files from
replicas.
4. Scalability: HDFS is designed to handle large-scale data and can scale horizontally by adding more nodes
to the cluster.
5. Write Once, Read Many: Data in HDFS is written once and read multiple times, which simplifies
concurrency control and improves efficiency.
6. High Throughput: HDFS is optimized for large data transfers, ensuring high throughput for data access
and processing.
7. Block-Based Storage: Files are divided into large blocks (default size is 128 MB or 256 MB) and
distributed across the cluster, enabling efficient storage and processing.
8. Support for Large Files: HDFS is designed to handle files in the range of gigabytes to terabytes
efficiently.
9. Compatibility with Heterogeneous Hardware: HDFS works seamlessly with a variety of hardware and
operating systems, making it adaptable to diverse environments.
10. Built-In Redundancy: Data redundancy through replication ensures data integrity and minimizes the risk
of data loss.
Components

HDFS operates using a Master/Slave architecture consisting of the following key components:
1. NameNode (Master)
• Role:
o Manages the file system namespace and regulates client access to files.
o Stores and manages metadata, such as file structure, file locations, permissions, and mapping of
file blocks to DataNodes.
• Key Responsibilities:
o Handles file system namespace operations: Opening, closing, renaming files, and directories.
o Determines the mapping of blocks to DataNodes.
o Monitors the health of DataNodes using heartbeat signals.
o Handles block creation, deletion, and replication.
o Ensures high availability of data by replicating blocks based on the replication factor.
• Data Flow:
o Does not store actual data but provides clients with metadata for interacting with DataNodes.
o Keeps all metadata in memory for fast access.
2. DataNodes (Slaves)
• Role:
o Store the actual data blocks of files.
o Serve read and write requests directly from/to clients.
• Key Responsibilities:
o Report block information to the NameNode through periodic heartbeats and block reports.
o Replicate data blocks as instructed by the NameNode.
o Handle actual data transfer between the client and storage.
• Data Replication:
o To ensure data reliability, the same block is replicated across multiple DataNodes.
o If multiple racks exist, the NameNode tries to replicate blocks across different racks to enhance
fault tolerance.
3. Client
• Role: Acts as the user-facing component that interacts with HDFS.
• Interaction Process:
o Requests file creation or retrieval from the NameNode.
o Receives metadata (e.g., block locations) from the NameNode.
o Interacts directly with DataNodes for data read/write operations.
4. Secondary NameNode (Checkpoint Node)
• Role:
o Assists the primary NameNode in managing metadata.
o Periodically merges the namespace image (fsimage) and the edit log to create a new namespace
image checkpoint.
• Key Characteristics:
o Not a backup or failover node.
o Does not replace the NameNode in case of failure.
o Improves the availability and manageability of metadata.
Data Flow in HDFS
1. Write Operation:
a. The client requests the NameNode to create a file.
b. The NameNode allocates the required blocks and provides the DataNodes for storage.
c. The client writes data to the designated DataNodes.
d. The NameNode ensures replication of blocks.
2. Read Operation:
a. The client requests a file from the NameNode.
b. The NameNode returns the DataNodes storing the file blocks.
c. The client reads data directly from the DataNodes.
Fault Tolerance in HDFS
• Heartbeat Monitoring:
o The NameNode monitors DataNodes via heartbeats.
o Lack of a heartbeat indicates a DataNode failure.
• Re-Replication: The NameNode re-replicates blocks stored on the failed DataNode to maintain the
replication factor.
• Rack Awareness: Replication is designed to optimize fault tolerance by storing data across different racks.

3. HDFS MAPREDUCE FRAMEWORK


MapReduce provides a programming framework for processing and analyzing large datasets in parallel across
distributed clusters of computers. The framework handles job scheduling, execution, and data movement
efficiently.

Key Features of MapReduce Framework

1. Automatic Parallelization and Distribution: Automates the division of computation tasks across several
processors, enabling efficient parallel processing.
2. Distributed Processing: Processes data stored on clusters of DataNodes distributed across racks.
3. Large-Scale Data Handling: Supports the processing of vast amounts of data simultaneously.
4. Scalability: Offers scalability by enabling the usage of numerous servers in the cluster.
5. Batch-Oriented Programming Model: Provides a batch-oriented model for processing in Hadoop
version 1.
6. Enhanced Processing in Hadoop 2 (YARN-Based): Additional modes of processing in YARN (Yet
Another Resource Negotiator) enable:
o Query processing
o Graph databases
o Real-time analytics
o Streaming data

These features align with the 3Vs of Big Data (Volume, Variety, Velocity).

Framework Architecture
The MapReduce framework handles two primary functions:
1. Job Distribution: Distributes the application tasks (jobs) to nodes within the cluster for parallel execution.
2. Data Aggregation: Collects and organizes intermediate results from nodes into a cohesive response.

Execution Process
1. JobTracker (Master Node)
a. A daemon (background process) responsible for:
i. Estimating resource requirements for tasks.
ii. Analyzing the states of slave nodes.
iii. Assigning map tasks to the appropriate DataNodes.
iv. Monitoring the progress of tasks.
v. Handling failures by rescheduling tasks.
2. TaskTracker (Slave Nodes)
a. Executes the Map and Reduce tasks assigned by the JobTracker.
b. Reports the status of tasks back to the JobTracker.

Job Execution Process


1. Mapper Phase
a. Deploys map tasks on DataNodes where the application data is stored.
b. Outputs intermediate results serialized using formats like AVRO.
2. Reducer Phase
a. Receives data from the Mapper phase.
b. Performs computations and consolidates results into the final output.
3. Result Collection: The final output is sent back to the Hadoop server after all tasks are completed.
Types of Processes in MapReduce
• JobTracker
o Manages jobs submitted by clients.
o Assigns tasks to TaskTrackers and monitors progress.
• TaskTracker
o Executes assigned tasks (Map and Reduce phases).
o Periodically reports status back to the JobTracker.
Daemon
• Refers to a background program dedicated to managing system processes in Hadoop.
• Examples:
o JobTracker: Coordinates jobs and resources.
o TaskTracker: Executes tasks on nodes.

4. APACHE HIVE
Apache Hive is a data warehouse infrastructure built on top of Hadoop that facilitates data summarization, ad
hoc queries, and analysis of large datasets using a SQL-like language called HiveQL. It provides an interface for
querying and managing data stored in Hadoop's distributed file system (HDFS) or other storage systems like
HBase.

Key Features of Apache Hive:

1. SQL-Like Interface (HiveQL)


a. Enables users familiar with SQL to work with Hadoop data.
b. Simplifies querying large datasets using declarative language.
2. ETL Support: Offers tools for data extraction, transformation, and loading (ETL).
3. Data Structure and Format Handling: Imposes structure on various data formats to enable querying.
4. HDFS and HBase Integration: Allows access to data stored in HDFS or HBase seamlessly.
5. Query Execution: Queries are executed via MapReduce or Tez (an optimized MapReduce engine).

Using Hive

1. Start Hive
a. Enter the Hive command to begin a session and access the hive> prompt.
b. Command: $ hive
2. Create a Table
a. Use the CREATE TABLE command to define a table structure.
Example: CREATE TABLE pokes (foo INT, bar STRING);
b. Command: hive> CREATE TABLE pokes (foo INT, bar STRING);
3. List Tables
a. Use the SHOW TABLES command to display all existing tables.
b. Command: hive> SHOW TABLES;
4. Drop a Table
a. Use the DROP TABLE command to delete a specific table.
Example: DROP TABLE pokes;
b. Command: hive> DROP TABLE pokes;
5. Commands End with Semicolon: All Hive commands must conclude with a semicolon (;).
Applications of Hive
• Interactive SQL queries on large datasets (petabytes).
• Data analysis and summarization.
• Integration with Hadoop for big data solutions.

Common questions

Powered by AI

YARN expands the capabilities of the original MapReduce v1 by separating the resource management and job scheduling functions. It allocates resources for application tasks in a distributed environment and schedules tasks in parallel, allowing for more efficient resource usage . Improvements over MapReduce v1 include better scalability, enhanced resource management, and the ability to support multiple types of data processing beyond large-scale batch processing (such as real-time analytics and streaming, enabled through frameworks like Spark and Flink).

The MapReduce framework provides distributed data processing by dividing tasks into Map and Reduce phases, which run parallelly across a cluster of nodes. It handles large datasets by automatically distributing the computation tasks and orchestrating the data movement and aggregation . With the integration of YARN, MapReduce has evolved to enhance scalability and resource management, supporting diverse processing models like real-time analytics and streaming alongside traditional batch processing, thereby broadening its applicability beyond its initial scope .

Hadoop's scalability is assured through mechanisms like HDFS's distributed storage, which allows data to be spread across multiple nodes, enabling efficient management of large datasets . Additionally, HDFS can horizontally scale by adding more nodes to the cluster, accommodating increasing data volumes . MapReduce contributes by automatically distributing processing tasks across numerous cluster nodes, while YARN further enhances scalability by managing resources effectively across distributed environments and supporting simultaneous execution of diverse processing workflows .

HDFS achieves fault tolerance through data replication, where each data block is typically replicated across multiple DataNodes (with a default factor of 3) to ensure availability even in the case of node failures . The system monitors the health of DataNodes using heartbeat signals, and if a DataNode fails, the NameNode re-replicates the data blocks stored on the failed node to maintain the replication factor . Rack awareness further enhances fault tolerance by replicating data across different racks to minimize data loss risk in case of rack failure .

Within HDFS, the NameNode serves as the master managing the file system namespace, storing metadata, and regulating access to files, while DataNodes act as slaves storing the actual data blocks and handling client read/write requests . During a file write operation, a client requests the NameNode to create a file. The NameNode then allocates necessary blocks and provides information about the DataNodes designated for storage. The client writes the data to these DataNodes, and the NameNode ensures that blocks are replicated as per the replication policy .

The 'Write Once, Read Many' model in HDFS eliminates the need for complex concurrency control mechanisms required in systems that allow multiple data writes. This simplifies the file consistency model and reduces synchronization overhead, leading to improvements in processing efficiency . As files are immutably stored once written, subsequent operations primarily focus on reading, allowing HDFS to deliver high throughput for data access and processing .

Apache Hive utilizes HDFS and HBase by providing a SQL-like interface to query and manage large datasets stored within these systems. Hive efficiently runs queries over data in HDFS by integrating with data processing engines like MapReduce or Tez, thereby enabling scalable data summarization and analysis . Furthermore, the integration with HBase supports low-latency queries on massive datasets, enhancing real-time data access capabilities. These interactions collectively facilitate comprehensive big data analytics, allowing users to leverage the strengths of both HDFS and HBase for storage and fast retrieval .

Within the Hadoop ecosystem, Pig serves as a high-level platform that uses a scripting language to specify data transformation tasks, simplifying complex data processing workflows by translating them into sequences of MapReduce jobs . On the other hand, Sqoop facilitates the import and export of large datasets between Hadoop and relational databases, aiding in data migration and integration tasks . By working together, these components enable comprehensive data transformation and seamless data migration into and out of the Hadoop framework, showcasing the ecosystem's flexibility and efficiency in handling diverse data workflows .

Hive's SQL-like interface, HiveQL, allows users familiar with SQL to perform data analysis on large datasets stored in HDFS or HBase without needing to write complex MapReduce code, thereby simplifying big data processing . Hive integrates seamlessly with Hadoop components like HDFS for data storage and with execution engines like MapReduce or Tez for processing, making it a pivotal component for querying and managing big data, further supporting ETL processes and enabling data analysis and summarization at scale .

MapReduce v2, based on YARN, offers several advantages over its predecessor by providing improved scalability, better resource management, and increased flexibility in managing large datasets and distributed computing tasks . These improvements stem from YARN's decoupling of resource management from job scheduling, which enhances overall efficiency and allows for the dynamic allocation of system resources, supporting a wide array of data processing applications beyond the traditional MapReduce model .

You might also like