0% found this document useful (0 votes)
45 views45 pages

UNIT-2 Cloud Computing Final PDF

The document covers Hadoop and Python, focusing on Hadoop's architecture, components, and job execution processes, including MapReduce and HDFS. It also discusses cloud application design principles and Python basics, highlighting its features, installation, and programming paradigms. The document serves as a comprehensive guide for understanding big data processing with Hadoop and programming with Python.

Uploaded by

padhu6121985
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views45 pages

UNIT-2 Cloud Computing Final PDF

The document covers Hadoop and Python, focusing on Hadoop's architecture, components, and job execution processes, including MapReduce and HDFS. It also discusses cloud application design principles and Python basics, highlighting its features, installation, and programming paradigms. The document serves as a comprehensive guide for understanding big data processing with Hadoop and programming with Python.

Uploaded by

padhu6121985
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT - II Hadoop and Python

Hadoop Map Reduce: Apache Hadoop, Hadoop Map Reduce Job Execution, Hadoop Schedulers,
Hadoop
Cluster setup.
Cloud Application Design: Reference Architecture for Cloud Applications, Cloud Application Design
Methodologies, Data Storage Approaches.
Python Basics: Introduction, Installing Python, Python data Types & Data Structures, Control flow,
Function, Modules, Packages, File handling, Date/Time Operations, Classes

Apache Hadoop:

Introduction:

➢ Hadoop is an open-source software framework used for storing and processing Big Data in a
distributed manner on large clusters of commodity hardware. Hadoop is licensed under Apache Software
Foundation (ASF).
➢ Hadoop is written in the Java programming language and ranks among the highest-level Apache
projects.
➢ Doug Cutting and Mike J. Cafarella developed Hadoop.
➢ By getting inspiration from Google, Hadoop is using technologies like Map-Reduce programming
model as well as Google file system (GFS).
➢ It is optimized to handle massive quantities of data that could be structured, unstructured or semi-
structured, using commodity hardware, that is, relatively inexpensive computers.
➢ It is intended to work upon from a single server to thousands of machines each offering local
computation and storage. It supports the large collection of data set in a distributed computing
environment.
Hadoop ecosystem:

➢ Hadoop Ecosystem is neither a programming language nor a service; it is a platform or framework


which solves big data problems. You can consider it as a suite which encompasses a number of services
(ingesting, storing, analyzing and maintaining) inside it. Let us discuss and get a brief idea about how
the services work individually and in collaboration.
➢ The Hadoop ecosystem provides the furnishings that turn the framework into a comfortable home for
big data activity that reflects your specific needs and tastes.
➢ The Hadoop ecosystem includes both official Apache open source projects and a wide range of
commercial tools and solutions.

The following are the components of Hadoop ecosystem:


1. HDFS: Hadoop Distributed File System. It simply stores data files as close to the original form as
possible.
2. HBase: It is Hadoop’s distributed column based database. It supports structured data storage for large
tables.
3. Hive: It is a Hadoop’s data warehouse, enables analysis of large data sets using a language very similar
to SQL. So, one can access data stored in hadoop cluster by using Hive.
4. Pig: Pig is an easy to understand data flow language. It helps with the analysis of large data sets which
is quite the order with Hadoop without writing codes in MapReduce paradigm
5. ZooKeeper: It is an open source application that configures synchronizes the distributed systems.
6. Oozie: It is a workflow scheduler system to manage apache hadoop jobs.
7. Mahout: It is a scalable Machine Learning and data mining library.
8. Chukwa: It is a data collection system for managing large distributed systems.
9. Sqoop: it is used to transfer bulk data between Hadoop and structured data stores such as relational
databases.
10. Ambari: it is a web based tool for provisioning, Managing and Monitoring Apache Hadoop clusters.

HDFS:
➢ Hadoop Distributed file system is a Java based distributed file system that allows you to store
large data across multiple nodes in a Hadoop cluster. So, if you install Hadoop, you get HDFS
as an underlying storage system for storing the data in the distributed environment.
➢ Hadoop File System was developed using distributed file system design. It runs on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS runs on top of the existing file systems on each node in a Hadoop cluster. It is a blockstructured
file system where each file is divided into blocks of a pre-determined size. These blocks are stored across
a cluster of one or several machines
Hadoop Distributed File System (HDFS)
HDFS is Hadoop’s primary storage system. It is designed to reliably store the vast amounts of data across
a cluster of machines.

Architecture and components:

The HDFS architecture revolves around a master-slave model. At the top sits the NameNode,
which manages metadata— essentially, the file system’s directory tree and information about
each file’s location. It doesn’t store the actual data.

The DataNodes are the workhorses. They manage storage attached to the nodes and serve
read/write requests from clients. Each DataNode regularly reports back to the NameNode with a
heartbeat and block reports to ensure consistent state tracking.

Finally, HDFS includes a Secondary NameNode, not to be confused with a failover node.
Instead, it periodically checkpoints the NameNode’s metadata to reduce startup time and
memory overhead.
Name node:
The NameNode is the primary master server in a Hadoop Distributed File System (HDFS) cluster. It
acts as the central coordinator, managing the filesystem namespace and regulating client
access to data stored across the various DataNodes (slave nodes).
DataNode:
DataNode works on the Slave system. The NameNode always instructs DataNode for storing the Data.
DataNode is a program that runs on the slave system that serves the read/write request from the client.

Secondary NameNode in HDFS:


Secondary NameNode in Hadoop is more of a helper to NameNode, it is not a backup NameNode server
which can quickly take over in case of NameNode failure.

JobTracker:
JobTracker is a master daemon responsible for executing over MapReduce job. It provides connectivity
between Hadoop and application.

TaskTracker:
This daemon is responsible for executing individual tasks that is assigned by the Job Tracker.
Task Tracker continuously sends heartbeat message to job tracker. When a job tracker fails to receive a
heartbeat message from a TaskTracker, the JobTracker assumes that the TaskTracker has failed and
resubmits the task to another available node in the cluster.
[Link] MapReduce is the data processing layer. It processes the huge amount of structured and
unstructured data stored in HDFS.
[Link] processes data in parallel by dividing the job into the set of independent tasks. So, parallel
processing improves speed and reliability.

Hadoop MapReduce data processing takes place in 2 phases- Map and Reduce phase.

Map phase- It is the first phase of data processing. In this phase, we specify all the complex
logic/business rules/costly code.
Reduce phase- It is the second phase of processing. In this phase, we specify light-weight processing like
aggregation/summation.
Steps of MapReduce Job Execution flow:
MapReduce processess the data in various phases with the help of different components. Let’s discuss the
steps of job execution in Hadoop.

1. Input Files
In input files data for MapReduce job is stored. In HDFS, input files reside. Input files format is arbitrary.
Line-based log files and binary format can also be used.

2. InputFormat
After that InputFormat defines how to split and read these input files. It selects the files or other objects
for input. InputFormat creates InputSplit.

3. InputSplits
It represents the data which will be processed by an individual Mapper. For each split, one map task is
created. Thus the number of map tasks is equal to the number of InputSplits. Framework divide split into
records, which mapper process.

4. RecordReader
It communicates with the inputSplit. And then converts the data into key-value pairs suitable for reading
by the Mapper. RecordReader by default uses TextInputFormat to convert data into a key-value pair.

5. Mapper
It processes input record produced by the RecordReader and generates intermediate key-value pairs. The
intermediate output is completely different from the input pair. The output of the mapper is the full
collection of key-value pairs.

4. Combiner
Combiner is Mini-reducer which performs local aggregation on the mapper’s output. It minimizes the
data transfer between mapper and reducer. So, when the combiner functionality completes, framework
passes the output to the partitioner for further processing.

5. Partitioner
Partitioner comes into the existence if we are working with more than one reducer. It takes the output of
the combiner and performs partitioning.

6. Shuffling and Sorting


After partitioning, the output is shuffled to the reduce node. The shuffling is the physical movement of the
data which is done over the network. As all the mappers finish and shuffle the output on the reducer
nodes.

Then framework merges this intermediate output and sort. This is then provided as input to reduce phase.
7. Reducer
Reducer then takes set of intermediate key-value pairs produced by the mappers as the input. After that
runs a reducer function on each of them to generate the output.
The output of the reducer is the final output. Then framework stores the output on HDFS.
8. RecordWriter
It writes these output key-value pair from the Reducer phase to the output files.
9. OutputFormat
OutputFormat defines the way how RecordReader writes these output key-value pairs in output files. So,
its instances provided by the Hadoop write files in HDFS. Thus OutputFormat instances write the final
output of reducer on HDFS.
Hadoop Schedulers:
Hadoop schedulers in cloud computing manage resources for big data jobs, with key types including the
default FIFO, the fair-share-focused Fair Scheduler, and capacity-based Capacity Scheduler,
all aiming to improve efficiency over basic queuing, addressing shared cluster needs by
balancing resource utilization, fairness, and performance, often using techniques like Delay
Scheduling to optimize data locality in dynamic cloud environments

FIFO Scheduler
First In First Out is the default scheduling policy used in Hadoop. FIFO
Scheduler gives more preferences to the application coming first than those
coming later. It places the applications in a queue and executes them in the
order of their submission (first in, first out).
Here, irrespective of the size and priority, the request for the first application
in the queue are allocated first. Once the first application request is satisfied,
then only the next application in the queue is served.

Advantage:
 It is simple to understand and doesn’t need any configuration.
 Jobs are executed in the order of their submission.
Disadvantage:
 It is not suitable for shared clusters. If the large application comes before
the shorter one, then the large application will use all the resources in the
cluster, and the shorter application has to wait for its turn. This leads to
starvation.
 It does not take into account the balance of resource allocation between
the long applications and short applications.
2. Capacity Scheduler
The CapacityScheduler allows multiple-tenants to securely share a large
Hadoop cluster. It is designed to run Hadoop applications in a shared, multi-
tenant cluster while maximizing the throughput and the utilization of the
cluster.
It supports hierarchical queues to reflect the structure of organizations or
groups that utilizes the cluster resources. A queue hierarchy contains three
types of queues that are root, parent, and leaf.

Advantages:
 It maximizes the utilization of resources and throughput in the Hadoop
cluster.
 Provides elasticity for groups or organizations in a cost-effective manner.
 It also gives capacity guarantees and safeguards to the organization
utilizing cluster.
Disadvantage:
 It is complex amongst the other scheduler.

3. Fair Scheduler
FairScheduler allows YARN applications to fairly share resources in large
Hadoop clusters. With FairScheduler, there is no need for reserving a set
amount of capacity because it will dynamically balance resources between all
running applications.

It assigns resources to applications in such a way that all applications get, on


average, an equal amount of resources over time.

The FairScheduler, by default, takes scheduling fairness decisions only on the


basis of memory. We can configure it to schedule with both memory and CPU.
Advantages:
 It provides a reasonable way to share the Hadoop Cluster between the
number of users.
 Also, the FairScheduler can work with app priorities where the priorities
are used as weights in determining the fraction of the total resources that
each application should get.
Disadvantage:
 It requires configuration.

Hadoop Cluster setup:


Hadoop Cluster Setup process:
Setting up a Hadoop cluster involves configuring multiple machines to work together as a unified system
for processing and storing large datasets. Here's a basic outline of the process:
1. Hardware Setup:
Choose suitable hardware for the cluster, including servers with sufficient RAM, CPU, and storage
capacity. Ensure that all machines have a reliable network connection.
2. Operating System Installation:
Install a compatible operating system (e.g., Linux) on each machine in the cluster.
[Link] Installation:
Install Java Development Kit (JDK) on all machines, as Hadoop is built using Java.
4. Hadoop Installation:
Download the Hadoop distribution and extract it on all machines.
Configure the Hadoop environment variables, such as HADOOP_HOME and JAVA_HOME.
5. Configuration Files:
Modify the [Link], [Link], and [Link] configuration files to specify the cluster
settings, such as the Namenode and Datanode details.

6. SSH Setup:
Set up passwordless SSH between the master and slave nodes to enable secure communication.
7. Hadoop Daemons:
Start the Hadoop daemons, including the Namenode, Datanode, Resource Manager, and Node Manager,
on their respective machines.

8. Testing:
Verify the cluster setup by running sample MapReduce jobs and checking the Hadoop web interface for
cluster status.
9. Maintenance:
Regularly monitor the cluster for performance, resource utilization, and any potential issues.
Cloud Application Design: Reference Architecture for Cloud Applications, Cloud Application Design
Methodologies, Data Storage Approaches.
1. Scalability
• Scalability is an important factor that drives the application designers to move to cloud computing
environments. Building applications that can serve millions of users without taking a hit on their
performance has always been challenging. With the growth of cloud computing application designers can
provision adequate resources to meet their workload levels.

esign

[Link] & Availability


Reliability of a system is defined as the probability that a system will perform the intended functions
under stated conditions for a specified amount of time. Availability is the probability that a system will
perform a specified function under given conditions at a prescribed time.

[Link]
Security is an important design consideration for cloud applications given the outsourced nature of cloud
computing environments.

4. Maintenance & Upgradation:


To achieve a rapid time-to-market, businesses typically launch their applications with a core set of
features ready and then incrementally add new features as and when they are complete.
[Link]
Applications should be designed while keeping the performance requirements in mind
Reference Architectures – e-Commerce, Business-to-Business, Banking and
Financial apps
Python Basics: Introduction, Installing Python, Python data Types & Data Structures, Control flow,
Function, Modules, Packages, File handling, Date/Time Operations, Classes
Python:
Python is a general-purpose programming language that is interpreted and high-level. It focuses on
readability and simple syntax. It has English like syntax and reading a python code is similar to reading
an English sentence.
Python offers all the functionalities one might ever need for programming tasks.
History of Python
In December 1989, Guido Van Rossum was searching for a hobby project to keep him occupied around
Christmas week. Since he had already been planning to write a new scripting language descended from
ABC, that would also appeal to Unix/C hackers, he ended up writing an interpreter for it. Being a big fan
of the British comedy troupe Monty Python, he chose to call the project ‘Python’ in an irreverent mood.

Key Characteristics

 Easy to Learn: Syntax is similar to English, reducing lines of code.

 Interpreted: Executes code line-by-line, allowing for rapid prototyping.

 High-Level: Manages complex tasks like memory automatically.

 Dynamically Typed: Type checking happens at runtime, not compile time.

 Cross-Platform: Runs on Windows, Mac, Linux, etc..

 Multi-Paradigm: Supports different programming styles (procedural, OOP, functional).

Installation of python:

You might also like