0% found this document useful (0 votes)
8 views70 pages

Cloud-Enabling Technologies Overview

Uploaded by

gaaurii.03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views70 pages

Cloud-Enabling Technologies Overview

Uploaded by

gaaurii.03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Cloud Computing

Module 3
Cloud-Enabling Technologies
• It likely refers to the set of technologies that make cloud computing
possible.
• These are tools and methods that allow cloud services to work
smoothly.
• They help in building, running, and managing cloud systems.
1. Broadband Networks and Internet Architecture

2. Data center technology

3. Web technology

4. Multitenant technology

5. Service technology
Broadband Networks and Internet
Architecture
• Broadband networks and robust internet architecture are fundamental to cloud
computing.

• They provide the necessary connectivity and bandwidth to access cloud services from
anywhere in the world.

• High-speed internet connections ensure that data can be transferred quickly and
efficiently between users and cloud servers.

• Importance of Network connectivity in clouds


 Essential for remote provisioning of IT resource

 Supports ubiquitous network access

 Cloud platforms grow with advancements in internet connectivity


Internet Service Providers (ISPs)
• ISP stands for Internet Service Provider.
• It is a company that provides access to the internet and similar services such as Website
designing and virtual hosting.
• For example, when you connect to the Internet, the connection between your Internet-enabled
device and the internet is executed through a specific transmission technology that involves the
transfer of information packets through an Internet Protocol route.
• An ISP network interconnects to other ISP networks and various organizations.
• ISPs can freely deploy, operate, and manage their networks in addition to selecting partner ISPs
for interconnection.
• Worldwide connectivity is enabled through a hierarchical topology composed of Tiers 1, 2, and 3
• The core Tier 1 is made of large-scale, international cloud providers that oversee massive
interconnected global networks, which are connected to Tier 2’s large regional providers.
• The interconnected ISPs of Tier 2 connect with Tier 1 providers, as well as the local ISPs of Tier 3.
• Hence, Cloud consumers and cloud providers can connect directly using a Tier 1 provider, as any
operational ISP can enable Internet connection.
• Technical architecture of internetworking:
1. Connectionless Packet Switching
2. Router-based Interconnectivity

Fig: An abstraction of the internetworking


structure of the internet

Connectionless Packet Switching (Datagram Networks)

• End-to-end (sender-receiver pair) data flows are divided into packets of a limited size that
are received and processed through network switches and routers, then queued and
forwarded from one intermediary node to the next.
• Each packet carries the necessary location information, such as the Internet Protocol (IP)
or Media Access Control (MAC) address, to be processed and routed at every source,
intermediary, and destination node.
Router-Based Interconnectivity
• A router is a device that is connected to multiple networks through which it
forwards packets.
• Even when successive packets are part of the same data flow, routers process and
forward each packet individually while maintaining the network topology
information that locates the next node on the communication path between the
source and destination nodes.
• Routers manage network traffic and gauge the most efficient hop for packet
delivery, since they are privy to both the packet source and packet destination.
• The communication path that connects a cloud consumer with its cloud provider
may involve multiple ISP networks.
• The Internet’s mesh structure connects Internet hosts (endpoint systems) using
multiple alternative network routes that are determined at runtime.
• Communication can therefore be sustained even during simultaneous network
failures, although using multiple network paths can cause routing fluctuations
and latency.
Technical and Business Considerations
Connectivity Issues
• In traditional, on-premise deployment models, enterprise applications and
various IT solutions are commonly hosted on centralized servers and storage
devices residing in the organization’s own data center.
• End-user devices, such as smartphones and laptops, access the data center
through the corporate network, which provides uninterrupted Internet
connectivity.
• TCP/IP facilitates both Internet access and on-premise data exchange over LANs
• Organizations using this deployment model can directly access the network
traffic to and from the Internet and usually have complete control over and can
safeguard their corporate networks using firewalls and monitoring software.
• These organizations also assume the responsibility of deploying, operating, and
maintaining their IT resources and Internet connectivity.
Fig: The internetworking architecture of a private cloud.

Fig: The internetworking architecture of an Internet-based cloud


deployment model.
• End-user devices that are connected to the network through the Internet can be granted
continuous access to centralized servers and applications in the cloud.
• A salient cloud feature that applies to end-user functionality is how centralized IT
resources can be accessed using the same network protocols regardless of whether they
reside inside or outside of a corporate network.
• Whether IT resources are on-premise or Internet-based dictates how internal versus
external end- users access services, even if the end-users themselves are not concerned
with the physical location of cloud-based IT resources.

Table: A
comparison of on-
premise and cloud-
based
internetworking
Network Bandwidth and Latency Issues
• End-to-End bandwidth is determined by the transmission capacity of the shared data
links that connect intermediary nodes.
• This type of bandwidth is constantly increasing, as Web acceleration technologies, such
as dynamic caching, compression, and pre-fetching, continue to improve end-user
connectivity.
• Latency is the amount of time it takes a packet to travel from one data node to another.
• Latency increases with every intermediary node on the data packet’s path.
• Transmission queues in the network infrastructure can result in heavy load conditions
that also increase network latency.
• Packet networks with “best effort” quality-of-service (QoS) typically transmit packets
on a first- come/first serve basis.
• Data flows that use congested network paths suffer service-level degradation in the
form of bandwidth reduction, latency increase, or packet loss when traffic is not
prioritized.
• The nature of packet switching allows data packets to choose routes dynamically as
they travel through the Internet’s network infrastructure.
Cloud Carrier and Cloud Provider Selection

• The service levels of Internet connections between cloud consumers


and cloud providers are determined by their ISPs, which are usually
different and therefore include multiple ISP networks in their paths.
• QoS management across multiple ISPs is difficult to achieve in practice,
requiring collaboration of the cloud carriers on both sides to ensure
that their end-to-end service levels are sufficient for business
requirements.
• Cloud consumers and cloud providers may need to use multiple cloud
carriers in order to achieve the necessary level of connectivity and
reliability for their cloud applications, resulting in additional costs.
• Cloud adoption can therefore be easier for applications with more
relaxed latency and bandwidth requirements.
Data Center Technology
• Grouping IT resources in close proximity with one another, rather than having them
geographically dispersed, allows for power sharing, higher efficiency in shared IT
resource usage, and improved accessibility for IT personnel.
• Modern data centers exist as specialized IT infrastructure used to house centralized IT
resources, such as servers, databases, networking and telecommunication devices, and
software systems.
• Data centers are typically comprised of the following technologies and components:
Virtualization
• Data centers consist of both physical and virtualized IT resources.
• The physical IT resource layer refers to the facility infrastructure that
houses computing/networking systems and equipment, together with
hardware systems and their operating systems.
• The resource abstraction and control of the virtualization layer is
comprised of operational and management tools that are often based
on virtualization platforms that abstract the physical computing and
networking IT resources as virtualized components that are easier to
allocate, operate, release, monitor, and control.
Standardization and Modularity
• Data centers are built upon standardized commodity hardware and
designed with modular architectures, aggregating multiple identical
building blocks of facility infrastructure and equipment to support
scalability, growth, and speedy hardware replacements.
• Modularity and standardization are key requirements for reducing
investment and operational costs as they enable economies of scale for the
procurement, acquisition, deployment, operation, and maintenance
processes.

Automation
• Data centers have specialized platforms that automate tasks like
provisioning, configuration, patching, and monitoring without supervision.
• Advances in data center management platforms and tools leverage
autonomic computing technologies to enable self-configuration and self-
recovery.
Remote Operation and Management
• Most of the operational and administrative tasks of IT resources in data
centers are commanded through the network’s remote consoles and
management systems.
• Technical personnel are not required to visit the dedicated rooms that house
servers, except to perform highly specific tasks, such as equipment handling
and cabling or hardware-level installation and maintenance.

High Availability
• Since any form of data center outage significantly impacts business continuity
for the organizations that use their services, data centers are designed to
operate with increasingly higher levels of redundancy to sustain availability.
• Data centers usually have redundant, uninterruptable power supplies, cabling,
and environmental control subsystems in anticipation of system failure, along
with communication links and clustered hardware for load balancing.
Security-Aware Design, Operation, and
Management
• Requirements for security, such as physical and logical access controls and data
recovery strategies, need to be thorough and comprehensive for data centers,
since they are centralized structures that store and process business data.
Computing Hardware
• Much of the heavy processing in data centers is often executed by
standardized commodity servers that have substantial computing power and
storage capacity.
• Several computing hardware technologies are integrated into these modular
servers, such as:
 rackmount form factor server design composed of standardized racks with
interconnects for power, network, and internal cooling
 support for different hardware processing architectures, such as x86-32bits, x86-64,
and RISC
 a power-efficient multi-core CPU architecture that houses hundreds of processing cores
in a space as small as a single unit of standardized racks
 redundant and hot-swappable components, such as hard disks, power supplies,
network interfaces, and storage controller cards
Storage system
• Data centers have specialized storage systems that maintain enormous amounts of
digital information in order to fulfill considerable storage capacity needs.
• Storage systems usually involve the following technologies:
Hard Disk Arrays – These arrays inherently divide and replicate data among
multiple physical drives, and increase performance and redundancy by including
spare disks.
I/O Caching – This is generally performed through hard disk array controllers,
which enhance disk access times and performance by data caching.
Hot-Swappable Hard Disks – These can be safely removed from arrays without
requiring prior powering down.
Storage Virtualization – This is realized through the use of virtualized hard disks
and storage sharing.
Fast Data Replication Mechanisms – These include snapshotting, which is saving a
virtual machine’s memory into a hypervisor-readable file for future reloading, and
volume cloning, which is copying virtual or physical hard disk volumes and
partitions.
• Networked storage devices usually fall into one of the following categories:
Storage Area Network (SAN) – A high-speed, dedicated network that connects
servers to block-level storage devices.
Network-Attached Storage (NAS) – A file level storage device connected to a
regular IP network, accessible by multiple clients over the network. Appears
to the client as a shared folder rather than a local disk.
NAS, SAN, and other more advanced storage system options provide fault
tolerance in many components through controller redundancy, cooling
redundancy, and hard disk arrays that use RAID storage technology.

Network Hardware
• Data centers require extensive network hardware in order to enable multiple
levels of connectivity.
Carrier and External Networks
Interconnection
• How a data center connects to outside networks using carriers.
• It ensures that data centers can send and receive data from the internet,
other data centers and clients reliably and efficiently.
LAN Fabric
• The network infrastructure within a data center.
• General data traffic inside the data center.
SAN Fabric
• Dedicated storage network connecting servers to storage devices.
Web Technology
• Computers communicate with each others using markup languages and multimedia
packages.
• The World Wide Web is a system of interlinked IT resources that are accessed through
the Internet.
• The two basic components of the Web are the Web browser client and the Web server.
• Other components, such as proxies, caching services, gateways, and load balancers, are
used to improve Web application characteristics such as scalability and security.
• These additional components reside in a layered architecture that is positioned between
the client and the server.
• Three fundamental elements comprise the technology architecture of the
Web:
Uniform Resource Locator (URL) – It is the address of a resource on the
internet. A standard syntax used for creating identifiers that point to Web
based resources, the URL is often structured using a logical network location.
Hypertext Transfer Protocol (HTTP) – Communication protocol used between
the web browser and a web server to exchange information.
Markup Languages (HTML, XML) – Markup languages provide a lightweight
means of expressing Web- centric data and metadata. The two primary
markup languages are HTML (which is used to express the presentation of
Web pages) and XML (which allows for the definition of vocabularies used to
associate meaning to Web-based data via metadata).
• For example, a Web browser can request to execute an action like read,
write, update, or delete on a Web resource on the Internet, and
proceed to identify and locate the Web resource through its URL.
• The request is sent using HTTP to the resource host, which is also
identified by a URL.
• The Web server locates the Web resource and performs the requested
operation, which is followed by a response being sent back to the client.
• The response may be comprised of content that includes HTML and
XML statements.
• Web resources are represented as hypermedia as opposed to hypertext,
meaning media such as graphics, audio, video, plain text, and URLs can
be referenced collectively in a single document.
• Some types of hypermedia resources cannot be rendered without
additional software or Web browser plug-ins.
Web Applications
• A distributed application that uses Web-based technologies (and
generally relies on Web browsers for the presentation of user-
interfaces) is typically considered a Web application.
• These applications can be found in all kinds of cloud-based
environments due to their high accessibility.
• Figure presents a common architectural abstraction for Web
applications that is based on the basic three tier model.
• The first tier is called the presentation layer, which represents the
user-interface.
• The middle tier is the application layer that implements application
logic, while the third tier is the data layer that is comprised of
persistent data stores.
• The presentation layer has components on both the client and server-side. Web
servers receive client requests and retrieve requested resources directly as static
Web content and indirectly as dynamic Web content, which is generated
according to the application logic.
• Web servers interact with application servers in order to execute the requested
application logic, which then typically involves interaction with one or more
underlying databases.
• PaaS ready-made environments enable cloud consumers to develop and deploy
Web applications. Typical PaaS offerings have separate instances of the Web
server, application server, and data storage server environments.
Multitenant Technology

• The multitenant application design was created to enable multiple users


(tenants) to access the same application logic simultaneously.
• Each tenant has its own view of the application that it uses, administers, and
customizes as a dedicated instance of the software while remaining unaware of
other tenants that are using the same application.
• Multitenant applications ensure that tenants do not have access to data and
configuration information that is not their own.
• Tenants can individually customize features of the application, such as:
 User Interface – Tenants can define a specialized “look and feel” for their application
interface.
 Business Process – Tenants can customize the rules, logic, and workflows of the business
processes that are implemented in the application.
 Data Model – Tenants can extend the data schema of the application to include, exclude,
or rename fields in the application data structures.
 Access Control – Tenants can independently control the access rights for users and
groups.
Common characteristics of multitenant applications include:
Usage Isolation – The usage behavior of one tenant does not affect the
application availability and performance of other tenants.
Data Security – Tenants cannot access data that belongs to other tenants.
Recovery – Backup and restore procedures are separately executed for the data
of each tenant.
Application Upgrades – Tenants are not negatively affected by the synchronous
upgrading of shared software artifacts.
Scalability – The application can scale to accommodate increases in usage by
existing tenants and/or increases in the number of tenants.
Metered Usage – Tenants are charged only for the application processing and
features that are actually consumed.
Data Tier Isolation – Tenants can have individual databases, tables, and/or
schemas isolated from other tenants.
 A multitenant application that is being concurrently used by two different tenants is
illustrated in the figure.
Multitenancy vs. Virtualization
• Multitenancy is sometimes mistaken for virtualization because the
concept of multiple tenants is similar to the concept of virtualized
instances.
• The differences lie in what is multiplied within a physical server acting
as a host:
• With virtualization: Multiple virtual copies of the server environment
can be hosted by a single physical server. Each copy can be provided
to different users, can be configured independently, and can contain
its own operating systems and applications.
• With multitenancy: A physical or virtual server hosting an application
is designed to allow usage by multiple different users. Each user feels
as though they have exclusive usage of the application.
Service Technology
• Service technology is software that assists customer service teams in
achieving customer success that is to provide effective solutions to
customers.
• Delivering IT resources as services to users.
• Formed on the basis of as-a-service cloud delivery model.
• Service technologies used are:
Web Services

REST Services

Service agents

Service middleware
Web Services
• Web services include Web Service Description Language (WSDL), XML Schema Definition
Language (XML Schema), Simple Object Access Protocol (SOAP), Universal Description,
Discovery and Integration (UDDI).
• WSDL- describe web services. Acts like a contract between the service provider and the
client.
• XML Schema- describes the structure of an XML document. Messages exchanged by web
services must be expressed using XML
• SOAP- Protocol that allows applications to communicate over the internet using XML
messages.
• UDDI- Acts as a registry to publish and discover web services.

REST Services
• REST- Representational State Transfer
• It is an architectural style for designing web services.
• Use HTTP protocol to allow system to communicate.
• Six REST design constraints are – Client-Server, Stateless, Cache, Interface/Uniform
Service Agents
• Are event driven programs designed to intercept messages at runtime. There
are active and passive service agents.
• Active service agents perform an action upon intercepting and reading the
contents of a message passive on the other hand do not change message
contents

Service Middleware
• A bridge that connects different applications or services so they can work
together.
• It helps in integration- allows different systems to communicate and share
data smoothly.
• 2 common middleware platforms are Enterprise service bus (ESB) and
Orchestration platform
Resource Provisioning Techniques
• Physical resources can be assigned to the VMs using two types of provisioning
approaches like static and dynamic.
• In static approach, VMs are created with specific volume of resources and the
capacity of the VM does not change in its lifetime.
• In dynamic approach, the resource capacity per VM can be adjusted dynamically to
match work-load fluctuations.
Static Approach
• Static provisioning is suitable for applications which have predictable and generally
unchanging workload demands.
• In this approach, once a VM is created it is expected to run for long time without
incurring any further resource allocation decision overhead on the system.
• Resource-allocation decision is taken only once and that too at the beginning when
user’s application starts running.
• Thus, this approach provides room for a little more time to take decision regarding
resource allocation since that does not impact negatively on the performance of the
system.
• This provisioning approach fails to deal with un-anticipated changes in resource
demands.
• When resource demand crosses the limit specified in SLA document it causes trouble
for the consumers.
• Again from provider’s point of view, some resources remain unutilized forever since
provider arranges for sufficient volume of resources to avoid SLA violation.
• So this method has drawback from the viewpoint of both provider as well as for
consumer.
Dynamic Approach
• With dynamic provisioning, the resources are allocated and de-allocated as per
requirement during run-time.
• This on-demand resource provisioning provides elasticity to the system.
• Providers no more need to keep a certain volume of resources unutilized for each
and every system separately, rather they maintain a common resource pool and
allocate resources from that when it is required.
• Resources are removed from VMs when they are no more required and returned
to the pool.
• With this dynamic approach, the processes of billing also become as pay-per-
usage basis.
• Dynamic provisioning technique is more appropriate for cloud computing where
application’s demand for resources is most likely to change or vary during the
execution.
• But this provisioning approach needs the ability of integrating newly-acquired
resources into the existing infrastructure.
• This gives provisioning elasticity to the system.
• Dynamic provisioning allows system to adapt in changed conditions at the cost of
bearing run- time resource allocation decision overhead.
• This overhead leads some amount of delay in system but this can be minimized
by putting upper limit on the complexity of provisioning algorithms.
Comparison between static and
dynamic approaches
Open Cloud Services
• Open-source cloud community generally focusses on private cloud computing arena.

1. Eucalyptus
 Eucalyptus is an open-source Infrastructure-as-a-Service (IaaS) facility for building private or hybrid cloud
computing environment.
 It is a linux- based development that enables cloud features while having installed over distributed
computing resources.
 The name ‘Eucalyptus’ is an acronym for ‘Elastic Utility Computing Architecture for Linking Your Programs To
Useful Systems’.
 Eucalyptus started as a research project at the University of California, United States and the company
‘Eucalyptus Systems’ was formed in the year of 2009 in order to support the commercialization of the
Eucalyptus cloud.
 In the same year, the Ubuntu 9.04 distribution of Linux OS was included Eucalyptus software into it.
 Eucalyptus Systems went into an agreement with Amazon during March 2012, which allowed them to make
it compatible with Amazon Cloud.
 This permits transferring of instances between Eucalyptus private cloud and Amazon public cloud making
them a combination for building hybrid cloud environment.
 Such interoperable pairing allows application developers to maintain the private cloud part (deployed as
Eucalyptus) as a sandbox for executing prominent codes.
 Eucalyptus also offers a storage cloud API emulating as Amazon’s storage service (Amazon S3) API.
2. OpenNebula
• OpenNebula is an open-source Infrastructure-as–a-Service (IaaS) implementation for building public, private and
hybrid clouds.
• Nebula is a Latin word which means as ‘cloud’. OpenNebula started as a research project in the year of 2005 and its
first release was made during March 2008.
• By March 2010, the prime authors of OpenNebula founded C12G Labs with the aim of providing value-added
professional services to OpenNebula and the cloud is currently managed by them.
• OpenNebula is freely available, subject to the requirements of the Apache License version 2. Like Eucalyptus,
OpenNebula is also compatible with Amazon cloud.
• Consequently, the distributions of Ubuntu and Red Hat Enterprise later included OpenNebula integrating into
them.

3. Nimbus
• Nimbus is an open-source IaaS cloud solution compatible with Amazon’s cloud services.
• It was developed at University of Chicago in United States and implemented the Amazon cloud’s APIs.
• The solution was specifically developed to support the scientific community.
• The Nimbus project has been created by an international collaboration of open-source contributors and
institutions.
• Nimbus code is licensed under the terms of the Apache License version 2.
4. OpenStack
• OpenStack is another free and open-source IaaS solution.
• In July 2010, U.S.-based IaaS cloud service provider [Link] and NASA jointly launched
the initiative for an open-source cloud solution called ‘OpenStack’ to produce a ubiquitous IaaS
solution for public and private clouds.
• NASA donated some parts of the Nebula Cloud Platform technology that it developed. Since
then, more than 200 companies (including AT&T, AMD, Dell, Cisco, HP, IBM, Oracle, Red Hat)
have contributed in the project.
• The project was later taken over and promoted by the OpenStack Foundation, a non-profit
organization founded in 2012 for promoting OpenStack solution. All the code of OpenStack is
freely available under the Apache 2.0 license.

5. Apache CloudStack
• Apache CloudStack is another open-source IaaS cloud solution. CloudStack was initially
developed by [Link], a software company based in California (United States) which was
later acquired by Citrix Systems, another USA based software firm, during 2011.
• By next year, Citrix Systems handed it over to the Apache Software Foundation and soon after
this CloudStack made its first stable release.
• In addition to its own APIs, CloudStack also supported AWS (Amazon Web Services) APIs which
facilitated hybrid cloud deployment.
Eucalyptus Architecture
Components of Architecture
• Node Controller is the lifecycle of instances running
on each node. Interacts with the operating system,
hypervisor, and Cluster Controller. It controls the
working of VM instances on the host machine.
• Cluster Controller manages one or more Node
Controller and Cloud Controller simultaneously. It
gathers information and schedules VM execution.
• Storage Controller (Walrus) Allows the creation of
snapshots of volumes. Persistent block storage over
VM instances. Walrus Storage Controller is a simple
file storage system. It stores images and snapshots.
Stores and serves files using S3(Simple Storage
Service) APIs.
• Cloud Controller Front-end for the entire
architecture. It acts as a Complaint Web Services to
client tools on one side and interacts with the rest of
the components on the other side.
Important Features are:-
 Images: A good example is the Eucalyptus Operation Modes Of Eucalyptus
 Managed Mode: Numerous security groups to
Machine Image which is a module software
users as the network is large. Each security group
bundled and uploaded to the Cloud.
is assigned a set or a subset of IP addresses.
 Instances: When we run the picture and utilize it,
Ingress rules are applied through the security
it turns into an instance. groups specified by the user. The network is
 Networking: It can be further subdivided into isolated by VLAN between Cluster Controller and
three modes: Static mode(allocates IP address to Node Controller. Assigns two IP addresses on
instances), System mode (assigns a MAC address each virtual machine.
and imputes the instance’s network interface to  Managed (No VLAN) Node: The root user on the
the physical network via NC), and Managed mode virtual machine can snoop into other virtual
machines running on the same network layer. It
(achieves local network of instances).
 Access Control: It is utilized to give limitations to does not provide VM network isolation.
 System Mode: Simplest of all modes, least
clients. number of features. A MAC address is assigned to
 Elastic Block Storage: It gives block-level storage a virtual machine instance and attached to Node
volumes to connect to an instance. Controller's bridge Ethernet device.
 Auto-scaling and Load Adjusting: It is utilized to  Static Mode: Similar to system mode but has
make or obliterate cases or administrations more control over the assignment of IP address.
dependent on necessities. MAC address/IP address pair is mapped to static
entry within the DHCP server. The next set of
MAC/IP addresses is mapped.
Open-Nebula Architecture
Open-Nebula architecture
• To control a VM’s life cycle, the Open-Nebula core coordinates with the following three areas of
management:
1) Image and storage technologies — to prepare disk images
2) The network fabric — to provide the virtual network environment
3) Hypervisors — to create and control VMs

Components of Open-Nebula
• Based on the existing infrastructure, Open-Nebula provides various services and resources. You can view
the components in Figure 3.
• APIs and interfaces: These are used to manage and monitor Open-Nebula components. To manage
physical and virtual resources, they work as an interface.
• Users and groups: These support authentication, and authorize individual users and groups with the
individual permissions.
• Hosts and VM resources: These are a key aspect of a heterogeneous cloud that is managed and
monitored, e.g., Xen, VMware.
• Storage components: These are the basis for centralized or decentralized template repositories.
• Network components: These can be managed flexibly. Naturally, there is support for VLANs and Open
vSwitch.
Nimbus
• Workspace Service- Allows clients to manage and administer VMs.
• Workspace Resource Manager- Implements VM instance creation on a site and
management.
• Workspace Pilot- Provides virtualization with significant changes to the site
configurations.
• Workspace Control- Implements VM instance management such as start,stop and
pause VM. It also provides image management and setup networks and provide IP
assignment.
Parallel Computing and
Programming Paradigms
 Before taking a toll on Parallel Computing, first, let's take a look at the background of
computations of computer software and why it failed for the modern era.
 Computer software was written conventionally for serial computing. This meant
that to solve a problem, an algorithm divides the problem into smaller instructions.
 These discrete instructions are then executed on the Central Processing Unit of a
computer one by one.
 Only after one instruction is finished, next one starts.
Parallel Computing
• It is the use of multiple processing elements simultaneously for solving any
problem.
• Problems are broken down into instructions and are solved concurrently as
each resource that has been applied to work is working at the same time.
• Advantages of Parallel Computing over Serial Computing are as follows:
It saves time and money as many resources working together will reduce the
time and cut potential costs.
It can be impractical to solve larger problems on Serial Computing.
It can take advantage of non-local resources when the local resources are
finite.
Serial Computing 'wastes' the potential computing power, thus Parallel
Computing makes better work of the hardware.

Fig 1: Serial computing Fig 2: Parallel computing


• The system issues for running a typical typical parallel program in either a
parallel or a distributed manner would include the following
Partitioning
• This is applicable to both computation and data as follows:
• Computation partitioning This splits a given job or a program into smaller
tasks. Partitioning greatly depends on correctly identifying portions of the
job or program that can be performed concurrently.
• Data partitioning This splits the input or intermediate data into smaller
pieces. Similarly, upon identification of parallelism in the input data, it can
also be divided into pieces to be processed on different workers.
Mapping
• This assigns the either smaller parts of a program or the smaller pieces of
data to underlying resources.
• This process aims to appropriately assign such parts or pieces to be run
simultaneously on different workers and is usually handled by resource
allocators in the system.
Synchronization
• Because different workers may perform different tasks, synchronization and
coordination among workers is necessary so that race conditions are prevented and
data dependency among different workers is properly managed.
Communication
• Because data dependency is one of the main reasons for communication among
workers, communication is always triggered when the intermediate data is sent to
workers.
Scheduling
• For a job or program, when the number of computation parts (tasks) or data pieces
is more than the number of available workers, a scheduler selects a sequence of
tasks or data pieces to be assigned to the workers.
• It is worth noting that the resource allocator performs the actual mapping of the
computation or data pieces to workers, while the scheduler only picks the next part
from the queue of unassigned tasks based on a set of rules called the scheduling
policy.
Motivation for Programming Paradigms
1. Simplicity of writing parallel programs
• It should be easy for programmers to write programs that can run tasks at the same
time (parallel) without too much complexity.
2. Improve productivity of programmers
• Programmers should be able to write code faster and with less effort.
• The paradigm should reduce the chances of mistakes and make coding smoother.
3. Decrease programs’ time to market
• Applications can be built and delivered to users more quickly.
• Less development time means businesses can release software earlier.
4. Use resources more efficiently to increase system throughput
• The system should make the best use of available processors, memory, and network.
• This way, more tasks can be completed in less time (higher performance).
5. Support higher levels of abstraction
• Programmers should be able to focus on the big picture (what needs to be done)
instead of the small details (how every step happens).
MapReduce
• It is a data processing tool.
• Came into existence in order to overcome the disadvantages of traditional
computing.
• MapReduce eliminated the idea of centralized systems.
• In MapReduce instead of using a single system, tasks are distributed between
multiple systems and parallel these tasks can be updated, monitored, and processed.
• MapReduce is an algorithm that divides the task into smaller parts and assign them
to many computers and collects the results from them which when integrated
generates data sets.
• MapReduce is by far the most powerful realization of data-intensive cloud
computing programming.
• It is often advocated as an easier-to-use, efficient and reliable replacement for the
traditional data intensive programming model for cloud computing.
MapReduce Programming Model
• MapReduce is a programming model designed for processing and
generating large datasets in a distributed and parallel manner.
• It is a core component of the Hadoop ecosystem and is widely used
for big data processing.
• The model simplifies data processing by dividing tasks into smaller,
manageable chunks that can be executed across multiple nodes in a
cluster.
• The name "MapReduce" refers to its two primary phases: Map and
Reduce.
• MapReduce is a software framework for solving many large-scale computing
problems.
• The MapReduce abstraction is inspired by the Map and Reduce functions,
which are commonly used in functional languages such as Lisp.
• The MapReduce system allows users to easily express their computation as
map and reduce.
The map function, processes a key/value pair to generate a set of
intermediate key/value pairs:
map (key1, value1) -----> list (key2, value2)
The reduce function, merges all intermediate values associated with the
same intermediate key:
reduce (key2, list (value2)) ----> list (value3)
How MapReduce Works
• Map Phase: The input data is divided into smaller chunks and processed
by the mapper function. Each mapper transforms the data into
intermediate key-value pairs. For example, in a word count task, the
mapper might output pairs like <word, 1> for each word in the input.
• Shuffle and Sort: The intermediate key-value pairs are shuffled and
grouped by key. This ensures that all values associated with the same key
are sent to the same reducer.
• Reduce Phase: The reducer function processes the grouped data to
produce the final output. For instance, in the word count example, the
reducer would sum up the values for each word to calculate its total
occurrences.
• Output: The results are written to a distributed storage system, such as
Hadoop Distributed File System (HDFS).
MapReduce Features
• Data-Aware- When the MapReduce-Master node is scheduling the Map tasks for a
newly submitted job, it takes in consideration the data location information retrieved
from the GFS- Master node.
• Simplicity- As the MapReduce runtime is responsible for parallelization and
concurrency control, this allows programmers to easily design parallel and distributed
applications.
• Scalability- Increasing the number of nodes (data nodes) in the system will increase the
performance of the jobs with potentially only minor losses.
• Fault Tolerance and Reliability: The data in the GFS are distributed on clusters with
thousands of nodes. Thus, any nodes with hardware failures can be handled by simply
removing them and installing a new node in their place. Moreover, MapReduce, taking
advantage of the replication in GFS, can achieve high reliability by
rerunning all the tasks (completed or in progress) when a host node is going off-line
rerunning failed tasks on another node, and
launching backup tasks when these tasks are slowing down and causing a
bottleneck to the entire job.
MapReduce Execution Flow
• The MapReduce library in the user program first splits the input files into M pieces of typically
16 to 64 megabytes (MB) per piece.
• It then starts many copies of the program on a cluster. One is the “master” and the rest are
“workers.”
• The master is responsible for scheduling (assigns the map and reduce tasks to the worker) and
monitoring (monitors the task progress and the worker health).
• When map tasks arise, the master assigns the task to an idle worker, taking into account the
data locality.
• A worker reads the content of the corresponding input split and emits a key/value pairs to the
user-defined Map function.
• The intermediate key/value pairs produced by the Map function are first buffered in memory
and then periodically written to a local disk, partitioned into R sets by the partitioning function.
• The master passes the location of these stored pairs to the reduce worker, which reads the
buffered data from the map worker using remote procedure calls (RPC).
• It then sorts the intermediate keys so that all occurrences of the same key are grouped
together.
• For each key, the worker passes the corresponding intermediate value for its entire occurrence
to the Reduce function. Finally, the output is available in R output files (one per reduce task).
The Wordcount Example

• As a simple illustration of the Map and Reduce functions, Figure shows the pseudo-
code and the algorithm and illustrates the process steps using the widely used
“Wordcount” example.
• The Wordcount application counts the number of occurrences of each word in a
large collection of documents.
• The steps of the process are briefly described as follows: The input is read (typically
from a distributed file system) and broken up into key/value pairs (e.g., the Map
function emits a word and its associated count of occurrence, which is just “1”).
• The pairs are partitioned into groups for processing, and they are sorted according
to their key as they arrive for reduction.
• Finally, the key/value pairs are reduced, once for each unique key in the sorted list,
to produce a combined result (e.g., the Reduce function sums all the counts emitted
for a particular word).
Hadoop Library from Apache
• Hadoop is an open-source implementation of MapReduce coded in
Java by Apache.
• This implementation uses the Hadoop Distributed File System (HDFS)
as its underlying layer.
• It has two fundamental layers: the MapReduce engine and HDFS.
• The MapReduce engine is the computation engine running on top of
HDFS as its data storage manager.
• HDFS: It is a distributed file system that organizes files and stores their
data on a distributed computing system.
Hadoop’s Architecture

Hadoop Architecture Mainly consists of 4 components:


 MapReduce
 HDFS (Hadoop Distributed File System)
 YARN (Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common
MapReduce engine
• MapReduce is a data processing model in Hadoop that runs on YARN. It
enables fast, distributed and parallel processing by dividing tasks into
two phases Map and Reduce making it efficient for handling large-scale
data.

JobTracker & TaskTracker


• JobTracker splits up data into
smaller tasks(“Map”) and sends
it to the TaskTracker process in
each node.
• TaskTracker reports back to the
JobTracker node and reports on
job progress, sends data
(“Reduce”) or requests new jobs
Map Task Components:
• RecordReader: reads input data and converts it into key-value pairs, with keys as
location info and values as actual data.
• Mapper: processes each pair and outputs zero or more new key-value pairs.
• Combiner (optional): acts as a mini-reducer to group Mapper output and reduce
data transfer before shuffling.
• Partitioner: assigns key-value pairs to Reducers
Reduce Task Components:
• Shuffle and Sort: transfers intermediate key-value pairs from Mappers to
Reducers and sorts them by key. Shuffling begins as soon as some Mappers finish.
• Reducer: processes grouped key-value pairs, performing tasks like aggregation or
filtering based on logic.
• OutputFormat: writes final results to HDFS using a RecordWriter, typically storing
each record as a key-value line.
Hadoop- HDFS • HDFS has a master/slave architecture.
NameNode
• The NameNode is the master server that manages the
filesystem namespace and controls access to files by
clients.
• It performs operations such as opening, closing, and
renaming files and directories.
• Additionally, the NameNode maps file blocks to
DataNodes, maintaining the metadata and the overall
structure of the file system.
• This metadata is stored in memory for fast access and
persisted on disk for reliability.
• Maintaining the filesystem tree and metadata.
• Managing the mapping of file blocks to DataNodes.
• Ensuring data integrity and coordinating replication of
data blocks.
DataNode
• DataNodes are the worker nodes in HDFS, responsible for HDFS Client
storing and retrieving actual data blocks as instructed by
• The HDFS client is the interface through which
the NameNode.
users and applications interact with the HDFS.
• Each DataNode manages the storage attached to it and • It allows for file creation, deletion, reading, and
periodically reports the list of blocks it stores to the writing operations.
NameNode. • The client communicates with the NameNode to
• Storing data blocks and serving read/write requests from determine which DataNodes hold the blocks of a
clients. file and interacts directly with the DataNodes for
• Performing block creation, deletion, and replication upon actual data read/write operations.
instruction from the NameNode. • Facilitating interaction between the
user/application and HDFS.
• Periodically sending block reports and heartbeats to the
• Communicating with the NameNode for metadata
NameNode to confirm its status.
and with DataNodes for data access.
Secondary NameNode Block Structure
• The Secondary NameNode acts as a helper to the primary • HDFS stores files by dividing them into large
NameNode, primarily responsible for merging the EditLogs blocks, typically 128MB or 256MB in size.
with the current filesystem image (FsImage) to reduce the • Each block is stored independently across multiple
potential load on the NameNode. DataNodes, allowing for parallel processing and
• It creates checkpoints of the namespace to ensure that the fault tolerance.
filesystem metadata is up-to-date and can be recovered in • The NameNode keeps track of the block locations
case of a NameNode failure. and their replicas.
HDFS Fault Tolerance
• Since Hadoop is designed to be deployed on low-cost hardware by default, a hardware failure
in this system is considered to be common rather than an exception.
• Hadoop considers the following issues to fulfill reliability requirements of the file system:
Block replication
• HDFS stores a file as a set of blocks and each block is replicated and distributed across the
whole cluster.
• The replication factor is set by the user and is three by default.
Replica Placement
• The placement of replicas is another factor to fulfill the desired fault tolerance in HDFS.
• Although storing replicas on different nodes (DataNodes) located in different racks across the
whole cluster provides more reliability, it is sometimes ignored as the cost of communication
between two nodes in different racks is relatively high in comparison with that of different
nodes located in the same rack.
• Therefore, sometimes HDFS compromises its reliability to achieve lower communication
costs.
Heartbeat and Blockreport messages Writing to a file
• Heartbeats and Blockreports are periodic messages sent to • To write a file in HDFS, a user sends a “create”
the NameNode by each DataNode in a cluster. request to the NameNode to create a new file in
the file system namespace.
• Receipt of a Heartbeat implies that the DataNode is
• If the file does not exist, the NameNode notifies
functioning properly, while each Blockreport contains a list of
all blocks on a DataNode. the user and allows him to start writing data to
the file by calling the write function.
HDFS Operation • The first block of the file is written to an internal
• The control flow of the main operations of HDFS on files is queue termed the data queue while a data
the interaction between the user, the NameNode, and the streamer monitors its writing into a DataNode.
DataNodes. • Since each file block needs to be replicated by a
Reading a file predefined factor, the data streamer first sends a
request to the NameNode to get a list of suitable
• To read a file in HDFS, a user sends an “open” request to the
DataNodes to store replicas of the first block.
NameNode to get the location of file blocks.
• The procedure will repeat for all the blocks of a
• For each file block, the NameNode returns the address of a file.
set of DataNodes containing replica information for the
requested file.
• Upon receiving such information, the user calls the read
function to connect to the closest DataNode containing the
first block of the file.
• After the first block is streamed, the established connection
is terminated and the same process is repeated for all blocks
of the requested file.
YARN
• YARN (Yet Another Resource Negotiator) is resource management layer of
Hadoop, responsible for scheduling and allocating resources across the cluster. It
has three key components:
• ResourceManager: Allocates resources to various applications in the system.
• NodeManager: Manages resources (CPU, memory, etc.) on individual nodes and
reports to ResourceManager.
• ApplicationMaster: Acts as a bridge between ResourceManager and
NodeManager, handling resource negotiation for each application.
Hadoop Common (Common Utilities)
• Hadoop Common, also known as Common Utilities, includes core Java
libraries and scripts required by all components in a Hadoop
ecosystem such as HDFS, YARN and MapReduce.
• These libraries offer core functionalities such as:
File system and I/O operations
Configuration and logging
Security and authentication
Network communication
• Hadoop Common provides shared libraries and utilities that help all
Hadoop components work together.
• It handles hardware failures automatically and includes tools like
Hadoop Archive, native library support and RPC mechanisms.
MapReduce Program
Apache Spark
Usage of Spark
• Data integration: The data generated by systems are
• Apache Spark is an open-source cluster computing
framework. Its primary purpose is to handle the not consistent enough to combine for analysis. To
real- time generated data. fetch consistent data from systems we can use
processes like Extract, transform, and load (ETL).
Features of Apache Spark Spark is used to reduce the cost and time required for
• Fast - It provides high performance for both batch this ETL process.
and streaming data, using a state-of-the-art DAG • Stream processing: It is always difficult to handle the
scheduler, a query optimizer, and a physical real-time generated data such as log files. Spark is
execution engine. capable enough to operate streams of data and
• Easy to Use - It facilitates to write the application in refuses potentially fraudulent operations.
Java, Scala, Python, R, and SQL. It also provides • Machine learning: Machine learning approaches
more than 80 high-level operators. become more feasible and increasingly accurate due
• Generality - It provides a collection of libraries to enhancement in the volume of data. As spark is
including SQL and DataFrames, MLlib for machine capable of storing data in memory and can run
learning, GraphX, and Spark Streaming. repeated queries quickly, it makes it easy to work on
• Lightweight - It is a light unified analytics engine machine learning algorithms.
which is used for large scale data processing. • Interactive analytics: Spark is able to generate the
respond rapidly. So, instead of running pre-defined
• Runs Everywhere - It can easily run on Hadoop,
queries, we can handle the data interactively.
Apache Mesos, Kubernetes, standalone, or in the
cloud.

You might also like