Module 1
Grid computing pools geographically dispersed, often heterogeneous, computer resources (like
processing power and storage) over a network to function as a single, powerful virtual
supercomputer, solving complex problems or analyzing massive datasets that a single machine can't
handle by breaking tasks into smaller pieces and distributing them across the network. It allows for
dynamic scalability and efficient resource use, making it vital for scientific research, finance, and
large-scale simulations, differing from clusters by connecting diverse, non-coupled systems rather
than tightly integrated ones.
How it Works
• Resource Pooling: Unused processing power and storage from many computers are gathered
into a virtual pool.
• Task Distribution:
Large problems are broken down into smaller subtasks
• Simultaneous Processing: Specialized software assigns these subtasks to different nodes
(computers) in the grid.
• Aggregation: The results from each node are gathered and combined to form the final
solution.
Key Characteristics
• Distributed: Resources are spread across different locations.
• Heterogeneous: Nodes can be different types of computers and operating systems.
• Virtual Supercomputer: Acts as one powerful system.
• Scalable: Easily add more machines as needed.
Use Cases
• Scientific Research: Protein folding, climate modeling, astronomical data analysis.
• Finance: Risk analysis, financial modeling.
• Data Analysis: Analyzing huge datasets.
Grid vs. Cluster Computing
• Grid: Geographically dispersed, heterogeneous nodes, often loosely connected, solving
complex problems.
• Cluster: Tightly coupled, homogeneous machines in one location, often for parallel
processing on a single application.
Grid computing is a distributed architecture that connects geographically dispersed, heterogeneous
computer resources to function as a single virtual supercomputer. Unlike traditional clusters, which
are usually centralized and identical, a grid pools unused processing power, memory, and storage
from diverse machines across multiple administrative domains to solve complex, large-scale
problems.
Core Components & Working
A grid operates through three primary types of nodes coordinated by specialized software
called middleware:
• Control Node: The administrative hub that manages resource allocation, security, and job
scheduling.
• Provider Node: The computers that contribute their idle resources (CPU, storage) to the
network.
• User Node: The machine requesting resources to perform a task.
• Middleware: Acts as an intermediary layer, breaking main tasks into subtasks, assigning them
to available nodes, and aggregating the final results.
Key Types of Grids
• Computational Grid: Aggregates high-performance processors for resource-intensive
mathematical calculations and simulations.
• Data Grid: Focuses on storing and managing massive datasets distributed across multiple
locations, making them appear as a single local system.
• Scavenging Grid: Also known as "cycle scavenging," it identifies and utilizes the idle
processing power of regular desktop computers.
• Collaborative Grid: Facilitates real-time work between dispersed individuals or institutions
by enabling shared access to data and resources.
Major Applications
Grid computing is primarily used for "Grand Challenge" problems in science and industry:
• Scientific Research: Powering the Large Hadron Collider (LHC) at CERN to analyze petabytes
of particle collision data.
• Healthcare: Projects like SETI@home or cancer research initiatives that use volunteer
computer power for data analysis.
• Financial Services: Used for complex risk analysis, portfolio optimization, and real-time
market forecasting.
• Engineering: Conducting high-fidelity simulations for aerospace, automotive design, and
weather modeling.
Grid vs. Cloud Computing
While both are distributed systems, they differ in their primary goals:
• Management: Grids are typically collaboratively managed by multiple organizations
(federated), whereas Cloud Computing is centrally managed by a single vendor.
• Scalability: Clouds offer near-instant, on-demand scaling via commercial agreements; grids
scale based on the available internal or partnered hardware.
• Focus: Grids are optimized for high-performance computing (HPC) and batch processing,
while clouds are designed for a broader range of general-purpose services like hosting and
web apps.
1. Basic Definition
Aspect Cluster Computing Grid Computing Cloud Computing
Group of tightly coupled Loosely coupled resources On-demand computing
Definition
computers working as one from multiple locations services over the internet
Different geographic
Location Same location Remote data centers
locations
Ownership Single organization Multiple organizations Cloud service provider
2. Architecture Comparison
Feature Cluster Grid Cloud
Coupling Tightly coupled Loosely coupled Loosely coupled
Network High-speed LAN Internet / WAN Internet
Resource Control Centralized Distributed Provider-managed
Virtualization Usually not used Rarely used Heavily used
3. Resource Management
Feature Cluster Grid Cloud
Resource Within one Across multiple
Provided as a service
Sharing organization organizations
Automated by cloud
Scheduling Central scheduler Grid middleware
platform
Fault Tolerance Limited Moderate High
4. Scalability & Cost
Feature Cluster Grid Cloud
Scalability Limited Moderate Very high
Cost Model High upfront cost Shared cost Pay-as-you-go
Maintenance Organization-managed Shared management Provider-managed
5. Typical Applications
Technology Applications
Cluster Computing Scientific simulations, supercomputers, HPC
Grid Computing Weather modeling, research collaboration, data-intensive science
Cloud Computing Web apps, AI/ML, big data analytics, storage
6. Simple Real-Life Analogy
• Cluster Computing
One lab with many computers working together
• Grid Computing
Many labs across the world sharing resources
• Cloud Computing
Renting computing power like electricity
7. Key Differences at a Glance
Point Cluster Grid Cloud
Geographic spread No Yes Yes
Elastic scaling No Limited Yes
Service-based No No Yes
User-friendly Low Medium High
8. Exam-Friendly Conclusion
• Cluster Computing is best for high-performance tasks in a single location
• Grid Computing is ideal for collaborative, large-scale scientific problems
• Cloud Computing is best for scalable, on-demand services
@@@
Data Grids and Computational Grids
4
In Grid Computing, resources are shared across multiple locations. Based on the type of resource
being shared, grids are mainly classified into Data Grids and Computational Grids.
1. Data Grids
Definition
A Data Grid is a grid computing system designed to store, manage, and share large volumes of data
that are distributed across different geographic locations.
Key Characteristics
• Focus on data storage and data access
• Data may be replicated at multiple sites
• Provides secure, reliable, and fast access to data
• Ensures data consistency and availability
Architecture (Conceptual)
• Distributed data repositories
• Metadata services for data discovery
• Data transfer and replication services
• Security and access control mechanisms
Examples
• Scientific experiments producing massive datasets
• Medical image databases shared among hospitals
• Climate and satellite data repositories
Example Scenario:
A physics experiment generates terabytes of data stored in different research centers worldwide.
Scientists access this data through a data grid without knowing where it is physically stored.
Advantages
• Efficient handling of huge datasets
• High data availability through replication
• Supports collaborative research
Limitations
• Complex data management
• Network bandwidth dependency
2. Computational Grids
Definition
A Computational Grid is designed to provide high processing power by aggregating computing
resources such as CPUs and GPUs from multiple systems.
Key Characteristics
• Focus on CPU-intensive and compute-heavy tasks
• Tasks are divided into smaller jobs and executed in parallel
• Uses job scheduling and load balancing
• Suitable for long-running simulations
Architecture (Conceptual)
• Compute nodes (clusters, servers, desktops)
• Job schedulers
• Resource brokers
• Monitoring services
Examples
• Weather forecasting models
• Engineering simulations (CFD, FEA)
• Financial risk analysis
Example Scenario:
A weather simulation is split into thousands of small calculations, each executed on different
computers in the grid, reducing execution time from weeks to hours.
Advantages
• Massive computational power
• Better utilization of idle resources
• Cost-effective alternative to supercomputers
Limitations
• Requires efficient scheduling
• Performance depends on network latency
3. Comparison Between Data Grid and Computational Grid
Aspect Data Grid Computational Grid
Primary focus Data storage and access Processing and computation
Main resource Data CPU/GPU
Typical workload Data-intensive Compute-intensive
Key challenge Data consistency Job scheduling
Example use case Scientific data repositories Simulations and modeling
4. Exam-Friendly Conclusion
• Data Grids are optimized for managing and sharing large distributed datasets.
• Computational Grids are optimized for high-performance parallel computation.
• Both together form the foundation of Grid Computing.
@@@@@
Grid Architecture and Its Relation to Various Distributed Technologies
4
Grid Architecture defines how heterogeneous, geographically distributed resources are organized,
managed, and accessed as a single virtual system.
1. Grid Architecture (Layered Model)
Grid computing generally follows a layered architecture proposed by Ian Foster.
1. Fabric Layer
• Contains physical resources:
computers, clusters, storage systems, networks
• Provides access to local resources
Example:
Servers, workstations, supercomputers in different organizations
2. Connectivity Layer
• Handles communication and security
• Provides authentication, authorization, and secure data transfer
Example:
Secure login, encrypted data exchange between grid sites
3. Resource Layer
• Manages individual resources
• Responsible for job submission, monitoring, and control
Example:
Allocating CPU time on a remote machine
4. Collective Layer
• Coordinates multiple resources
• Provides services like resource discovery, scheduling, and load balancing
Example:
Selecting the best machines across the grid to run a job
5. Application Layer
• End-user applications that run on the grid
• Uses grid services and APIs
Example:
Scientific simulations, weather forecasting software
2. Relation to Various Distributed Technologies
Grid computing is closely related to other distributed computing technologies but differs in scope
and control.
A. Grid Computing vs Cluster Computing
Feature Grid Cluster
Location Geographically distributed Single location
Ownership Multiple organizations Single organization
Coupling Loosely coupled Tightly coupled
Network WAN / Internet High-speed LAN
Relation:
Clusters can act as building blocks (nodes) within a grid.
B. Grid Computing vs Cloud Computing
Feature Grid Cloud
Resource access Shared collaboration On-demand service
Virtualization Limited Extensive
Cost model Shared cost Pay-as-you-go
Resource control User-managed Provider-managed
Relation:
Cloud computing evolved from grid concepts but focuses on service delivery and scalability.
C. Grid Computing vs Distributed Systems
Feature Grid Distributed System
Scope Large-scale resource sharing General-purpose
Heterogeneity High Moderate
Autonomy Resources remain autonomous Usually centrally managed
Relation:
Grid computing is a specialized form of distributed system.
D. Grid Computing vs Peer-to-Peer (P2P) Systems
Feature Grid P2P
Control Managed Decentralized
Security Strong Weak/limited
Resource reliability High Variable
Relation:
Both share resources, but grids provide strong security and coordination.
3. Summary Table
Technology Relationship to Grid Computing
Cluster Forms building blocks of grid
Cloud Commercial evolution of grid ideas
Distributed Systems Grid is a subset
Technology Relationship to Grid Computing
P2P Systems Similar sharing but less control
4. Exam-Friendly Conclusion
• Grid Architecture organizes distributed resources using a layered model
• It enables secure, scalable, and coordinated resource sharing
• Grid computing bridges clusters, distributed systems, and cloud computing
@@@@@
Autonomic Computing
4
Autonomic Computing is a computing paradigm in which systems are self-managing, capable of
automatic decision-making and adaptation with minimal human intervention.
The concept was introduced by IBM to address the growing complexity of large-scale distributed
systems.
1. Need for Autonomic Computing
As computing systems (grids, clouds, data centers) grow in size and complexity:
• Manual management becomes impractical
• System failures increase
• Maintenance cost rises
Autonomic computing aims to reduce human effort and increase system reliability.
2. Key Characteristics (Self-CHOP)
1. Self-Configuring
• Automatically configures components
• Adapts to changes in workload
Example:
New servers automatically join a grid or cloud system.
2. Self-Healing
• Detects, diagnoses, and repairs failures
• Ensures continuous operation
Example:
If a grid node fails, tasks are automatically reassigned.
3. Self-Optimizing
• Monitors performance and improves efficiency
• Balances workload dynamically
Example:
Automatic load balancing during peak usage.
4. Self-Protecting
• Detects and prevents security threats
• Responds to attacks automatically
Example:
Blocking unauthorized access in real time.
3. Autonomic Architecture (MAPE-K Loop)
MAPE-K Components
• Monitor – Collects system data
• Analyze – Evaluates system behavior
• Plan – Decides corrective actions
• Execute – Applies changes
• Knowledge – Stores policies and system models
This loop continuously controls the system.
4. Role of Autonomic Computing in Grid and Cloud Systems
• Essential for managing large grids
• Enables fault tolerance and scalability
• Reduces operational cost
• Improves system availability
Example:
In a computational grid, autonomic managers handle job failures and resource reallocation without
administrator intervention.
5. Advantages
• Reduced human intervention
• Higher system reliability
• Better performance optimization
• Lower management cost
6. Limitations
• Complex design and implementation
• Security policy management challenges
• Initial deployment cost
7. Real-Life Analogy
Human autonomic nervous system controlling breathing and heartbeat without conscious effort.
8. Exam-Friendly Conclusion
Autonomic computing enables self-managing systems that are capable of self-configuration, self-
healing, self-optimization, and self-protection, making it essential for modern grid, cloud, and
distributed computing environments.
@@@@
Examples of Grid Computing Efforts by IBM
4
IBM was one of the early pioneers of Grid Computing and played a major role in bringing grid
technologies from research labs into enterprise and commercial environments.
1. IBM Autonomic Computing Initiative
IBM introduced Autonomic Computing to support large-scale grid systems.
Purpose
• Reduce complexity in managing grids
• Enable self-managing behavior
Key Features
• Self-configuring
• Self-healing
• Self-optimizing
• Self-protecting
Example:
If a grid node fails, IBM autonomic managers automatically reassign tasks to other nodes.
2. IBM Grid Toolbox
IBM developed the Grid Toolbox to help organizations build and manage grid environments.
Functions
• Resource discovery
• Job scheduling
• Workload management
• Security services
Technologies Used
• Based on Open Grid standards (OGSA)
• Integrated with Globus Toolkit
Example:
Research labs used IBM Grid Toolbox to share computing resources across departments.
3. IBM Enterprise Grid Computing
IBM applied grid computing to enterprise environments.
Key Idea
• Utilize idle CPUs across an organization
• Improve overall system utilization
Applications
• Financial risk analysis
• Portfolio simulations
• Business analytics
Example:
Banks used IBM enterprise grids to run overnight risk calculations using unused office computers.
4. IBM Grid Computing in Scientific Research
IBM collaborated with universities and research institutions.
Applications
• Weather modeling
• Life sciences and genomics
• High-energy physics
Example:
Large-scale scientific simulations were executed using IBM-supported grid infrastructures.
5. IBM Grid and Cloud Transition
IBM’s grid computing efforts influenced modern cloud computing.
Contributions
• Resource virtualization
• Automated provisioning
• Service-oriented architecture
Result:
Grid computing concepts later evolved into IBM cloud platforms.
6. Summary Table
IBM Grid Effort Description
Autonomic Computing Self-managing grid systems
Grid Toolbox Tools for building grid infrastructure
Enterprise Grid Business use of idle computing power
Research Grids Scientific and academic collaboration
Exam-Friendly Conclusion
IBM’s grid computing efforts laid the foundation for autonomic systems and cloud computing,
enabling efficient resource sharing, fault tolerance, and large-scale collaboration.
Module 2
Cluster Computing at a Glance
4
1. Introduction to Cluster Computing
Cluster Computing is a computing technique in which multiple independent computers (nodes) are
connected through a high-speed local network and work together as a single integrated system.
• All nodes cooperate to execute tasks
• Appears to users as one powerful computer
• Mainly used for High Performance Computing (HPC) and high availability
2. A Cluster Computer
A cluster computer consists of several interconnected machines that function collectively.
Basic Components of a Cluster Computer
1. Nodes
o Individual computers (PCs/servers)
o Can be master node and worker nodes
2. High-Speed Network
o Typically LAN, InfiniBand, or Ethernet
o Enables fast communication between nodes
3. Shared Storage (Optional)
o Common data access for all nodes
o Used in data-intensive applications
4. Cluster Middleware
o Software that manages the cluster
o Handles job scheduling, communication, and monitoring
Working of a Cluster Computer
1. User submits a job to the master node
2. Job is divided into smaller tasks
3. Tasks are distributed to worker nodes
4. Results are collected and combined
5. Final output is returned to the user
3. Types of Cluster Computing
• High Performance Clusters (HPC)
Used for scientific and engineering computations
• High Availability Clusters
Used to minimize downtime and ensure reliability
• Load Balancing Clusters
Distribute workloads evenly across servers
4. Advantages of Cluster Computing
• High processing speed
• Cost-effective compared to supercomputers
• Scalability by adding more nodes
• Fault tolerance (limited)
5. Limitations
• Requires specialized setup and maintenance
• Limited scalability compared to cloud systems
• Depends heavily on network performance
6. Real-Life Example
A university connects multiple lab computers to form a cluster for running simulation programs,
reducing execution time significantly.
7. Exam-Friendly Conclusion
Cluster computing combines multiple computers in a single location to work as one system,
providing high performance, reliability, and scalability for compute-intensive tasks.
Cluster Computing: Architecture and Classifications
4
1. Architecture of Cluster Computing
The cluster architecture defines how multiple computers (nodes) are interconnected and managed
to work as a single system.
1.1 Basic Cluster Architecture
A typical cluster consists of:
1. Master (Head) Node
• Controls the entire cluster
• Accepts user jobs
• Distributes tasks to worker nodes
• Collects and combines results
2. Worker (Compute) Nodes
• Perform actual computation
• Execute tasks assigned by the master node
• Can be homogeneous or heterogeneous
3. High-Speed Interconnection Network
• Provides fast communication between nodes
• Usually LAN, Ethernet, or InfiniBand
4. Shared Storage (Optional)
• Centralized data storage accessible by all nodes
• Used in data-intensive applications
5. Cluster Middleware
• Software layer that manages the cluster
• Handles:
o Job scheduling
o Resource allocation
o Load balancing
o Fault monitoring
Working of Cluster Architecture
1. User submits job to the master node
2. Middleware splits the job into tasks
3. Tasks are assigned to worker nodes
4. Nodes execute tasks in parallel
5. Results are merged and returned
2. Cluster Classifications
Clusters can be classified based on purpose and functionality.
2.1 High Performance Computing (HPC) Clusters
Purpose
• Achieve maximum computational speed
Characteristics
• Tightly coupled nodes
• Parallel processing
• High-speed networks
Applications
• Scientific simulations
• Weather forecasting
• Engineering analysis
2.2 High Availability (HA) Clusters
Purpose
• Ensure continuous service and minimize downtime
Characteristics
• Redundant nodes
• Failover mechanisms
• Automatic recovery
Applications
• Banking systems
• Web servers
• Critical enterprise applications
2.3 Load Balancing Clusters
Purpose
• Distribute workload evenly across nodes
Characteristics
• Improves response time
• Prevents overloading of servers
Applications
• Web hosting
• E-commerce platforms
• Application servers
2.4 Storage Clusters
Purpose
• Provide high-capacity and reliable data storage
Characteristics
• Distributed storage systems
• Data replication and redundancy
Applications
• File servers
• Big data storage
• Backup systems
2.5 Heterogeneous vs Homogeneous Clusters
Type Description
Homogeneous Cluster All nodes have identical hardware and software
Heterogeneous Cluster Nodes differ in hardware or operating systems
3. Summary Table
Classification Main Objective Example Use
HPC Cluster High computation speed Scientific research
HA Cluster Fault tolerance Banking systems
Load Balancing Cluster Performance improvement Web servers
Storage Cluster Data reliability Big data storage
4. Exam-Friendly Conclusion
• Cluster Architecture enables multiple computers to work as a single system
• Master–worker model is the most common architecture
• Clusters are classified based on performance, availability, load balancing, and storage needs
• Cluster computing is widely used in HPC and enterprise environments
Commodity Components for Clusters
Commodity components are low-cost, off-the-shelf hardware and software used to build cluster
computers instead of expensive, proprietary supercomputer parts.
Clusters built this way are often called commodity clusters (e.g., Beowulf clusters).
1. Commodity Nodes (Compute Nodes)
• Standard PCs or servers
• Use common processors (Intel/AMD)
• Each node has its own:
o CPU
o RAM
o Local disk
o Network interface card (NIC)
Example:
Desktop-class machines connected together to form an HPC cluster.
2. Commodity Processors
• General-purpose CPUs
• Multi-core processors commonly used
Advantages
• Easily available
• Low cost
• Easy replacement and upgrade
3. Commodity Memory (RAM)
• Standard DDR RAM modules
• Installed independently on each node
Benefit:
Scalable memory by adding more nodes.
4. Commodity Storage
A. Local Storage
• HDDs or SSDs on each node
• Used for temporary data and OS
B. Shared Storage (Optional)
• Network Attached Storage (NAS)
• Storage Area Network (SAN)
Purpose:
Provides common data access for all nodes.
5. Commodity Network Components
• Ethernet switches
• Standard LAN cables
• Network Interface Cards (NICs)
Common Networks Used
• Gigabit Ethernet
• 10/25/40 Gb Ethernet
Note:
High-speed networks improve performance but increase cost.
6. Commodity Operating Systems
• Open-source OS widely used:
o Linux distributions (Ubuntu, CentOS, Rocky Linux)
Reasons
• Free and flexible
• Strong networking and process support
• Ideal for parallel computing
7. Commodity Middleware and Software
• Cluster management software
• Parallel programming libraries
Examples
• MPI (Message Passing Interface)
• Job schedulers (SLURM, PBS)
Role
• Job scheduling
• Resource management
• Inter-node communication
8. Advantages of Using Commodity Components
• Low cost compared to supercomputers
• Easy scalability (add more nodes)
• Vendor independence
• Easy maintenance and replacement
9. Limitations
• Performance depends on network speed
• Higher power and space requirements
• More management effort than integrated systems
10. Exam-Friendly Conclusion
Commodity components enable the construction of cost-effective, scalable, and flexible cluster
computers using standard hardware and software, making cluster computing accessible to
universities, research labs, and enterprises.
Network Services / Communication Software in Cluster Computing
4
In cluster computing, network services and communication software enable efficient data
exchange, coordination, and control among multiple nodes so that the cluster behaves like a single
system.
1. Need for Network Services in Clusters
• Nodes must exchange data rapidly
• Tasks must be synchronized
• Resources must be discovered and managed
• Failures must be detected
Network services act as the backbone of cluster operation.
2. Communication Software
2.1 Message Passing Interface (MPI)
MPI is the most widely used communication standard in clusters.
Functions
• Point-to-point communication (send/receive)
• Collective communication (broadcast, reduce)
• Synchronization among nodes
Example:
In a weather simulation, each node computes part of the model and exchanges results using MPI.
2.2 Remote Procedure Call (RPC)
• Allows one node to execute a procedure on another node remotely
• Used for control and management tasks
Example:
Master node remotely starts or stops processes on worker nodes.
2.3 Sockets
• Low-level communication mechanism
• Uses TCP/UDP protocols
• Provides flexibility but requires more programming effort
Example:
Custom cluster applications using socket programming.
3. Network Services in Cluster Computing
3.1 Naming and Directory Services
• Identify nodes and services in the cluster
• Maintain resource information
Example:
DNS, LDAP-based services
3.2 Resource Discovery Services
• Locate available CPUs, memory, and storage
• Helps schedulers select suitable nodes
3.3 Time Synchronization Services
• Maintain consistent system time across nodes
Example:
NTP (Network Time Protocol)
3.4 Monitoring and Management Services
• Track node health and performance
• Detect failures
Example:
Heartbeat monitoring in HA clusters
3.5 Security Services
• Authentication and authorization
• Secure communication between nodes
Example:
SSH, SSL/TLS-based communication
4. Communication Models Used in Clusters
Model Description Example Use
Message Passing Explicit data exchange MPI-based HPC
Shared Memory Shared address space SMP clusters
Hybrid Model Combination of both Large HPC systems
5. Role in Different Cluster Types
Cluster Type Role of Communication Software
HPC Cluster High-speed data exchange
HA Cluster Failure detection and failover
Load Balancing Cluster Request distribution
Storage Cluster Data replication
6. Advantages
• Efficient parallel execution
• Scalability across nodes
• Fault detection and recovery
7. Exam-Friendly Conclusion
Network services and communication software enable coordination, data exchange,
synchronization, and fault tolerance in cluster computing, making parallel execution possible and
efficient.
Cluster Middleware and Single System Image (SSI)
4
In cluster computing, cluster middleware and Single System Image (SSI) work together to make a
collection of independent computers behave like one unified system.
1. Cluster Middleware
Definition
Cluster middleware is the software layer that sits between the hardware/operating system and user
applications, managing and coordinating all cluster resources.
Functions of Cluster Middleware
• Job scheduling and resource allocation
• Process creation and management
• Inter-node communication
• Load balancing
• Fault detection and recovery
• Monitoring and administration
Examples of Cluster Middleware
• MPI (Message Passing Interface)
• Job schedulers (SLURM, PBS)
• Cluster management tools
Role of Middleware in a Cluster
1. Accepts user jobs
2. Allocates suitable nodes
3. Manages execution
4. Handles node failures
5. Collects results
2. Single System Image (SSI)
Definition
Single System Image (SSI) is the illusion that a cluster of multiple computers appears to users and
applications as one single, unified computer system.
Objectives of SSI
• Simplify cluster usage
• Hide hardware complexity
• Improve usability and transparency
Key SSI Features
• Single login point
• Single file hierarchy
• Single process space
• Single job management system
• Unified resource management
Example
A user logs into the cluster once and runs an application without worrying about which node
executes it.
3. Relationship Between Cluster Middleware and SSI
Cluster Middleware SSI
Implements resource management Provides unified system view
Controls job execution Hides distributed nature
Manages communication Improves user transparency
Middleware enables SSI, and SSI is the goal achieved by middleware.
4. Advantages
Cluster Middleware
• Efficient resource utilization
• Fault tolerance
• Scalability
SSI
• Ease of use
• Improved productivity
• Simplified system administration
5. Limitations
• Complex design
• Overhead in maintaining transparency
• Difficult debugging
6. Exam-Friendly Conclusion
Cluster middleware provides the management and coordination mechanisms, while Single System
Image (SSI) provides the illusion of a single system, together making cluster computing powerful,
scalable, and user-friendly.
RMS – Resource Management System (in Cluster Computing)
4
In cluster computing, RMS (Resource Management System) is a crucial software component
responsible for allocating, scheduling, monitoring, and managing resources such as CPUs, memory,
storage, and network bandwidth across the cluster.
1. Definition
A Resource Management System (RMS) controls how cluster resources are shared among users and
applications to ensure:
• Efficient utilization
• Fair access
• High performance
• Fault tolerance
2. Objectives of RMS
• Allocate resources optimally
• Schedule jobs efficiently
• Balance load across nodes
• Monitor system health
• Handle node failures
3. Core Functions of RMS
3.1 Resource Discovery
• Identifies available nodes and their capabilities
• Tracks CPU load, memory, and storage status
3.2 Job Scheduling
• Decides when and where a job should run
• Uses scheduling policies like:
o FIFO
o Priority-based
o Fair-share
3.3 Resource Allocation
• Assigns CPUs, memory, and nodes to jobs
• Prevents conflicts between jobs
3.4 Job Execution & Control
• Starts, pauses, migrates, or terminates jobs
• Monitors job progress
3.5 Load Balancing
• Distributes workload evenly
• Prevents node overloading
3.6 Fault Detection & Recovery
• Detects node or job failure
• Reschedules jobs on healthy nodes
4. Components of an RMS
Component Function
Job Queue Holds submitted jobs
Scheduler Selects jobs for execution
Resource Monitor Tracks node status
Dispatcher Assigns jobs to nodes
Accounting Module Records resource usage
5. Types of RMS in Clusters
1. Centralized RMS
• Single master controls all scheduling
• Simple but less scalable
2. Distributed RMS
• Multiple controllers
• Better scalability and fault tolerance
6. Examples of RMS (for understanding)
• SLURM
• PBS
• LSF
(Examples are for explanation; exams usually focus on concepts.)
7. Advantages
• Improves system throughput
• Maximizes resource utilization
• Supports multi-user environments
• Enhances reliability
8. Limitations
• Scheduling overhead
• Complexity in large clusters
• Requires proper configuration
9. Exam-Friendly Conclusion
The Resource Management System (RMS) is the backbone of cluster computing that controls
resource allocation, job scheduling, load balancing, and fault handling, ensuring efficient and
reliable cluster operation.
Programming Environments and Tools (in Cluster Computing)
4
In cluster computing, programming environments and tools provide the software support required
to develop, execute, debug, and optimize parallel applications that run across multiple nodes.
1. Introduction
A programming environment in clusters includes:
• Programming models
• Libraries
• Compilers
• Debugging and performance tools
These tools allow programmers to write parallel programs without dealing directly with low-level
hardware complexities.
2. Programming Models Used in Clusters
2.1 Message Passing Model
• Processes communicate by sending and receiving messages
• Most widely used model in clusters
Tool Used: MPI (Message Passing Interface)
Example:
Each node computes part of a matrix and exchanges results via messages.
2.2 Shared Memory Model
• Multiple processes share a common memory space
• Mostly used in multi-core or SMP clusters
Tool Used: OpenMP
2.3 Hybrid Programming Model
• Combines message passing and shared memory
• Used in large HPC clusters
Example:
MPI between nodes + OpenMP within a node.
3. Programming Tools and Libraries
3.1 MPI (Message Passing Interface)
• Standard library for parallel programming
• Supports point-to-point and collective communication
Functions include:
• Send / Receive
• Broadcast
• Reduce
Usage:
Scientific simulations, weather models, engineering applications.
3.2 OpenMP
• Compiler-based parallel programming
• Uses directives (#pragma)
• Easier to program than MPI
Usage:
Loop-level parallelism on shared-memory systems.
3.3 PVM (Parallel Virtual Machine)
• Older message-passing system
• Allows heterogeneous systems
(Mostly of historical importance)
4. Development Tools
4.1 Compilers
• Translate parallel programs into executable code
Examples:
• GCC
• Intel compilers
4.2 Debugging Tools
• Help detect logical and communication errors
Functions:
• Deadlock detection
• Process tracing
4.3 Performance Analysis Tools
• Measure execution time
• Identify bottlenecks
• Optimize parallel efficiency
5. Job Execution and Management Tools
• Interface with the Resource Management System (RMS)
• Control job submission and execution
Functions:
• Job submission
• Job monitoring
• Resource usage reporting
6. Integrated Programming Environment
A typical cluster programming workflow:
1. Write parallel program
2. Compile using parallel compiler
3. Submit job via RMS
4. Monitor execution
5. Analyze performance
7. Advantages
• Enables efficient parallel programming
• Improves application performance
• Simplifies development and debugging
• Supports scalability
8. Limitations
• Steep learning curve
• Debugging parallel programs is complex
• Performance tuning requires expertise
9. Exam-Friendly Conclusion
Programming environments and tools in cluster computing provide the necessary software
framework to design, execute, debug, and optimize parallel applications, making efficient use of
distributed cluster resources.
Cluster Applications
4
Cluster applications are software applications designed to run on cluster computing systems, where
multiple interconnected computers work together to deliver high performance, reliability, and
scalability.
1. Scientific and Engineering Applications
Description
Clusters are widely used for compute-intensive scientific problems that require massive parallel
processing.
Examples
• Weather forecasting
• Climate modeling
• Computational Fluid Dynamics (CFD)
• Molecular modeling
Why clusters?
Large problems are divided into smaller tasks and executed simultaneously on many nodes.
2. High Performance Computing (HPC) Applications
Description
HPC applications require extreme processing power and low-latency communication.
Examples
• Astrophysics simulations
• Nuclear research
• Space research
• Seismic data processing
Benefit:
Clusters provide supercomputer-like performance at lower cost.
3. Business and Enterprise Applications
Description
Clusters improve performance, availability, and scalability of business systems.
Examples
• Financial risk analysis
• Online transaction processing
• Data warehousing
Use Case:
Banks use clusters to process millions of transactions reliably.
4. Web and Internet Applications
Description
Clusters are used to handle large numbers of user requests.
Examples
• Web server clusters
• E-commerce platforms
• Content delivery systems
Technique Used:
Load balancing distributes requests across servers.
5. High Availability (HA) Applications
Description
HA clusters ensure continuous service even if a node fails.
Examples
• Banking systems
• Airline reservation systems
• Hospital management systems
Feature:
Automatic failover to backup nodes.
6. Database and Storage Applications
Description
Clusters support large-scale data storage and fast access.
Examples
• Distributed databases
• Big data analytics
• Backup and recovery systems
7. Multimedia and Graphics Applications
Description
Used for rendering and processing large multimedia content.
Examples
• Animation rendering
• Video processing
• Image analysis
Benefit:
Reduces rendering time from days to hours.
8. Machine Learning and AI Applications
Description
Clusters accelerate training of AI and ML models.
Examples
• Deep learning model training
• Natural language processing
• Image recognition
9. Summary Table
Application Area Purpose
Scientific Computing High-speed simulations
HPC Massive parallel computation
Enterprise Systems Reliability and scalability
Web Services Load balancing
HA Systems Fault tolerance
Databases Large data handling
AI/ML Fast model training
10. Exam-Friendly Conclusion
Cluster applications span science, engineering, business, web services, databases, and AI, leveraging
parallel processing, fault tolerance, and scalability to solve complex and large-scale problems
efficiently.
Lightweight Messaging Systems (LMS)
Introduction & Latency–Bandwidth Evaluation of Communication Performance
4
1. Introduction to Lightweight Messaging Systems
A Lightweight Messaging System (LMS) is a communication software layer used in cluster and
distributed computing to enable fast, low-overhead message exchange between nodes.
Why “Lightweight”?
• Minimal protocol overhead
• Small memory footprint
• Designed for high performance
• Avoids complex OS or network layers
LMS is especially important in HPC clusters, where communication speed directly affects
performance.
Role of LMS in Clusters
• Enables process-to-process communication
• Supports parallel programming models
• Reduces communication delay
• Improves scalability
Examples (Conceptual)
• Message passing libraries
• Low-level communication layers used under MPI
(In exams, focus on concept rather than specific products.)
2. Communication Performance Metrics
To evaluate the performance of a lightweight messaging system, two key metrics are used:
1. Latency
2. Bandwidth
3. Latency in Communication Performance
Definition
Latency is the time taken for a message to travel from the sender to the receiver.
Latency = Send time + Network delay + Receive time
Key Points
• Measured in microseconds (µs)
• Important for small message transfers
• Lower latency = better performance
Factors Affecting Latency
• Network hardware
• Protocol overhead
• Software stack
• Context switching
Example
If a control message takes 5 µs to reach another node, the latency is 5 µs, regardless of message size.
4. Bandwidth in Communication Performance
Definition
Bandwidth is the amount of data transferred per unit time between nodes.
Message Size
Bandwidth =
Transfer Time
Key Points
• Measured in MB/s or GB/s
• Important for large message transfers
• Higher bandwidth = better throughput
Factors Affecting Bandwidth
• Network speed
• Message size
• Buffer size
• Communication protocol
Example
If 100 MB of data is transferred in 1 second,
Bandwidth = 100 MB/s
5. Latency vs Bandwidth (Key Difference)
Aspect Latency Bandwidth
Meaning Delay in communication Data transfer rate
Units µs or ms MB/s or GB/s
Affects Small messages Large messages
Goal Minimize latency Maximize bandwidth
6. Latency–Bandwidth Evaluation
Communication Time Model
𝑀
𝑇=𝐿+
𝐵
Where:
• T = Total communication time
• L = Latency
• M = Message size
• B = Bandwidth
Interpretation
• For small messages → latency dominates
• For large messages → bandwidth dominates
Graph Interpretation (Exam Tip)
• X-axis: Message size
• Y-axis: Communication time
• Initial flat region → latency
• Sloped region → bandwidth limitation
7. Importance of LMS Performance Evaluation
• Determines scalability of cluster applications
• Helps optimize parallel algorithms
• Identifies communication bottlenecks
• Improves overall system throughput
8. Exam-Friendly Conclusion
Lightweight Messaging Systems provide efficient, low-overhead communication in cluster
computing.
Their performance is evaluated mainly using latency (for small messages) and bandwidth (for large
messages).
Optimizing both is essential for achieving high-performance parallel computing.
Traditional Communication Mechanisms for Clusters
4
Before the development of high-performance and lightweight messaging systems, traditional
communication mechanisms were used in cluster and distributed systems to enable interaction
between nodes. These mechanisms are simpler but introduce higher overhead and latency, making
them less suitable for modern HPC clusters.
1. Introduction
Traditional communication mechanisms rely on general-purpose operating system services and
network protocols.
They were originally designed for distributed systems, not specifically optimized for high-
performance cluster computing.
2. Types of Traditional Communication Mechanisms
2.1 Socket-Based Communication
Description
• Uses TCP or UDP sockets
• Based on client–server model
• Low-level communication interface
Working
• Sender sends data through a socket
• Receiver reads data from its socket
Advantages
• Flexible
• Widely supported
• Platform independent
Limitations
• High programming complexity
• Higher latency due to OS involvement
• Not efficient for fine-grained parallelism
Example:
Custom distributed applications using TCP/IP sockets.
2.2 Remote Procedure Call (RPC)
Description
• Allows a program to execute a procedure on a remote machine
• Hides communication details from the programmer
Working
• Client calls a function
• Function executes on remote server
• Result is returned
Advantages
• Easy to use
• Simplifies distributed programming
Limitations
• High overhead
• Blocking communication
• Not suitable for large data transfers
Example:
Remote service invocation in distributed systems.
2.3 Message-Oriented Middleware (MOM)
Description
• Uses message queues for communication
• Asynchronous communication
Advantages
• Decouples sender and receiver
• Reliable message delivery
Limitations
• Higher latency
• Not optimized for HPC workloads
Example:
Queue-based distributed applications.
2.4 Shared Memory Communication
Description
• Processes communicate via a common memory region
• Used in tightly coupled systems
Advantages
• Fast communication
• Low latency
Limitations
• Limited scalability
• Difficult synchronization
• Mostly restricted to single machine or SMP systems
2.5 File-Based Communication
Description
• Processes communicate by reading and writing files
• One process writes data; another reads it
Advantages
• Simple to implement
• Persistent data storage
Limitations
• Very slow
• High I/O overhead
• Not suitable for parallel applications
3. Comparison of Traditional Communication Mechanisms
Mechanism Speed Complexity Scalability Suitability for Clusters
Sockets Medium High Moderate Limited
RPC Medium Low Low Limited
MOM Low Medium Moderate Poor
Shared Memory High High Low Limited
File-based Very Low Low Very Low Poor
4. Why Traditional Mechanisms Are Inadequate for Clusters
• High latency
• Excessive OS and protocol overhead
• Poor scalability
• Inefficient for fine-grained parallel tasks
These limitations led to the development of MPI and lightweight messaging systems.
5. Exam-Friendly Conclusion
Traditional communication mechanisms such as sockets, RPC, shared memory, and file-based
communication laid the foundation for cluster communication but are not optimized for high-
performance computing, prompting the evolution of specialized messaging systems for modern
clusters.
Lightweight Communication Mechanisms
4
In cluster computing, lightweight communication mechanisms are designed to provide fast, low-
latency, and low-overhead communication between nodes. They overcome the limitations of
traditional mechanisms (sockets, RPC) and are essential for high-performance parallel applications.
1. Introduction
Lightweight communication mechanisms:
• Minimize operating system involvement
• Reduce protocol overhead
• Enable direct, efficient data transfer between processes
Their main goal is to maximize communication performance in clusters and HPC systems.
2. Key Characteristics
• Low latency
• High bandwidth
• User-level communication
• Zero or minimal data copying
• Scalable for large clusters
3. Types of Lightweight Communication Mechanisms
3.1 Message Passing Interface (MPI – Lightweight Usage)
Although MPI is a standard, its implementations use lightweight mechanisms internally.
Features
• User-level messaging
• Efficient buffering
• Optimized collective operations
Example
• Parallel matrix multiplication
• Weather and climate simulations
3.2 User-Level Communication (ULC)
Description
• Communication handled at user space, bypassing kernel where possible
Advantages
• Reduces context switching
• Faster message delivery
Example
• Direct memory access between processes
3.3 Zero-Copy Communication
Description
• Data transferred directly from sender memory to receiver memory
• Avoids intermediate buffering
Advantages
• Reduces CPU overhead
• Improves bandwidth
Example
• Large scientific data transfers between cluster nodes
3.4 Active Messages
Description
• Messages carry both data and handler (code)
• Receiver executes the handler immediately on message arrival
Advantages
• Low latency
• Overlaps communication and computation
Example
• Fine-grained parallel algorithms
3.5 Lightweight Protocols
Description
• Simplified communication protocols compared to TCP/IP
Advantages
• Reduced protocol stack overhead
• Faster setup and teardown
Example
• Custom cluster interconnect protocols
4. Performance Evaluation Metrics
Lightweight communication mechanisms are evaluated mainly using:
Metric Description
Latency Time taken to send a message
Bandwidth Data transferred per unit time
Overhead CPU time spent on communication
5. Comparison: Traditional vs Lightweight Mechanisms
Aspect Traditional Mechanisms Lightweight Mechanisms
OS involvement High Minimal
Latency High Very low
Bandwidth Moderate High
Aspect Traditional Mechanisms Lightweight Mechanisms
Suitability for HPC Poor Excellent
6. Advantages
• Faster inter-node communication
• Better scalability
• Improved application performance
• Efficient use of network hardware
7. Limitations
• More complex to implement
• Hardware dependent in some cases
• Debugging can be difficult
8. Exam-Friendly Conclusion
Lightweight communication mechanisms provide efficient, low-latency, and high-bandwidth
communication in cluster computing by minimizing OS and protocol overhead. They form the
foundation of modern HPC communication systems and enable scalable parallel applications.