0% found this document useful (0 votes)
11 views49 pages

1 Introduction

Uploaded by

fa23-bcs-180
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views49 pages

1 Introduction

Uploaded by

fa23-bcs-180
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CSC 334 – Parallel and Distributed Computing

Instructor: Dr. M. Hasan Jamal


Lecture# 01: Introduction

1
Motivating Parallelism
• Traditionally, software has been written for serial computation:
• To be run on a single computer having a single Central Processing Unit (CPU);
• A problem is broken into a discrete series of instructions that are executed sequentially.
• Only one instruction may execute at any moment in time.
• Performance is achieved by executing instructions concurrently through multitasking

• Uniprocessors are fast but:


• Some problems require too much computation
• Some problems use too much data
• Some problems have too many parameters to explore

• “For example:
• Weather simulations
• Gaming 2
• Web servers
• Code breaking
Motivating Parallelism
• In the simplest sense, parallel computing is the simultaneous use of multiple
compute resources to solve a computational problem
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be solved in parallel
• Each part is further broken down to a series of instructions

• Instructions from each part execute simultaneously on different CPUs

3
Motivating Parallelism
• The role of parallelism in accelerating computing speeds has been recognized
for several decades

• Its role in providing multiplicity of datapaths and increased access to storage


elements has been significant in commercial applications.

• The scalable performance and lower cost of parallel platforms is reflected in the
wide variety of applications

4
Motivating Parallelism
• Developing parallel hardware and software has traditionally been time and effort
intensive.

• If one is to view this in the context of rapidly improving uniprocessor speeds,


one is tempted to question the need for parallel computing.

• Latest trends in hardware design indicate that uniprocessors may not be able to
sustain the rate of realizable performance increments in the future.

• This is the result of several fundamental physical and computational limitations.

• The emergence of standardized parallel programming environments, libraries,


and hardware have significantly reduced time to develop (parallel) solutions.
5
The Computational Power Argument
• Moore’s Law states (1965):

• Number of transistors incorporated on a microchip doubles approximately every


two years, while the cost of computers is halved.

6
The Computational Power Argument
• If one is to buy into Moore’s law, the question still remains – how does one
translate transistors into useful OPS (operations per second)?

• The logical recourse is to rely on parallelism, both implicit and explicit.

• Most serial (or seemingly serial) processors rely extensively on implicit parallelism.

• The focus of this class, for the most part, is on explicit parallelism.

7
The Computational Power Argument
• Why doubling the transistors does not double the speed?
• Increase in number of transistor per processor is due to multi-core CPUs
• It means, to follow Moore’s law, companies had to introduce ultra large-scale
integrations and multi-core processing era

• Will Moore’s law hold forever?


• Adding multiple cores on a single chip causes heat issues
• Furthermore, increasing the number of cores, may not be able to increase speeds
due to inter-process interactions
• Moreover, transistors would eventually reach the limit of miniaturization at atomic level

8
The Computational Power Argument
• So, we must look for efficient
parallel software solutions to fulfill
our future computational needs

• As stated earlier, number of cores


on a single chip also have some
restrictions

• Solution(s)?
• Need to find more scalable
distributed and hybrid solutions

9
The memory/Disk Speed Argument
• While clock rates of high-end processors have increased at roughly 40% per
year over the past decade, DRAM access times have only improved at the
rate of roughly 10% per year over this interval

• This mismatch in speeds causes significant performance bottlenecks

• Parallel platforms provide increased bandwidth to the memory systems

• Parallel platforms also provide higher aggregate caches

• Some of the fastest growing applications of parallel computing utilize not the
raw computational speeds, rather their ability to pump data to memory and
disk faster
10
The Data Communication Argument
• As the network evolves, the vision of the Internet as one large computing
platform has emerged

• In many applications (typically databases and data mining) the volume of data
is such that they cannot be moved

• Any analyses on this data must be performed over the network using parallel
techniques

11
Computing vs. Systems
Distributed Systems
• A collection of autonomous computers, connected through a network and
distribution middleware
• This enables computers to coordinate their activities and to share the resources of
the system
• The system is usually perceived as a single, integrated computing facility
• Mostly connected with the hardware-based accelerations

Distributed Computing
• A specific use of distributed systems, to split a large and complex processing
into subparts and execute them in parallel, to increase the productivity
• Computing mainly concerned with software-based accelerations (i.e., designing and
12
implementing algorithms)
Parallel and Distributed Computing
Parallel (shared-memory) Computing
• The term is usually used for developing concurrent solutions for following two
types of the systems:
• Multi-core Architecture
• Many-core Architectures (i.e., GPUs)

Distributed Computing
• This type of computing is mainly concerned with developing algorithms for the
distributed cluster systems
• Here, distributed means a geographical distance between the computers
without any shared-memory
13
Scope of Parallel Computing Applications
• Parallelism finds applications in very diverse application domains for different
motivating reasons

• These range from improved application performance to cost considerations

14
Scientific Applications
• Functional and structural characterization of genes and proteins

• Applications in astrophysics have explored the evolution of galaxies,


thermonuclear processes, and the analysis of extremely large datasets from
telescope

• Advances in computational physics and chemistry have explored new materials,


understanding of chemical pathways, and more efficient processes

• Bioinformatics and astrophysics also present some of the most challenging


problems with respect to analyzing extremely large datasets

• Weather modeling, mineral prospecting, flood prediction etc., are other 15


important applications
Commercial Applications
• Some of the largest parallel computers power Wall Street

• Data mining analysis for optimizing business and marketing decisions

• Large scale servers (mail and web servers) are often implemented using
parallel platforms

• Applications such as information retrieval and search are typically powered by


large clusters

16
Applications in Computer Systems
• Network intrusion detection: A large amount of data needs to be analyzed and
processed

• Cryptography (the art of writing or solving codes) employs parallel


infrastructures and algorithms to solve complex codes

• Graphic processing

• A modern automobile consists of tens of processors communicating to perform


complex tasks for optimizing handling and performance

17
Von Neumann Architecture
• Named after the Hungarian mathematician John von Neumann who first
authored the general requirements for an electronic computer in his 1945 papers.

• Also known as "stored-program computer" – both program instructions and data


are kept in electronic memory. Differs from earlier computers which were
programmed through "hard wiring".

• Since then, virtually all computers have followed this basic design.

• Comprised of four main components:


• Memory
• Control Unit
• Arithmetic Logic Unit
18
• Input/Output
Von Neumann Architecture
• Basic Design
• RD/WR, random access memory is used to store both program instructions and data
• Program instructions are coded data which tell the computer to do something
• Data is simply information to be used by the program
• Control unit fetches instructions/data from memory, decodes the instructions and then
sequentially coordinates operations to accomplish the programmed task.
• Arithmetic Unit performs basic arithmetic operations
• Input/Output is the interface to the human operator

• Parallel computers still follow this basic design, just multiplied in units. The
basic, fundamental architecture remains the same.
19
Flynn's Classical Taxonomy
• There are different ways to classify parallel computers. One of the more widely
used classifications, in use since 1966, is called Flynn's Taxonomy.

• Flynn's taxonomy distinguishes multi-processor computer architectures


according to how they can be classified along the two independent dimensions
of Instruction and Data. Each of these dimensions can have only one of two
possible states: Single or Multiple.

• The Flynn matrix below defines the 4 possible classifications:

SISD SIMD
Single Instruction, Single Data Single Instruction, Multiple Data
MISD MIMD
Multiple Instruction, Single Data Multiple Instruction, Multiple Data 20
Single Instruction, Single Data (SISD)
• A serial (non-parallel) computer

• Single instruction: only one instruction stream is


being acted on by the CPU during any one clock cycle

• Single data: only one data stream is being used as


input during any one clock cycle

• Deterministic execution

• This is the oldest and until recently, the most prevalent


form of computer

• Examples: most PCs, single CPU workstations and 21


mainframes
Single Instruction, Multiple Data (SIMD)
• A type of parallel computer

• Single instruction: All processing units execute the same instruction at any given
clock cycle

• Multiple data: Each processing unit can operate on a different data element

• This type of machine typically has an instruction dispatcher, a very high-bandwidth


internal network, and a very large array of very small-capacity instruction units.

• Best suited for specialized problems characterized by a high degree of regularity, such
as graphics/image processing.

• Synchronous (lockstep) and deterministic execution


22
• Two varieties: Processor Arrays and Vector Pipelines
Single Instruction, Multiple Data (SIMD)
• Examples:
• Processor Arrays: Thinking Machine CM-2, MasPar MP-1 & MP-2
• Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820

• Most modern computers, particularly those with graphics processor units (GPUs)
employ SIMD instructions and execution units. 23
Multiple Instruction, Single Data (MISD)
• A type of a parallel computer
• Multiple Instruction: Each processing unit operates on the data independently.
via independent instruction streams.
• Single Data: A single data stream is fed into multiple processing units.
• Few (if any) actual examples of this class of parallel computer have ever existed.
• Some conceivable uses might be:
• multiple frequency filters operating on a single signal stream
• multiple cryptography algorithms attempting to crack a single coded message.

24
Multiple Instruction, Multiple Data (MIMD)
• Currently, the most common type of parallel
computer. Most modern computers fall into this
category.

• Multiple Instruction: every processor may be


executing a different instruction stream

• Multiple Data: every processor may be working


with a different data stream

• Execution can be synchronous or


asynchronous, deterministic or non-deterministic

• Examples: most current supercomputers, 25


networked parallel computer "grids" and multi-
processor SMP computers, multi-core PCs.
General Parallel Computing Terminology
• CPU: Modern CPUs comprise one or more cores – a distinct execution unit with
its own instruction stream. Cores with a CPU may be organized into one or
more sockets - each socket with its own distinct memory . When a CPU
consists of two or more sockets, usually hardware infrastructure supports
memory sharing across sockets.

• Node: A standalone "computer in a box." Usually comprised of multiple


CPUs/processors/cores, memory, network interfaces, etc. Nodes are networked
together to comprise a supercomputer.

• Task: A logically discrete section of computational work. A task is typically a


program or program-like set of instructions that is executed by a processor. A
parallel program consists of multiple tasks running on multiple processors.

• Pipelining: Breaking a task into steps performed by different processor units, 26


with inputs streaming through, much like an assembly line; a type of parallel
computing.
General Parallel Computing Terminology
• Serial Execution: Execution of a program sequentially, one statement at a
time. In the simplest sense, this is what happens on a one processor machine.
However, virtually all parallel tasks will have sections of a parallel program that
must be executed serially.

• Parallel Execution: Execution of a program by more than one task, with each
task being able to execute the same or different statement at the same
moment in time.

• Shared Memory: From a strictly hardware point of view, describes a computer


architecture where all processors have direct (usually bus based) access to
common physical memory. In a programming sense, it describes a model
where parallel tasks all have the same "picture" of memory and can directly
address and access the same logical memory locations regardless of where 27
the physical memory actually exists.
General Parallel Computing Terminology
• Distributed Memory: In hardware, refers to network-based memory access for
physical memory that is not common. As a programming model, tasks can only
logically "see" local machine memory and must use communications to access
memory on other machines where other tasks are executing.

• Communications: Parallel tasks typically need to exchange data. There are


several ways this can be accomplished, such as through a shared memory bus
or over a network, however the actual event of data exchange is commonly
referred to as communications regardless of the method employed.

• Synchronization: The coordination of parallel tasks in real time, very often


associated with communications. Often implemented by establishing a
synchronization point within an application where a task may not proceed further
until another task(s) reaches the same or logically equivalent point. 28
Synchronization usually involves waiting by at least one task and can therefore
cause a parallel application's wall clock execution time to increase.
General Parallel Computing Terminology
• Computational Granularity: In parallel computing, granularity is a qualitative
measure of the ratio of computation to communication.
• Coarse: relatively large amounts of computational work are done between
communication events
• Fine: relatively small amounts of computational work are done between
communication events

• Observed Speedup: One of the simplest and most widely used indicators for
a parallel program's performance. Observed speedup of a code which has
been parallelized, defined as:
wall−clock time of serial execution
wall−clock time of parallel execution
29
General Parallel Computing Terminology
• Parallel Overhead: Required execution time that is unique to parallel tasks, as
opposed to that for doing useful work. Parallel overhead can include factors such as:
• Task start-up time
• Synchronizations
• Data communications
• Software overhead imposed by parallel languages, libraries, operating system, etc.
• Task termination time

• Massively Parallel: Refers to parallel system hardware, having many


processing elements. The meaning of "many" keeps increasing, but currently, the
largest parallel computers are comprised of processing elements numbering in
the hundreds of thousands to millions.
30
General Parallel Computing Terminology
• Embarrassingly (IDEALLY) Parallel: Solving many similar, but independent tasks
simultaneously; little to no need for coordination between the tasks.

• Scalability: Refers to a parallel system's (hardware and/or software) ability to


demonstrate a proportionate increase in parallel speedup with the addition of
more resources. Factors that contribute to scalability include:
• Hardware – particularly memory-CPU bandwidths and network communication
• Application algorithm
• Parallel overhead related
• Characteristics of your specific application and coding

31
Design Goals of Parallel and Distributed Computing
• Performance Improvement
• Speedup: Execute tasks faster by dividing the workload among multiple processors.
• High Throughput: Perform many tasks concurrently.
• Low Latency: Reduce time to complete individual tasks through parallel execution.

• Scalability
• Horizontal Scalability: Add more nodes/processors without significant reconfiguration.
• Vertical Scalability: Increase the power of existing nodes.
• Load Balancing: Efficiently distribute work across available resources to prevent
bottlenecks.

32
Design Goals of Parallel and Distributed Computing
• Reliability and Fault Tolerance
• Redundancy: Ensure backup systems are available in case of failure.
• Replication: Duplicate critical data and services across nodes.
• Recovery Mechanism: Enable quick recovery after failures through checkpointing
and rollback mechanisms.

• Resource Sharing and Utilization


• Efficient Resource Utilization: Optimize the use of computational resources like
CPU, memory, and network bandwidth.
• Multi-user Support: Allow multiple users or tasks to share and access resources
concurrently
33
Design Goals of Parallel and Distributed Computing
• Transparency
• Access Transparency: Users do not need to know the physical location of resources.
• Location Transparency: Resource location changes are hidden from users.
• Replication Transparency: Users are unaware of the presence of multiple copies.
• Concurrency Transparency: Multiple processes execute simultaneously without
conflict.
• Failure Transparency: Failures are masked from users to ensure uninterrupted
service.

• Modularity and Flexibility


• Modularity: Design systems as independent components that can be developed,
tested, and maintained separately.
• Flexibility: Easily upgrade, modify, or reconfigure the system without major 34
overhauls.
Design Goals of Parallel and Distributed Computing
• Heterogeneity Support:
• Support for diverse hardware, operating systems, and network protocols to ensure
seamless integration and communication.

• Cost Effectiveness
• Utilize cost-effective commodity hardware.
• Reduce operational costs through optimized parallel and distributed processing.

• Security and Privacy


• Implement robust security measures like encryption, authentication, and secure
communication to protect data in distributed environments.

35
Types of Parallel Systems
• Shared Memory Systems:
• Multiple processors share a single address space (RAM)
• Communication happen through shared variables
• Examples:
• symmetric multiprocessing (SMP)
• NUMA (non-uniform memory access)
• Pros: Low communication overhead, easy programming
• Cons: Scalability issues, memory access bottlenecks

• Distributed Memory Systems:


• Each processor has its own private memory
• Processes communicate using message passing (MPI, PVM)
• Examples:
• Cluster Computing
• Massively Parallel Processors (MPP) 36
• Pros: High scalability, cost-effective
• Cons: Complex programming, data consistency challenges
Types of Parallel Systems
• Hybrid Systems (Distributed Shared Memory):
• Combines shared memory and distributed memory models
• Allows processes to communicate either through shared memory or message passing
• Examples:
• Intel Xeon Phi, Cray supercomputers
• NUMA (non-uniform memory access)

• SIMD vs. MIMD Systems:


• SIMD (Single Instruction, Multiple Data) – One instruction operates on multiple data
streams (e.g., GPUs).
• MIMD (Multiple Instruction, Multiple Data) – Different processors execute different
instructions independently.

37
Types of Distributed Systems
• Cluster Computing
• Multiple computers (nodes) work together as a single logical unit.
• Typically uses high-speed local networks (LAN).
• Example: Beowulf Clusters, Apache Hadoop Clusters.

• Grid Computing
• Uses geographically distributed computers to perform large-scale computations.
• Typically, loosely coupled compared to clusters.

• Cloud Computing
• Uses virtualized distributed resources over the Internet.
• Provides services like IaaS, PaaS, SaaS.
• Example: AWS, Google Cloud, Microsoft Azure.
38
Types of Distributed Systems
• Peer-to-Peer (P2P) Systems
• No central authority; all nodes are equal and communicate directly.
• Used in file sharing, cryptocurrencies, decentralized networks.
• Example: BitTorrent, Ethereum blockchain.

• Internet of Things (IoT) Systems


• Distributed network of smart devices that process and share data.
• Uses Edge & Fog Computing to process data closer to the source.
• Example: Smart Cities, Industrial IoT.

39
Comparison of Parallel vs. Distributed Systems

Feature Parallel Systems Distributed Systems


Memory Sharing Shared or Distributed Completely Distributed
Communication Fast, Low Latency Uses Networks (LAN/WAN)
Scalability Limited Highly Scalable
Fault Tolerance Lower Higher
Use Cases HPC, Scientific Simulations Cloud, IoT, Big Data

40
Enabling Technologies for PDC
• Hardware:
• Multicore Processors
• CPUs with multiple cores for parallel execution (e.g., Intel Xeon, AMD Ryzen).

• GPUs (Graphics Processing Units)


• Parallel execution of thousands of threads (e.g., NVIDIA CUDA, AMD ROCm).

• FPGAs (Field-Programmable Gate Arrays)


• Hardware acceleration for parallel tasks (e.g., Xilinx, Intel FPGA).

• High-Speed Networks
• InfiniBand, Ethernet, RDMA for fast inter-node communication.
41
• Storage Technologies
• HDFS, NVMe, Object Storage for distributed data access.
Enabling Technologies for PDC
• Middleware and Communication:
• Message Passing Interface (MPI)
• Standard for distributed computing communication (e.g., OpenMPI, MPICH).

• Remote Procedure Call (RPC)


• Enables function calls between distributed systems (e.g., gRPC, Thrift).

• Distributed File Systems


• HDFS, Google File System (GFS), Ceph.

• Virtualization & Containerization


• Vmware, KVM, Docker, Kubernetes
42
Enabling Technologies for PDC
• Cloud & Edge Computing:
• Virtual Machines (VMs) & Hypervisors
• Xen, KVM, Hyper-V for cloud-based distributed computing.

• Serverless Computing
• AWS Lambda, Google Cloud Functions, Azure Functions

• Edge Computing Platforms


• NVIDIA Jetson, Intel OpenVINO, AWS Greengrass

43
Platforms for PDC
• High Performance Computing (HPC):
• Supercomputers – IBM Summit, Fugaku, Cray XC40.
• HPC Clusters – Linux-based Beowulf Clusters, Slurm Workload Manager.

• Cloud Computing:
• Amazon Web Services (AWS) – EC2, S3, Lambda, EMR.
• Microsoft Azure – Azure Virtual Machines, Kubernetes Service.
• Google Cloud Platform (GCP) – Compute Engine, Kubernetes, BigQuery.

• Big Data Analytics:


• Apache Hadoop – Distributed storage and processing (HDFS + MapReduce).
• Apache Spark – Fast in-memory distributed processing.
44
• Google BigQuery – Distributed cloud data warehouse.
Platforms for PDC
• Blockchain & Distributed Ledger:
• Ethereum – Smart contracts and decentralized applications.
• Hyperledger Fabric – Enterprise blockchain solutions.

• IoT and Edge Computing:


• AWS IoT, Azure IoT Hub – Cloud-based IoT data processing.
• Google TensorFlow Lite – AI inference on edge devices.

45
Software Environments for PDC
• Programming Models & Languages:
• Parallel Programming
• OpenMP (C, C++, Fortran) – Shared memory parallelism.
• MPI (C, C++, Python) – Message passing for distributed systems.
• CUDA (C, Python) – GPU programming for high-performance tasks.
• Distributed Programming
• Hadoop MapReduce (Java, Python) – Large-scale data processing.
• Apache Spark (Scala, Python) – Distributed in-memory computing.
• TensorFlow, PyTorch (Python) – Distributed deep learning.
• General-Purpose Languages
• Python (Dask, Ray) – Distributed computing frameworks.
• Rust – Safe and efficient parallel programming.
46
• Java, Scala – Used in cloud-based distributed systems.
Software Environments for PDC
• Workflow Management & Orchestration:
• Apache Airflow – Distributed task scheduling.
• Kubernetes – Container orchestration for cloud and edge.
• Apache Kafka – Distributed streaming and event processing.

• Distributed Databases:
• NoSQL Databases: Apache Cassandra, MongoDB, Google Bigtable.
• NewSQL Databases: Google Spanner, CockroachDB.

47
Summary: Technologies, Platforms, and Software

Category Examples
Hardware Multicore CPUs, GPUs (NVIDIA, AMD), FPGAs, InfiniBand
Middleware MPI, RPC, gRPC, Distributed File Systems (HDFS, Ceph)
Cloud Platform AWS, Azure, Google Cloud, OpenStack
Big Data Apache Hadoop, Spark, Google BigQuery
Edge Computing NVIDIA Jetson, AWS Greengrass, OpenVINO
Programming Models OpenMP, MPI, CUDA, Dask, Ray
Workflow Tools Apache Airflow, Kubernetes, Apache Kafka
Databases Apache Cassandra, Google Spanner, MongoDB
48
Limitations of Parallel Computing
It requires designing the proper communication and synchronization
mechanisms between the processes and sub-tasks

• Exploring the proper parallelism from a problem is a hectic process

• The program must have low coupling and high cohesion. But it’s difficult to
create such programs

• It needs relatively more technical skills to code a parallel program

49

You might also like