0% found this document useful (0 votes)

5 views57 pages

Cours Parallel Programming

The document discusses the fundamentals of parallel programming, contrasting it with sequential programming and highlighting the importance of utilizing multiple cores in modern processors due to physical limitations like the power wall. It covers key concepts such as the von Neumann architecture, multitasking, threading, caching, virtual memory, and various parallel hardware architectures, including SIMD and MIMD. Additionally, it emphasizes the role of software in parallel programming, outlining steps for writing efficient parallel programs and the challenges of cache coherence.

Uploaded by

chloe.mercier

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views57 pages

Cours Parallel Programming

Uploaded by

chloe.mercier

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

INF-3201 - Parallel Programming

Sequential programming = the instructions are executed one by one and an instruction is executed at any point in
time

Parallel programming = the problem is broken into smaller ones and the instructions are executed, for each part,
in parallel

In theory, over the years, the density of transistors in a processor increases. Like that, the power by transistors
decreases. So, we can imagine that the core frequency can increase without problem and the power density will
stay stable. But in practice, the physical limits are reached and the heat dissipation is insufficient : we hit the power
wall !

processor = Central Processing Unit (CPU), it’s a physical component

core = computing unit inside the processor

Scaling beyond the power wall, the chip (puce) density is increased : the number of cores per chip increases and
the chips performance also. Moreover, the power management is better. We need parallel programmation to use
the different cores of one processor.

Nowadays, most architectures are parallel because there is no more single core CPU but mainly clusters and
massively parallel processing.

1
Parallel Hardware and Parallel Software

Goal : to present the basics of parallel hardware and software, starting from the von Neumann model, to arrive at the
design and execution of parallel programs.

Some background

About the serial hardware and software, a computer runs one program at a time, with input/output through dedicated
programs. A program reads data, processes it, and produces output, all while managing a single stream of
instructions.

The von Neumann architecture

Main memory = this is a collection of locations, each of which is capable of

storing both instructions and data. Every location consists of an address which
is used to access the location, and the contents of the location.

Central processing unit (CPU) = is divided into two parts : the control unit
which is responsible for deciding which instruction in a program should be
executed (the boss) and the arithmetic and logic unit (ALU) which is
responsible for executing the actual instructions (the worker).

Register = very fast storage, part of the CPU.

Program counter = stores address of the next instruction to be executed.

Bus = wires and hardware that connects the CPU and memory.

But the von Neumann architecture has a bottleneck : the bus limits the transfer speed between the CPU and
memory.

Process, multitasking and threading

An operating system “process” is an instance of a computer program that is being executed. The components of
a process are :
- the executable machine language program
- a block of memory
- descriptors of resources the OS has allocated to the process
- security information
- information about the state of the process

The multitasking gives the illusion that a single processor system is running multiple programs simultaneously. Each
process takes turns running (time slice). After its time is up, it waits until it has a turn again (blocks).

Concerning the threading, threads are contained within processes. They allow programmers to divide their programs
into (more or less) independent tasks. The hope is that when one thread blocks because it is waiting on a resource,
another will have work to do and can run.
1
A process and two threads :

Modifications to the von Neumann model

Basics of caching

A collection of memory locations that can be accessed in less

time than some other memory locations.
A CPU cache is typically located on the same chip, or one that
can be accessed much faster than ordinary memory.

The principle of locality is when accessing one location is

followed by an access of a nearby location.
Spatial locality = accessing a nearby location
Temporal locality = accessing in the near future

Levels of cache →

Cache hit = when the processor looks for data and it's already in the cache, it's called a hit. Access is therefore fast.
Cache miss = if the data isn't in the cache, it's a miss. The processor then has to retrieve it from main memory
(RAM), which is much slower.

But there are some issues with cache. When a CPU writes data to cache, the value in cache may be inconsistent
with the value in main memory.
Write-through caches handle this by updating the data in main memory at the time it is written to cache.
Write-back caches mark data in the cache as dirty, when the cache line is replaced by a new cache line from memory,
the dirty line is written to memory.

Cache mappings
Full associative = a new line can be placed at any location in the cache.
Direct mapped = each cache line has a unique location in the cache to which it will be assigned.
n-way set associative = each cache line can be placed in one of n different locations in the cache. When more than
one line in memory can be mapped to several different locations in cache we also need to be able to decide which
line should be replaced or evicted.

Virtual memory

If we run a very large program or a program that accesses very large

data sets, all of the instructions and data may not fit into main memory.
Virtual memory functions are used as a cache for secondary storage.
It exploits the principle of spatial and temporal locality. It only keeps
the active parts of running programs in main memory.

Swap space = those parts that are idle are kept in a block of secondary
storage

2
Pages = blocks of data and instructions. Usually these are relatively large. Most systems have a fixed page size that
currently ranges from 4 to 16 kilobytes.

Virtual page numbers

When a program is compiled, its pages are assigned virtual page numbers.
When the program is run, a table is created that maps the virtual page numbers to physical addresses.
A page table is used to translate the virtual address into a physical address.

Virtual Address divided into Virtual Page Number and Byte Offset

Translation-lookaside buffer (TLB), using a page table, has the potential to significantly increase each program’s
overall run-time. It’s a special address translation cache in the processor. It caches a small number of entries (typically
16-512) from the page table in very fast memory.
Page fault = attempting to access a valid physical address for a page in the page table but the page is only stored
on disk

Instruction Level Parallelism (ILP)

The ILP attempts to improve processor performance by having multiple processor components or functional units
simultaneously executing instructions.

Pipelining = functional units are arranged in stages.

for example :
float x[1000], y[1000], z[1000];
…
for (i = 0; i < 1000; i++)
z[i] = x[i] + y[i];
Assume each operation takes one nanosecond. This for loop takes about 7000 nanoseconds.
Divide the floating point adder into 7 separate pieces of hardware or functional units. The first unit fetches two
operands, the second unit compares exponents, etc. Output of one functional unit is input to the next. One floating
point addition still takes 7 nanoseconds but 1000 floating point additions now takes 1006 nanoseconds !

Multiple issue = multiple instructions can be simultaneously initiated.

Multiple issue processors replicate functional units and try to simultaneously execute different instructions in a
program.
static multiple issue = functional units are scheduled at compile time
dynamic multiple issue = functional units are scheduled at run-time (superscalar)

Speculation
In order to make use of multiple issue, the system must find instructions that can be executed simultaneously. In
speculation, the compiler or the processor makes a guess about an instruction, and then executes the instruction on
the basis of the guess.

Hardware multithreading
There aren’t always good opportunities for simultaneous execution of different threads. Hardware multithreading
provides a means for systems to continue doing useful work when the task being currently executed has stalled.

Fine-grained = the processor switches between threads after each instruction, skipping threads that are stalled.
Pros (advantages) : potential to avoid wasted machine time due to stalls
Cons (disadvantages) : a thread that’s ready to execute a long sequence of instructions may have to wait to execute
every instruction

3
Coarse-grained = only switches threads that are stalled waiting for a time - consuming operation to complete.
Pros : switching threads doesn’t need to be nearly instantaneous
Cons : the processor can be idled on shorter stalls, and thread switching will also cause delays

Simultaneous multithreading (SMT) = a variation on fine-grained multithreading, allows multiple threads to make
use of the multiple functional units.

Parallel hardware

Flynn's Taxonomy is a classification of processor architectures. It describes how computer systems process
instructions and data, depending on whether they manipulate one or several instructions simultaneously. It has 4
categories.

SIMD → Single instruction stream Multiple data stream & MISD → Multiple instruction stream Single data
stream = Not covered

SISD → Single instruction stream Single data stream

It is the classic von Neumann.

Parallelism achieved by dividing data among the processors. It applies the same instruction to multiple data items,
called data parallelism.

All ALUs are required to execute the same instruction, or remain idle. In classic design, they must also operate
synchronously. The ALUs have no instruction storage. It’s efficient for large data parallel problems, but not other
types of more complex parallel problems.

Vector processors operate on arrays or vectors of data while conventional CPU’s operate on individual data
elements or scalars. The vector registers are capable of storing a vector of operands and operating simultaneously
on their contents.
Vectorized and pipelined functional units apply the same operation to each element in the vector (or pairs of
elements).
Vector instructions operate on vectors rather than scalars.
Next, the idea is to use interleaved memory, multiple “banks” of memory, which can be accessed more or less
independently. Distributing elements of a vector across multiple banks, so it reduces or eliminates delay in
loading/storing successive elements.
The program accesses elements of a vector located at fixed intervals, it is strided memory access and hardware
scatter/gather.
Pros : fast, easy to use, vectorizing compilers are good at identifying code to exploit, compilers also can provide
information about code that cannot be vectorized (helps the programmer re-evaluate code), high memory bandwidth,
uses every item in a cache line
Cons : they don’t handle irregular data structures as well as other parallel architectures, a very finite limit to their
ability to handle ever larger problems (scalability)

The Graphics Processing Units (GPU) is used for graphics rendering via shaders (parallel functions).
Real time graphics application programming interfaces or API’s use points, lines, and triangles to internally represent
the surface of an object. A graphics processing pipeline converts the internal representation into an array of pixels
that can be sent to a computer screen. Several stages of this pipeline (called shader functions) are programmable,
typically just a few lines of C code. Shader functions are also implicitly parallel, since they can be applied to multiple
elements in the graphics stream. GPU’s can often optimize performance by using SIMD parallelism. The current
generation of GPU’s use SIMD parallelism, although they are not pure SIMD systems.

MIMD → Multiple instruction stream Multiple data stream

The MIMD supports multiple simultaneous instruction streams operating on multiple data streams. It typically consists
of a collection of fully independent processing units or cores, each of which has its own control unit and its own ALU.

Shared Memory System

4
A collection of autonomous processors is connected to a memory system via an interconnection network. Each
processor can access each memory location. The processors usually communicate implicitly by accessing shared
data structures. Most widely available shared memory systems use one or more multicore processors (multiple CPU’s
or cores on a single chip).

UMA multicore system NUMA multicore system

Time to access all the memory locations will be the A memory location a core is directly connected to can
same for all the cores. be accessed faster than a memory location that must
be accessed through another chip.

Clusters are the most popular distributed memory system. They are a collection of commodity systems connected
by a commodity interconnection network. Nodes of a cluster are individual computations units joined by a
communication network.

Interconnection networks affect performance of both distributed and shared memory systems.
Two categories :
● Shared memory interconnects
- Bus interconnect : a collection of parallel communication wires together with some hardware that
controls access to the bus. Communication wires are shared by the devices that are connected to it.
As the number of devices connected to the bus increases, contention for use of the bus increases,
and performance decreases.
- Switched interconnect : uses switches to control the routing of data among the connected devices.
Crossbar allows simultaneous communication among different devices. It is faster than buses but
the cost of the switches and links is relatively high.

● Distributed memory interconnects

- Direct interconnect : each switch is directly connected to a processor memory pair, and the switches
are connected to each other.
- Indirect interconnect : switches may not be directly connected to a processor.
5
The bisection width is a measure of “number of simultaneous communications” or “connectivity”. How many
simultaneous communications can take place “across the divide” between the halves (moitié) ?
Bandwidth = the rate at which a link can transmit data, usually given in megabits or megabytes per second / the rate
at which the destination receives data after it has started to receive the first byte
Bisection bandwidth = a measure of network quality, instead of counting the number of links joining the halves, it
sums the bandwidth of the links.

In a fully connected network, each switch is directly connected to every other switch (but impractical).

A hypercube is a type of interconnect network used in parallel architectures. It is a highly connected direct
interconnect, which facilitates fast communication between processors.
Built inductively :
- a one-dimensional hypercube is a fully-connected system with two processors
- a two-dimensional hypercube is built from two one-dimensional hypercubes by joining “corresponding”
switches
- similarly a three-dimensional hypercube is built from two two-dimensional hypercubes

Crossbar and Omega network are simple examples of indirect interconnected networks. Often shown with
unidirectional links and a collection of processors, each of which has an outgoing and an incoming link, and a
switching network.

Any time data is transmitted, we are interested in how long it will take for the data to reach its destination.

Latency = the time that elapses between the source’s beginning to transmit the data and the destination’s starting
to receive the first byte.

𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑚𝑒𝑠𝑠𝑎𝑔𝑒 (𝑏𝑦𝑡𝑒𝑠)

𝑀𝑒𝑠𝑠𝑎𝑔𝑒 𝑡𝑟𝑎𝑛𝑠𝑚𝑖𝑠𝑠𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 = 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 (𝑠𝑒𝑐𝑜𝑛𝑑𝑠) +
𝑏𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ (𝑏𝑦𝑡𝑒𝑠 𝑝𝑒𝑟 𝑠𝑒𝑐𝑜𝑛𝑑)

Cache coherence
Programmers have no control over caches and when they get updated.
About the snooping cache coherence, the cores share a bus. Any signal transmitted on the bus can be “seen” by
all cores connected to the bus. When core 0 updates the copy of x stored in its cache it also broadcasts this
information across the bus. If core 1 is “snooping” the bus, it will be that x has been updated and it can mark its copy
of x as invalid.
Directory based cache coherence uses a data structure called a directory that stores the status of each cache
line. When a variable is updated, the directory is consulted, and the cache controllers of the cores that have that
variable’s cache line in their caches are invalidated.
Parallel software

The burden is on software. Hardware and compilers can keep up the pace needed. From now on in shared memory
programs, start a single process and fork threads and threads carry out tasks. In distributed memory programs, start
multiple processes and processes carry out tasks.

A single program multiple data (SPMD) program consists of a single executable that can behave as if it were
multiple different programs through the use of conditional branches.

6
To write a parallel program, the steps to follow are :
1. divide the work among the processes/threads, so each process/thread gets roughly the same amount of
work and communication is minimized
2. arrange for the processes/threads to synchronize
3. arrange for communication among processes/threads

Shared Memory
With dynamic threads, the master thread waits for work, forks new threads, and when threads are done, they
terminate. The use of resources is efficient, but thread creation and termination is time consuming.
With static threads, a pool of threads is created and are allocated work, but do not terminate until cleanup. The
performance is better, but potential waste of system resources.

Non-determinism
The exact order in which the threads execute may vary. Race conditions can occur when multiple threads access
the same data simultaneously, and at least one of them modifies it. Critical section is the part of the code that
accesses a shared resource and that should only be executed by one thread at a time. Mutually exclusive means
that only one thread can enter the critical section at any given time. Mutual exclusion lock (mutex, or simply lock)
exists to manage that.
Busy-waiting : active wait loop.
Message passing : explicit communication (e.g., MPI: Send/Receive).

Partitioned global address space languages use a shared global address space, but it is divided into parts among
the different threads/processes. Each variable is shared, but each thread has a local portion (fast access) and can
also access remote portions (slower access).

Input and output

In distributed memory programs, only process 0 will access stdin. In shared memory programs, only the master
thread or thread 0 will access stdin.

In both distributed memory and shared memory programs all the processes/threads can access stdout and stderr.

However, because of the indeterminacy of the order of output to stdout, in most cases only a single process/thread
will be used for all output to stdout other than debugging output.

Debug output should always include the rank or id of the process/thread that’s generating the output.

Only a single process/thread will attempt to access any single file other than stdin, stdout, or stderr. So, for example,
each process/thread can open its own, private file for reading or writing, but no two processes/threads will open the
same file.

Performance

Number of cores = 𝑝
Serial run-time = 𝑇𝑠𝑒𝑟𝑖𝑎𝑙
Parallel run-time = 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
𝑇𝑠𝑒𝑟𝑖𝑎𝑙
Linear speedup = 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 =
𝑝
𝑇𝑠𝑒𝑟𝑖𝑎𝑙
Speedup of a parallel program = 𝑆 =
𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
𝑇
(𝑇 𝑠𝑒𝑟𝑖𝑎𝑙 )
𝑆 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 𝑇𝑠𝑒𝑟𝑖𝑎𝑙
Efficiency of a parallel program = 𝐸 = = =
𝑝 𝑝 𝑝 ∗ 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙

7
Speedup Efficiency

𝑇𝑠𝑒𝑟𝑖𝑎𝑙
Effect of overhead = 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 =
𝑝 + 𝑇𝑜𝑣𝑒𝑟ℎ𝑒𝑎𝑑

The Amdahl’s Law is : unless virtually all of a serial program is parallelized, the possible speedup is going to be very
limited - regardless of the number of cores available.

Scalability
In general, a problem is scalable if it can handle (gérer) ever increasing problem sizes.
If we increase the number of processes/threads and keep the efficiency fixed without increasing problem size, the
problem is strongly scalable.
If we keep the efficiency fixed by increasing the problem size at the same rate as we increase the number of
processes/threads, the problem is weakly scalable.

Taking timings : CPU time vs. wall clock; functions like MPI_Wtime or omp_get_wtime.

Parallel program design

Foster’s methodology
1. Partitioning : divide the computation to be performed and the data operated on by the computation into
small tasks. The focus here should be on identifying tasks that can be executed in parallel.
2. Communication : determine what communication needs to be carried out among the tasks identified in the
previous step.
3. Agglomeration or aggregation : combine tasks and communications identified in the first step into larger
tasks. For example, if task A must be executed before task B can be executed, it may make sense to
aggregate them into a single composite task.
4. Mapping : assign the composite tasks identified in the previous step to processes/threads. This should be
done so that communication is minimized, and each process/thread gets roughly the same amount of work.

Concluding Remarks

Serial systems = the standard model of computer hardware has been the von Neumann architecture

Parallel hardware = Flynn’s taxonomy

Parallel software = we focus on software for homogeneous MIMD systems, consisting of a single program that
obtains parallelism by branching ; SPMD programs

Input and output = we’ll write programs in which one process or thread can access stdin, and all processes can
access stdout and stderr ; however, because of nondeterminism, except for debug output we’ll usually have a single
process or thread accessing stdout

Performance = speedup ; efficiency ; Amdahl’s law ; scalability

Parallel program design = Foster’s methodology

8
MPI

First glance of MPI

We have a vector of values 𝑋 = [𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ]. For each value 𝑥𝑖 , we apply a computationally expensive function
𝑓(𝑥𝑖 ) which takes approximately one day to compute = heavy computations. On a single machine, this would take n
days in total, because the computations are performed sequentially : 𝑡 = 𝑡𝑓(𝑥1 )+ . . . + 𝑡𝑓(𝑥𝑛 ) ≈ 𝑛 𝑑𝑎𝑦𝑠
This time can be improved because the computations are independent of each other, so they can be distributed
across several machines. We can then introduce MPI to show how to leverage multiple machines to significantly
reduce execution time.

An introduction to MPI

MPI for Message Passing Interface

MPI is not a program, it is a specification ! a common language, a standard that defines how multiple processes
(on one or more machines) can communicate to perform a parallel computation.

Different implementations exist, these are programs who realise this standard :
- MPICH : one of the firsts, often used as an academic reference
- OpenMPI : very widespread, used in production on many clusters
- and more : for example Intel MPI, Microsoft MPI, etc.

The parallelism offered by MPI is not limited to computer science ; parallel applications are used in other fields such
as physics, biology, and mathematics.

MPI is primarily designed for distributed memory systems, where each process has its own memory and messages
are exchanged.

MPI uses SPMD for Single Program Multiple Data

MPI is based on this parallel programming model, which involves compiling ONE program. All processors execute
the same code, but each works on different data; calculations are performed in parallel. Each process does
“something” different. SPMD is based on Conditional Branching : through conditional branches based on the rank,
each process chooses its role and works on its own portion of the data.

To identify the MPI Processes, a common practice is to attribute non-negative integers called ranks. So for p
processes, we have 0, 1, …, p-1.

Input/Output in MPI

Concerning the output, each process just prints a message. If there are several processes, the order of the output
is unpredictable.

About the inputs, in most MPI implementations, only process 0 in MPI_COMM_WORLD access to stdin (lecture de
l’entrée standard). Process 0 reads the data (scanf) and sends the data to the other process.

1
Compilation Execution

MPI is written in C, uses the imports like stdio.h, string.h, etc. and needs to add the mpi.h header file.
MPI identifiers start with “MPI_” and the first letter following underscore is uppercase for function names and types
and to avoid confusion.

MPI Components :
● MPI_Init : tell MPI to setup → int MPI_Init(int* argc_p, char*** argv_p);
● MPI_Finalize : tell MPI to cleanup → int MPI_Finalize(void);

The basic outline

Point-to-point communications

Communicators = a reference to processes that can communicate together

MPI_Init creates one for us ! Called MPI_COMM_WORLD and it contains in all the processes (si on crée p processus
(my_rank_p = my rank = the process making this call), le communicateur regroupe les p rangs (comm_sz_p = nb of
processes in the communicator)).

int MPI_Comm_size(MPI_Comm comm /in/, int* comm_sz_p /out/);

int MPI_Comm_rank(MPI_Comm comm /*in*/, int* my_rank_p /*out*/);

int MPI_Send(void* msg_buf_p, int msg_size, MPI_Datatype msg_type, int dest, int tag,
MPI_Comm communicator);
int MPI_Recv(void* msg_buf_p, int buf_size, MPI_Datatype buf_type, int source, int
tag, MPI_Comm communicator, MPI_Status** status_p /*out*/);
We can see message matching between both MPI_Send and MPI_Recv.

A receiver can receive a message without knowing the message size, the sender → MPI_ANY_SOURCE or the tag
→ MPI_ANY_TAG.

In the function MPI_Recv, the last argument “status_p” has the type MPI_Status*. It’s mean MPI_SOURCE,
MPI_TAG or MPI_ERROR → for MPI_Status* status; status.MPI_SOURCE / status.MPI_TAG.

2
For the amount of data : int MPI_Get_count(MPI_Status* status_p, MPI_Datatype type, int*
count_p /*out*/);
MPI_Recv is always block (appel bloquant) and MPI_Send behaves differently according to buffer size (bloquant ou
non, cutoffs/blocking) depending on the implementation ! The solution is to know your implementation !

Safety in MPI programs

MPI_Send behave in two ways :

● Buffering : copy the data in the send buffer and return
● Blocking : block until a matching MPI_Recv call
A threshold is used to switch from buffering to blocking. Relatively small messages will be buffered by MPI_Send.
Larger messages will cause it to block.

If every process does a MPI_Send, the program will hang or deadlock since MPI_Recv is not reached. Each
process is blocked waiting for an event that will never happen.
A program is unsafe if it relies on MPI buffering to work. It works for various inputs and hangs for others. Program
unsafe if correct behavior depends on buffering of MPI_Send.

To check if a program is safe, use MPI_Ssend instead, “s” = synchronous, block until a matching MPI_Recv :
int MPI_Ssend(void* msg_buf_p, int msg_size, MPI_Datatype msg_type, int dest, int tag,
MPI_Comm communicator);

To make a program safe, using MPI_Sendrecv, scheduling handled by MPI, blocking send + receive, dest and source
can be equal :
int MPI_Sendrecv(void* send_buf_p, int send_buf_size, MPI_Datatype send_buf_type, int
dest, int send_tag, void* recv_buf_p /*out*/, int recv_buf_size, MPI_Datatype
recv_buf_type, int source, int recv_tag, MPI_Comm communicator, MPI_Status* status_p);

Collective communications

Tree-structured global sum → optimized Tree-structured global sum alternative

3
Collective communications involve ALL processes of a communicator.

int MPI_Reduce(void* input_data_p, void* output_data_p /out/, int count, MPI_Datatype

datatype, MPI_Op operator, int dest_process, MPI_Comm communicator);

In collective communications, ALL processes MUST call the SAME collective function. All arguments must BE
COMPATIBLE.
output_data_p only used on dest_process. All of the processes still need to pass in an actual argument
corresponding to output_data_p even if NULL.

The Point-To-Point (P2P) communications are matched using communicators and tags. The collective
communications don't use tags, they are matched using communicators and call order ! → All processes use the
same collective call order.

If the result should be available to all the processes :

int MPI_Allreduce(void* input_data_p, void* output_data_p /*out*/, int count,
MPI_Datatype datatype, MPI_Op operator, MPI_Comm comm);

Broadcast = one process send its data to all the others in a communicator
int MPI_Bcast(void* data_p /*in/out*/, int count, MPI_Datatype datatype, int
source_proc, MPI_Comm comm);

Data distribution
● Block partitioning = assign blocks of consecutive components to each process
● Cyclic partitioning = assign components in a round robin fashion
● Block-cyclic partitioning = use a cyclic distribution of blocks of components

MPI_Scatter allows to read an entire vector on a process and send the required components to each process:
int MPI_Scatter(void* send_buf_p, int send_count, MPI_Datatype send_type, void*
recv_buf_p /*out*/, int recv_count, MPI_Datatype recv_type, int src_proc, MPI_Comm
comm);
MPI_Gather allows from one process to collect data of all the other processes :
4
int MPI_Gather(void* send_buf_p, int send_count, MPI_Datatype send_type, void*
recv_buf_p /*out*/, int recv_count, MPI_Datatype recv_type, int dest_proc, MPI_Comm
comm);
MPI_Allgather allows to collect data from all the processes on all the processes :
int MPI_Allgather(void* send_buf_p, int send_count, MPI_Datatype send_type, void*
recv_buf_p /*out*/, int recv_count, MPI_Datatype recv_type, MPI_Comm comm);
Allgather concatenates the content of send_buf_p of each process and stores it in each process recv_buf_p.
recv_count is the amount of data being received from each process.

Derived datatypes

The derived datatypes allow to represent any collection of data items in memory by storing the types of the
items and their relative locations in memory. They also allow MPI communication functions to handle custom
user types properly. They work for both send and receive cases.
Derived Datatypes = sequence of basics MPI types
MPI_Type_create_struct builds a derived datatype that consists of individual elements that have different basic
types :
int MPI_Type_create_struct(int count, int array_of_blocklenghts[], MPI_Aint
array_of_displacements[], MPI_Datatype array_of_types[], MPI_Datatype* new_type_p
/*out*/);

Steps to create a derived datatype :

1. Found the right MPI types (MPI_DOUBLE, MPI_INT, …)
2. Define their dimensions (usually 1)
3. Compute their relative displacement
4. Now ready to call MPI_Type_create_struct
5. Commit your type using MPI Type commit
6. Use your type
7. FREE YOUR TYPE using MPI_Type_free

To compute displacement, MPI_Get_address returns the address of the memory location referenced by location_p:
int MPI_Get_address(void* location_p, MPI_Aint* address_p /*out*/);
MPI_Aint is a big enough type to store addresses.

Performance evaluation

Elapsed Parallel Time, to performance evaluation, we can return the number of seconds that have elapsed since
some time in the past. To measure performance, we use wall clock time.

double MPI_Wtime(void);
double start, finish;
...
start = MPI_Wtime();
/* Code to be timed */
...
finish = MPI_Wtime();
printf(“Proc %d > Elapsed time = %e seconds\n”, my_rank, finish-start);

To synchronize processes → use barriers

It ensures that no process will return from calling it until every process in the communicator has started calling it.
int MPI_Barrier(MPI_Comm comm);
All processors expect the barrier before continuing.

5
Pthread

A thread is a schedulable set of instructions, created by a process and shared virtual memory.
Pthread is a library for creating and managing threads. Pthread for POSIX thread ← interoperability

Thread overview

Usually, each thread has its local variables in a stack so, if there are several stack, each thread has its own local
variables in its stack.

For use Pthread

1. Creation with pthread_create()
2. Management and Termination with pthread_join()
3. That’s it

It is the main thread who calls other threads with the function pthread_create().
The compilation of a file with Pthread needs -lpthread in the CFLAGS of the makefile. Like -lm for the math functions.
The order of the output when there are several threads is unpredictable.
To define the task(s) of a thread, create a function with the set of instructions of the thread and add this when
pthread_create() is called (like an argument).

Several threads can execute the same set of instructions but beware about the shared variables (the global
variables). Their modifications need to be in a critical section. Different solutions exist :
1. The solution with a flag / boolean does not work because the threads can check the value of the flag at the
same time.

2. It is possible to use a locker :

pthread_mutex_lock(...);
// the instructions here are “atomics”
// in the sens or just one thread can execute at the time
...
pthread_mutex_unlock(...):

3. The use of semaphore :

sem_init(...,...); // with the nb of threads at a time
sem_wait(...);
sem_post(...);
sem_destroy(...);

4. The use of barrier :

pthread_barrier_init(...); // with the nb of threads
pthread_barrier_wait(...);
pthread_barrier_destroy(...);
Neither of them are optimal or better than the others.
Reminder : be careful not to have blocking instructions in the atomic code !

1
OpenMP

OpenMP is a programming interface with a set of directives, functions and clauses that are added to code to easily
write parallel programs. It works on machines where all threads share the same memory space (shared memory)
and it is threads-based.

Fork-join model

Pros :
● Easier synchronization
● Less error prone
● Built-in load balancer
● Portability
● Faster development
Cons :
● Limited control (compared to PThread)

Using OpenMP

OpenMP is implemented by the compiler (e.g gcc which recognizes and activates the directives).
It can be used directly in C or C++ code (and also Fortran).The API functions are OpenMP functions for controlling
or querying threads and parallelism is enabled via pragma directives in the code.
It is enabled using -fopenmp with gcc.
It uses an additional shared library : omp.h and lib<something>.so.

1
CUDA introduction

What is an accelerator ?

An accelerator is used typically for a special (or limited) purpose. It is very efficient at what it does, less so for generic
programs and it is often a plug-in card to a computer.

What is a GPU ? (card, plug-in (typically))

A Multicore CPU is focused on single-threaded A GPU (Graphics Processing Unit) is an add-on card
performance and general purpose computations. Each focused on graphics and specialized computations.
core uses techniques to optimize its thread(s). More general purpose programming on host CPU. The
Complicated cores consume chip area (branch computations are simple and can easily be parallelised.
prediction, control, caches, …) → the number of cores There are many simpler cores and the aggregated
on a chip is limited. performance is more important than single thread
performance.

Graphics, but lot of compute power

So graphics, but also a lot of compute power. GPUs are cheap, powerful and ubiquitous.

Before CUDA, it was necessary to transform the problem into a graphics problem to use the GPU, but NVIDIA created
CUDA to allow the GPU to be used directly as a parallel computing engine.

How do we program a GPU ?

CUDA = Compute Unified Device Architecture

CUDA is a program for more general purpose tasks and specifies a generic GPU
architecture and a programming model.

Overview of CUDA architecture

CUDA has developed over time with the support for more features added over time
(like Tensor cores). The registers, cache sizes ++ change also over time. It is
important for optimization.
1
Introduction to CUDA programming = Programming the GPU

exemple : C program with “device code” that runs on the GPU and “host code” that runs on the CPU
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main() {
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}

The function to run is called a “kernel”.

The key idea is to run a lot of tiny threads : Single Program Multiple Data Model
(SPMD). But how to organize and start threads ? With a hierarchical model.
- The threads are queued as a grid of thread blocks : use 1-, 2-, or 3-
dimensional grids of blocks, and map well to vector and matrix problems.
- A block is a set of threads : blocks are scheduled / assigned to Symmetric
Multiprocessors (Sms). Runs to completion on that SM. Can also be 1-,
or 3-dimensional.
- Organization is specified using the kernel start syntax :
kernel<<<numBlocks, threadsPerBlock>>>(parms)

A unique identifier or position for a thread can be computed for a thread using
“magic” variables. Typically used to identify which part of the problem this thread
should work on.

global void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {

int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N) {
C[i][j] = A[i][j] + B[i][j];
}
}

The scalability is transparent. Hardware is free to schedule thread blocks on any processor. The programmer can
make no assumption on the execution order of thread blocks. Concerning digression, SMT and GPU threads are
2
scheduled by hardware, not by an operating system : lower overhead and software still controls which threads the
hardware can schedule.

To summarize, we start a grid of blocks on a GPU and each block has a number of threads.
The threads are executed in lock-step. That saves hardware resources compared to individual execution of threads.
Threads inside a block are grouped into “warps” for execution. For each cycle, the scheduler picks a warp that is
ready to run and issues 1 (or 2) instructions for that warp.

Warning about SIMT execution model (Single Instruction Multiple Threads is similar to SIMD model). SM executes
warps (32 threads).
There is one problem, divergent branches can cause performance problems. For example : in this execution, threads
with even threadId execute A, threads with odd threadId idle, then, the threads with odd threadId execute B, and
threads with even threadId idle.
if threadId is even :
do a
else :
do b

GPUs prior to Volta could deadlock if divergent threads in a warp tried to signal or synchronize using lock etc.
Any blocks with deadlock if divergent threads use barriers (like syncthreads()) / if some of the threads never call the
barrier.

GPU Memory hierarchy

The GPU operates on its own memory hierarchy. The local memory is important for
scaling (less contention on memory buses) : thread level, block level and global / GPU
level.

3
The threads inside a block are synchronized and __syncthreads() works as a barrier inside the block (and
__syncwarp() for in a warp).

Synchronization can be expensive (ex : locks stalling threads).

The atomic operations are the mechanism which atomically executes one read-modify-write operation on a word
(32 or 64 bits) in global or shared memory. No other thread can access this memory operation until it completes. But
that requires hardware support.

Other options for GPU programming :

● Higher level libraries and languages that automate some of the problems like Numpy, OpenMP have some
support, …
● OpenCL : Cross-platform (Nvidia, Intel, Amd, +++). Similar thinking to CUDA

4
CUDA optimization

The first step of optimization is “DON’T DO IT”. First, you need to focus on getting a simple implementation to
work. You will run into many problems if you start trying to design an optimal solution before you know the problem.
Sometimes, the simple implementation is good enough and will be simpler to maintain. Make sure you understand
the algorithm, data flow and system architecture. Then, we can optimise.

When you start optimising, an iterative process :

● Benchmark and compare with expectations, try to find hot spots. Tweak parameters.
● Consider redesigning algorithms vs tweaking existing implementation
● Some things to look for :
- Do you keep GPU multiprocessors busy ? Spend time computing, not waiting.
- Memory overhead (keep data close to where you use it if you need it more than once)
- Reducing synchronization and memory contention (accessing the same data)
- Optimise GPU ←→ CPU memory transfers : PCI memory transfers slower than on-device memory
transfers, keep memory on the GPU as long as possible (possibly between calling kernels)
● Make sure the program computes correctly after optimising
● Keep the unoptimized code : comparing performance and comparing correctness

The thread execution goal is to keep cores busy. The main reasons for stalling threads are memory access,
synchronization and any other shared resource (ex : limited registers / shared memory → fewer threads per block).

Occupancy (keeping cores busy) = is the ratio of the number of active warps per multiprocessor to the maximum
number of possible active warps

The kernel code can be compiled with --ptxas-options=-v (or --resource-usage) to determine the memory usage.
Dealing with memory concerns memory bottlenecks and memory capacity.

CUDA Memory Model

A warning about local memory, the local memory is automatically used by the compiler for thread local variables
when it runs out of registers ; even off-chip has the same speed as the global memory. The compiler can reveal this
(see best practices guide).
NB : trem : register spill
may be cached (depending on architecture)

The global memory access optimization resides in device memory, accessed via 32-, 64- and 128- byte aligned
memory transactions (aligned to size).
Compute Capability (CC) 2.x : concurrent memory access from threads coalesced into N cache line operations
- N = number of cache lines or transactions necessary for coalesced / combined memory accesses
5
- L1 has 128-byte lines, L2 has 32-byte segments
Compute Capability 3.x :
- global memory only cached in L2, L1 reserved for local memory
- Some exceptions (CC 3.5, 3.7, 5.2 → different generations)

The effects of misaligned memory access result in more transactions and therefore a significant drop in GPU
performance and the effects of strided (spaced) memory accesses prevent coalescence, causing more transactions
and a sharp drop in GPU performance.

A solution is optimizing host-device memory transfers : avoid transfers if possible (keep data on the device
between kernel calls) and consider overlapping transfers and computation.

Timing using CUDA events

CUDA calls are entered into a stream and queued for running by the run time system and are executed in order in a
stream. Some calls are blocking and will wait for all previous (ex : copy from device to host). To measure a kernel,
use cudaEventRecord.

6
OpenCL

What is OpenCL ?

OpenCL, for Open Computing Language, is an open industry standard for programming a heterogeneous collection
of CPUs, GPUs and other processors (AMD, Apple, IBM, Intel, Nvidia, ...).

The use of OpenCL allows portability because the same code can work on multiple architectures (CPU, GPU, Cell
BE, Xeon Phi, …). It is an efficient parallel programming model because it is a data- and task- parallel computing
model and it abstracts the specifics of underlying hardware.

OpenCL platform model

A host connected to one or more OpenCL devices. An OpenCL device consists of compute units (CU). A compute
unit consists of processing elements (PE). PE executes code as SIMD or SPMD.

OpenCL execution model

OpenCL to CUDA data parallelism model mapping :

OpenCL parallelism concept CUDA equivalent

Kernel Kernel

Host program Host program

NDRange (index space) Grid

Work group Block

Work item Thread

Two-dimensional index space :

- Total number of work items = Gx * Gy
- #work items in work group = Sx * Sy
- Global ID of a work item computed from its local ID (sx,sy) and work group ID (wx,wy)

1
Mapping OpenCL dimensions and indices to CUDA :

OpenCL API call Explanation CUDA equivalent

get_global_id(0) global index of the work item in the blockIdx.x * blockDim.x +

parameters : 0:x; 1:y; 2:z x dimension threadIdx.x

get_local_id(0) local index of the work item in the threadIdx.x

work group in the x dimension

get_global_size(0) size of NDRange in the x dimension gridDim.x * blockDim.x

get_local_size(0) size of each work group in the x blockDim.x

dimension

OpenCL memory model

Private memory : read/write access for work item

Local memory : read/write access for entire work groups
Global memory : read/write access for entire ND-range (all work items, all work groups)
Constant memory : read access for entire ND-range

Memory model : OpenCL vs CUDA

2
OpenCL Programming model

The data parallel programming model is the primary model driving the design of OpenCL. It applies a sequence of
instructions to multiple elements of a memory object. Typically, one-to-one mapping between work-item (thread in
CUDA) and the element.

The task parallel programming model executes a kernel on a compute unit (multiprocessor) with a work-group
(threadblock) containing a single work-item (thread). The parallelism is expressed by enqueuing multiple tasks.

3
GPU vs CPU - really 100-1000x ?

Lecture based on the paper :

Debunking the 100x GPU vs CPU Myth : An Evaluation of Throughput Computing on CPU and GPU

“Myths” vs the authors’ results, the published and otherwise reported results are :
- GPUs provide speedups of 100-100x compared to CPUs
Their experiments showed that the average performance advantage is 2.5x but sometimes, the CPU was faster.

Basic observations

Older work presented fantastic speedups, but some comparisons were made with unoptimized or even single-
threaded CPU code or were also made with mobile CPUs. Unbalanced comparisons may create unrealistic
expectations of performance.

About the paper’s experimental evaluation, all is optimized to maximize performance for both : CPU vs GPU. In this
way, 14 kernels /benchmarks covering common computations for high performance computing.

The GPU still came out ahead in most cases. Not as much as previous papers reported. In two cases, the CPU came
out ahead.

Why ?

Looking at theoretical performance numbers shows us that a large performance gap (100x - 1000x) indicates
inefficient use of the CPU.
Similarly, efficient use of both platforms should still give the GPU an edge based on the raw performance numbers.

Some optimization concerns

● What limits your speed ? CPU bound vs memory bound

● Memory access patterns : efficient uses of cache and coalesced memory access
● The characteristics of a kernel may change with increasing datasets ; cache sizes typically influence things
like this.
● CPU : hide memory latency through prefetching, prediction and use of caches
1
● GPU : hide memory latency through switching between many light-weight threads

What the platforms struggled with

GPUs encounter difficulties with the synchronization. Synchronizing many cores efficiently is a difficult problem, it’s
easier with a few cores. Also, multiple passes over datasets that were too large to fit in registers or shared memory.

For the CPUs, it is about the kernels that are truly memory or CPU bound (the GPU has higher memory BW (bande
passante) and more FLOPS). Also, the CPUs do not have native gather/scatter operations (non-contiguous).

Some problems are hard to optimize. Implementations easily end up using the CPU or GPU inefficiently compared
to the other platform. Inefficient use may still provide an improvement.

2
Communicating Sequential Processes (CSP)

This course focuses on Communicating Sequential Processes (CSP), a mathematical theory of concurrency
developed by Tony Hoare in 1978. It explains how to build multithreaded applications without common concurrency
problems (such as deadlocks or race conditions) by using sequential processes that communicate via channels. The
course emphasizes the simplicity, compositionality, and verifiability of concurrent programs. It uses examples in
PyCSP (a Python implementation of CSP) and compares it with other languages such as Go.

Introduction and motivation

Multithreading ? “If you can get away with it, avoid using threads. Threads can be difficult to use, and they make
programs harder to debug. In general, they just aren’t necessary for strictly GUI work, such as updating component
properties.”

Message passing → Deadlocks ? Buffering ? Coordination of sending and receiving ? Simple ?

Instead of threads, CSP offers message passing to avoid deadlocks, manage buffering and coordinate
sending/receiving in a simple way.

Object orientation programming can lead to unpredictable shared states (e.g., a value that changes unexpectedly
between executions). CSP avoids this by prohibiting global state sharing between processes.

CSP

Like this, CSP is the solution. “Communicating Sequential Processes (CSP) is a precise mathematical theory
of concurrency that can be used to build multithreaded applications that are guaranteed to be free of the
common problems of concurrency and (perhaps more importantly) can be proven to be so.”

Observations :
● The action of the assignment is familiar and well understood. Any change of internal state of a machine
executing a program can be modeled as an assignment.
● Operations of input and output, affecting the external environment, are not nearly so well understood.

“This paper (Tony Hoare, 1978, CSP paper) suggests that input and output are basic primitives of programming and
that parallel composition of communicating sequential processes is a fundamental program structuring method. When
combined with a development of Dijkstra’s guarded command, these concepts are surprisingly versatile.”

Concurrency vs parallel execution

Concurrency : create the illusion of simultaneous execution of multiple tasks at once and manage the execution of
these tasks.
● Interleaving on a single core
● Can also compute correctly in parallel
● Correctness, efficiency

Parallel : actual simultaneous execution of tasks.

● Adds execution on parallel resources
● Related : multi-core, vector instructions, and GPU.

CSP core functionality

Processes

In pure CSP, processes don’t share state (no global variables). Each process is a small sequential program.
Typical granularity : expression or function
They are compositional (can be combined). (cf. figure for Parallel)
1
In PyCSP :
@process
def foo():
pass

Channels

The channels are synchronous communication mechanisms between processes. The reads and writes are
synchronous : a process writes to B (B!v) and another read from A (A?v). There is no implicit buffering. The idea is
similar to assigning across processes.

In PyCSP :
@process
def foo1(cin, cout):
cout(42)
while 1:
v = cin()
cout(v+1)

@process
def foo2(cin, cout):
while 1:
v = cin()
print(“Got”, v)
cout(v+1)

A = Channel()
B = Channel()
Parallel(foo1([Link], [Link]), foo2([Link], [Link]))

Parallel

Each process in a Parallel() construct runs concurrently with the others. The multiple processes are executed
simultaneously. It also exists in Sequence() for sequential execution.

In PyCSP :
@process
def foo():
pass

@process
def bar():
Parallel(foo(), foo())

Alternative (Alt)

In CSP theory, you will encounter the term “external choice”. The idea is that this mechanism will wait for any event
and pick one if more than one is ready. Thus, it allows you to choose from several ready channels (external choice).
This is useful for waiting for one event among several.

2
In PyCSP :
@process
def read_any(A, B, C):
alt = Alternative(A, B, C)
while 1:
guard, ret = [Link]()
print(“Got value”, ret)

A = Channel()
B = Channel()
C = Channel()
Parallel(read_any([Link], [Link], [Link]), ...)

In this case, input channels are used as input guards.

Guards

The guards are conditions for triggering actions, such as timeouts.

In PyCSP :
@process
def read_any(A, B, C):
timeout = Timer(10)
alt = Alternative(A, B, C, timeout)
while 1:
guard, ret = [Link]()
if guard == timeout:
print(“Timed out”)
else:
print(“Got value”, ret)

A = Channel()
B = Channel()
C = Channel()
Parallel(read_any([Link], [Link], [Link]), ...)

Some implementations

CSP can be used in many languages thanks to libraries and implementations like : JCSP (JAVA library for CSP),
C++CSP, PyCSP, etc.

There exists CSP-based languages like Go, Rust or actors and actor-based systems and languages.

3
💻 Assignment 1
mainSequential.c

#include <stdio.h>, <stdlib.h>, <mpi.h>, <math.h>, "lock.h"

// Usage (after make) : ./mainSequential <lock_sequence_length>

int main(int argc, char *argv[]) {

// Check arguments
if (argc != 2) {
printf("Wrong usage: %s <lock_sequence_length>\n", argv[0]);
return 1;
}
int seqLength = atoi(argv[1]);

MPI_Init(NULL, NULL); // init for the time only

double MPI_Wtime(void);

lock_init(seqLength);

double start = MPI_Wtime();

// Try to open the lock with a sequence of length seqLength

for (int i = 0; i < pow(100, seqLength); i++) {
for (int j = 0; j < seqLength; j++) {
int n = (i / ((int)pow(100, j))) % 100;
lock_put(n);
printf("%d ", n);
}
if (lock_open()) {
double finish = MPI_Wtime();
printf("Wow it works!\n");
printf("Execution time = %e seconds\n", finish-start);
return 0;
}
else {
printf("Unfortunately, it did not work..\n");
}
}
}

mainParallel.c

#include <stdlib.h>, <stdio.h>, <mpi.h>, "lock.h"

// Usage (after make) : ./mainMPI <lock_sequence_length>

int main(int argc, char **argv) {

// Check arguments
if (argc != 2) {
printf("Wrong usage: %s <lock_sequence_length>\n", argv[0]);
exit(1);
}
int seqLength = atoi(argv[1]);

int world_rank, world_size;

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
double MPI_Wtime(void);

lock_init(seqLength);

1
double start_time = MPI_Wtime();

// Try to open the lock

// printf("PID%d is running\n", world_rank);

int unlock = 0;
int unlock_global = 0;
int attempt[seqLength];

// nb total of possible combinaisons

unsigned long long total = 1;
for (int i = 0; i < seqLength; i++) {
total *= 100ULL;
}

for (unsigned long long k = world_rank; k < total; k += world_size) {

// integer slicing
unsigned long long n = k;
for (int i = 0; i < seqLength; i++) {
attempt[i] = n % 100;
n /= 100;
lock_put(attempt[i]);
// printf("%d ", attempt[i]);
}

if (lock_open()) {
unlock = 1;
double finish_time = MPI_Wtime();
printf("Wow it works!\n");

// print the result

printf("The right sequence is");
for (int r = 0; r < seqLength; r++) {
printf(" %d", attempt[r]);
}
printf(" !\n");
printf("Execution time = %e seconds\n", world_rank, finish_time-start_time);
}
else {
// printf("Unfortunately, it did not work...\n");
}

// check if a thread found the correct sequence of numbers

MPI_Allreduce(&unlock, &unlock_global, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);

if (unlock_global) {
// printf("PID%d stopped\n", world_rank);
break;
}
}

MPI_Finalize();
return 0;
}

mainMPI.c

#include <stdlib.h>, <stdio.h>, <mpi.h>, "lock.h"

// Usage (after make) : ./mainMPI <lock_sequence_length>

int main(int argc, char **argv) {

// Check arguments
if (argc != 2) {
2
printf("Wrong usage: %s <lock_sequence_length>\n", argv[0]);
exit(1);
}
int seqLength = atoi(argv[1]);

int world_rank, world_size;

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
double MPI_Wtime(void);

lock_init(seqLength);
double start_time = MPI_Wtime();

// Try to open the lock

// printf("PID%d is running\n", world_rank);

int unlock = 0;
int unlock_global = 0;
int attempt[seqLength];

// nb total of possible combinaisons

unsigned long long total = 1;
for (int i = 0; i < seqLength; i++) {
total *= 100ULL;
}

unsigned long long chunk = total / world_size;

unsigned long long start = world_rank * chunk;
unsigned long long end = (world_rank == world_size - 1) ? total : start + chunk;

for (unsigned long long k = start; k < end; k++) {

// integer slicing
unsigned long long n = k;
for (int i = 0; i < seqLength; i++) {
attempt[i] = n % 100;
n /= 100;
lock_put(attempt[i]);
// printf("%d ", attempt[i]);
}

if (lock_open()) {
//unlock = 1;
unlock_global = 1;
double finish_time = MPI_Wtime();
printf("Wow it works!\n");

MPI_Bcast(&unlock_global, 1, MPI_INT, 0, MPI_COMM_WORLD);

// print the result

printf("The right sequence is");
for (int r = 0; r < seqLength; r++) {
printf(" %d", attempt[r]);
}
printf(" !\n");
printf("Execution time = %e seconds\n", world_rank, finish_time-start_time);
}
else {
MPI_Bcast(&unlock_global, 1, MPI_INT, 0, MPI_COMM_WORLD);
// printf("Unfortunately, it did not work...\n");
}

// check if a thread found the correct sequence of numbers

3
/*if (k % 1000 == 0) {
MPI_Allreduce(&unlock, &unlock_global, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
}*/

if (unlock_global) {
// printf("PID%d stopped\n", world_rank);
break;
}
}

MPI_Finalize();
return 0;
}

makefile

CC=gcc
MPICC=$(shell which mpicc)
MPIRUN=$(shell which mpirun)

all: mainSequential mainMPI mainParallel

mainSequential: mainSequential.c lock.o

$(MPICC) $^ -o $@ -lm

mainMPI: mainMPI.c lock.o

$(MPICC) $^ -o $@

mainParallel: mainParallel.c lock.o

$(MPICC) $^ -o $@

run_parallel: mainParallel hostfile

$(MPIRUN) --hostfile hostfile -np 10 ./mainParallel 3

run_local_parallel: mainParallel
$(MPIRUN) -np 4 ./mainParallel 3

run: mainMPI hostfile

$(MPIRUN) --hostfile hostfile -np 10 ./mainMPI 3 # Launch 10 MPI processes on nodes from hostfile to brut force the lock
with a sequence of 3 numbers

run_local: mainMPI
$(MPIRUN) -np 4 ./mainMPI 3 # Launch 4 MPI processes locally to brut force the lock with a sequence of 3 numbers

clean:
@rm mainSequential mainMPI mainParallel

.PHONY: clean

4
💻 Assignment 2
main.c → serial

#include <stdio.h>, <stdlib.h>, "lib/bmp.h", <math.h>, <string.h>, <time.h>, <sys/time.h>, <sys/stat.h>

#define MIN(a, b) ((a) < (b) ? (a) : (b))
#define MAX(a, b) ((a) < (b) ? (b) : (a))
#define GP(X, Y) GetPixel(img, width, height, (X), (Y))

// @brief Convert an image to grayscale

void ImageToGrayscale(RGB *img, const int width, const int height) {
for (int i = 0; i < width * height; i++) {
char grayscale = img[i].red * 0.3 + img[i].green * 0.59 + img[i].blue * 0.11;
img[i].red = grayscale;
img[i].green = grayscale;
img[i].blue = grayscale;
}
}

// @brief Return a pixel no matter the coordinate

RGB GetPixel(RGB *img, const int width, const int height, const int x, const int y) {
if (x < 0 || y < 0 || x >= width || y >= height) {
int approxX = MIN(MAX(x, 0), width - 1);
int approxY = MIN(MAX(y, 0), height - 1);
return (img[approxX + approxY * width]);
}
return (img[x + y * width]);
}

// @brief Apply the Sobel filter to an image

void ApplySobel(RGB *img, const int width, const int height) {
RGB *sobel_out = malloc(sizeof(RGB) * width * height);

// first matrice
int filterX[3][3] = {
{-1, 0, 1},
{-2, 0, 2},
{-1, 0, 1}};

// second matrice
int filterY[3][3] = {
{1, 2, 1},
{0, 0, 0},
{-1, -2, -1}};

// RGB image converted in grayscale image

ImageToGrayscale(img, width, height);

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {
int Gx = 0;
int Gy = 0;

for (int j = -1; j <= 1; j++) {

for (int i = -1; i <= 1; i++) {
int limitx = x + i;
int limity = y + j;

// get the pixel

RGB p = GP(limitx, limity);

1
// convolution
Gx += [Link] * filterX[j + 1][i + 1];
Gy += [Link] * filterY[j + 1][i + 1];
}
}
// final value
int G = (int)(sqrt(Gx * Gx + Gy * Gy));

if (G > 255)
G = 255;
if (G < 0)
G = 0;

sobel_out[y * width + x].red = G;

sobel_out[y * width + x].green = G;
sobel_out[y * width + x].blue = G;
}
}
memcpy(img, sobel_out, sizeof(RGB) * width * height);
free(sobel_out);
}

// @brief Apply the Emboss filter to an image

void ApplyEmboss(RGB *img, const int width, const int height) {
RGB *emboss_out = malloc(sizeof(RGB) * width * height);

// matrice
int kernel[3][3] = {
{-2, -1, 0},
{-1, 1, 1},
{0, 1, 2}};

// RGB image converted in grayscale image

ImageToGrayscale(img, width, height);

for (int y = 0; y < height; y++) {

for (int x = 0; x < width; x++) {
int G = 0;

for (int j = -1; j <= 1; j++) {

for (int i = -1; i <= 1; i++) {
int limitx = x + i;
int limity = y + j;

// get the pixel

RGB p = GP(limitx, limity);

// convolution
G += [Link] * kernel[j + 1][i + 1];
}
}
if (G > 255)
G = 255;
if (G < 0)
G = 0;

emboss_out[y * width + x].red = G;

emboss_out[y * width + x].green = G;
emboss_out[y * width + x].blue = G;
}
}
memcpy(img, emboss_out, sizeof(RGB) * width * height);
free(emboss_out); }
2
// @brief Get current time in seconds with microsecond precision
double get_time() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + tv.tv_usec / 1000000.0;
}

int main(int argc, char **argv) {

const char *img_env = getenv("IMG");
const char *marguerite = (argc > 1 && argv[1] && argv[1][0]) ? argv[1] : (img_env && img_env[0] ? img_env :
"../../images/[Link]");
const char *outsobel = "[Link]";
const char *outemboss = "[Link]";

printf("=== Serial Image Filtering ===\n");

double start_time = get_time();

// Get image dimensions

int width, height;
GetSize(marguerite, &width, &height);
printf("Image dimensions: %dx%d pixels\n", width, height);

// Init memory
RGB *sobel = malloc(sizeof(RGB) * width * height);
RGB *emboss = malloc(sizeof(RGB) * width * height);

double load_start = get_time();

// Load images
LoadRegion(marguerite, 0, 0, width, height, sobel);
LoadRegion(marguerite, 0, 0, width, height, emboss);
double load_time = get_time() - load_start;

printf("Image loading time: %.4f seconds\n", load_time);

// Apply filters
double sobel_start = get_time();
ApplySobel(sobel, width, height);
double sobel_time = get_time() - sobel_start;

double emboss_start = get_time();

ApplyEmboss(emboss, width, height);
double emboss_time = get_time() - emboss_start;

printf("Sobel filter time: %.4f seconds\n", sobel_time);

printf("Emboss filter time: %.4f seconds\n", emboss_time);

double save_start = get_time();

// Save images
CreateBMP(outsobel, width, height);
WriteRegion(outsobel, 0, 0, width, height, sobel);
CreateBMP(outemboss, width, height);
WriteRegion(outemboss, 0, 0, width, height, emboss);
double save_time = get_time() - save_start;

printf("Image saving time: %.4f seconds\n", save_time);

double total_time = get_time() - start_time;

double computation_time = sobel_time + emboss_time;

printf("Total computation time: %.4f seconds\n", computation_time);

printf("Total execution time: %.4f seconds\n", total_time);
printf("Computation efficiency: %.2f%%\n", (computation_time / total_time) * 100);
3
// CSV logging (append) for plotting: version,image,width,height,ranks,threads,load_gray,distribute,compute,gather,save,total
const char *csv_path = "../../doc/[Link]";
FILE *csv = fopen(csv_path, "r");
int csv_exists = (csv != NULL);
if (csv) fclose(csv);
csv = fopen(csv_path, "a");

if (csv) {
if (!csv_exists) {
fprintf(csv, "version,image,width,height,ranks,threads,load_gray,distribute,compute,gather,save,total\n");
}
// For serial: ranks=1, threads=1 (no OpenMP here), distribute=gather=0
fprintf(csv, "serial,%s,%d,%d,%d,%d,%.6f,%.6f,%.6f,%.6f,%.6f,%.6f\n",
marguerite, width, height, 1, 1,
load_time, 0.0, computation_time, 0.0, save_time, total_time);
fclose(csv);
}
// Free memory
free(sobel);
free(emboss);
return (0);
}

main.c → parallel

#include <stdio.h>,<stdlib.h>, "lib/bmp.h", <mpi.h>,<math.h>, <string.h>, <omp.h>, <sys/stat.h>

#define MIN(a, b) ((a) < (b) ? (a) : (b))

#define MAX(a, b) ((a) < (b) ? (b) : (a))
#define GP(X, Y) GetPixel(img, width, height, (X), (Y))

// @brief Convert an image to grayscale

// @brief Return a pixel no matter the coordinate

// @brief Apply the Sobel filter to an image

void ApplySobel(RGB *img, const int width, const int height) {
RGB *sobel_out = malloc(sizeof(RGB) * width * height);

// first matrice
int filterX[3][3] = {
{-1, 0, 1},
{-2, 0, 2},
{-1, 0, 1}};

// second matrice
int filterY[3][3] = {
4
{1, 2, 1},
{0, 0, 0},
{-1, -2, -1}};

#pragma omp parallel for collapse(2)

for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int Gx = 0;
int Gy = 0;

for (int j = -1; j <= 1; j++) {

for (int i = -1; i <= 1; i++) {
int limitx = x + i;
int limity = y + j;

// get the pixel

RGB p = GP(limitx, limity);

// convolution
Gx += [Link] * filterX[j + 1][i + 1];
Gy += [Link] * filterY[j + 1][i + 1];
}
}
// final value
int G = (int)(sqrt(Gx * Gx + Gy * Gy));

if (G > 255)
G = 255;
if (G < 0)
G = 0;

sobel_out[y * width + x].red = G;

sobel_out[y * width + x].green = G;
sobel_out[y * width + x].blue = G;
}
}
memcpy(img, sobel_out, sizeof(RGB) * width * height);
free(sobel_out);
}

// @brief Apply the Emboss filter to an image

void ApplyEmboss(RGB *img, const int width, const int height) {
RGB *emboss_out = malloc(sizeof(RGB) * width * height);

// matrice
int kernel[3][3] = {
{-2, -1, 0},
{-1, 1, 1},
{0, 1, 2}};

#pragma omp parallel for collapse(2)

for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int G = 0;

for (int j = -1; j <= 1; j++) {

for (int i = -1; i <= 1; i++) {
int limitx = x + i;
int limity = y + j;

// get the pixel

RGB p = GP(limitx, limity);

5
// convolution
G += [Link] * kernel[j + 1][i + 1];
}
}
if (G > 255)
G = 255;
if (G < 0)
G = 0;

emboss_out[y * width + x].red = G;

emboss_out[y * width + x].green = G;
emboss_out[y * width + x].blue = G;
}
}
memcpy(img, emboss_out, sizeof(RGB) * width * height);
free(emboss_out);
}

int main(int argc, char **argv) {

omp_set_num_threads(4);
const char *img_env = getenv("IMG");
const char *marguerite = (argc > 1 && argv[1] && argv[1][0]) ? argv[1] : (img_env && img_env[0] ? img_env :
"../../images/[Link]");
const char *outsobel = "[Link]";
const char *outemboss = "[Link]";

int world_rank, world_size;

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);

const char *omp_env = getenv("OMP_NUM_THREADS");

int omp_threads = omp_env ? atoi(omp_env) : 4;
double t0_total = MPI_Wtime();

// Get image dimensions

int width, height;
RGB *sobel = NULL;
RGB *emboss = NULL;

if (world_rank == 0) { // execute just by one

double t0_io = MPI_Wtime();
GetSize(marguerite, &width, &height);

// Init memory
sobel = malloc(sizeof(RGB) * width * height);
emboss = malloc(sizeof(RGB) * width * height);

// Load images
LoadRegion(marguerite, 0, 0, width, height, sobel);
LoadRegion(marguerite, 0, 0, width, height, emboss);

// RGB image converted in grayscale image

ImageToGrayscale(sobel, width, height);
ImageToGrayscale(emboss, width, height);
double t1_io = MPI_Wtime();
printf("=== Parallel (manual MPI + OpenMP) ===\n");
printf("Image: %s, size: %dx%d\n", marguerite, width, height);
printf("MPI ranks: %d, OMP threads per rank: %d\n", world_size, omp_threads);
printf("Load + grayscale time (rank0): %.6f s\n", (t1_io - t0_io));
}

// share the size

6
MPI_Bcast(&width, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&height, 1, MPI_INT, 0, MPI_COMM_WORLD);

int nb_tiles = height / world_size;

int start_rank = world_rank * nb_tiles;
int end_rank = (world_rank == world_size - 1) ? height : start_rank + nb_tiles;

// need a line above and a line below

int start = (start_rank == 0) ? start_rank : start_rank - 1;
int end = (end_rank == height) ? end_rank : end_rank + 1;

RGB input_sobel = malloc(sizeof(RGB) width * (end - start));

RGB *input_emboss = malloc(sizeof(RGB) * width * (end - start));

// tiles' distribution by one process

double t0_dist = MPI_Wtime();
if (world_rank == 0) {
#pragma omp parallel for collapse(2)
for (int y = start; y < end; y++) {
for (int x = 0; x < width; x++) {
input_sobel[(y - start) * width + x] = sobel[y * width + x];
input_emboss[(y - start) * width + x] = emboss[y * width + x];
}
}

for (int rank = 1; rank < world_size; rank++) {

// world_rank 0 needs to calculate this again for each others processes
int Rstart = rank * nb_tiles;
int Rend = (rank == world_size - 1) ? height : Rstart + nb_tiles;
Rstart = (Rstart == 0) ? Rstart : Rstart - 1;
Rend = (Rend == height) ? Rend : Rend + 1;
MPI_Send(sobel + Rstart * width, (Rend - Rstart) * width * sizeof(RGB), MPI_BYTE, rank, 0, MPI_COMM_WORLD);
MPI_Send(emboss + Rstart * width, (Rend - Rstart) * width * sizeof(RGB), MPI_BYTE, rank, 0, MPI_COMM_WORLD);
}
}
else {
MPI_Recv(input_sobel, (end - start) * width * sizeof(RGB), MPI_BYTE, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
MPI_Recv(input_emboss, (end - start) * width * sizeof(RGB), MPI_BYTE, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}
double t1_dist = MPI_Wtime();

int localstart = (world_rank == 0) ? 0 : 1;

int localend = (world_rank == world_size - 1) ? (end - start) : (end - start - 1);

// Apply filters
double t0_comp = MPI_Wtime();
ApplySobel(input_sobel, width, end - start);
ApplyEmboss(input_emboss, width, end - start);
double t1_comp = MPI_Wtime();

// tiles' recovery by one process

double t0_gath = MPI_Wtime();
if (world_rank == 0) {
#pragma omp parallel for collapse(2)
for (int y = 0; y < end_rank - start_rank; y++) {
for (int x = 0; x < width; x++) {
sobel[(start_rank + y) * width + x] = input_sobel[(localstart + y) * width + x];
emboss[(start_rank + y) * width + x] = input_emboss[(localstart + y) * width + x];
}
}

7
for (int rank = 1; rank < world_size; rank++) {
// interior (non-halo) region for this rank
int baseStart = rank * nb_tiles;
int baseEnd = (rank == world_size - 1) ? height : baseStart + nb_tiles;
int interiorRows = baseEnd - baseStart; // rows sent back by worker

MPI_Recv(sobel + baseStart * width, interiorRows * width * sizeof(RGB), MPI_BYTE, rank, 1, MPI_COMM_WORLD,

MPI_STATUS_IGNORE);
MPI_Recv(emboss + baseStart * width, interiorRows * width * sizeof(RGB), MPI_BYTE, rank, 1, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}

// Save images
double t0_save = MPI_Wtime();
CreateBMP(outsobel, width, height);
WriteRegion(outsobel, 0, 0, width, height, sobel);
CreateBMP(outemboss, width, height);
WriteRegion(outemboss, 0, 0, width, height, emboss);
double t1_save = MPI_Wtime();
double t1_total = MPI_Wtime();
double t_dist = (t1_dist - t0_dist);
double t_comp = (t1_comp - t0_comp);
double t_gath = (t0_save - t0_gath);
double t_save = (t1_save - t0_save);
double t_tot = (t1_total - t0_total);
printf("Distribute time: %.6f s\n", t_dist);
printf("Compute time: %.6f s\n", t_comp);
printf("Gather time: %.6f s\n", t_gath);
printf("Save time: %.6f s\n", t_save);
printf("Total time: %.6f s\n", t_tot);

// CSV logging
const char *csv_path = "../../doc/[Link]";
FILE *csv = fopen(csv_path, "r");
int csv_exists = (csv != NULL);
if (csv) fclose(csv);
csv = fopen(csv_path, "a");

if (csv) {
if (!csv_exists) {
fprintf(csv, "version,image,width,height,ranks,threads,load_gray,distribute,compute,gather,save,total\n");
}
fprintf(csv, "parallel,%s,%d,%d,%d,%d,%.6f,%.6f,%.6f,%.6f,%.6f,%.6f\n",
marguerite, width, height, world_size, omp_threads,
0.0, t_dist, t_comp, t_gath, t_save, t_tot);
fclose(csv);
}
}
else {
// Send only the interior (non-halo) rows back to rank 0
MPI_Send(input_sobel + localstart * width, (localend - localstart) * width * sizeof(RGB), MPI_BYTE, 0, 1,
MPI_COMM_WORLD);
MPI_Send(input_emboss + localstart * width, (localend - localstart) * width * sizeof(RGB), MPI_BYTE, 0, 1,
MPI_COMM_WORLD);
}

// Free memory
free(input_sobel);
free(input_emboss);

if (world_rank == 0) {
free(sobel);
free(emboss);
8
}
MPI_Finalize();
return (0);
}

main.c → optimized

#include <stdio.h>, <stdlib.h>,"lib/bmp.h", <mpi.h>, <math.h>, <string.h>, <omp.h>, <sys/stat.h>

#define MIN(a, b) ((a) < (b) ? (a) : (b))

#define MAX(a, b) ((a) < (b) ? (b) : (a))
#define GP(X, Y) GetPixel(img, width, height, (X), (Y))

// @brief Convert an image to grayscale

// @brief Return a pixel no matter the coordinate

// @brief Apply the Sobel filter to an image

void ApplySobel(RGB *img, const int width, const int height) {
RGB *sobel_out = malloc(sizeof(RGB) * width * height);

// first matrice
int filterX[3][3] = {
{-1, 0, 1},
{-2, 0, 2},
{-1, 0, 1}};

// second matrice
int filterY[3][3] = {
{1, 2, 1},
{0, 0, 0},
{-1, -2, -1}};

#pragma omp parallel for collapse(2)

for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int Gx = 0;
int Gy = 0;

for (int j = -1; j <= 1; j++) {

for (int i = -1; i <= 1; i++) {
int limitx = x + i;
int limity = y + j;

// get the pixel

RGB p = GP(limitx, limity);

9
// convolution
Gx += [Link] * filterX[j + 1][i + 1];
Gy += [Link] * filterY[j + 1][i + 1];
}
}
// final value
int G = (int)(sqrt(Gx * Gx + Gy * Gy));

if (G > 255)
G = 255;
if (G < 0)
G = 0;

sobel_out[y * width + x].red = G;

sobel_out[y * width + x].green = G;
sobel_out[y * width + x].blue = G;
}
}
memcpy(img, sobel_out, sizeof(RGB) * width * height);
free(sobel_out);
}

// @brief Apply the Emboss filter to an image

void ApplyEmboss(RGB *img, const int width, const int height) {
RGB *emboss_out = malloc(sizeof(RGB) * width * height);

// matrice
int kernel[3][3] = {
{-2, -1, 0},
{-1, 1, 1},
{0, 1, 2}};

#pragma omp parallel for collapse(2)

for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) {
int G = 0;

for (int j = -1; j <= 1; j++) {

for (int i = -1; i <= 1; i++) {
int limitx = x + i;
int limity = y + j;

// get the pixel

RGB p = GP(limitx, limity);

// convolution
G += [Link] * kernel[j + 1][i + 1];
}
}
if (G > 255)
G = 255;
if (G < 0)
G = 0;

emboss_out[y * width + x].red = G;

emboss_out[y * width + x].green = G;
emboss_out[y * width + x].blue = G;
}
}
memcpy(img, emboss_out, sizeof(RGB) * width * height);
free(emboss_out);
}

10
int main(int argc, char **argv) {
omp_set_num_threads(4);
const char *img_env = getenv("IMG");
const char *marguerite = (argc > 1 && argv[1] && argv[1][0]) ? argv[1] : (img_env && img_env[0] ? img_env :
"../../images/[Link]");
const char *outsobel = "[Link]";
const char *outemboss = "[Link]";

int world_rank, world_size;

MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
const char *omp_env = getenv("OMP_NUM_THREADS");
int omp_threads = omp_env ? atoi(omp_env) : 4;
double t0_total = MPI_Wtime();

// Get image dimensions

int width, height;
RGB *sobel, *emboss;

if (world_rank == 0) { // execute just by one

double t0_io = MPI_Wtime();
GetSize(marguerite, &width, &height);

// Init memory
sobel = malloc(sizeof(RGB) * width * height);
emboss = malloc(sizeof(RGB) * width * height);

// Load images
LoadRegion(marguerite, 0, 0, width, height, sobel);
LoadRegion(marguerite, 0, 0, width, height, emboss);

// RGB image converted in grayscale image

ImageToGrayscale(sobel, width, height);
ImageToGrayscale(emboss, width, height);
double t1_io = MPI_Wtime();
printf("=== Optimized (Scatterv/Gatherv + OpenMP) ===\n");
printf("Image: %s, size: %dx%d\n", marguerite, width, height);
printf("MPI ranks: %d, OMP threads per rank: %d\n", world_size, omp_threads);
printf("Load + grayscale time (rank0): %.6f s\n", (t1_io - t0_io));
}

// share the size

MPI_Bcast(&width, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&height, 1, MPI_INT, 0, MPI_COMM_WORLD);

int nb_tiles = height / world_size;

// for the amount of data and the offset to send at each process
int *data_sobel = malloc(world_size * sizeof(int));
int *offset_sobel = malloc(world_size * sizeof(int));
int *data_emboss = malloc(world_size * sizeof(int));
int *offset_emboss = malloc(world_size * sizeof(int));

int start = 0;
int end = 0;

#pragma omp parallel for

for (int rank = 0; rank < world_size; rank++) {
int Rstart = rank * nb_tiles;
int Rend = (rank == world_size - 1) ? height : Rstart + nb_tiles;
Rstart = (Rstart == 0) ? Rstart : Rstart - 1;
Rend = (Rend == height) ? Rend : Rend + 1;
11
data_sobel[rank] = (Rend - Rstart) * width * sizeof(RGB);
offset_sobel[rank] = Rstart * width * sizeof(RGB);

data_emboss[rank] = (Rend - Rstart) * width * sizeof(RGB);

offset_emboss[rank] = Rstart * width * sizeof(RGB);

if (rank == world_rank) {
start = Rstart;
end = Rend;
}
}

RGB input_sobel = malloc(sizeof(RGB) width * (end - start));

RGB *input_emboss = malloc(sizeof(RGB) * width * (end - start));

// use Scatterv and Gatherv because each process not receive the same amount of data (maybe less for the last one)

// tiles' distribution by one process

double t0_dist = MPI_Wtime();
MPI_Scatterv(sobel, data_sobel, offset_sobel, MPI_BYTE, input_sobel, data_sobel[world_rank], MPI_BYTE, 0,
MPI_COMM_WORLD);
MPI_Scatterv(emboss, data_emboss, offset_emboss, MPI_BYTE, input_emboss, data_emboss[world_rank], MPI_BYTE, 0,
MPI_COMM_WORLD);
double t1_dist = MPI_Wtime();

int localstart = (world_rank == 0) ? 0 : 1;

int localend = (world_rank == world_size - 1) ? (end - start) : (end - start - 1);

// Apply filters
double t0_comp = MPI_Wtime();
ApplySobel(input_sobel, width, end - start);
ApplyEmboss(input_emboss, width, end - start);
double t1_comp = MPI_Wtime();

// tiles' recovery by one process

double t0_gath = MPI_Wtime();
MPI_Gatherv(input_sobel + localstart * width, (localend - localstart) * width * sizeof(RGB), MPI_BYTE, sobel, data_sobel,
offset_sobel, MPI_BYTE, 0, MPI_COMM_WORLD);
MPI_Gatherv(input_emboss + localstart * width, (localend - localstart) * width * sizeof(RGB), MPI_BYTE, emboss,
data_emboss, offset_emboss, MPI_BYTE, 0, MPI_COMM_WORLD);
double t1_gath = MPI_Wtime();

if (world_rank == 0) {
// Save images
double t0_save = MPI_Wtime();
CreateBMP(outsobel, width, height);
WriteRegion(outsobel, 0, 0, width, height, sobel);
CreateBMP(outemboss, width, height);
WriteRegion(outemboss, 0, 0, width, height, emboss);
double t1_save = MPI_Wtime();
double t1_total = MPI_Wtime();
double t_dist = (t1_dist - t0_dist);
double t_comp = (t1_comp - t0_comp);
double t_gath = (t1_gath - t0_gath);
double t_save = (t1_save - t0_save);
double t_tot = (t1_total - t0_total);
printf("Distribute time: %.6f s\n", t_dist);
printf("Compute time: %.6f s\n", t_comp);
printf("Gather time: %.6f s\n", t_gath);
printf("Save time: %.6f s\n", t_save);
printf("Total time: %.6f s\n", t_tot);

// CSV logging
12
const char *csv_path = "../../doc/[Link]";
FILE *csv = fopen(csv_path, "r");
int csv_exists = (csv != NULL);
if (csv) fclose(csv);
csv = fopen(csv_path, "a");
if (csv) {
if (!csv_exists) {
fprintf(csv, "version,image,width,height,ranks,threads,load_gray,distribute,compute,gather,save,total\n");
}
fprintf(csv, "optimized,%s,%d,%d,%d,%d,%.6f,%.6f,%.6f,%.6f,%.6f,%.6f\n",
marguerite, width, height, world_size, omp_threads,
0.0, t_dist, t_comp, t_gath, t_save, t_tot);
fclose(csv);
}
}
// Free memory
free(input_sobel);
free(input_emboss);
free(data_sobel);
free(data_emboss);
free(offset_sobel);
free(offset_emboss);

if (world_rank == 0) {
free(sobel);
free(emboss);
}
MPI_Finalize();
return (0);
}

makefile

# Number of processes:
N?=6
# Optional hostfile path (set HOSTFILE=path/to/hosts when calling make)
HOSTFILE?=
# OpenMP threads per process (export OMP_NUM_THREADS if preferred)
OMP_NTHREADS?=4
# Change this as you like:
MPIEXEC_OPTS?=--oversubscribe

UNAME_S := $(shell uname -s)

ifeq ($(UNAME_S),Darwin)
OMP_CFLAGS := -Xpreprocessor -fopenmp -I/opt/homebrew/opt/libomp/include
OMP_LDFLAGS := -L/opt/homebrew/opt/libomp/lib -lomp
else
OMP_CFLAGS := -fopenmp
OMP_LDFLAGS :=
endif

main: main.c lib/libbmp/libbmp.c lib/bmp.c

mpicc $(OMP_CFLAGS) -o $@ $^ -lm $(OMP_LDFLAGS)

run: main
OMP_NUM_THREADS=$(OMP_NTHREADS) mpiexec $(if $(HOSTFILE),-hostfile $(HOSTFILE),) -n $(N)
$(MPIEXEC_OPTS) ./main

clean:
@rm -f main [Link] [Link]

.PHONY: clean

13
💻 Assignment 3
mainSerial.c

#include <stdio.h>, <time.h>, <stdlib.h>,<math.h>

// Function : that finds the maximum value in an array (sequential version)

int max_tab(int array[], int len) {
int max = array[0];

for (int i = 1; i < len; i++) {

if (array[i] > max)
max = array[i];
}
return max;
}

int main() {
int greatest1, greatest2, greatest3;
srand(time(NULL));

// for the measures of the time

struct timespec start1, end1, start2, end2, start3, end3;
double elapsed1, elapsed2, elapsed3;

// the three different dimensions

size_t size1 = 134217728; // 2^27
size_t size2 = 268435456; // 2^28
size_t size3 = 402653184; // 2^27 + 2^28

// init the array

int *array1 = malloc(size1 * sizeof(int));
int *array2 = malloc(size2 * sizeof(int));
int *array3 = malloc(size3 * sizeof(int));

for (size_t i = 0; i < size1; i++)

array1[i] = rand();

for (size_t j = 0; j < size2; j++)

array2[j] = rand();

for (size_t k = 0; k < size3; k++)

array3[k] = rand();

// init array for the time

double *time1 = malloc(30 * sizeof(double));
double *time2 = malloc(30 * sizeof(double));
double *time3 = malloc(30 * sizeof(double));

/////////////////////////////////////////////////////
// EXECUTION //
/////////////////////////////////////////////////////
for (int n = 0; n < 30; n++) {
// 0.5 GiB
clock_gettime(CLOCK_MONOTONIC, &start1);
greatest1 = max_tab(array1, size1);
clock_gettime(CLOCK_MONOTONIC, &end1);
elapsed1 = (end1.tv_sec - start1.tv_sec) + (end1.tv_nsec - start1.tv_nsec) / 1e9;
time1[n] = elapsed1;

// 1 GiB

1
clock_gettime(CLOCK_MONOTONIC, &start2);
greatest2 = max_tab(array2, size2);
clock_gettime(CLOCK_MONOTONIC, &end2);
elapsed2 = (end2.tv_sec - start2.tv_sec) + (end2.tv_nsec - start2.tv_nsec) / 1e9;
time2[n] = elapsed2;

// 1.5 GiB
clock_gettime(CLOCK_MONOTONIC, &start3);
greatest3 = max_tab(array3, size3);
clock_gettime(CLOCK_MONOTONIC, &end3);
elapsed3 = (end3.tv_sec - start3.tv_sec) + (end3.tv_nsec - start3.tv_nsec) / 1e9;
time3[n] = elapsed3;
}

/////////////////////////////////////////////////////
// RESULTS //
/////////////////////////////////////////////////////
// for the measures of the time
double sum1 = 0.0, sum2 = 0.0, sum3 = 0.0;
double variance1 = 0.0, variance2 = 0.0, variance3 = 0.0;
double time_min1 = time1[0], time_min2 = time2[0], time_min3 = time3[0];

for (int i = 0; i < 30; i++) {

// search the minimal time
if (time1[i] < time_min1)
time_min1 = time1[i];
if (time2[i] < time_min2)
time_min2 = time2[i];
if (time3[i] < time_min3)
time_min3 = time3[i];

// calculate the sums

sum1 += time1[i];
sum2 += time2[i];
sum3 += time3[i];
}

// calculate the averages

double average1 = sum1 / 30;
double average2 = sum2 / 30;
double average3 = sum3 / 30;

// calculate the variances

for (int j = 0; j < 30; j++) {
variance1 += pow(time1[j] - average1, 2);
variance2 += pow(time2[j] - average2, 2);
variance3 += pow(time3[j] - average3, 2);
}
variance1 = variance1 / 30;
variance2 = variance2 / 30;
variance3 = variance3 / 30;

// calculate the standard deviations

double st_d1 = sqrt(variance1);
double st_d2 = sqrt(variance2);
double st_d3 = sqrt(variance3);

// print the results

printf("------------------ for size = 2^27 ------------------\n");
printf("Greatest : %i\n", greatest1);
printf("Minimum elapsed time : %.9f seconds\n", time_min1);
printf("Mean of the elapsed times : %.9f seconds\n", average1);
printf("Standard deviation : %.9f seconds\n", st_d1);
2
printf("------------------ for size = 2^28 ------------------\n");
printf("Greatest : %i\n", greatest2);
printf("Minimum elapsed time : %.9f seconds\n", time_min2);
printf("Mean of the elapsed times : %.9f seconds\n", average2);
printf("Standard deviation : %.9f seconds\n", st_d2);

printf("--------------- for size = 2^27 + 2^28 --------------\n");

printf("Greatest : %i\n", greatest3);
printf("Minimum elapsed time : %.9f seconds\n", time_min3);
printf("Mean of the elapsed times : %.9f seconds\n", average3);
printf("Standard deviation : %.9f seconds\n", st_d3);

printf("That's all !\n");

// release
free(array1);
free(array2);
free(array3);
free(time1);
free(time2);
free(time3);
return 0;
}

[Link]

#include <cuda.h>, <stdio.h>, <stdlib.h>, <math.h>, <time.h>, <cuda_runtime.h>

// Function : that finds the maximum value within a warp

__device__ int MaxWarp(int max) {
// warp reduction loop : offset is 16, then 8, then 4, then 2, then 1
for (int offset = 16; offset > 0; offset >>= 1) {
// read the value from another thread in the same warp
int value = __shfl_down_sync(0xffffffff, max, offset);
if (value > max)
max = value;
}
return max;
}

// Function : that finds the maximum value in an array (parallel version with CUDA)
__global__ void MaxTab(int array[], int len, int output[]) {
// share table to store the max of each warp
extern __shared__ int warps_max[];

int posBlock = threadIdx.x;

int posGrid = blockIdx.x * blockDim.x + posBlock;
int current_max = -1;

// each thread processes a portion of the array

for (int i = posGrid; i < len; i += blockDim.x * gridDim.x) {
if (array[i] > current_max)
current_max = array[i];
}

// reduction on the wrap

int max_warp = MaxWarp(current_max);
int posWarp = posBlock % 32;
int warpId = posBlock / 32;
if (posWarp == 0) // only the first thread of each block write on the tab
warps_max[warpId] = max_warp;
__syncthreads(); // synchronize all the threads

3
// reduction on the block
if (warpId == 0) {
int value = (posWarp < blockDim.x / 32) ? warps_max[posWarp] : -1;
int blockMax = MaxWarp(value);

if (posWarp == 0)
atomicMax(output, blockMax);
}
}

// Function : that finds the maximum value in an array (sequential version)

int max_tab(int array[], int len) {
int max = array[0];

for (int i = 1; i < len; i++) {

if (array[i] > max)
max = array[i];
}
return max;
}

int serialFunction(int array1[], size_t size1, int array2[], size_t size2, int array3[], size_t size3) {
int greatest1, greatest2, greatest3;
srand(time(NULL));

// for the measures of the time

struct timespec start1, end1, start2, end2, start3, end3;
double elapsed1, elapsed2, elapsed3;

// init array for the time

double *time1 = (double *)malloc(30 * sizeof(double));
double *time2 = (double *)malloc(30 * sizeof(double));
double *time3 = (double *)malloc(30 * sizeof(double));

// 1 GiB
clock_gettime(CLOCK_MONOTONIC, &start2);
greatest2 = max_tab(array2, size2);
clock_gettime(CLOCK_MONOTONIC, &end2);
elapsed2 = (end2.tv_sec - start2.tv_sec) + (end2.tv_nsec - start2.tv_nsec) / 1e9;
time2[n] = elapsed2;

/////////////////////////////////////////////////////
// RESULTS //
/////////////////////////////////////////////////////
4
// for the measures of the time
double sum1 = 0.0, sum2 = 0.0, sum3 = 0.0;
double variance1 = 0.0, variance2 = 0.0, variance3 = 0.0;
double time_min1 = time1[0], time_min2 = time2[0], time_min3 = time3[0];

for (int i = 0; i < 30; i++) {

// search the minimal time
if (time1[i] < time_min1)
time_min1 = time1[i];
if (time2[i] < time_min2)
time_min2 = time2[i];
if (time3[i] < time_min3)
time_min3 = time3[i];

// calculate the sums

sum1 += time1[i];
sum2 += time2[i];
sum3 += time3[i];
}

// calculate the averages

double average1 = sum1 / 30;
double average2 = sum2 / 30;
double average3 = sum3 / 30;

// calculate the variances

// calculate the standard deviations

double st_d1 = sqrt(variance1);
double st_d2 = sqrt(variance2);
double st_d3 = sqrt(variance3);

// print the results

printf("------------------ SERIAL for size = 2^27 ------------------\n");
printf("Greatest : %i\n", greatest1);
printf("Minimum elapsed time : %.9f seconds\n", time_min1);
printf("Mean of the elapsed times : %.9f seconds\n", average1);
printf("Standard deviation : %.9f seconds\n", st_d1);

printf("------------------ SERIAL for size = 2^28 ------------------\n");

printf("Greatest : %i\n", greatest2);
printf("Minimum elapsed time : %.9f seconds\n", time_min2);
printf("Mean of the elapsed times : %.9f seconds\n", average2);
printf("Standard deviation : %.9f seconds\n", st_d2);

printf("--------------- SERIAL for size = 2^27 + 2^28 --------------\n");

// release
free(time1);
free(time2);
free(time3);
5
return 0;
}

int main() {
int greatest1, greatest2, greatest3;
int greatest1_h = -1, greatest2_h = -1, greatest3_h = -1; // host verification
srand(time(NULL));
int init = -1;
int threads = 256, blocks = 256;

// for the measures of the time

cudaEvent_t copy1, endcopy1, copy2, endcopy2, copy3, endcopy3;
cudaEvent_t copyB1, endcopyB1, copyB2, endcopyB2, copyB3, endcopyB3;
cudaEvent_t start1, end1, start2, end2, start3, end3;
float elapsed1 = 0.0f, elapsed2 = 0.0f, elapsed3 = 0.0f;
float elapcopy1 = 0.0f, elapcopy2 = 0.0f, elapcopy3 = 0.0f;
float elapcopyB1 = 0.0f, elapcopyB2 = 0.0f, elapcopyB3 = 0.0f;
float sum1 = 0.0f, sum2 = 0.0f, sum3 = 0.0f;
float variance1 = 0.0f, variance2 = 0.0f, variance3 = 0.0f;

/////////////////////////////////////////////////////
// 0.5 GiB //
/////////////////////////////////////////////////////
size_t size1 = 134217728; // 2^27

// init the table

int *array1;
cudaMallocHost((void **)&array1, size1 * sizeof(int));
for (size_t i = 0; i < size1; i++)
array1[i] = rand();
// host verification on the same data
for (size_t i = 0; i < size1; i++)
if (array1[i] > greatest1_h) greatest1_h = array1[i];

// for the measures of the time

float *time1 = (float *)malloc(30 * sizeof(float));
cudaEventCreate(&copy1);
cudaEventCreate(&endcopy1);
cudaEventCreate(&start1);
cudaEventCreate(&end1);
cudaEventCreate(&copyB1);
cudaEventCreate(&endcopyB1);

// copy data CPU -> GPU

int *input1, *output1;
cudaMalloc((void **)&input1, size1 * sizeof(int));
cudaMalloc((void **)&output1, sizeof(int));
cudaEventRecord(copy1);
cudaMemcpy(input1, array1, size1 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(output1, &init, sizeof(int), cudaMemcpyHostToDevice);
cudaEventRecord(endcopy1);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapcopy1, copy1, endcopy1);

// execution
for (int n = 0; n < 30; n++) {
cudaEventRecord(start1);
MaxTab<<<blocks, threads, ((threads / 32) * sizeof(int))>>>(input1, size1, output1);
cudaEventRecord(end1);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapsed1, start1, end1);
time1[n] = elapsed1;
}
6
// copy data GPU -> CPU
cudaEventRecord(copyB1);
cudaMemcpy(&greatest1, output1, sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord(endcopyB1);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapcopyB1, copyB1, endcopyB1);

// release
cudaFree(input1);
cudaFree(output1);

/////////////////////////////////////////////////////
// 1 GiB //
/////////////////////////////////////////////////////
size_t size2 = 268435456; // 2^28

// init the table

int *array2;
cudaMallocHost((void **)&array2, size2 * sizeof(int));
for (size_t i = 0; i < size2; i++)
array2[i] = rand();
// host verification on the same data
for (size_t i = 0; i < size2; i++)
if (array2[i] > greatest2_h) greatest2_h = array2[i];

// for the measures of the time

float *time2 = (float *)malloc(30 * sizeof(float));
cudaEventCreate(&copy2);
cudaEventCreate(&endcopy2);
cudaEventCreate(&start2);
cudaEventCreate(&end2);
cudaEventCreate(&copyB2);
cudaEventCreate(&endcopyB2);

// copy data CPU -> GPU

int *input2, *output2;
cudaMalloc((void **)&input2, size2 * sizeof(int));
cudaMalloc((void **)&output2, sizeof(int));
cudaEventRecord(copy2);
cudaMemcpy(input2, array2, size2 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(output2, &init, sizeof(int), cudaMemcpyHostToDevice);
cudaEventRecord(endcopy2);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapcopy2, copy2, endcopy2);

// execution
for (int n = 0; n < 30; n++) {
cudaEventRecord(start2);
MaxTab<<<blocks, threads, ((threads / 32) * sizeof(int))>>>(input2, size2, output2);
cudaEventRecord(end2);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapsed2, start2, end2);
time2[n] = elapsed2;
}

// copy data GPU -> CPU

cudaEventRecord(copyB2);
cudaMemcpy(&greatest2, output2, sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord(endcopyB2);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapcopyB2, copyB2, endcopyB2);

7
// release
cudaFree(input2);
cudaFree(output2);

/////////////////////////////////////////////////////
// 1.5 GiB //
/////////////////////////////////////////////////////
size_t size3 = 402653184; // 2^27 + 2^28

// init the table

int *array3;
cudaMallocHost((void **)&array3, size3 * sizeof(int));
for (size_t i = 0; i < size3; i++)
array3[i] = rand();
// host verification on the same data
for (size_t i = 0; i < size3; i++)
if (array3[i] > greatest3_h) greatest3_h = array3[i];

// for the measures of the time

float *time3 = (float *)malloc(30 * sizeof(float));
cudaEventCreate(&copy3);
cudaEventCreate(&endcopy3);
cudaEventCreate(&start3);
cudaEventCreate(&end3);
cudaEventCreate(&copyB3);
cudaEventCreate(&endcopyB3);

// copy data CPU -> GPU

int *input3, *output3;
cudaMalloc((void **)&input3, size3 * sizeof(int));
cudaMalloc((void **)&output3, sizeof(int));
cudaEventRecord(copy3);
cudaMemcpy(input3, array3, size3 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(output3, &init, sizeof(int), cudaMemcpyHostToDevice);
cudaEventRecord(endcopy3);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapcopy3, copy3, endcopy3);

// execution
for (int n = 0; n < 30; n++) {
cudaEventRecord(start3);
MaxTab<<<blocks, threads, ((threads / 32) * sizeof(int))>>>(input3, size3, output3);
cudaEventRecord(end3);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapsed3, start3, end3);
time3[n] = elapsed3;
}

// copy data GPU -> CPU

cudaEventRecord(copyB3);
cudaMemcpy(&greatest3, output3, sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord(endcopyB3);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapcopyB3, copyB3, endcopyB3);

// release
cudaFree(input3);
cudaFree(output3);

/////////////////////////////////////////////////////
// RESULTS //
/////////////////////////////////////////////////////
// execute the serial function
8
serialFunction(array1, size1, array2, size2, array3, size3);

float time_min1 = time1[0];

float time_min2 = time2[0];
float time_min3 = time3[0];

for (int n = 0; n < 30; n++) {

// search the minimal time
if (time1[n] < time_min1)
time_min1 = time1[n];
if (time2[n] < time_min2)
time_min2 = time2[n];
if (time3[n] < time_min3)
time_min3 = time3[n];

// calculate the sums

sum1 += time1[n];
sum2 += time2[n];
sum3 += time3[n];
}

// calculate the averages

double average1 = sum1 / 30;
double average2 = sum2 / 30;
double average3 = sum3 / 30;

// calculate the variances

// calculate the standard deviations

float st_d1 = sqrt(variance1);
float st_d2 = sqrt(variance2);
float st_d3 = sqrt(variance3);

// print the results

printf("------------------ CUDA for size = 2^27 ------------------\n");
printf("Greatest : %i\n", greatest1);
printf("Verification (host max) : %s\n", (greatest1 == greatest1_h ? "PASS" : "FAIL"));
printf("Time copying data : %f ms\n", elapcopy1 + elapcopyB1);
printf("Minimum elapsed running time : %f ms\n", time_min1);
printf("Mean of the elapsed running times : %f ms\n", average1);
printf("Total (time during data copy + average of running) : %f ms\n", elapcopy1 + elapcopyB1 + average1);
printf("Standard deviation : %f ms\n", st_d1);

printf("------------------ CUDA for size = 2^28 ------------------\n");

printf("Greatest : %i\n", greatest2);
printf("Verification (host max) : %s\n", (greatest2 == greatest2_h ? "PASS" : "FAIL"));
printf("Time copying data : %f ms\n", elapcopy2 + elapcopyB2);
printf("Minimum elapsed running time : %f ms\n", time_min2);
printf("Mean of the elapsed running times : %f ms\n", average2);
printf("Total (time during data copy + average of running) : %f ms\n", elapcopy2 + elapcopyB2 + average2);
printf("Standard deviation : %f ms\n", st_d2);

printf("--------------- CUDA for size = 2^27 + 2^28 --------------\n");

printf("Greatest : %i\n", greatest3);
printf("Verification (host max) : %s\n", (greatest3 == greatest3_h ? "PASS" : "FAIL"));
9
printf("Time copying data : %f ms\n", elapcopy3 + elapcopyB3);
printf("Minimum elapsed running time : %f ms\n", time_min3);
printf("Mean of the elapsed running times : %f ms\n", average3);
printf("Total (time during data copy + average of running) : %f ms\n", elapcopy3 + elapcopyB3 + average3);
printf("Standard deviation : %f ms\n", st_d3);

printf("That's all !\n");

// release
cudaFreeHost(array1);
cudaFreeHost(array2);
cudaFreeHost(array3);
free(time1);
free(time2);
free(time3);
return 0;
}

Makefile

all: mainSerial mainParallel

mainSerial: mainSerial.c
gcc mainSerial.c -o mainSerial -lm

mainParallel: [Link]
# Include PTX for compute_50 to enable JIT on newer GPUs
nvcc -gencode arch=compute_50,code=sm_50 -gencode arch=compute_50,code=compute_50 [Link] -o
mainParallel -lm

run_serial: mainSerial
./mainSerial

run_parallel: mainParallel
./mainParallel

clean:
rm -f mainSerial mainParallel

# Build the report PDF

report:
cd report && latexmk -pdf -interaction=nonstopmode -halt-on-error [Link]

report-clean:
cd report && latexmk -C

# Convenience: list CUDA nodes (cluster helper)

list-nodes:
./scripts/list_cuda_hosts.sh || true

# Convenience: run on a chosen CUDA node via SSH (shared FS by default)

# Usage: make run-on-node NODE=<node> [MODE=parallel|serial] [COPY=0|1]
run-on-node:
@NODE="$(NODE)" MODE="$(MODE)" COPY="$(COPY)" bash -c '\
if [ -n "$$NODE" ]; then NARG="-n $$NODE"; else NARG=""; fi; \
./scripts/run_on_cuda_node.sh $$NARG $$(test "$(MODE)" = "serial" && echo --serial || echo --parallel) $$(test "$(COPY)"
= "1" && echo --copy || true) \'

Introduction to Parallel Programming
No ratings yet
Introduction to Parallel Programming
143 pages
Parallel Hardware and Software Overview
No ratings yet
Parallel Hardware and Software Overview
143 pages
Introduction to Parallel Computing Concepts
No ratings yet
Introduction to Parallel Computing Concepts
40 pages
Introduction to Parallel Programming Basics
No ratings yet
Introduction to Parallel Programming Basics
32 pages
Overview of Parallel Hardware Concepts
No ratings yet
Overview of Parallel Hardware Concepts
60 pages
Lecture2 ParallelSoftwareHarware
No ratings yet
Lecture2 ParallelSoftwareHarware
142 pages
Chapter 2
No ratings yet
Chapter 2
143 pages
Chapter - 2 - Parallel Hardware and Parallel Software
No ratings yet
Chapter - 2 - Parallel Hardware and Parallel Software
143 pages
RISC Architecture Overview and Features
100% (1)
RISC Architecture Overview and Features
38 pages
Introduction to HPC and Parallel Models
No ratings yet
Introduction to HPC and Parallel Models
76 pages
A Level Computer Science Hardware Guide
No ratings yet
A Level Computer Science Hardware Guide
48 pages
Fundamentals of Modern Processor Design
100% (1)
Fundamentals of Modern Processor Design
52 pages
Understanding Memory Hierarchies in Computers
No ratings yet
Understanding Memory Hierarchies in Computers
20 pages
Modern Processor and Memory Technology
No ratings yet
Modern Processor and Memory Technology
17 pages
Lecture 03
No ratings yet
Lecture 03
24 pages
Introduction to Computer Organization
No ratings yet
Introduction to Computer Organization
32 pages
Parallel 0 Distributed Computing
No ratings yet
Parallel 0 Distributed Computing
30 pages
CPU Architecture and Instruction Set Overview
No ratings yet
CPU Architecture and Instruction Set Overview
7 pages
Parallel Computing Fundamentals Explained
No ratings yet
Parallel Computing Fundamentals Explained
23 pages
Parallel vs Distributed Computing Explained
No ratings yet
Parallel vs Distributed Computing Explained
31 pages
Introduction to Parallel Programming
No ratings yet
Introduction to Parallel Programming
71 pages
IAS CPU and MIPS Rate Overview
No ratings yet
IAS CPU and MIPS Rate Overview
42 pages
Understanding Computer Architecture Basics
No ratings yet
Understanding Computer Architecture Basics
33 pages
Hardwired Control in Computer Design
No ratings yet
Hardwired Control in Computer Design
50 pages
Understanding Computer Architecture Basics
No ratings yet
Understanding Computer Architecture Basics
65 pages
Understanding Parallel Computing Concepts
No ratings yet
Understanding Parallel Computing Concepts
113 pages
Memory Management Techniques Explained
No ratings yet
Memory Management Techniques Explained
63 pages
Design Principles of Computer Parallelism
No ratings yet
Design Principles of Computer Parallelism
5 pages
Understanding CPU Pipelining Techniques
No ratings yet
Understanding CPU Pipelining Techniques
33 pages
Understanding Computer Architecture Basics
No ratings yet
Understanding Computer Architecture Basics
5 pages
Coa Lect 6 & 11
No ratings yet
Coa Lect 6 & 11
6 pages
Introduction to Parallel Programming
No ratings yet
Introduction to Parallel Programming
34 pages
Parallel Computer Architecture Overview
No ratings yet
Parallel Computer Architecture Overview
7 pages
Computer Architecture and Evolution Guide
No ratings yet
Computer Architecture and Evolution Guide
49 pages
Understanding Parallel Processor Architectures
No ratings yet
Understanding Parallel Processor Architectures
117 pages
RISC: Advancements in Computer Architecture
No ratings yet
RISC: Advancements in Computer Architecture
14 pages
RISC Architecture: Key Features and Advances
No ratings yet
RISC Architecture: Key Features and Advances
14 pages
Advanced Processor Technologies Overview
50% (2)
Advanced Processor Technologies Overview
73 pages
Parallel and Distributed Programming Overview
No ratings yet
Parallel and Distributed Programming Overview
6 pages
Computer Design: Hardware/Software Interface
100% (3)
Computer Design: Hardware/Software Interface
16 pages
Parallel Processing in Computer Systems
No ratings yet
Parallel Processing in Computer Systems
36 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Computer System Overview: 1 Spring 2015
No ratings yet
Computer System Overview: 1 Spring 2015
48 pages
Advanced Computer Architecture Overview
No ratings yet
Advanced Computer Architecture Overview
20 pages
Basic Structure of Computer Systems
No ratings yet
Basic Structure of Computer Systems
59 pages
Brainkart Notes
No ratings yet
Brainkart Notes
14 pages
Enhancing Computer Performance Techniques
100% (1)
Enhancing Computer Performance Techniques
36 pages
Computer System Overview and Architecture
No ratings yet
Computer System Overview and Architecture
48 pages
Computer Architecture Overview
No ratings yet
Computer Architecture Overview
28 pages
Introduction To COA
No ratings yet
Introduction To COA
20 pages
Parallel Processing in Computer Design
No ratings yet
Parallel Processing in Computer Design
43 pages
Understanding Computer Organization Basics
No ratings yet
Understanding Computer Organization Basics
77 pages
RISC vs CISC: Processor Comparison Guide
No ratings yet
RISC vs CISC: Processor Comparison Guide
30 pages
Types of Processor Architectures
No ratings yet
Types of Processor Architectures
8 pages
RISC Architecture Overview and Features
No ratings yet
RISC Architecture Overview and Features
33 pages
RISC vs. CISC: Key Computer Architecture Insights
No ratings yet
RISC vs. CISC: Key Computer Architecture Insights
38 pages
Coa Mod 1
No ratings yet
Coa Mod 1
19 pages
Fundamentals of Computer Architecture
No ratings yet
Fundamentals of Computer Architecture
43 pages
Introduction to Parallel Computing
100% (1)
Introduction to Parallel Computing
40 pages
Implementing Linear and Absolute Layouts
100% (2)
Implementing Linear and Absolute Layouts
10 pages
XBT f011110
100% (1)
XBT f011110
2 pages
VLSI Design Overview and Processes
100% (1)
VLSI Design Overview and Processes
89 pages
Introduction to kubectl and Workloads
No ratings yet
Introduction to kubectl and Workloads
107 pages
Location-Based Review App Proposal
No ratings yet
Location-Based Review App Proposal
9 pages
Library Management System Overview
No ratings yet
Library Management System Overview
9 pages
PHP Session and Curl Errors Explained
No ratings yet
PHP Session and Curl Errors Explained
5 pages
LTE Initial Attach Procedure Overview
No ratings yet
LTE Initial Attach Procedure Overview
2 pages
Cyber Security and Cryptography Module
No ratings yet
Cyber Security and Cryptography Module
13 pages
CPU Organization and Functional Units
No ratings yet
CPU Organization and Functional Units
14 pages
Affluent Global Services Overview
No ratings yet
Affluent Global Services Overview
16 pages
Student Contact Information List
No ratings yet
Student Contact Information List
14 pages
Wendler's 5/3/1 Spreadsheet v1.3
No ratings yet
Wendler's 5/3/1 Spreadsheet v1.3
44 pages
Upload Website to AWS EC2 Guide
100% (1)
Upload Website to AWS EC2 Guide
64 pages
Maxent F90 Library User Manual
No ratings yet
Maxent F90 Library User Manual
16 pages
Ro-Ghoul Auto Farm Script Guide
No ratings yet
Ro-Ghoul Auto Farm Script Guide
28 pages
Efficient Max-Min Filter Algorithm
No ratings yet
Efficient Max-Min Filter Algorithm
12 pages
Grocery Store Management System Report
No ratings yet
Grocery Store Management System Report
22 pages
JavaScript in HTML: Class 8 Answers
No ratings yet
JavaScript in HTML: Class 8 Answers
6 pages
10 Shell Scripts for Database Automation
No ratings yet
10 Shell Scripts for Database Automation
14 pages
Full Stack Developer Profile Summary
No ratings yet
Full Stack Developer Profile Summary
2 pages
NUST SEECS Graduate Profile 2025
No ratings yet
NUST SEECS Graduate Profile 2025
46 pages
CNC Machine Overview and Applications
No ratings yet
CNC Machine Overview and Applications
13 pages
SQL Query Solutions for Salesman and Customers
No ratings yet
SQL Query Solutions for Salesman and Customers
27 pages
Xstream Oem Rfmodule v5
No ratings yet
Xstream Oem Rfmodule v5
64 pages
Network Design and Administration Guide
No ratings yet
Network Design and Administration Guide
38 pages
844A Datasheet
No ratings yet
844A Datasheet
1 page
Trojan Attack Micro-Project Report
No ratings yet
Trojan Attack Micro-Project Report
17 pages
Credit Memo Request Approval Workflow
No ratings yet
Credit Memo Request Approval Workflow
30 pages

Cours Parallel Programming

Uploaded by

Cours Parallel Programming

Uploaded by

INF-3201 - Parallel Programming

processor = Central Processing Unit (CPU), it’s a physical component

The von Neumann architecture

Main memory = this is a collection of locations, each of which is capable of

Register = very fast storage, part of the CPU.

Program counter = stores address of the next instruction to be executed.

Process, multitasking and threading

Modifications to the von Neumann model

A collection of memory locations that can be accessed in less

The principle of locality is when accessing one location is

If we run a very large program or a program that accesses very large

Virtual page numbers

Instruction Level Parallelism (ILP)

Pipelining = functional units are arranged in stages.

Multiple issue = multiple instructions can be simultaneously initiated.

SISD → Single instruction stream Single data stream

MIMD → Multiple instruction stream Multiple data stream

Shared Memory System

UMA multicore system NUMA multicore system

● Distributed memory interconnects

𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑚𝑒𝑠𝑠𝑎𝑔𝑒 (𝑏𝑦𝑡𝑒𝑠)

Input and output

Parallel program design

Parallel hardware = Flynn’s taxonomy

Performance = speedup ; efficiency ; Amdahl’s law ; scalability

Parallel program design = Foster’s methodology

First glance of MPI

MPI for Message Passing Interface

MPI uses SPMD for Single Program Multiple Data

The basic outline

Communicators = a reference to processes that can communicate together

int MPI_Comm_size(MPI_Comm comm /*in*/, int* comm_sz_p /*out*/);

Safety in MPI programs

MPI_Send behave in two ways :

Tree-structured global sum → optimized Tree-structured global sum alternative

int MPI_Reduce(void* input_data_p, void* output_data_p /*out*/, int count, MPI_Datatype

If the result should be available to all the processes :

Steps to create a derived datatype :

To synchronize processes → use barriers

For use Pthread

2. It is possible to use a locker :

3. The use of semaphore :

4. The use of barrier :

What is a GPU ? (card, plug-in (typically))

Graphics, but lot of compute power

How do we program a GPU ?

CUDA = Compute Unified Device Architecture

Overview of CUDA architecture

The function to run is called a “kernel”.

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {

GPU Memory hierarchy

Synchronization can be expensive (ex : locks stalling threads).

Other options for GPU programming :

When you start optimising, an iterative process :

CUDA Memory Model

Timing using CUDA events

OpenCL platform model

OpenCL execution model

OpenCL to CUDA data parallelism model mapping :

Host program Host program

NDRange (index space) Grid

Work group Block

Work item Thread

Two-dimensional index space :

OpenCL API call Explanation CUDA equivalent

get_global_id(0) global index of the work item in the blockIdx.x * blockDim.x +

get_local_id(0) local index of the work item in the threadIdx.x

get_global_size(0) size of NDRange in the x dimension gridDim.x * blockDim.x

get_local_size(0) size of each work group in the x blockDim.x

OpenCL memory model

Private memory : read/write access for work item

Memory model : OpenCL vs CUDA

Lecture based on the paper :

Some optimization concerns

● What limits your speed ? CPU bound vs memory bound

What the platforms struggled with

int MPI_Comm_size(MPI_Comm comm /in/, int* comm_sz_p /out/);

int MPI_Reduce(void* input_data_p, void* output_data_p /out/, int count, MPI_Datatype

global void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {