Cours Parallel Programming
Cours Parallel Programming
Sequential programming = the instructions are executed one by one and an instruction is executed at any point in
time
Parallel programming = the problem is broken into smaller ones and the instructions are executed, for each part,
in parallel
In theory, over the years, the density of transistors in a processor increases. Like that, the power by transistors
decreases. So, we can imagine that the core frequency can increase without problem and the power density will
stay stable. But in practice, the physical limits are reached and the heat dissipation is insufficient : we hit the power
wall !
Scaling beyond the power wall, the chip (puce) density is increased : the number of cores per chip increases and
the chips performance also. Moreover, the power management is better. We need parallel programmation to use
the different cores of one processor.
Nowadays, most architectures are parallel because there is no more single core CPU but mainly clusters and
massively parallel processing.
1
Parallel Hardware and Parallel Software
Goal : to present the basics of parallel hardware and software, starting from the von Neumann model, to arrive at the
design and execution of parallel programs.
Some background
About the serial hardware and software, a computer runs one program at a time, with input/output through dedicated
programs. A program reads data, processes it, and produces output, all while managing a single stream of
instructions.
Central processing unit (CPU) = is divided into two parts : the control unit
which is responsible for deciding which instruction in a program should be
executed (the boss) and the arithmetic and logic unit (ALU) which is
responsible for executing the actual instructions (the worker).
Bus = wires and hardware that connects the CPU and memory.
But the von Neumann architecture has a bottleneck : the bus limits the transfer speed between the CPU and
memory.
An operating system “process” is an instance of a computer program that is being executed. The components of
a process are :
- the executable machine language program
- a block of memory
- descriptors of resources the OS has allocated to the process
- security information
- information about the state of the process
The multitasking gives the illusion that a single processor system is running multiple programs simultaneously. Each
process takes turns running (time slice). After its time is up, it waits until it has a turn again (blocks).
Concerning the threading, threads are contained within processes. They allow programmers to divide their programs
into (more or less) independent tasks. The hope is that when one thread blocks because it is waiting on a resource,
another will have work to do and can run.
1
A process and two threads :
Basics of caching
Levels of cache →
Cache hit = when the processor looks for data and it's already in the cache, it's called a hit. Access is therefore fast.
Cache miss = if the data isn't in the cache, it's a miss. The processor then has to retrieve it from main memory
(RAM), which is much slower.
But there are some issues with cache. When a CPU writes data to cache, the value in cache may be inconsistent
with the value in main memory.
Write-through caches handle this by updating the data in main memory at the time it is written to cache.
Write-back caches mark data in the cache as dirty, when the cache line is replaced by a new cache line from memory,
the dirty line is written to memory.
Cache mappings
Full associative = a new line can be placed at any location in the cache.
Direct mapped = each cache line has a unique location in the cache to which it will be assigned.
n-way set associative = each cache line can be placed in one of n different locations in the cache. When more than
one line in memory can be mapped to several different locations in cache we also need to be able to decide which
line should be replaced or evicted.
Virtual memory
Swap space = those parts that are idle are kept in a block of secondary
storage
2
Pages = blocks of data and instructions. Usually these are relatively large. Most systems have a fixed page size that
currently ranges from 4 to 16 kilobytes.
Virtual Address divided into Virtual Page Number and Byte Offset
Translation-lookaside buffer (TLB), using a page table, has the potential to significantly increase each program’s
overall run-time. It’s a special address translation cache in the processor. It caches a small number of entries (typically
16-512) from the page table in very fast memory.
Page fault = attempting to access a valid physical address for a page in the page table but the page is only stored
on disk
The ILP attempts to improve processor performance by having multiple processor components or functional units
simultaneously executing instructions.
Speculation
In order to make use of multiple issue, the system must find instructions that can be executed simultaneously. In
speculation, the compiler or the processor makes a guess about an instruction, and then executes the instruction on
the basis of the guess.
Hardware multithreading
There aren’t always good opportunities for simultaneous execution of different threads. Hardware multithreading
provides a means for systems to continue doing useful work when the task being currently executed has stalled.
Fine-grained = the processor switches between threads after each instruction, skipping threads that are stalled.
Pros (advantages) : potential to avoid wasted machine time due to stalls
Cons (disadvantages) : a thread that’s ready to execute a long sequence of instructions may have to wait to execute
every instruction
3
Coarse-grained = only switches threads that are stalled waiting for a time - consuming operation to complete.
Pros : switching threads doesn’t need to be nearly instantaneous
Cons : the processor can be idled on shorter stalls, and thread switching will also cause delays
Simultaneous multithreading (SMT) = a variation on fine-grained multithreading, allows multiple threads to make
use of the multiple functional units.
Parallel hardware
Flynn's Taxonomy is a classification of processor architectures. It describes how computer systems process
instructions and data, depending on whether they manipulate one or several instructions simultaneously. It has 4
categories.
SIMD → Single instruction stream Multiple data stream & MISD → Multiple instruction stream Single data
stream = Not covered
Parallelism achieved by dividing data among the processors. It applies the same instruction to multiple data items,
called data parallelism.
All ALUs are required to execute the same instruction, or remain idle. In classic design, they must also operate
synchronously. The ALUs have no instruction storage. It’s efficient for large data parallel problems, but not other
types of more complex parallel problems.
Vector processors operate on arrays or vectors of data while conventional CPU’s operate on individual data
elements or scalars. The vector registers are capable of storing a vector of operands and operating simultaneously
on their contents.
Vectorized and pipelined functional units apply the same operation to each element in the vector (or pairs of
elements).
Vector instructions operate on vectors rather than scalars.
Next, the idea is to use interleaved memory, multiple “banks” of memory, which can be accessed more or less
independently. Distributing elements of a vector across multiple banks, so it reduces or eliminates delay in
loading/storing successive elements.
The program accesses elements of a vector located at fixed intervals, it is strided memory access and hardware
scatter/gather.
Pros : fast, easy to use, vectorizing compilers are good at identifying code to exploit, compilers also can provide
information about code that cannot be vectorized (helps the programmer re-evaluate code), high memory bandwidth,
uses every item in a cache line
Cons : they don’t handle irregular data structures as well as other parallel architectures, a very finite limit to their
ability to handle ever larger problems (scalability)
The Graphics Processing Units (GPU) is used for graphics rendering via shaders (parallel functions).
Real time graphics application programming interfaces or API’s use points, lines, and triangles to internally represent
the surface of an object. A graphics processing pipeline converts the internal representation into an array of pixels
that can be sent to a computer screen. Several stages of this pipeline (called shader functions) are programmable,
typically just a few lines of C code. Shader functions are also implicitly parallel, since they can be applied to multiple
elements in the graphics stream. GPU’s can often optimize performance by using SIMD parallelism. The current
generation of GPU’s use SIMD parallelism, although they are not pure SIMD systems.
Time to access all the memory locations will be the A memory location a core is directly connected to can
same for all the cores. be accessed faster than a memory location that must
be accessed through another chip.
Clusters are the most popular distributed memory system. They are a collection of commodity systems connected
by a commodity interconnection network. Nodes of a cluster are individual computations units joined by a
communication network.
Interconnection networks affect performance of both distributed and shared memory systems.
Two categories :
● Shared memory interconnects
- Bus interconnect : a collection of parallel communication wires together with some hardware that
controls access to the bus. Communication wires are shared by the devices that are connected to it.
As the number of devices connected to the bus increases, contention for use of the bus increases,
and performance decreases.
- Switched interconnect : uses switches to control the routing of data among the connected devices.
Crossbar allows simultaneous communication among different devices. It is faster than buses but
the cost of the switches and links is relatively high.
In a fully connected network, each switch is directly connected to every other switch (but impractical).
A hypercube is a type of interconnect network used in parallel architectures. It is a highly connected direct
interconnect, which facilitates fast communication between processors.
Built inductively :
- a one-dimensional hypercube is a fully-connected system with two processors
- a two-dimensional hypercube is built from two one-dimensional hypercubes by joining “corresponding”
switches
- similarly a three-dimensional hypercube is built from two two-dimensional hypercubes
Crossbar and Omega network are simple examples of indirect interconnected networks. Often shown with
unidirectional links and a collection of processors, each of which has an outgoing and an incoming link, and a
switching network.
Any time data is transmitted, we are interested in how long it will take for the data to reach its destination.
Latency = the time that elapses between the source’s beginning to transmit the data and the destination’s starting
to receive the first byte.
Cache coherence
Programmers have no control over caches and when they get updated.
About the snooping cache coherence, the cores share a bus. Any signal transmitted on the bus can be “seen” by
all cores connected to the bus. When core 0 updates the copy of x stored in its cache it also broadcasts this
information across the bus. If core 1 is “snooping” the bus, it will be that x has been updated and it can mark its copy
of x as invalid.
Directory based cache coherence uses a data structure called a directory that stores the status of each cache
line. When a variable is updated, the directory is consulted, and the cache controllers of the cores that have that
variable’s cache line in their caches are invalidated.
Parallel software
The burden is on software. Hardware and compilers can keep up the pace needed. From now on in shared memory
programs, start a single process and fork threads and threads carry out tasks. In distributed memory programs, start
multiple processes and processes carry out tasks.
A single program multiple data (SPMD) program consists of a single executable that can behave as if it were
multiple different programs through the use of conditional branches.
6
To write a parallel program, the steps to follow are :
1. divide the work among the processes/threads, so each process/thread gets roughly the same amount of
work and communication is minimized
2. arrange for the processes/threads to synchronize
3. arrange for communication among processes/threads
Shared Memory
With dynamic threads, the master thread waits for work, forks new threads, and when threads are done, they
terminate. The use of resources is efficient, but thread creation and termination is time consuming.
With static threads, a pool of threads is created and are allocated work, but do not terminate until cleanup. The
performance is better, but potential waste of system resources.
Non-determinism
The exact order in which the threads execute may vary. Race conditions can occur when multiple threads access
the same data simultaneously, and at least one of them modifies it. Critical section is the part of the code that
accesses a shared resource and that should only be executed by one thread at a time. Mutually exclusive means
that only one thread can enter the critical section at any given time. Mutual exclusion lock (mutex, or simply lock)
exists to manage that.
Busy-waiting : active wait loop.
Message passing : explicit communication (e.g., MPI: Send/Receive).
Partitioned global address space languages use a shared global address space, but it is divided into parts among
the different threads/processes. Each variable is shared, but each thread has a local portion (fast access) and can
also access remote portions (slower access).
In distributed memory programs, only process 0 will access stdin. In shared memory programs, only the master
thread or thread 0 will access stdin.
In both distributed memory and shared memory programs all the processes/threads can access stdout and stderr.
However, because of the indeterminacy of the order of output to stdout, in most cases only a single process/thread
will be used for all output to stdout other than debugging output.
Debug output should always include the rank or id of the process/thread that’s generating the output.
Only a single process/thread will attempt to access any single file other than stdin, stdout, or stderr. So, for example,
each process/thread can open its own, private file for reading or writing, but no two processes/threads will open the
same file.
Performance
Number of cores = 𝑝
Serial run-time = 𝑇𝑠𝑒𝑟𝑖𝑎𝑙
Parallel run-time = 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
𝑇𝑠𝑒𝑟𝑖𝑎𝑙
Linear speedup = 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 =
𝑝
𝑇𝑠𝑒𝑟𝑖𝑎𝑙
Speedup of a parallel program = 𝑆 =
𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
𝑇
(𝑇 𝑠𝑒𝑟𝑖𝑎𝑙 )
𝑆 𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 𝑇𝑠𝑒𝑟𝑖𝑎𝑙
Efficiency of a parallel program = 𝐸 = = =
𝑝 𝑝 𝑝 ∗ 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙
7
Speedup Efficiency
𝑇𝑠𝑒𝑟𝑖𝑎𝑙
Effect of overhead = 𝑇𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙 =
𝑝 + 𝑇𝑜𝑣𝑒𝑟ℎ𝑒𝑎𝑑
The Amdahl’s Law is : unless virtually all of a serial program is parallelized, the possible speedup is going to be very
limited - regardless of the number of cores available.
Scalability
In general, a problem is scalable if it can handle (gérer) ever increasing problem sizes.
If we increase the number of processes/threads and keep the efficiency fixed without increasing problem size, the
problem is strongly scalable.
If we keep the efficiency fixed by increasing the problem size at the same rate as we increase the number of
processes/threads, the problem is weakly scalable.
Taking timings : CPU time vs. wall clock; functions like MPI_Wtime or omp_get_wtime.
Foster’s methodology
1. Partitioning : divide the computation to be performed and the data operated on by the computation into
small tasks. The focus here should be on identifying tasks that can be executed in parallel.
2. Communication : determine what communication needs to be carried out among the tasks identified in the
previous step.
3. Agglomeration or aggregation : combine tasks and communications identified in the first step into larger
tasks. For example, if task A must be executed before task B can be executed, it may make sense to
aggregate them into a single composite task.
4. Mapping : assign the composite tasks identified in the previous step to processes/threads. This should be
done so that communication is minimized, and each process/thread gets roughly the same amount of work.
Concluding Remarks
Serial systems = the standard model of computer hardware has been the von Neumann architecture
Parallel software = we focus on software for homogeneous MIMD systems, consisting of a single program that
obtains parallelism by branching ; SPMD programs
Input and output = we’ll write programs in which one process or thread can access stdin, and all processes can
access stdout and stderr ; however, because of nondeterminism, except for debug output we’ll usually have a single
process or thread accessing stdout
8
MPI
We have a vector of values 𝑋 = [𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ]. For each value 𝑥𝑖 , we apply a computationally expensive function
𝑓(𝑥𝑖 ) which takes approximately one day to compute = heavy computations. On a single machine, this would take n
days in total, because the computations are performed sequentially : 𝑡 = 𝑡𝑓(𝑥1 )+ . . . + 𝑡𝑓(𝑥𝑛 ) ≈ 𝑛 𝑑𝑎𝑦𝑠
This time can be improved because the computations are independent of each other, so they can be distributed
across several machines. We can then introduce MPI to show how to leverage multiple machines to significantly
reduce execution time.
An introduction to MPI
MPI is not a program, it is a specification ! a common language, a standard that defines how multiple processes
(on one or more machines) can communicate to perform a parallel computation.
Different implementations exist, these are programs who realise this standard :
- MPICH : one of the firsts, often used as an academic reference
- OpenMPI : very widespread, used in production on many clusters
- and more : for example Intel MPI, Microsoft MPI, etc.
The parallelism offered by MPI is not limited to computer science ; parallel applications are used in other fields such
as physics, biology, and mathematics.
MPI is primarily designed for distributed memory systems, where each process has its own memory and messages
are exchanged.
To identify the MPI Processes, a common practice is to attribute non-negative integers called ranks. So for p
processes, we have 0, 1, …, p-1.
Input/Output in MPI
Concerning the output, each process just prints a message. If there are several processes, the order of the output
is unpredictable.
About the inputs, in most MPI implementations, only process 0 in MPI_COMM_WORLD access to stdin (lecture de
l’entrée standard). Process 0 reads the data (scanf) and sends the data to the other process.
1
Compilation Execution
MPI is written in C, uses the imports like stdio.h, string.h, etc. and needs to add the mpi.h header file.
MPI identifiers start with “MPI_” and the first letter following underscore is uppercase for function names and types
and to avoid confusion.
MPI Components :
● MPI_Init : tell MPI to setup → int MPI_Init(int* argc_p, char*** argv_p);
● MPI_Finalize : tell MPI to cleanup → int MPI_Finalize(void);
Point-to-point communications
MPI_Init creates one for us ! Called MPI_COMM_WORLD and it contains in all the processes (si on crée p processus
(my_rank_p = my rank = the process making this call), le communicateur regroupe les p rangs (comm_sz_p = nb of
processes in the communicator)).
int MPI_Send(void* msg_buf_p, int msg_size, MPI_Datatype msg_type, int dest, int tag,
MPI_Comm communicator);
int MPI_Recv(void* msg_buf_p, int buf_size, MPI_Datatype buf_type, int source, int
tag, MPI_Comm communicator, MPI_Status** status_p /*out*/);
We can see message matching between both MPI_Send and MPI_Recv.
A receiver can receive a message without knowing the message size, the sender → MPI_ANY_SOURCE or the tag
→ MPI_ANY_TAG.
In the function MPI_Recv, the last argument “status_p” has the type MPI_Status*. It’s mean MPI_SOURCE,
MPI_TAG or MPI_ERROR → for MPI_Status* status; status.MPI_SOURCE / status.MPI_TAG.
2
For the amount of data : int MPI_Get_count(MPI_Status* status_p, MPI_Datatype type, int*
count_p /*out*/);
MPI_Recv is always block (appel bloquant) and MPI_Send behaves differently according to buffer size (bloquant ou
non, cutoffs/blocking) depending on the implementation ! The solution is to know your implementation !
If every process does a MPI_Send, the program will hang or deadlock since MPI_Recv is not reached. Each
process is blocked waiting for an event that will never happen.
A program is unsafe if it relies on MPI buffering to work. It works for various inputs and hangs for others. Program
unsafe if correct behavior depends on buffering of MPI_Send.
To check if a program is safe, use MPI_Ssend instead, “s” = synchronous, block until a matching MPI_Recv :
int MPI_Ssend(void* msg_buf_p, int msg_size, MPI_Datatype msg_type, int dest, int tag,
MPI_Comm communicator);
To make a program safe, using MPI_Sendrecv, scheduling handled by MPI, blocking send + receive, dest and source
can be equal :
int MPI_Sendrecv(void* send_buf_p, int send_buf_size, MPI_Datatype send_buf_type, int
dest, int send_tag, void* recv_buf_p /*out*/, int recv_buf_size, MPI_Datatype
recv_buf_type, int source, int recv_tag, MPI_Comm communicator, MPI_Status* status_p);
Collective communications
3
Collective communications involve ALL processes of a communicator.
In collective communications, ALL processes MUST call the SAME collective function. All arguments must BE
COMPATIBLE.
output_data_p only used on dest_process. All of the processes still need to pass in an actual argument
corresponding to output_data_p even if NULL.
The Point-To-Point (P2P) communications are matched using communicators and tags. The collective
communications don't use tags, they are matched using communicators and call order ! → All processes use the
same collective call order.
Broadcast = one process send its data to all the others in a communicator
int MPI_Bcast(void* data_p /*in/out*/, int count, MPI_Datatype datatype, int
source_proc, MPI_Comm comm);
Data distribution
● Block partitioning = assign blocks of consecutive components to each process
● Cyclic partitioning = assign components in a round robin fashion
● Block-cyclic partitioning = use a cyclic distribution of blocks of components
MPI_Scatter allows to read an entire vector on a process and send the required components to each process:
int MPI_Scatter(void* send_buf_p, int send_count, MPI_Datatype send_type, void*
recv_buf_p /*out*/, int recv_count, MPI_Datatype recv_type, int src_proc, MPI_Comm
comm);
MPI_Gather allows from one process to collect data of all the other processes :
4
int MPI_Gather(void* send_buf_p, int send_count, MPI_Datatype send_type, void*
recv_buf_p /*out*/, int recv_count, MPI_Datatype recv_type, int dest_proc, MPI_Comm
comm);
MPI_Allgather allows to collect data from all the processes on all the processes :
int MPI_Allgather(void* send_buf_p, int send_count, MPI_Datatype send_type, void*
recv_buf_p /*out*/, int recv_count, MPI_Datatype recv_type, MPI_Comm comm);
Allgather concatenates the content of send_buf_p of each process and stores it in each process recv_buf_p.
recv_count is the amount of data being received from each process.
Derived datatypes
The derived datatypes allow to represent any collection of data items in memory by storing the types of the
items and their relative locations in memory. They also allow MPI communication functions to handle custom
user types properly. They work for both send and receive cases.
Derived Datatypes = sequence of basics MPI types
MPI_Type_create_struct builds a derived datatype that consists of individual elements that have different basic
types :
int MPI_Type_create_struct(int count, int array_of_blocklenghts[], MPI_Aint
array_of_displacements[], MPI_Datatype array_of_types[], MPI_Datatype* new_type_p
/*out*/);
To compute displacement, MPI_Get_address returns the address of the memory location referenced by location_p:
int MPI_Get_address(void* location_p, MPI_Aint* address_p /*out*/);
MPI_Aint is a big enough type to store addresses.
Performance evaluation
Elapsed Parallel Time, to performance evaluation, we can return the number of seconds that have elapsed since
some time in the past. To measure performance, we use wall clock time.
double MPI_Wtime(void);
double start, finish;
...
start = MPI_Wtime();
/* Code to be timed */
...
finish = MPI_Wtime();
printf(“Proc %d > Elapsed time = %e seconds\n”, my_rank, finish-start);
5
Pthread
A thread is a schedulable set of instructions, created by a process and shared virtual memory.
Pthread is a library for creating and managing threads. Pthread for POSIX thread ← interoperability
Thread overview
Usually, each thread has its local variables in a stack so, if there are several stack, each thread has its own local
variables in its stack.
It is the main thread who calls other threads with the function pthread_create().
The compilation of a file with Pthread needs -lpthread in the CFLAGS of the makefile. Like -lm for the math functions.
The order of the output when there are several threads is unpredictable.
To define the task(s) of a thread, create a function with the set of instructions of the thread and add this when
pthread_create() is called (like an argument).
Several threads can execute the same set of instructions but beware about the shared variables (the global
variables). Their modifications need to be in a critical section. Different solutions exist :
1. The solution with a flag / boolean does not work because the threads can check the value of the flag at the
same time.
1
OpenMP
OpenMP is a programming interface with a set of directives, functions and clauses that are added to code to easily
write parallel programs. It works on machines where all threads share the same memory space (shared memory)
and it is threads-based.
Fork-join model
Pros :
● Easier synchronization
● Less error prone
● Built-in load balancer
● Portability
● Faster development
Cons :
● Limited control (compared to PThread)
Using OpenMP
OpenMP is implemented by the compiler (e.g gcc which recognizes and activates the directives).
It can be used directly in C or C++ code (and also Fortran).The API functions are OpenMP functions for controlling
or querying threads and parallelism is enabled via pragma directives in the code.
It is enabled using -fopenmp with gcc.
It uses an additional shared library : omp.h and lib<something>.so.
1
CUDA introduction
What is an accelerator ?
An accelerator is used typically for a special (or limited) purpose. It is very efficient at what it does, less so for generic
programs and it is often a plug-in card to a computer.
A Multicore CPU is focused on single-threaded A GPU (Graphics Processing Unit) is an add-on card
performance and general purpose computations. Each focused on graphics and specialized computations.
core uses techniques to optimize its thread(s). More general purpose programming on host CPU. The
Complicated cores consume chip area (branch computations are simple and can easily be parallelised.
prediction, control, caches, …) → the number of cores There are many simpler cores and the aggregated
on a chip is limited. performance is more important than single thread
performance.
So graphics, but also a lot of compute power. GPUs are cheap, powerful and ubiquitous.
Before CUDA, it was necessary to transform the problem into a graphics problem to use the GPU, but NVIDIA created
CUDA to allow the GPU to be used directly as a parallel computing engine.
CUDA is a program for more general purpose tasks and specifies a generic GPU
architecture and a programming model.
CUDA has developed over time with the support for more features added over time
(like Tensor cores). The registers, cache sizes ++ change also over time. It is
important for optimization.
1
Introduction to CUDA programming = Programming the GPU
exemple : C program with “device code” that runs on the GPU and “host code” that runs on the CPU
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main() {
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}
The key idea is to run a lot of tiny threads : Single Program Multiple Data Model
(SPMD). But how to organize and start threads ? With a hierarchical model.
- The threads are queued as a grid of thread blocks : use 1-, 2-, or 3-
dimensional grids of blocks, and map well to vector and matrix problems.
- A block is a set of threads : blocks are scheduled / assigned to Symmetric
Multiprocessors (Sms). Runs to completion on that SM. Can also be 1-,
or 3-dimensional.
- Organization is specified using the kernel start syntax :
kernel<<<numBlocks, threadsPerBlock>>>(parms)
A unique identifier or position for a thread can be computed for a thread using
“magic” variables. Typically used to identify which part of the problem this thread
should work on.
The scalability is transparent. Hardware is free to schedule thread blocks on any processor. The programmer can
make no assumption on the execution order of thread blocks. Concerning digression, SMT and GPU threads are
2
scheduled by hardware, not by an operating system : lower overhead and software still controls which threads the
hardware can schedule.
To summarize, we start a grid of blocks on a GPU and each block has a number of threads.
The threads are executed in lock-step. That saves hardware resources compared to individual execution of threads.
Threads inside a block are grouped into “warps” for execution. For each cycle, the scheduler picks a warp that is
ready to run and issues 1 (or 2) instructions for that warp.
Warning about SIMT execution model (Single Instruction Multiple Threads is similar to SIMD model). SM executes
warps (32 threads).
There is one problem, divergent branches can cause performance problems. For example : in this execution, threads
with even threadId execute A, threads with odd threadId idle, then, the threads with odd threadId execute B, and
threads with even threadId idle.
if threadId is even :
do a
else :
do b
GPUs prior to Volta could deadlock if divergent threads in a warp tried to signal or synchronize using lock etc.
Any blocks with deadlock if divergent threads use barriers (like syncthreads()) / if some of the threads never call the
barrier.
3
The threads inside a block are synchronized and __syncthreads() works as a barrier inside the block (and
__syncwarp() for in a warp).
4
CUDA optimization
The first step of optimization is “DON’T DO IT”. First, you need to focus on getting a simple implementation to
work. You will run into many problems if you start trying to design an optimal solution before you know the problem.
Sometimes, the simple implementation is good enough and will be simpler to maintain. Make sure you understand
the algorithm, data flow and system architecture. Then, we can optimise.
The thread execution goal is to keep cores busy. The main reasons for stalling threads are memory access,
synchronization and any other shared resource (ex : limited registers / shared memory → fewer threads per block).
Occupancy (keeping cores busy) = is the ratio of the number of active warps per multiprocessor to the maximum
number of possible active warps
The kernel code can be compiled with --ptxas-options=-v (or --resource-usage) to determine the memory usage.
Dealing with memory concerns memory bottlenecks and memory capacity.
A warning about local memory, the local memory is automatically used by the compiler for thread local variables
when it runs out of registers ; even off-chip has the same speed as the global memory. The compiler can reveal this
(see best practices guide).
NB : trem : register spill
may be cached (depending on architecture)
The global memory access optimization resides in device memory, accessed via 32-, 64- and 128- byte aligned
memory transactions (aligned to size).
Compute Capability (CC) 2.x : concurrent memory access from threads coalesced into N cache line operations
- N = number of cache lines or transactions necessary for coalesced / combined memory accesses
5
- L1 has 128-byte lines, L2 has 32-byte segments
Compute Capability 3.x :
- global memory only cached in L2, L1 reserved for local memory
- Some exceptions (CC 3.5, 3.7, 5.2 → different generations)
The effects of misaligned memory access result in more transactions and therefore a significant drop in GPU
performance and the effects of strided (spaced) memory accesses prevent coalescence, causing more transactions
and a sharp drop in GPU performance.
A solution is optimizing host-device memory transfers : avoid transfers if possible (keep data on the device
between kernel calls) and consider overlapping transfers and computation.
CUDA calls are entered into a stream and queued for running by the run time system and are executed in order in a
stream. Some calls are blocking and will wait for all previous (ex : copy from device to host). To measure a kernel,
use cudaEventRecord.
6
OpenCL
What is OpenCL ?
OpenCL, for Open Computing Language, is an open industry standard for programming a heterogeneous collection
of CPUs, GPUs and other processors (AMD, Apple, IBM, Intel, Nvidia, ...).
The use of OpenCL allows portability because the same code can work on multiple architectures (CPU, GPU, Cell
BE, Xeon Phi, …). It is an efficient parallel programming model because it is a data- and task- parallel computing
model and it abstracts the specifics of underlying hardware.
A host connected to one or more OpenCL devices. An OpenCL device consists of compute units (CU). A compute
unit consists of processing elements (PE). PE executes code as SIMD or SPMD.
Kernel Kernel
1
Mapping OpenCL dimensions and indices to CUDA :
The data parallel programming model is the primary model driving the design of OpenCL. It applies a sequence of
instructions to multiple elements of a memory object. Typically, one-to-one mapping between work-item (thread in
CUDA) and the element.
The task parallel programming model executes a kernel on a compute unit (multiprocessor) with a work-group
(threadblock) containing a single work-item (thread). The parallelism is expressed by enqueuing multiple tasks.
3
GPU vs CPU - really 100-1000x ?
“Myths” vs the authors’ results, the published and otherwise reported results are :
- GPUs provide speedups of 100-100x compared to CPUs
Their experiments showed that the average performance advantage is 2.5x but sometimes, the CPU was faster.
Basic observations
Older work presented fantastic speedups, but some comparisons were made with unoptimized or even single-
threaded CPU code or were also made with mobile CPUs. Unbalanced comparisons may create unrealistic
expectations of performance.
About the paper’s experimental evaluation, all is optimized to maximize performance for both : CPU vs GPU. In this
way, 14 kernels /benchmarks covering common computations for high performance computing.
The GPU still came out ahead in most cases. Not as much as previous papers reported. In two cases, the CPU came
out ahead.
Why ?
Looking at theoretical performance numbers shows us that a large performance gap (100x - 1000x) indicates
inefficient use of the CPU.
Similarly, efficient use of both platforms should still give the GPU an edge based on the raw performance numbers.
GPUs encounter difficulties with the synchronization. Synchronizing many cores efficiently is a difficult problem, it’s
easier with a few cores. Also, multiple passes over datasets that were too large to fit in registers or shared memory.
For the CPUs, it is about the kernels that are truly memory or CPU bound (the GPU has higher memory BW (bande
passante) and more FLOPS). Also, the CPUs do not have native gather/scatter operations (non-contiguous).
Some problems are hard to optimize. Implementations easily end up using the CPU or GPU inefficiently compared
to the other platform. Inefficient use may still provide an improvement.
2
Communicating Sequential Processes (CSP)
This course focuses on Communicating Sequential Processes (CSP), a mathematical theory of concurrency
developed by Tony Hoare in 1978. It explains how to build multithreaded applications without common concurrency
problems (such as deadlocks or race conditions) by using sequential processes that communicate via channels. The
course emphasizes the simplicity, compositionality, and verifiability of concurrent programs. It uses examples in
PyCSP (a Python implementation of CSP) and compares it with other languages such as Go.
Multithreading ? “If you can get away with it, avoid using threads. Threads can be difficult to use, and they make
programs harder to debug. In general, they just aren’t necessary for strictly GUI work, such as updating component
properties.”
Object orientation programming can lead to unpredictable shared states (e.g., a value that changes unexpectedly
between executions). CSP avoids this by prohibiting global state sharing between processes.
CSP
Like this, CSP is the solution. “Communicating Sequential Processes (CSP) is a precise mathematical theory
of concurrency that can be used to build multithreaded applications that are guaranteed to be free of the
common problems of concurrency and (perhaps more importantly) can be proven to be so.”
Observations :
● The action of the assignment is familiar and well understood. Any change of internal state of a machine
executing a program can be modeled as an assignment.
● Operations of input and output, affecting the external environment, are not nearly so well understood.
“This paper (Tony Hoare, 1978, CSP paper) suggests that input and output are basic primitives of programming and
that parallel composition of communicating sequential processes is a fundamental program structuring method. When
combined with a development of Dijkstra’s guarded command, these concepts are surprisingly versatile.”
Concurrency : create the illusion of simultaneous execution of multiple tasks at once and manage the execution of
these tasks.
● Interleaving on a single core
● Can also compute correctly in parallel
● Correctness, efficiency
Processes
In pure CSP, processes don’t share state (no global variables). Each process is a small sequential program.
Typical granularity : expression or function
They are compositional (can be combined). (cf. figure for Parallel)
1
In PyCSP :
@process
def foo():
pass
Channels
The channels are synchronous communication mechanisms between processes. The reads and writes are
synchronous : a process writes to B (B!v) and another read from A (A?v). There is no implicit buffering. The idea is
similar to assigning across processes.
In PyCSP :
@process
def foo1(cin, cout):
cout(42)
while 1:
v = cin()
cout(v+1)
@process
def foo2(cin, cout):
while 1:
v = cin()
print(“Got”, v)
cout(v+1)
A = Channel()
B = Channel()
Parallel(foo1([Link], [Link]), foo2([Link], [Link]))
Parallel
Each process in a Parallel() construct runs concurrently with the others. The multiple processes are executed
simultaneously. It also exists in Sequence() for sequential execution.
In PyCSP :
@process
def foo():
pass
@process
def bar():
Parallel(foo(), foo())
Alternative (Alt)
In CSP theory, you will encounter the term “external choice”. The idea is that this mechanism will wait for any event
and pick one if more than one is ready. Thus, it allows you to choose from several ready channels (external choice).
This is useful for waiting for one event among several.
2
In PyCSP :
@process
def read_any(A, B, C):
alt = Alternative(A, B, C)
while 1:
guard, ret = [Link]()
print(“Got value”, ret)
A = Channel()
B = Channel()
C = Channel()
Parallel(read_any([Link], [Link], [Link]), ...)
Guards
In PyCSP :
@process
def read_any(A, B, C):
timeout = Timer(10)
alt = Alternative(A, B, C, timeout)
while 1:
guard, ret = [Link]()
if guard == timeout:
print(“Timed out”)
else:
print(“Got value”, ret)
A = Channel()
B = Channel()
C = Channel()
Parallel(read_any([Link], [Link], [Link]), ...)
Some implementations
CSP can be used in many languages thanks to libraries and implementations like : JCSP (JAVA library for CSP),
C++CSP, PyCSP, etc.
There exists CSP-based languages like Go, Rust or actors and actor-based systems and languages.
3
💻 Assignment 1
mainSequential.c
lock_init(seqLength);
mainParallel.c
lock_init(seqLength);
1
double start_time = MPI_Wtime();
int unlock = 0;
int unlock_global = 0;
int attempt[seqLength];
if (lock_open()) {
unlock = 1;
double finish_time = MPI_Wtime();
printf("Wow it works!\n");
if (unlock_global) {
// printf("PID%d stopped\n", world_rank);
break;
}
}
MPI_Finalize();
return 0;
}
mainMPI.c
lock_init(seqLength);
double start_time = MPI_Wtime();
int unlock = 0;
int unlock_global = 0;
int attempt[seqLength];
if (lock_open()) {
//unlock = 1;
unlock_global = 1;
double finish_time = MPI_Wtime();
printf("Wow it works!\n");
if (unlock_global) {
// printf("PID%d stopped\n", world_rank);
break;
}
}
MPI_Finalize();
return 0;
}
makefile
CC=gcc
MPICC=$(shell which mpicc)
MPIRUN=$(shell which mpirun)
run_local_parallel: mainParallel
$(MPIRUN) -np 4 ./mainParallel 3
run_local: mainMPI
$(MPIRUN) -np 4 ./mainMPI 3 # Launch 4 MPI processes locally to brut force the lock with a sequence of 3 numbers
clean:
@rm mainSequential mainMPI mainParallel
.PHONY: clean
4
💻 Assignment 2
main.c → serial
// first matrice
int filterX[3][3] = {
{-1, 0, 1},
{-2, 0, 2},
{-1, 0, 1}};
// second matrice
int filterY[3][3] = {
{1, 2, 1},
{0, 0, 0},
{-1, -2, -1}};
1
// convolution
Gx += [Link] * filterX[j + 1][i + 1];
Gy += [Link] * filterY[j + 1][i + 1];
}
}
// final value
int G = (int)(sqrt(Gx * Gx + Gy * Gy));
if (G > 255)
G = 255;
if (G < 0)
G = 0;
// matrice
int kernel[3][3] = {
{-2, -1, 0},
{-1, 1, 1},
{0, 1, 2}};
// convolution
G += [Link] * kernel[j + 1][i + 1];
}
}
if (G > 255)
G = 255;
if (G < 0)
G = 0;
// Init memory
RGB *sobel = malloc(sizeof(RGB) * width * height);
RGB *emboss = malloc(sizeof(RGB) * width * height);
// Apply filters
double sobel_start = get_time();
ApplySobel(sobel, width, height);
double sobel_time = get_time() - sobel_start;
if (csv) {
if (!csv_exists) {
fprintf(csv, "version,image,width,height,ranks,threads,load_gray,distribute,compute,gather,save,total\n");
}
// For serial: ranks=1, threads=1 (no OpenMP here), distribute=gather=0
fprintf(csv, "serial,%s,%d,%d,%d,%d,%.6f,%.6f,%.6f,%.6f,%.6f,%.6f\n",
marguerite, width, height, 1, 1,
load_time, 0.0, computation_time, 0.0, save_time, total_time);
fclose(csv);
}
// Free memory
free(sobel);
free(emboss);
return (0);
}
main.c → parallel
// first matrice
int filterX[3][3] = {
{-1, 0, 1},
{-2, 0, 2},
{-1, 0, 1}};
// second matrice
int filterY[3][3] = {
4
{1, 2, 1},
{0, 0, 0},
{-1, -2, -1}};
// convolution
Gx += [Link] * filterX[j + 1][i + 1];
Gy += [Link] * filterY[j + 1][i + 1];
}
}
// final value
int G = (int)(sqrt(Gx * Gx + Gy * Gy));
if (G > 255)
G = 255;
if (G < 0)
G = 0;
// matrice
int kernel[3][3] = {
{-2, -1, 0},
{-1, 1, 1},
{0, 1, 2}};
5
// convolution
G += [Link] * kernel[j + 1][i + 1];
}
}
if (G > 255)
G = 255;
if (G < 0)
G = 0;
// Init memory
sobel = malloc(sizeof(RGB) * width * height);
emboss = malloc(sizeof(RGB) * width * height);
// Load images
LoadRegion(marguerite, 0, 0, width, height, sobel);
LoadRegion(marguerite, 0, 0, width, height, emboss);
// Apply filters
double t0_comp = MPI_Wtime();
ApplySobel(input_sobel, width, end - start);
ApplyEmboss(input_emboss, width, end - start);
double t1_comp = MPI_Wtime();
7
for (int rank = 1; rank < world_size; rank++) {
// interior (non-halo) region for this rank
int baseStart = rank * nb_tiles;
int baseEnd = (rank == world_size - 1) ? height : baseStart + nb_tiles;
int interiorRows = baseEnd - baseStart; // rows sent back by worker
// Save images
double t0_save = MPI_Wtime();
CreateBMP(outsobel, width, height);
WriteRegion(outsobel, 0, 0, width, height, sobel);
CreateBMP(outemboss, width, height);
WriteRegion(outemboss, 0, 0, width, height, emboss);
double t1_save = MPI_Wtime();
double t1_total = MPI_Wtime();
double t_dist = (t1_dist - t0_dist);
double t_comp = (t1_comp - t0_comp);
double t_gath = (t0_save - t0_gath);
double t_save = (t1_save - t0_save);
double t_tot = (t1_total - t0_total);
printf("Distribute time: %.6f s\n", t_dist);
printf("Compute time: %.6f s\n", t_comp);
printf("Gather time: %.6f s\n", t_gath);
printf("Save time: %.6f s\n", t_save);
printf("Total time: %.6f s\n", t_tot);
// CSV logging
const char *csv_path = "../../doc/[Link]";
FILE *csv = fopen(csv_path, "r");
int csv_exists = (csv != NULL);
if (csv) fclose(csv);
csv = fopen(csv_path, "a");
if (csv) {
if (!csv_exists) {
fprintf(csv, "version,image,width,height,ranks,threads,load_gray,distribute,compute,gather,save,total\n");
}
fprintf(csv, "parallel,%s,%d,%d,%d,%d,%.6f,%.6f,%.6f,%.6f,%.6f,%.6f\n",
marguerite, width, height, world_size, omp_threads,
0.0, t_dist, t_comp, t_gath, t_save, t_tot);
fclose(csv);
}
}
else {
// Send only the interior (non-halo) rows back to rank 0
MPI_Send(input_sobel + localstart * width, (localend - localstart) * width * sizeof(RGB), MPI_BYTE, 0, 1,
MPI_COMM_WORLD);
MPI_Send(input_emboss + localstart * width, (localend - localstart) * width * sizeof(RGB), MPI_BYTE, 0, 1,
MPI_COMM_WORLD);
}
// Free memory
free(input_sobel);
free(input_emboss);
if (world_rank == 0) {
free(sobel);
free(emboss);
8
}
MPI_Finalize();
return (0);
}
main.c → optimized
// first matrice
int filterX[3][3] = {
{-1, 0, 1},
{-2, 0, 2},
{-1, 0, 1}};
// second matrice
int filterY[3][3] = {
{1, 2, 1},
{0, 0, 0},
{-1, -2, -1}};
9
// convolution
Gx += [Link] * filterX[j + 1][i + 1];
Gy += [Link] * filterY[j + 1][i + 1];
}
}
// final value
int G = (int)(sqrt(Gx * Gx + Gy * Gy));
if (G > 255)
G = 255;
if (G < 0)
G = 0;
// matrice
int kernel[3][3] = {
{-2, -1, 0},
{-1, 1, 1},
{0, 1, 2}};
// convolution
G += [Link] * kernel[j + 1][i + 1];
}
}
if (G > 255)
G = 255;
if (G < 0)
G = 0;
10
int main(int argc, char **argv) {
omp_set_num_threads(4);
const char *img_env = getenv("IMG");
const char *marguerite = (argc > 1 && argv[1] && argv[1][0]) ? argv[1] : (img_env && img_env[0] ? img_env :
"../../images/[Link]");
const char *outsobel = "[Link]";
const char *outemboss = "[Link]";
// Init memory
sobel = malloc(sizeof(RGB) * width * height);
emboss = malloc(sizeof(RGB) * width * height);
// Load images
LoadRegion(marguerite, 0, 0, width, height, sobel);
LoadRegion(marguerite, 0, 0, width, height, emboss);
// for the amount of data and the offset to send at each process
int *data_sobel = malloc(world_size * sizeof(int));
int *offset_sobel = malloc(world_size * sizeof(int));
int *data_emboss = malloc(world_size * sizeof(int));
int *offset_emboss = malloc(world_size * sizeof(int));
int start = 0;
int end = 0;
if (rank == world_rank) {
start = Rstart;
end = Rend;
}
}
// use Scatterv and Gatherv because each process not receive the same amount of data (maybe less for the last one)
// Apply filters
double t0_comp = MPI_Wtime();
ApplySobel(input_sobel, width, end - start);
ApplyEmboss(input_emboss, width, end - start);
double t1_comp = MPI_Wtime();
if (world_rank == 0) {
// Save images
double t0_save = MPI_Wtime();
CreateBMP(outsobel, width, height);
WriteRegion(outsobel, 0, 0, width, height, sobel);
CreateBMP(outemboss, width, height);
WriteRegion(outemboss, 0, 0, width, height, emboss);
double t1_save = MPI_Wtime();
double t1_total = MPI_Wtime();
double t_dist = (t1_dist - t0_dist);
double t_comp = (t1_comp - t0_comp);
double t_gath = (t1_gath - t0_gath);
double t_save = (t1_save - t0_save);
double t_tot = (t1_total - t0_total);
printf("Distribute time: %.6f s\n", t_dist);
printf("Compute time: %.6f s\n", t_comp);
printf("Gather time: %.6f s\n", t_gath);
printf("Save time: %.6f s\n", t_save);
printf("Total time: %.6f s\n", t_tot);
// CSV logging
12
const char *csv_path = "../../doc/[Link]";
FILE *csv = fopen(csv_path, "r");
int csv_exists = (csv != NULL);
if (csv) fclose(csv);
csv = fopen(csv_path, "a");
if (csv) {
if (!csv_exists) {
fprintf(csv, "version,image,width,height,ranks,threads,load_gray,distribute,compute,gather,save,total\n");
}
fprintf(csv, "optimized,%s,%d,%d,%d,%d,%.6f,%.6f,%.6f,%.6f,%.6f,%.6f\n",
marguerite, width, height, world_size, omp_threads,
0.0, t_dist, t_comp, t_gath, t_save, t_tot);
fclose(csv);
}
}
// Free memory
free(input_sobel);
free(input_emboss);
free(data_sobel);
free(data_emboss);
free(offset_sobel);
free(offset_emboss);
if (world_rank == 0) {
free(sobel);
free(emboss);
}
MPI_Finalize();
return (0);
}
makefile
# Number of processes:
N?=6
# Optional hostfile path (set HOSTFILE=path/to/hosts when calling make)
HOSTFILE?=
# OpenMP threads per process (export OMP_NUM_THREADS if preferred)
OMP_NTHREADS?=4
# Change this as you like:
MPIEXEC_OPTS?=--oversubscribe
run: main
OMP_NUM_THREADS=$(OMP_NTHREADS) mpiexec $(if $(HOSTFILE),-hostfile $(HOSTFILE),) -n $(N)
$(MPIEXEC_OPTS) ./main
clean:
@rm -f main [Link] [Link]
.PHONY: clean
13
💻 Assignment 3
mainSerial.c
int main() {
int greatest1, greatest2, greatest3;
srand(time(NULL));
/////////////////////////////////////////////////////
// EXECUTION //
/////////////////////////////////////////////////////
for (int n = 0; n < 30; n++) {
// 0.5 GiB
clock_gettime(CLOCK_MONOTONIC, &start1);
greatest1 = max_tab(array1, size1);
clock_gettime(CLOCK_MONOTONIC, &end1);
elapsed1 = (end1.tv_sec - start1.tv_sec) + (end1.tv_nsec - start1.tv_nsec) / 1e9;
time1[n] = elapsed1;
// 1 GiB
1
clock_gettime(CLOCK_MONOTONIC, &start2);
greatest2 = max_tab(array2, size2);
clock_gettime(CLOCK_MONOTONIC, &end2);
elapsed2 = (end2.tv_sec - start2.tv_sec) + (end2.tv_nsec - start2.tv_nsec) / 1e9;
time2[n] = elapsed2;
// 1.5 GiB
clock_gettime(CLOCK_MONOTONIC, &start3);
greatest3 = max_tab(array3, size3);
clock_gettime(CLOCK_MONOTONIC, &end3);
elapsed3 = (end3.tv_sec - start3.tv_sec) + (end3.tv_nsec - start3.tv_nsec) / 1e9;
time3[n] = elapsed3;
}
/////////////////////////////////////////////////////
// RESULTS //
/////////////////////////////////////////////////////
// for the measures of the time
double sum1 = 0.0, sum2 = 0.0, sum3 = 0.0;
double variance1 = 0.0, variance2 = 0.0, variance3 = 0.0;
double time_min1 = time1[0], time_min2 = time2[0], time_min3 = time3[0];
// release
free(array1);
free(array2);
free(array3);
free(time1);
free(time2);
free(time3);
return 0;
}
[Link]
// Function : that finds the maximum value in an array (parallel version with CUDA)
__global__ void MaxTab(int array[], int len, int output[]) {
// share table to store the max of each warp
extern __shared__ int warps_max[];
3
// reduction on the block
if (warpId == 0) {
int value = (posWarp < blockDim.x / 32) ? warps_max[posWarp] : -1;
int blockMax = MaxWarp(value);
if (posWarp == 0)
atomicMax(output, blockMax);
}
}
int serialFunction(int array1[], size_t size1, int array2[], size_t size2, int array3[], size_t size3) {
int greatest1, greatest2, greatest3;
srand(time(NULL));
/////////////////////////////////////////////////////
// EXECUTION //
/////////////////////////////////////////////////////
for (int n = 0; n < 30; n++) {
// 0.5 GiB
clock_gettime(CLOCK_MONOTONIC, &start1);
greatest1 = max_tab(array1, size1);
clock_gettime(CLOCK_MONOTONIC, &end1);
elapsed1 = (end1.tv_sec - start1.tv_sec) + (end1.tv_nsec - start1.tv_nsec) / 1e9;
time1[n] = elapsed1;
// 1 GiB
clock_gettime(CLOCK_MONOTONIC, &start2);
greatest2 = max_tab(array2, size2);
clock_gettime(CLOCK_MONOTONIC, &end2);
elapsed2 = (end2.tv_sec - start2.tv_sec) + (end2.tv_nsec - start2.tv_nsec) / 1e9;
time2[n] = elapsed2;
// 1.5 GiB
clock_gettime(CLOCK_MONOTONIC, &start3);
greatest3 = max_tab(array3, size3);
clock_gettime(CLOCK_MONOTONIC, &end3);
elapsed3 = (end3.tv_sec - start3.tv_sec) + (end3.tv_nsec - start3.tv_nsec) / 1e9;
time3[n] = elapsed3;
}
/////////////////////////////////////////////////////
// RESULTS //
/////////////////////////////////////////////////////
4
// for the measures of the time
double sum1 = 0.0, sum2 = 0.0, sum3 = 0.0;
double variance1 = 0.0, variance2 = 0.0, variance3 = 0.0;
double time_min1 = time1[0], time_min2 = time2[0], time_min3 = time3[0];
// release
free(time1);
free(time2);
free(time3);
5
return 0;
}
int main() {
int greatest1, greatest2, greatest3;
int greatest1_h = -1, greatest2_h = -1, greatest3_h = -1; // host verification
srand(time(NULL));
int init = -1;
int threads = 256, blocks = 256;
/////////////////////////////////////////////////////
// 0.5 GiB //
/////////////////////////////////////////////////////
size_t size1 = 134217728; // 2^27
// execution
for (int n = 0; n < 30; n++) {
cudaEventRecord(start1);
MaxTab<<<blocks, threads, ((threads / 32) * sizeof(int))>>>(input1, size1, output1);
cudaEventRecord(end1);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapsed1, start1, end1);
time1[n] = elapsed1;
}
6
// copy data GPU -> CPU
cudaEventRecord(copyB1);
cudaMemcpy(&greatest1, output1, sizeof(int), cudaMemcpyDeviceToHost);
cudaEventRecord(endcopyB1);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapcopyB1, copyB1, endcopyB1);
// release
cudaFree(input1);
cudaFree(output1);
/////////////////////////////////////////////////////
// 1 GiB //
/////////////////////////////////////////////////////
size_t size2 = 268435456; // 2^28
// execution
for (int n = 0; n < 30; n++) {
cudaEventRecord(start2);
MaxTab<<<blocks, threads, ((threads / 32) * sizeof(int))>>>(input2, size2, output2);
cudaEventRecord(end2);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapsed2, start2, end2);
time2[n] = elapsed2;
}
7
// release
cudaFree(input2);
cudaFree(output2);
/////////////////////////////////////////////////////
// 1.5 GiB //
/////////////////////////////////////////////////////
size_t size3 = 402653184; // 2^27 + 2^28
// execution
for (int n = 0; n < 30; n++) {
cudaEventRecord(start3);
MaxTab<<<blocks, threads, ((threads / 32) * sizeof(int))>>>(input3, size3, output3);
cudaEventRecord(end3);
cudaDeviceSynchronize();
cudaEventElapsedTime(&elapsed3, start3, end3);
time3[n] = elapsed3;
}
// release
cudaFree(input3);
cudaFree(output3);
/////////////////////////////////////////////////////
// RESULTS //
/////////////////////////////////////////////////////
// execute the serial function
8
serialFunction(array1, size1, array2, size2, array3, size3);
// release
cudaFreeHost(array1);
cudaFreeHost(array2);
cudaFreeHost(array3);
free(time1);
free(time2);
free(time3);
return 0;
}
Makefile
mainSerial: mainSerial.c
gcc mainSerial.c -o mainSerial -lm
mainParallel: [Link]
# Include PTX for compute_50 to enable JIT on newer GPUs
nvcc -gencode arch=compute_50,code=sm_50 -gencode arch=compute_50,code=compute_50 [Link] -o
mainParallel -lm
run_serial: mainSerial
./mainSerial
run_parallel: mainParallel
./mainParallel
clean:
rm -f mainSerial mainParallel
report-clean:
cd report && latexmk -C
10