GPU Programming with CUDA Overview
GPU Programming with CUDA Overview
if ( x [ i ] >= 0 )
x [ i ] += 1 ;
else
x [ i ] −= 2 ;
In a typical SIMD system, each datapath carries out the test x[i] >= 0. Then the datapaths for
which the test is true execute x[i] += 1, while those for which x[i] < 0 are idle. Then the roles
of the datapaths are reversed: those for which x[i] >= 0 are idle while the other datapaths
execute x[i] = 2. See Table 6.1. −
Table 6.1 Execution of branch on a SIMD system.
A typical GPU can be thought of as being composed of one or more SIMD processors. Nvidia
GPUs are composed of Streaming Multiprocessors or SMs.1 One SM can have several
control units and many more datapaths. So an SM can be thought of as consisting of one or
more SIMD processors. The SMs, however, operate asynchronously: there is no penalty if
one branch of an if −else executes on one SM, and
the other executes on another SM. So in our preceding example, if all the threads with x[i] >=
0 were executing on one SM, and all the threads with x[i] < 0 were executing on another, the
execution of our if else example would−require only two stages. (See Table 6.2.)
Table 6.2 Execution of branch on a system with multiple SMs.
Time Datapaths with x[i] >= 0 (on Datapaths with x[i] < 0 (on
SM A) SM B)
1 Test x[i] >= 0 Test x[i] >= 0
2 x[i] += 1 x[i] -= 2
In Nvidia parlance, the datapaths are called cores, Streaming Processors, or SPs. Currently,2
one of the most powerful Nvidia processor has 82 SMs, and each SM has 128 SPs for a total
of 10,496 SPs. Since we use the term “core” to mean something else when we’re discussing
MIMD architectures, we’ll use SP to denote an Nvidia datapath. Also note that Nvidia uses
the term SIMT instead of SIMD. SIMT stands for Single Instruction Multiple Thread, and the
term is used because threads on an SM that are executing the same instruction may not
execute simultaneously: to hide memory access latency, some threads may block while
memory is accessed and other threads, that have already accessed the data, may proceed with
execution.
Each SM has a relatively small block of memory that is shared among its SPs. As we’ll see,
this memory can be accessed very quickly by the SPs. All of the SMs on a single chip also
have access to a much larger block of memory that is shared among all the SPs. Accessing
this memory is relatively slow. (See Fig. 6.1.)
The GPU and its associated memory are usually physically separate from the CPU and its
associated memory. In Nvidia documentation, the CPU together with its associated memory
is often called the host, and the GPU together with its memory is called the device. In earlier
systems the physical separation of host and device memories required that data was usually
explicitly transferred between CPU memory and GPU memory. That is, a function was called
that would transfer a block of data from host memory to device memory or vice versa. So, for
example, data read from a file by the CPU or output data generated by the GPU would have
to be transferred between the host and device with ≥an explicit function call. However, in more
recent Nvidia systems (those with compute capability 3.0), the explicit transfers in the source
code aren’t needed for correctness, although they may be able to improve overall
performance. (See Fig. 6.2.)
In this program we get both the number of thread blocks and the number of threads in each
block from the command line. Now the kernel call starts blk_ct thread blocks, each of which
contains th_per_blk threads.
When the kernel is started, each block is assigned to an SM, and the threads in the block are
then run on that SM. The output is similar to the output from the original program, except
that now we’re using two system-defined variables: threadIdx.x and blockIdx.x. As you’ve
probably guessed, threadIdx.x gives a thread’s rank or index in its block, and blockIdx.x
gives a block’s rank in the grid.
A grid is just the collection of thread blocks started by a kernel. So a thread block is
composed of threads, and a grid is composed of thread blocks.
There are several built-in variables that a thread can use to get information on the grid started
by the kernel. The following four variables are structs that are initialized in each thread’s
memory when a kernel begins execution:
• threadIdx: the rank or index of the thread in its thread block.
• blockDim: the dimensions, shape, or size of the thread blocks.
• blockIdx: the rank or index of the block within the grid.
• gridDim: the dimensions, shape, or size of the grid.
All of these structs have three fields, x, y, and z,5 and the fields all have unsigned integer
types. The fields are often convenient for applications. For example, an application that uses
graphics may find it convenient to assign a thread to a point in two- or three-dimensional
space, and the fields in threadIdx can be used to indicate the point’s position. An application
that makes extensive use of matrices may find it convenient to assign a thread to an element
of a matrix, and the fields in threadIdx can be used to indicate the column and row of the
element. When we call a kernel with something like
int blk_ct , th_per_blk ;
...
Hello <<<blk_ct , th_per_blk > > ( ) ;
the three-element structures gridDim and blockDim are initialized by assigning the values in
angle brackets to the x fields. So, effectively, the following assignments are made:
gridDim . x = blk_ct ;
blockDim . x = th_per_blk ;
The y and z fields are initialized to 1. If we want to use values other than 1 for the y and z
fields, we should declare two variables of type dim3, and pass them into the call to the kernel.
For example,
dim3 grid_dims , block_dims ;
grid_dims . x = 2 ; grid_dims . y = 3 ; grid_dims . z = 1 ; block_dims . x = 4 ; block_dims . y
= 4 ; block_dims . z = 4 ;
...
Kernel <<<grid_dims , block_dims >>> ( . . . ) ;
This should start a grid with 23 1 6 blocks, each of which has 43 64 threads.
× × = =
Note that all the blocks must have the same dimensions. More importantly, CUDA requires
that thread blocks be independent. So one thread block must be able to com- plete its
execution, regardless of the states of the other thread blocks: the thread blocks can be
executed sequentially in any order, or they can be executed in par- allel. This ensures that the
GPU can schedule a block to execute solely on the basis
of the state of that block: it doesn’t need to check on the state of any other block.6
We should note that Nvidia has a number of “product families” that can consist of anything
from an Nvidia-based graphics card to a “system on a chip,” which has the main hardware
components of a system, such as a mobile phone in a single integrated circuit.
Finally, note that there are a number of versions of the CUDA API, and they do
not correspond to the compute capabilities of the different GPUs.
In the kernel, we assign this global rank to my_elt and use this as the subscript for accessing
each thread’s elements of the arrays x, y, and z.
Note that we’ve allowed for the possibility that the total number of threads may not be
exactly the same as the number of components of the vectors. So before car- rying out the
addition,
z [ my_elt ] = x [ my_elt ] + y [ my_elt ];
we first check that my_elt < n. For example, if we=have n 997, and we want at least two
blocks with at least two threads per block, then, since 997 is prime, we can’t possibly have
exactly 997 threads. Since this kernel needs to be executed by at least n threads, we must start
more than 997. For example, we might use four blocks of 256 threads, and the last 27 threads
in the last block would skip the line
z [ my_elt ] = x [ my_elt ] + y [ my_elt ];
Note that if we needed to run our program on a system that didn’t support CUDA, we could
replace the kernel with a serial vector addition function. (See Program 6.4.)
So we can view the CUDA kernel as taking the serial for loop and assigning each iteration to
a different thread. This is often how we start the design process when we want to parallelize a
serial code for CUDA: assign the iterations of a loop to individual threads.
Also note that if we apply Foster’s method to parallelizing the serial vector sum, and we
make the tasks the additions of the individual components, then we don’t need to do anything
for the communication and aggregation phases, and the mapping phase simply assigns each
addition to a thread.
6.8.2 Get_args
After declaring the variables, the main function calls a Get_args function, which re- turns n,
the number of elements in the arrays, blk_ct, the number of thread blocks, and th_per_blk, the
number of threads in each block. It gets these from the command line. It also returns a char
i_g. This tells the program whether the user will input x and y or whether it should generate
them using a random number generator. If the user doesn’t enter the correct number of
command line arguments, the function prints a usage summary and terminates execution.
Also if n is greater than the total number of threads, it prints a message and terminates. (See
Program 6.5.) Note that Get_args is written in standard C, and it runs completely on the host.
6.8.3 Allocate_vectors and managed memory
After getting the command line arguments, the main function calls Allocate_vectors, which
allocates storage for four n-element arrays of float :
x , y , z , cz
The first three arrays are used on both the host and the device. The fourth array, cz, is only
used on the host: we use it to compute the vector sum with one core of the host. We do this so
that we can check the result computed on the device. (See Program 6.6.)
First note that since cz is only used on the host, we allocate its storage using the standard C
library function malloc. For the other three arrays, we allocate storage in Lines 9–11 using the
CUDA function
__host__ cudaError_t cudaMallocManaged (
void ∗∗ devPtr / ∗ out ∗ / , size_t size / ∗ in ∗ / , unsigned flags / ∗ in ∗ / );
The host qualifier is a CUDA addition to C, and it indicates that the function should be
called and run on the host. This is the default for functions in CUDA programs, so it can be
omitted when we’re writing our own functions, and they’ll only be run on the host.
The return value, which has type cudaError_t, allows the function to return an er- ror. Most
CUDA functions return a cudaError_t value, and if you’re having problems with your code, it
is a very good idea to check it. However, always checking it tends to clutter the code, and this
can distract us from the main purpose of a program. So in the code we discuss we’ll generally
ignore cudaError_t return values.
The first argument is a pointer to a pointer: it refers to the pointer that’s being allocated. The
second argument specifies the number of bytes that should be allo- cated. The flags argument
controls which kernels can access the allocated memory. It defaults to cudaMemAttachGlobal
and can be omitted.
The function cudaMallocManaged is one of several CUDA memory allocation func- tions. It
allocates memory that will be automatically managed by the “unified memory system.” This
is a relatively recent addition to CUDA,8 and it allows a programmer to write CUDA
programs as if the host and device shared a single memory: pointers referring to memory
allocated with cudaMallocManaged can be used on both the device and the host, even when
the host and the device have separate physical memories. As you can imagine this greatly
simplifies programming, but there are some cautions. Here are a few:
1. Unified memory requires a device with compute capability ≥ 3.0, and a 64-bit host
operating system.
2. On devices with compute capability < 6.0 memory allocated with cudaMallocManaged
cannot be simultaneously accessed by both the device and the host. When a kernel is
executing, it has exclusive access to memory allocated with cudaMallocManaged.
3. Kernels that use unified memory can be slower than kernels that treat device mem- ory as
separate from host memory.
The last caution has to do with the transfer of data between the host and the device. When a
program uses unified memory, it is up to the system to decide when to transfer from the host
to the device or vice versa. In programs that explicitly transfer data, it is up to the
programmer to include code that implements the transfers, and she may be able to exploit her
knowledge of the code to do things that reduce the cost of transfers, things such as omitting
some transfers or overlapping data transfer with computation.
At the end of this section we’ll briefly discuss the modifications required if you want to
explicitly handle the transfers between host and device.
The first thing to notice is that the kernel is unchanged: the arguments are x, y, z, and n. It
finds the thread’s global index, my_elt, and if this is less than n, it adds the elements of x and
y to get the corresponding element of z.
The basic structure of the main function is almost the same. However, since we’re assuming
unified memory is unavailable, pointers on the host aren’t valid on the de- vice, and vice
versa: an address on the host may be illegal on the device, or, even worse, it might refer to
memory that the device is using for some other purpose. Similar problems occur if we try to
use a device address on the host. So instead of declaring and allocating storage for three
arrays that are all valid on both the host and the device, we declare and allocate storage for
three arrays that are valid on the host hx, hy, and hz, and we declare and allocate storage for
three arrays that are valid on the device, dx, dy, and dz. The declarations are in Lines 15–16,
and the allocations are in the Allocate_vectors function called in Line 20. The function itself
is in Program 6.10.
Since unified memory isn’t available, instead of using cudaMallocManaged, we use the C
library function malloc for the host arrays, and the CUDA function cudaMalloc for the device
arrays:
__host__ __device__ cudaError_t cudaMalloc (
void ∗∗ dev_p / ∗ out ∗ / ,
size_t size / ∗ in ∗ / );
The first argument is a reference to a pointer that will be used on the device. The second
argument specifies the number of bytes to allocate on the device.
After we’ve initialized hx and hy on the host, we copy their contents over to the device,
storing the transferred contents in the memory allocated for dx and dy, respec- tively. The
copying is done in Lines 24–26 using the CUDA function cudaMemcpy:
__host__ cudaError_t cudaMemcpy (
void ∗ dest / ∗ out ∗ / , c o n s t void ∗ source / ∗ in ∗ / , size_t count / ∗ in ∗ / ,
cudaMemcpyKind kind / ∗ in ∗ / );
This copies count bytes from the memory referred to by source into the memory referred to
by dest. The type of the kind argument, cudaMemcpyKind, is an enumer- ated type defined
by CUDA that specifies where the source and dest pointers are located. For our purposes the
two values of interest are cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost. The
first indicates that we’re copying from the host to the device, and the second indicates that
we’re copying from the device to the host.
The call to the kernel in Line 28 uses the pointers dx, dy, and dz, because these are addresses
that are valid on the device.
After the call to the kernel, we copy the result of the vector addition from the device to the
host in Line 31 using cudaMemcpy again. A call to cudaMemcpy is syn- chronous, so it waits
for the kernel to finish executing before carrying out the transfer. So in this version of vector
addition we do not need to use cudaDeviceSynchronize to ensure that the kernel has
completed before proceeding.
After copying the result from the device back to the host, the program checks the result, frees
the memory allocated on the host and the device, and terminates. So for this part of the
program, the only difference from the original program is that we’re freeing seven pointers
instead of four. As before, the Free_vectors function frees the storage allocated on the host
with the C library function free. It uses cudaFree to free the storage allocated on the device.
If your system doesn’t support unified memory, the same idea will work, but the result will
have to be explicitly copied from the device to the host:
Note that in both the unified and non-unified memory settings, we’re returning a single value
from the device to the host.
If unified memory is available, another option is to use a global managed variable for the
sum:
The qualifier managed declares sum to be a managed int that is accessible to all the
functions, regardless of whether they run on the host or the device. Since it’s man- aged, the
same restrictions apply to it that apply to managed variables allocated with
cudaMallocManaged. So this option is unavailable on systems with compute capability < 3.0,
and on systems with compute capability < 6.0, sum can’t be accessed on the host while the
kernel is running. So after the call to Add has started, the host can’t access sum until after the
call to cudaDeviceSynchronize has completed.
Since this last approach uses a global variable, it has the usual problem of reduced modularity
associated with global variables.
So if we think of the length of the subinterval xi, xi+1 as the height of the ith trapezoid, and f
(xi) and f (xi 1) as the two base
+
lengths (see Fig. 3.4), then the area of the ith trapezoid is
However, it’s immediately obvious that there are several problems here:
1. We haven’t initialized h or trap.
2. The my_i value can be too large or too small: the serial loop ranges from 1 up to and
− smallest value for my_i is 0 and the largest is the total number of
including n 1. The
threads minus 1.
3. The variable trap must be shared among the threads. So the addition of my_trap forms a
race condition: when multiple threads try to update trap at roughly the same time, one
thread can overwrite another thread’s result, and the final value in trap may be wrong. (For
a discussion of race conditions, see Section 2.4.3.)
4. The variable trap in the serial code is returned by the function, and, as we’ve seen, kernels
must have void return type.
5. We see from the serial code that we need to multiply the total in trap by h after all of the
threads have added their results.
Program 6.12 shows how we might deal with these problems. In the following sections, we’ll
look at the rationales for the various choices we’ve made.
6.10.3 Initialization, return value, and final update
To deal with the initialization and the final update (Items 1 and 5), we could try to select a
single thread—say, thread 0 in block 0—to carry out the operations:
int my_i = blockDim . x blockIdx . x + threadIdx . x ;
∗
if ( my_i == 0 ) {
h = ( b−a )/ n ;
trap = 0 . 5 ∗ ( f ( a ) + f ( b ));
}
...
if ( my_i == 0 )
trap = trap ∗h ;
There are (at least) a couple of problems with these options: formal arguments to functions
are private to the executing thread and thread synchronization.
Kernel and function arguments are private to the executing thread.
Like the threads started in Pthreads and OpenMP, each CUDA thread has its own stack and,
and since formal arguments are allocated on the thread’s stack, each thread has its own
private variables h and trap. So, any changes made to one of these variables by one thread
won’t be visible to the other threads. We could have each thread initialize h, but we could
also just do the initialization once in the host. If we do this before the kernel is called, each
thread will get a copy of the value of h.
Things are more complicated with trap. Since it’s updated by multiple threads, it must be
shared among the threads. We can achieve the effect of sharing trap by allocating storage for
a memory location before the kernel is called. This allocated memory location will
correspond to what we’ve been calling trap. Now we can pass a pointer to the memory
location to the kernel. That is, we can do something like this:
When we do this, each thread will get its own copy of trap_p, but all of the copies of trap_p
will refer to the same memory location. So ∗trap_p will be shared.
Note that using a pointer instead of a simple float also solves the problem of returning the
value of trap in Item 4.
A wrapper function
If you look at the code in Program 6.12, you’ll see that we’ve placed most of the code we use
before and after calling the kernel in a wrapper function, Trap_wrapper. A wrapper function
is a function whose main purpose is to call another function. It can perform any preparation
needed for the call. It can also perform any additional work needed after the call.
6.10.4 Using the correct threads
We assume that the number of threads, blk_ct∗th_per_blk, is at least as large as the number of
trapezoids. Since the serial for loop iterates from 1 up to n−1, thread 0 and any thread with
−
my_i >n 1, shouldn’t execute the code in the body of the serial for loop. So we should
include a test before the main part of the kernel code
if (0 < my_i && my_i < n ) {
/* Compute x , f ( x ) , and add i n t o *trap_p */
...
}
See Line 11 in Program 6.12.
So rather than have each thread wait for its turn to do an addition into ∗trap_p, we can pair up
the threads so that half of the “active” threads add their partial sum to their partner’s partial
sum. This gives us a structure that resembles a tree (or, perhaps better, a shrub). See Fig. 6.4.
In our figures, we’ve gone from requiring a sequence of 8 consecutive additions to a
sequence of 4. More generally, if we double the number of threads and values (e.g., increase
from 8 to 16), we’ll double the length of the sequence of additions using the basic approach,
while we’ll only add one using the second, tree-structured approach. For example, if we
increase the number of threads and values from 8 to 16, the first approach requires a sequence
of 16 additions, but the tree-structured approach only requires 5., if there are t threads and t
values, the first approach requires a sequence of t additions, while In fact the tree-structured
approach requires log2(t )+ 1. For example, if we have 1000 threads and values, we’ll go
from 1000 communications and sums using the basic approach to 11 using the tree-structured
approach, and if we have 1,000,000, we’ll go from 1,000,000 to 21!
There are two standard implementations of a tree-structured sum in CUDA. One
implementation uses shared memory, and in devices with compute capability < 3 this is the
best implementation. However, in devices with compute capability >= 3 there are several
functions called warp shuffles, that allow a collection of threads within a warp to read
variables stored by other threads in the warp.
Access times also increase dramatically. It takes on the order of 1 cycle to copy a 4-byte int
from one register to another. Depending on the system it can take up to an order of magnitude
more time to copy from one shared memory location to another, and it can take from two to
three orders of magnitude more time to copy from one global memory location to another.
An obvious question here: what about local variables? How much storage is avail- able for
them? And how fast is it? This depends on total available memory and program memory
usage. If there is enough storage, local variables are stored in registers. However, if there isn’t
enough register storage, local variables are “spilled” to a region of global memory that’s
thread private, i.e., only the thread that owns the local variables can access them.
So as long as we have sufficient register storage, we expect the performance of a kernel to
improve if we increase our use of registers and reduce our use of shared and/or global
memory. The catch, of course, is that the storage available in registers is tiny compared to the
storage available in shared and global memory.
The mask argument indicates which threads are participating in the call. A bit, representing
the thread’s lane, must be set for each participating thread to ensure that all of the threads in
the call have converged—i.e., arrived at the call—before any thread begins executing the call
to s mask = 0 xffffffff ;
Recall that 0x denotes a hexadecimal (base 16) value and 0xf is 1510, which is 11112.11 So
this value of mask is 32 1’s in binary, and it indicates that every thread in the warp
participates in the call to shfl_down_sync. If the thread with lane l calls
shfl_down_sync, then the value stored in var on the thread with
lane = l + diff
is returned on thread l. Since diff has type unsigned, it is ≥ 0. So the value that’s returned is
from a higher-ranked thread. Hence the name “shuffle down”.
We’ll only use width = warpSize, and since its default value is warpSize, we’ll omit it from
our calls.
There are several possible issues:
• What happens if thread l calls shfl_down_sync but thread l+diff doesn’t? In this case, the
value returned by the call on thread l is undefined.
• What happens if thread l calls shfl_down_sync but l+ diff ≥ warpSize? In this case the call
will return the value in var already stored on thread l.
• What happens if thread l calls shfl_down_sync, and l +diff < warpSize, but l diff > largest
lane in the warp. In other words, because the thread block size is not a multiple of
warpSize, the last warp in the block has fewer than warpSize threads. Say there are w
threads in the last warp, where 0 <w < warpSize. Then if
l + diff ≥ w,
the value returned by the call is also undefined. So to avoid undefined results, it’s best if
• All the threads in the warp call shfl_down_sync, and
• All the warps have warpSize threads, or, equivalently, the thread block size (blockDim.x)
is a multiple of warpSize.
Fig. 6.5 shows how the function would operate if warpSize were 8. (The diagram would be
illegible if we used a warpSize of 32.) Perhaps the most confusing point in the behavior of
shfl_down_sync is that when the lane ID
l + diff ≥ warpSize,
the call returns the value in the caller’s var. In the diagram this is shown by having only one
arrow entering the oval with the sum, and it’s labeled with the value just calculated by the
thread carrying out the sum. In the row corresponding to diff = 4 (the first row of sums), the
threads with lane IDs l = 4, 5, 6, and 7 all = have l + 4 ≥ 8. So the call to
shfl_down_sync returns their current var values, 9, 11, 13, and 15, respectively, and these
values are doubled, because the return value of the call is added into the calling thread’s
variable var. Similar behavior occurs in the row corresponding to the sums for diff = 2 and
lane IDs l = 6 and 7, and in the last row when diff = 1 for the thread with lane ID l = 7.
From a practical standpoint, it’s important to remember that this implementation will only
return the correct sum on the thread with lane ID 0. If all of the threads need the result, we
can use an alternative warp shuffle function, shfl_xor. See Eercise 6.6.
This should be called by all the threads in a warp, and the array shared_vals should be stored
in the shared memory of the SM that’s running the warp. Since the threads in the warp are
operating in SIMD fashion, they effectively execute the code of the function in lockstep. So
there’s no race condition in the updates to shared_vals: all the threads read the values in
shared_vals[source] before any thread updates
shared_vals[my_lane].
Technically speaking, this isn’t a tree-structured sum. It’s sometimes called a dissemination
sum or dissemination reduction. Fig. 6.6 illustrates the copying and additions that take
place.
Unlike the earlier figures, this figure doesn’t show the di- rect contributions that a thread
makes to its sums: including these lines would have made the figure too difficult to read. Also
note that every thread reads a value from another thread in each pass through the for
statement. After all these values have been added in, every thread has the correct sum—not
just thread 0. Although we won’t need this for the trapezoidal rule, this can be useful in other
applications. Furthermore, in any cycle in which the threads in a warp are working, each
thread either executes the current instruction or it is idle. So the cost of having every thread
exe- cute the same instruction shouldn’t be any greater than having some of the threads
execute one instruction and the others idle.
An obvious question here is: how does Shared_mem_sum make use of Nvidia’s shared
memory? The answer is that it’s not required to use shared memory. The function’s argument,
the array shared_vals, could reside in either global memory or shared memory. In either case,
the function would return the sum of the elements of shared_vals.
However, to get the best performance, the argument shared_vals should be defined to be
shared in a kernel. For example, if we know that shared_vals will need to store at most 32
floats in each thread block, we can add this definition to our kernel:
__shared__ float shared_vals [32];
For each thread block this sets aside storage for a collection of 32 floats in the shared
memory of the SM to which the block is assigned.
Alternatively, if it isn’t known at compile time how much shared memory is needed, it can be
declared as
extern __shared__ float shared_vals [];
and when the kernel is called, a third argument can be included in the triple angle brackets
specifying the size in bytes of the block of shared memory. For example, if we were using
Shared_mem_sum in a trapezoidal rule program, we might call the kernel Dev_trap with
Dev_trap <<<blk_ct , th_per_blk , th_per_blk sizeof ∗( float )>>> ( . . . args to Dev_trap . . .
);
This would allocate storage for th_per_blk floats in the shared_vals array in each thread
block.
This would allocate storage for th_per_blk floats in the shared_vals array in each thread
block.
Note that the CUDA defined variable warpSize is not defined at compile-time. So our
program defines a preprocessor macro
# define WARPSZ 32
6.12.4 Performance
Of course, we want to see how the various implementations perform. (See Table 6.8.) The
problem is the same as the problem we ran earlier (see Table 6.5): we’re integrating f(x)=x2+1
on the interval [ -3, 3 ], and there are 220 = 1,048,576 trapezoids. However, since the thread
block size is 32, we’re using 32,768 thread blocks (32 × 32,768 = 1,048,576).
We see that on both systems and with both sum implementations, the new pro- grams do
significantly better than the original. For the GK20A, the warp shuffle version runs in about
70% of the time of the original, and the shared memory version runs in about 72% of the time
of the original. For the Titan X, the improvements are much more impressive: both versions
run in less than 7% of the time of the original. Perhaps most striking is the fact that on the
Titan X, the warp shuffle is, on average, slightly slower than the shared memory version.
6.13 CUDA trapezoidal rule III: blocks with more than one warp
Limiting ourselves to thread blocks with only 32 threads reduces the power and flex- ibility
of our CUDA programs. For example, devices with compute capability 2.0 can ≥ have blocks
with as many as 1024 threads or 32 warps, and CUDA provides a fast barrier that can be used
to synchronize all the threads in a block. So if we limited ourselves to only 32 threads in a
block, we wouldn’t be using one of the most useful features of CUDA: the ability to
efficiently synchronize large numbers of threads.
So what would a “block” sum look like if we allowed ourselves to use blocks with up to 1024
threads? We could use one of our existing warp sums to add the values computed by the
=
threads in each warp. Then we would have as many as 1024/32 = 32 warp sums, and we
could use one warp in the thread block to add the warp sums.
Since two threads belong to the same warp if their ranks in the block have the same quotient
when divided by warpSize, to add the warp sums, we can use warp 0, the threads with ranks
0, 1, ..., 31 in the block.
6.13.1 syncthreads
We might try to use the following pseudocode for finding the sum of the values com- puted
by all the threads in a block:
Each thread computes its contribution ;
Each warp adds its threads ’ contributions ; Warp 0 in block adds warp sums ;
However, there’s a race condition. Do you see it? When warp 0 tries to compute the total of
the warp sums in the block, it doesn’t know whether all the warps in the block have
completed their sums. For example, suppose we have two warps, warp 0 and warp 1, each of
which has 32 threads. Recall that the threads in a warp operate in SIMD fashion: no thread in
the warp proceeds to a new instruction until all the threads in the warp have completed (or
skipped) the current instruction. But the threads in warp 0 can operate independently of the
threads in warp 1. So if warp 0 finishes computing its sum before warp 1 computes its sum,
warp 0 could try to add warp 1’s sum to its sum before warp 1 has finished, and, in this case,
the block sum could be incorrect.
So we must make sure that warp 0 doesn’t start adding up the warp sums until all of the
warps in the block are done. We can do this by using CUDA’s fast barrier:
__device__ void __syncthreads ( void );
This will cause the threads in the thread block to wait in the call until all of the threads have
started the call. Using syncthreads, we can modify our pseudocode so that the race condition
is avoided:
Each thread computes its contribution ;
Each warp adds its threads ’ contributions ;
__syncthreads ();
Warp 0 in block adds warp sums ;
Now warp 0 won’t be able to add the warp sums until every warp in the block has completed
its sum.
There are a couple of important caveats when we use syncthreads. First, it’s critical that all
of the threads in the block execute the call. For example, if the block contains at least two
threads, and our code includes something like this:
int my_x = threadIdx . x ;
if ( my_x < blockDim . x /2)
__syncthreads ();
my_x ++;
then only half the threads in the block will call syncthreads, and these threads can’t proceed
until all the threads in the block have called syncthreads. So they will wait forever for the
other threads to call syncthreads.
The second caveat is that syncthreads only synchronizes the threads in a block. If a grid
contains at least two blocks, and if all the threads in the grid call
syncthreads then the threads in different blocks will continue to operate indepen- dently of
each other. So we can’t synchronize the threads in a general grid with
syncthreads.12
Table 6.9 illustrates the organization of thread_calcs. In the table, the columns are banks, and
the rows show the subscripts of consecutive elements of thread_calcs. So the 32 threads in a
warp can simultaneously access the 32 elements in any one of the rows, or, more generally, if
each thread access is to a different column.
When two or more threads access different elements in a single bank (or column in the table),
then those accesses must be serialized. So the problem with our approach to saving the warp
sums in elements 0, 32, 64, ..., 992 is that these are all in the same bank. So when we try to
execute them, the GPU will serialize access, e.g., element 0 will be written, then element 32,
then element 64, etc. So the writes will take something like 32 times as long as it would if the
32 elements were stored in different banks, e.g., a row of the table.
The details of bank access are a little complicated and some of the details depend on the
compute capability, but the main points are
• If each thread in a warp accesses a different bank, the accesses can happen simultaneously.
• If multiple threads access different memory locations in a single bank, the accesses must
be serialized.
• If multiple threads read the same memory location in a bank, the value read is broadcast to
the reading threads, and the reads are simultaneous.
The CUDA programming Guide [11] provides full details.
Thus we could exploit the use of the shared memory banks if we stored the results in a
contiguous subarray of shared memory. Since each thread block can use at least 16 Kbytes of
shared memory, and our “current” definition of shared_vals only uses at most 1024 floats or 4
Kbytes of shared memory, there is plenty of shared memory available for storing 32 more
floats.
So if we’re using shared memory warp sums, a simple solution is to declare two arrays of
shared memory: one for storing the computations made by each thread, and another for
storing the warp sums.
__shared__ float thread_calcs [ MAX_BLKSZ ];
__shared__ float warp_sum_arr [ WARPSZ ];
float∗ shared_vals = thread_calcs + my_warp∗warpSize ;
...
float my_result = Shared_mem_sum ( shared_vals );
if ( my_lane == 0 ) warp_sum_arr [ my_warp ] = my_result ;
__syncthreads ();
...
6.6.1 Finishing up
The remaining codes for the warp sum kernel and the shared memory sum kernel are very
similar. First warp 0 computes the sum of the elements in warp_sum_arr. Then thread 0 in the
block adds the block sum into the total across all the threads in the grid using atomicAdd.
Here’s the code for the shared memory sum:
if ( my_warp == 0 ) {
if ( threadIdx . x >= blockDim . x / warpSize )
warp_sum_arr [ threadIdx . x ] = 0.0;
blk_result = Shared_mem_sum ( warp_sum_arr );
}
if ( threadIdx . x == 0 ) atomicAdd ( trap_p , blk_result );
In the test threadIdx.x > blockDim.x/warpSize we’re checking to see if there are fewer than
32 warps in the block. If there are, then the final elements in warp_sum_arr won’t have been
initialized. For example, if there are 256 warps in the block, then
blockDim . x / warpSize = 256/32 = 8
So there are only 8 warps in a block and we’ll have only initialized elements 0, 1,..., 7 of
warp_sum_arr. But the warp sum function expects 32 values. So for the threads with
threadIdx.x >= 8, we assign
warp_sum_arr [ threadIdx . x ] = 0.0;
For the sake of completeness, Program 6.15 shows the kernel that uses shared memory. The
main differences between this kernel and the kernel that uses warp shuffles are that the
declaration of the first shared array isn’t needed in the warp shuffle version, and, of course,
the warp shuffle version calls Warp_sum instead of Shared_mem_sum.