Add CUDA, HIP and DPCPP batch bicgstab kernels#1443
Conversation
MarcelKoch
left a comment
There was a problem hiding this comment.
I think the kernels look good so far. I have mostly comments outside of those.
Here are some things to be tackled later:
- use dispatch instead of manual switch
- make reductions work with more than 1 warp
| // Compute norms of rhs | ||
| single_rhs_compute_norm2(subgroup, num_rows, b_global_entry, rhs_norm); | ||
| } | ||
| __syncthreads(); |
There was a problem hiding this comment.
Is this necessary? The above code writes only to the norm.
There was a problem hiding this comment.
Diverging paths between subwarps. To ensure consistency, I think it is good to synchronize them.
There was a problem hiding this comment.
Sure, they diverge, but I don't see how that would affect the following code. But I'm no expert on this, so I won't push anything here.
There was a problem hiding this comment.
Not requesting any changes, but I wanted to elaborate on this a bit. I agree here, I think we could take a page from CUB's book, where they ensure synchronization always happens inside functions that require it (i.e. SpMVs and reductions) and are entirely absent from the code otherwise.
To make this work, you need a "default" work assignment (like the default for (int iz = threadIdx.x; iz < num_rows; iz += blockDim.x) loop) and every time you read from values outside your own assigned set, you have a threadsync before, and if you write to values outside your set (also computing reductions), you have a threadsync after. This may even allow you to keep all values in registers most of the time, as long as you don't have huge blocks. But that is an optional detail.
Outside of this, there is also some potential for "kernel fusion" (i.e. removing the __syncthreads and computing directly on values in registers) by computing the dot product on the result of the SpMV, but I don't have a clear idea how large the runtime impact of that would be.
| } | ||
| __syncthreads(); | ||
|
|
||
| for (int iz = threadIdx.x; iz < num_rows; iz += blockDim.x) { |
There was a problem hiding this comment.
nit: in the other kernels you are using r as index variable.
|
|
||
| // template | ||
| // launch_apply_kernel<StopType, SIMDLEN, n_shared_total, sg_kernel_all> | ||
| if (num_rows <= 32 && n_shared_total == 10) |
There was a problem hiding this comment.
cuda/hip uses 9 vectors in shmem. Why does this check for 10? Also the kernel only checks until n_shared_total == 9
There was a problem hiding this comment.
the strategy is slightly different. Here the count includes the prec_shared vector. The number of shared vectors is always 9, so you can only check until 9. If it is greater than 9, then you know that the prec is also in shared memory.
There was a problem hiding this comment.
but isn't that what storage_config::prec_shared is there for?
There was a problem hiding this comment.
I think it is a bit easier with looking at n_shared as 10 vectors. Otherwise, prec_shared will need to be a template parameter as well. But I understand your point that it makes the cuda/dpcpp kernels more confusing to compare.
There was a problem hiding this comment.
I would prefer the additional template parameter then. But that might also be done later.
|
format! |
| if (sizeof(ValueType) == 4) { | ||
| cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte); | ||
| } else if (sizeof(ValueType) % 8 == 0) { | ||
| cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte); | ||
| } |
There was a problem hiding this comment.
do they have TwoByte? Otherwise, it may introduce some troubles when adding half
There was a problem hiding this comment.
No, I dont think that is necessary. Only a value of 8 is recommended for double to avoid bank conflicts. You can just set it to 4 for half I think .
There was a problem hiding this comment.
This is kind of problematic - it configures the entire device, but we only run on a single stream. At the very least, we need to revert it after the kernel finished, otherwise we interfere with other applications' performance
There was a problem hiding this comment.
I guess a scope guard similar to the one for the device id could work here.
| } | ||
| } | ||
| x.values[tidx * x.stride] = temp; | ||
| x[tidx] = temp; |
There was a problem hiding this comment.
I just use the plain pointers as arguments here. I guess technically we should have another stride parameter to the function, but I think that is unnecessary for now and we can add that when we support stride later.
| ValueType values[5]; | ||
| real_type reals[2]; | ||
| rho_old_sh = &values[0]; | ||
| rho_new_sh = &values[1]; | ||
| alpha_sh = &values[2]; | ||
| omega_sh = &values[3]; | ||
| temp_sh = &values[4]; | ||
| norms_rhs_sh = &reals[0]; | ||
| norms_res_sh = &reals[1]; |
There was a problem hiding this comment.
segfault.
values and reals will be destroies after else.
| { | ||
| using real_type = gko::remove_complex<value_type>; | ||
| const size_type num_batch_items = mat.num_batch_items; | ||
| constexpr int align_multiple = 8; |
There was a problem hiding this comment.
So, that alignment is only relevant if the vectors are stored in global memory, right?
yhmtsai
left a comment
There was a problem hiding this comment.
except for the shared_memory in dpcpp and storage computation (not reviewed yet), others LGTM
| __dpct_inline__ void initialize( | ||
| const int num_rows, const BatchMatrixType_entry& mat_global_entry, | ||
| const ValueType* const b_global_entry, | ||
| const ValueType* const x_global_entry, ValueType& rho_old, ValueType& omega, | ||
| ValueType& alpha, ValueType* const x_shared_entry, | ||
| ValueType* const r_shared_entry, ValueType* const r_hat_shared_entry, | ||
| ValueType* const p_shared_entry, ValueType* const v_shared_entry, | ||
| typename gko::remove_complex<ValueType>& rhs_norm, | ||
| typename gko::remove_complex<ValueType>& res_norm, | ||
| sycl::nd_item<3> item_ct1) |
There was a problem hiding this comment.
I think from CUDA, it will use __ldg() automatically if it is const __restrict__*. That's why we do not need to use __ldg
b8def5b to
b653d3b
Compare
| inline batch::matrix::ell::uniform_batch<const hip_type<ValueType>, | ||
| const IndexType> |
There was a problem hiding this comment.
I think the to_const usually face this issue.
Could you check the other const version also correct?
If all related to this issue are not in public interface, it are not urgent before release
b653d3b to
fb50eaf
Compare
8982811 to
28560a5
Compare
e21b275 to
2260c8f
Compare
fb50eaf to
d21d5fd
Compare
|
format! |
Co-authored-by: Pratik Nayak <pratikvn@pm.me>
Co-authored-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>
Co-authored-by: Yu-Hsiang Tsai <yhmtsai@gmail.com>
Co-authored-by: Pratik Nayak <pratikvn@pm.me>
Co-authored-by: Yu-Hsiang Tsai <yhmtsai@gmail.com>
f48179b to
f600023
Compare
Co-authored-by: Yu-Hsiang Tsai <yhmtsai@gmail.com>
f600023 to
79e68b3
Compare
|
format! |
Co-authored-by: Pratik Nayak <pratikvn@pm.me>
|
Turns out the |
|
Kudos, SonarCloud Quality Gate passed!
|









This PR adds the batch bicgstab solver kernels for CUDA, HIP and DPCPP backends. Some additional single rhs vector kernels are also added into the batch multivector kernels.
TODO