Add CUDA, HIP and DPCPP batch bicgstab kernels by pratikvn · Pull Request #1443 · ginkgo-project/ginkgo

pratikvn · 2023-10-26T22:10:23Z

This PR adds the batch bicgstab solver kernels for CUDA, HIP and DPCPP backends. Some additional single rhs vector kernels are also added into the batch multivector kernels.

TODO

Add DPCPP kernels

MarcelKoch

I think the kernels look good so far. I have mostly comments outside of those.

Here are some things to be tackled later:

use dispatch instead of manual switch
make reductions work with more than 1 warp

MarcelKoch · 2023-10-27T09:14:56Z

+        // Compute norms of rhs
+        single_rhs_compute_norm2(subgroup, num_rows, b_global_entry, rhs_norm);
+    }
+    __syncthreads();


Is this necessary? The above code writes only to the norm.

Diverging paths between subwarps. To ensure consistency, I think it is good to synchronize them.

Sure, they diverge, but I don't see how that would affect the following code. But I'm no expert on this, so I won't push anything here.

Not requesting any changes, but I wanted to elaborate on this a bit. I agree here, I think we could take a page from CUB's book, where they ensure synchronization always happens inside functions that require it (i.e. SpMVs and reductions) and are entirely absent from the code otherwise.
To make this work, you need a "default" work assignment (like the default for (int iz = threadIdx.x; iz < num_rows; iz += blockDim.x) loop) and every time you read from values outside your own assigned set, you have a threadsync before, and if you write to values outside your set (also computing reductions), you have a threadsync after. This may even allow you to keep all values in registers most of the time, as long as you don't have huge blocks. But that is an optional detail.

Outside of this, there is also some potential for "kernel fusion" (i.e. removing the __syncthreads and computing directly on values in registers) by computing the dot product on the result of the SpMV, but I don't have a clear idea how large the runtime impact of that would be.

MarcelKoch · 2023-10-27T09:15:54Z

+    }
+    __syncthreads();
+
+    for (int iz = threadIdx.x; iz < num_rows; iz += blockDim.x) {


nit: in the other kernels you are using r as index variable.

MarcelKoch · 2023-10-27T14:09:14Z

+
+        // template
+        // launch_apply_kernel<StopType, SIMDLEN, n_shared_total, sg_kernel_all>
+        if (num_rows <= 32 && n_shared_total == 10)


cuda/hip uses 9 vectors in shmem. Why does this check for 10? Also the kernel only checks until n_shared_total == 9

the strategy is slightly different. Here the count includes the prec_shared vector. The number of shared vectors is always 9, so you can only check until 9. If it is greater than 9, then you know that the prec is also in shared memory.

but isn't that what storage_config::prec_shared is there for?

I think it is a bit easier with looking at n_shared as 10 vectors. Otherwise, prec_shared will need to be a template parameter as well. But I understand your point that it makes the cuda/dpcpp kernels more confusing to compare.

I would prefer the additional template parameter then. But that might also be done later.

pratikvn · 2023-10-29T19:43:12Z

format!

yhmtsai · 2023-10-29T20:31:43Z

+    if (sizeof(ValueType) == 4) {
+        cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte);
+    } else if (sizeof(ValueType) % 8 == 0) {
+        cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);
+    }


do they have TwoByte? Otherwise, it may introduce some troubles when adding half

No, I dont think that is necessary. Only a value of 8 is recommended for double to avoid bank conflicts. You can just set it to 4 for half I think .

This is kind of problematic - it configures the entire device, but we only run on a single stream. At the very least, we need to revert it after the kernel finished, otherwise we interfere with other applications' performance

I guess a scope guard similar to the one for the device id could work here.

yhmtsai · 2023-10-29T20:35:03Z

            }
        }
-        x.values[tidx * x.stride] = temp;
+        x[tidx] = temp;


delete stride?

I just use the plain pointers as arguments here. I guess technically we should have another stride parameter to the function, but I think that is unnecessary for now and we can add that when we support stride later.

yhmtsai · 2023-10-29T20:51:49Z

+        ValueType values[5];
+        real_type reals[2];
+        rho_old_sh = &values[0];
+        rho_new_sh = &values[1];
+        alpha_sh = &values[2];
+        omega_sh = &values[3];
+        temp_sh = &values[4];
+        norms_rhs_sh = &reals[0];
+        norms_res_sh = &reals[1];


segfault.
values and reals will be destroies after else.

MarcelKoch · 2023-10-30T08:15:18Z

+    {
+        using real_type = gko::remove_complex<value_type>;
+        const size_type num_batch_items = mat.num_batch_items;
+        constexpr int align_multiple = 8;


So, that alignment is only relevant if the vectors are stored in global memory, right?

yhmtsai

except for the shared_memory in dpcpp and storage computation (not reviewed yet), others LGTM

yhmtsai · 2023-10-30T10:30:07Z

+__dpct_inline__ void initialize(
+    const int num_rows, const BatchMatrixType_entry& mat_global_entry,
+    const ValueType* const b_global_entry,
+    const ValueType* const x_global_entry, ValueType& rho_old, ValueType& omega,
+    ValueType& alpha, ValueType* const x_shared_entry,
+    ValueType* const r_shared_entry, ValueType* const r_hat_shared_entry,
+    ValueType* const p_shared_entry, ValueType* const v_shared_entry,
+    typename gko::remove_complex<ValueType>& rhs_norm,
+    typename gko::remove_complex<ValueType>& res_norm,
+    sycl::nd_item<3> item_ct1)


I think from CUDA, it will use __ldg() automatically if it is const __restrict__*. That's why we do not need to use __ldg

yhmtsai · 2023-10-30T16:39:42Z

+inline batch::matrix::ell::uniform_batch<const hip_type<ValueType>,
+                                         const IndexType>


I think the to_const usually face this issue.
Could you check the other const version also correct?
If all related to this issue are not in public interface, it are not urgent before release

pratikvn · 2023-11-01T10:58:59Z

format!

Co-authored-by: Pratik Nayak <pratikvn@pm.me>

Co-authored-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>

Co-authored-by: Yu-Hsiang Tsai <yhmtsai@gmail.com>

Co-authored-by: Pratik Nayak <pratikvn@pm.me>

Co-authored-by: Yu-Hsiang Tsai <yhmtsai@gmail.com>

pratikvn · 2023-11-05T16:20:26Z

format!

Co-authored-by: Pratik Nayak <pratikvn@pm.me>

pratikvn · 2023-11-05T23:43:27Z

Turns out the no-circular-deps job is terribly slow. I verified (with the same config and flags as the job, inside the same image with a docker container) that it builds successfully with GINKGO_CHECK_CIRCULAR_DEPS=on, so I will go ahead and merge this.

sonarqubecloud · 2023-11-06T05:07:42Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
7 Code Smells

98.6% Coverage
17.7% Duplication

The version of Java (11.0.3) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

pratikvn added 1:ST:WIP This PR is a work in progress. Not ready for review. type:batched-functionality This is related to the batched functionality in Ginkgo labels Oct 26, 2023

pratikvn added this to the Release 1.7.0 milestone Oct 26, 2023

pratikvn self-assigned this Oct 26, 2023

ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:core This is related to the core module. mod:cuda This is related to the CUDA module. type:solver This is related to the solvers mod:hip This is related to the HIP module. labels Oct 26, 2023

MarcelKoch reviewed Oct 27, 2023

View reviewed changes

yhmtsai reviewed Oct 29, 2023

View reviewed changes

MarcelKoch reviewed Oct 30, 2023

View reviewed changes

yhmtsai mentioned this pull request Oct 30, 2023

Add a batch::Bicgstab solver class, core, ref and omp kernels #1438

Merged

yhmtsai reviewed Oct 30, 2023

View reviewed changes

pratikvn force-pushed the batch-bicgstab-device branch from b8def5b to b653d3b Compare October 30, 2023 13:38

yhmtsai reviewed Oct 30, 2023

View reviewed changes

pratikvn force-pushed the batch-bicgstab-device branch from b653d3b to fb50eaf Compare October 30, 2023 21:37

pratikvn added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Oct 31, 2023

pratikvn force-pushed the batch-bicgstab branch 3 times, most recently from 8982811 to 28560a5 Compare October 31, 2023 14:04

upsj reviewed Oct 31, 2023

View reviewed changes

Comment thread hip/base/exception.hip.hpp Outdated

pratikvn force-pushed the batch-bicgstab branch from e21b275 to 2260c8f Compare October 31, 2023 22:47

Base automatically changed from batch-bicgstab to develop November 1, 2023 09:06

yhmtsai reviewed Nov 1, 2023

View reviewed changes

Comment thread common/cuda_hip/stop/batch_criteria.hpp.inc Outdated

Comment thread core/base/batch_utilities.hpp Outdated

Comment thread core/device_hooks/common_kernels.inc.cpp

pratikvn force-pushed the batch-bicgstab-device branch from fb50eaf to d21d5fd Compare November 1, 2023 10:57

pratikvn and others added 16 commits November 5, 2023 17:10

Use synchronize for error handling

84be7dd

Format files

c6e9543

Co-authored-by: Pratik Nayak <pratikvn@pm.me>

Add scoped cuda shmem config

1054b7b

move max_shmem query to internal

cc22557

Update size_type in tests

501c4e7

Update contributors.txt

7b0ebfd

Co-authored-by: Phuong Nguyen <phuong.nguyen@icl.utk.edu>

review updates

f1babfd

Co-authored-by: Yu-Hsiang Tsai <yhmtsai@gmail.com>

Format files

221bba9

Co-authored-by: Pratik Nayak <pratikvn@pm.me>

dpcpp group size and doc fixes

aa026c1

use global_and_local barrier

79e5cad

Fix Intel2020 apply call issue

693d308

Fix diag_dominance and tol issue

705339e

Fix some include issues

6729f68

Review updates

eebc06a

Co-authored-by: Yu-Hsiang Tsai <yhmtsai@gmail.com>

use fence_space::global_and_local

498512c

Use updated deferred factory macros.

1bc6d83

pratikvn force-pushed the batch-bicgstab-device branch from f48179b to f600023 Compare November 5, 2023 16:11

Review updates

79e68b3

Co-authored-by: Yu-Hsiang Tsai <yhmtsai@gmail.com>

pratikvn force-pushed the batch-bicgstab-device branch from f600023 to 79e68b3 Compare November 5, 2023 16:17

Format files

a1b84d4

Co-authored-by: Pratik Nayak <pratikvn@pm.me>

yhmtsai approved these changes Nov 5, 2023

View reviewed changes

pratikvn merged commit 47b3267 into develop Nov 5, 2023

pratikvn deleted the batch-bicgstab-device branch November 5, 2023 23:44

tcojean mentioned this pull request Nov 6, 2023

Release 1.7.0 to master #1451

Merged

pratikvn mentioned this pull request Dec 18, 2023

Update contributors.txt #1399

Closed

		inline batch::matrix::ell::uniform_batch<const hip_type<ValueType>,
		const IndexType>

Conversation

pratikvn commented Oct 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO

Uh oh!

MarcelKoch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pratikvn commented Oct 29, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yhmtsai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pratikvn commented Oct 26, 2023 •

edited

Loading