Use event to handle distributed row gatherer#1882
Merged
Merged
Conversation
2daf422 to
a3f291a
Compare
846d8d4 to
d65428b
Compare
d65428b to
85dfc7b
Compare
MarcelKoch
reviewed
Nov 13, 2025
| ptr_param<LinOp> x, | ||
| array<char>& workspace) const; | ||
|
|
||
| std::shared_ptr<const Event> apply_prepare(ptr_param<const LinOp> b, |
Member
There was a problem hiding this comment.
Note: make these protected and add the distributed matrix as a friend.
85dfc7b to
92f9715
Compare
MarcelKoch
requested changes
Nov 24, 2025
MarcelKoch
left a comment
Member
There was a problem hiding this comment.
I would suggest changing the arguments to apply_prepare.
Comment on lines
+377
to
+382
| auto ev = apply_prepare(this->rg.get(), b, x, workspace); | ||
| // we modify the workspace to all 0 | ||
| workspace.fill(static_cast<char>(0)); | ||
| this->exec->synchronize(); | ||
| auto req = apply_finalize(this->rg.get(), b, x, ev, workspace); | ||
| req.wait(); |
Member
There was a problem hiding this comment.
Just for my understanding, this tests that the send buffer can be overwritten after the apply_prepare, right?
Member
Author
There was a problem hiding this comment.
yes (but require on the same stream). It checks the apply finalize indeed uses the workspace
92f9715 to
f32c492
Compare
MarcelKoch
approved these changes
Nov 26, 2025
4bd5529 to
f13e720
Compare
f13e720 to
4fe9d4a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With the current distributed row gatherer , we does the following:
local row gatherer (prepare data for mpi) -> synchronize (it is required by mpi) -> submit mpi ialltoallv -> submit local spmv
Ginkgo will do local spmv and mpi ialltoallv.
However, the local spmv is forced to wait for synchronize() and then submit its kernel, which lead the gap between gpu activity.
Another approach is introduced in this pr to use event
local row gatherer -> record event -> submit local spmv -> wait the event -> submit mpi ialltoallv.
the submission local spmv does not need to wait for local row gatherer to finish, but it will give additional overhead during record event.
For those does not support AsyncEvent, we just synchronize when creating it.
something from the profiler

The local spmv starts earlier (yellow box for gpu activity and gree for cpu). The MPI communication starts later now (pink box). It gives better performance because local spmv still covers the mpi communication. If mpi communication is not covered by the local spmv, this will give slower performance. I will say it is okay-ish now. This situation means it is network bandwidth bound now, which is slow compared to other bandwidth.
We can improve it by introducing thread but it will lead another discussion whether we allow thread in ReferenceExecutor.
There are two approach might also help this situation - stream-aware mpi or async execution.
stream-aware mpi needs more study like using isend/irecv implement all_to_all_v + row gatherer.
async execution I have done something for async schwarz which touching some design question on executor, so it lead larger and longer discussion. Thus, I do not touch it here.