Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cuda::device::memcpy_async_tx #405

Merged
merged 19 commits into from
Oct 13, 2023
Merged

Conversation

ahendriksen
Copy link
Contributor

@ahendriksen ahendriksen commented Sep 5, 2023

Description

Closes #404

Add cuda::device::memcpy_async_tx for global to shared memory loads.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link
Collaborator

@gonzalobg gonzalobg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial review, this is probably blocked on the other PRs about arrive_tx and the TMA intrinsics.

Copies `size` bytes from global memory `src` to shared memory `dest` and arrives
on a shared memory barrier `bar`, updating its transaction count by `size`
bytes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs should mention preconditions and effects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the preconditions. I really don't grok what should be documented under effects, so I have left that as is. The current memcpy_async docs didn't have anything I could use as inspiration.

libcudacxx/include/cuda/barrier Outdated Show resolved Hide resolved
Copy link
Collaborator

@griwes griwes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine. If we ever come up with the formulation for pipeline_tx we'll want to move the function into its own header and reuse much of the code between pipeline and barrier versions, but since that's very much a future thing, this is sufficient for now.

libcudacxx/include/cuda/std/detail/libcxx/include/barrier Outdated Show resolved Hide resolved
// unconditionally. This permits checking for the feature in host code. When
// __CUDA_MINIMUM_ARCH__ is available, we only enable the feature when the
// hardware supports it.
#if (!defined(__CUDA_MINIMUM_ARCH__)) || (defined(__CUDA_MINIMUM_ARCH__) && 900 <= __CUDA_MINIMUM_ARCH__)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: @griwes @wmaxey Do we want to move such conditions into __config and give it a proper name like _LIBCUDACXX_REQUIRES_SM_90

Copy link
Contributor Author

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I have moved the docs, added tests, and improved the docs.

The only thing that I am still unsure of is the what the effects section should list. @gonzalobg

Copies `size` bytes from global memory `src` to shared memory `dest` and arrives
on a shared memory barrier `bar`, updating its transaction count by `size`
bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the preconditions. I really don't grok what should be documented under effects, so I have left that as is. The current memcpy_async docs didn't have anything I could use as inspiration.

@ahendriksen ahendriksen marked this pull request as ready for review September 20, 2023 07:58
@ahendriksen ahendriksen requested review from a team as code owners September 20, 2023 07:58
@ahendriksen ahendriksen requested review from wmaxey and ericniebler and removed request for a team September 20, 2023 07:58
Copy link
Contributor Author

@ahendriksen ahendriksen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Suggestions have been implemented and branch has been rebased on top of main.

@miscco miscco requested a review from gonzalobg October 12, 2023 06:42
ahendriksen and others added 2 commits October 12, 2023 12:11
…ync_tx.md

Co-authored-by: gonzalobg <65027571+gonzalobg@users.noreply.github.com>
@miscco miscco merged commit c7c46a3 into NVIDIA:main Oct 13, 2023
469 checks passed
@miscco
Copy link
Collaborator

miscco commented Oct 13, 2023

Thanks a lot for adding this new feature 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA]: memcpy_async_tx
5 participants