-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement pad
as a CUDA kernel
#860
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
5d60959
to
db04e9d
Compare
`Ops.pad` was a fairly slow operation on GPU. It iterates over all sequences and copies each sequence into the padded array. This results in a lot of kernel launches. In the biaffine parser, padding the inputs was more costly than applying the biaffine layers. This change optimizes the `pad` op using a custom CUDA kernel. The kernel get an array of pointers to the CuPy arrays that are provided as a list. The output array is then filled, parallelizing over the 'time steps'. This should provides the largest amount of parallelism, since we usually have n_steps * hidden_size to parallelize over.
shadeMe
reviewed
Apr 17, 2023
👷 Deploy request for thinc-ai pending review.Visit the deploys page to approve it
|
shadeMe
approved these changes
Apr 19, 2023
@explosion-bot please test_slow_gpu |
URL: https://buildkite.com/explosion-ai/thinc-slow-gpu-tests/builds/41 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Ops.pad
was a fairly slow operation on GPU. It iterates over all sequences and copies each sequence into the padded array. This results in a lot of kernel launches. In the biaffine parser, padding the inputs was more costly than applying the biaffine layers.This change optimizes the
pad
op using a custom CUDA kernel. The kernel get an array of pointers to the CuPy arrays that are provided as a list. The output array is then filled, parallelizing over the 'time steps'. This should provides the largest amount of parallelism, since we usually have n_steps * hidden_size to parallelize over.Before:
After:
Warning: please do not review yet! There are still some todo items and I still want to be able to rebase the branch.master
.Ops.pad
code.Types of change
Checklist