-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add b200 policies for cub.device.partition.flagged,if #3617
Conversation
bernhardmgruber
commented
Jan 30, 2025
•
edited
Loading
edited
- Merge before: Add b200 policies for device.select.if,flagged,unique #3545
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
🟨 CI finished in 1h 38m: Pass: 98%/89 | Total: 2d 13h | Avg: 41m 29s | Max: 1h 17m | Hits: 291%/10936
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 89)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
11 | linux-amd64-gpu-v100-latest-1 |
8 | windows-amd64-cpu16 |
4 | linux-arm64-cpu16 |
1 | linux-amd64-gpu-h100-latest-1 |
I discussed this PR and the necessary workarounds with @gonidelis but we got stuck on the following question: Does DevicePartition also use the same streaming approach as DeviceSelect? Because for DeviceSelect we always use offset type int32 internally, so we only specifiy tunings for int32 (even if the API is called with int64). The CUB benchmark for DeviceSelect even has a corresponding comment: cccl/cub/benchmarks/bench/select/if.cu Lines 200 to 202 in 031efef
From the implementation it seems DevicePartition also uses the same streaming approach, but the CUB benchmark does not have that comment and we now have tuning results for int32 and int64. This seems odd. PR #2400 is a strong indicator that DevicePartition does use the same streaming approach as DeviceSelect and we should therefore only add tunings for a single offset type (DeviceSelect benchmarks run only for int64). @elstehle can you confirm please? |
FWIW @elstehle I just cross checked all the partition tunings one-to-one for If no decision made by tomorrow I 'll revert my changes so I don't break H100 and earlier performance and scrap all the |
Thanks. I think what would be more interesting is to see if there's indeed a performance difference between the two tunings.
I think the decision is to use the fully specialized tunings, i.e., using both the |
Agreed. |
f2ecc5c
to
8fbe0f9
Compare
b1db29d
to
52f97c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied some changes to how the offset type is passed around.
Then I diffed the SASS:
cmake --preset cub-benchmark -DCMAKE_CUDA_ARCHITECTURES="50;80;86;90" ...
ninja cub.bench.partition.if.base && cuobjdump -sass bin/cub.bench.partition.if.base > after.txt
git checkout upstream/main
ninja cub.bench.partition.if.base && cuobjdump -sass bin/cub.bench.partition.if.base > before.txt
diff before.txt after.txt
I see differences in kernel symbol names, because the template parameters of the policy hub changed, but otherwise no instruction changes.
Same for cub.bench.partition.flagged.base
.
|
For
|
|
|
For
|
Updated @bernhardmgruber Although there are not SLOWS at all, the perf speedup only happens for some rare a) We keep it as is and find different tunings later on that will be doing better for large problem sizes (but may as well regress the good 2^20 cases and uses might complain). b) We scrap them all and do not provide tunings for
|
That is a valid result and I would just keep it as is. |
021ea5a
to
cba8bcb
Compare
#if CUB_IS_INT128_ENABLED | ||
// because we introduced cases for when offset is I64 this leads to regressions if not defaulted explicitly | ||
template <distinct_partitions DistinctPartitions> | ||
struct sm100_tuning<__int128_t, | ||
flagged::no, | ||
keep_rejects::yes, | ||
offset_size::_8, | ||
primitive::no, | ||
input_size::_16, | ||
may_alias::no, | ||
DistinctPartitions> | ||
: sm90_tuning<__int128_t, flagged::no, keep_rejects::yes, offset_size::_4, primitive::no, input_size::_16> | ||
// ^^^^^ this base is wrong and leads to regressions ^^^^^ | ||
{}; | ||
#endif // CUB_IS_INT128_ENABLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gonidelis I need more information on why this specialization is needed, since I cannot understand which other sm100_tuning
could match an input_size::_16
.
Also, what do we need to do about the
this base is wrong and leads to regressions
?
In my understanding, we can simple remove this tuning entirely. Can you test whether works out fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it slipped through during the early stages of evaluation before we fix any bugs. Looking into it rn
🟨 CI finished in 1h 37m: Pass: 98%/90 | Total: 23h 48m | Avg: 15m 52s | Max: 1h 22m | Hits: 92%/131009
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
…ass template parameter to Nominal4BItemsToItems call
bcf3ba4
to
c62efcb
Compare
🟩 CI finished in 1h 42m: Pass: 100%/90 | Total: 2d 16h | Avg: 43m 05s | Max: 1h 16m | Hits: 74%/132225
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
|
I am happy with the results. Let's ask @elstehle for approval. |
🟩 CI finished in 1h 42m: Pass: 100%/90 | Total: 2d 16h | Avg: 42m 59s | Max: 1h 17m | Hits: 74%/132225
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 90)
# | Runner |
---|---|
65 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
4 | linux-arm64-cpu16 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
1 | linux-amd64-gpu-h100-latest-1 |
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin branch/2.8.x
git worktree add -d .worktree/backport-3617-to-branch/2.8.x origin/branch/2.8.x
cd .worktree/backport-3617-to-branch/2.8.x
git switch --create backport-3617-to-branch/2.8.x
git cherry-pick -x 8072358f8f0fedd940718979f995a75caf016d30 |
Co-authored-by: gonidelis <ggonidelis@nvidia.com>