Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uses unsigned offset types in thrust's scan algorithms #3436

Merged
merged 1 commit into from
Jan 21, 2025

Conversation

elstehle
Copy link
Collaborator

Description

PR #2171 has added support for large number of items to DeviceScan, using unsigned offset types. We want to reflect the switch to unsigned offset types in thrust, so thrust can benefit from future tunings that we do for unsigned offset types.

@elstehle elstehle requested a review from a team as a code owner January 17, 2025 12:04
@elstehle elstehle requested a review from gevtushenko January 17, 2025 12:04
Copy link
Contributor

🟩 CI finished in 1h 51m: Pass: 100%/78 | Total: 2d 03h | Avg: 39m 42s | Max: 1h 08m | Hits: 288%/12760
  • 🟩 cub: Pass: 100%/38 | Total: 1d 06h | Avg: 48m 45s | Max: 1h 08m | Hits: 377%/3540

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total:  1d 04h | Avg: 48m 19s | Max:  1h 08m | Hits: 377%/3540  
      🟩 arm64              Pass: 100%/2   | Total:  1h 52m | Avg: 56m 25s | Max: 57m 32s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  4h 43m | Avg: 56m 43s | Max:  1h 08m | Hits: 377%/885   
      🟩 12.5               Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 08m
      🟩 12.6               Pass: 100%/31  | Total: 23h 58m | Avg: 46m 24s | Max:  1h 06m | Hits: 377%/2655  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 54m | Avg: 57m 14s | Max: 57m 39s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 43m | Avg: 56m 43s | Max:  1h 08m | Hits: 377%/885   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 08m
      🟩 nvcc12.6           Pass: 100%/29  | Total: 22h 04m | Avg: 45m 40s | Max:  1h 06m | Hits: 377%/2655  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 54m | Avg: 57m 14s | Max: 57m 39s
      🟩 nvcc               Pass: 100%/36  | Total:  1d 04h | Avg: 48m 16s | Max:  1h 08m | Hits: 377%/3540  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 35m | Avg: 53m 45s | Max: 56m 29s
      🟩 Clang15            Pass: 100%/1   | Total: 57m 28s | Avg: 57m 28s | Max: 57m 28s
      🟩 Clang16            Pass: 100%/1   | Total: 52m 49s | Avg: 52m 49s | Max: 52m 49s
      🟩 Clang17            Pass: 100%/1   | Total: 54m 21s | Avg: 54m 21s | Max: 54m 21s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 29m | Avg: 47m 02s | Max: 57m 39s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 47m | Avg: 53m 48s | Max: 54m 38s
      🟩 GCC8               Pass: 100%/1   | Total: 57m 05s | Avg: 57m 05s | Max: 57m 05s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 49m | Avg: 54m 37s | Max: 54m 50s
      🟩 GCC10              Pass: 100%/1   | Total: 59m 21s | Avg: 59m 21s | Max: 59m 21s
      🟩 GCC11              Pass: 100%/1   | Total: 53m 49s | Avg: 53m 49s | Max: 53m 49s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 38m | Avg: 32m 42s | Max: 53m 42s
      🟩 GCC13              Pass: 100%/8   | Total:  4h 23m | Avg: 32m 56s | Max: 55m 19s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 15m | Avg:  1h 07m | Max:  1h 08m | Hits: 377%/1770  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 09m | Avg:  1h 04m | Max:  1h 05m | Hits: 377%/1770  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 08m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total: 11h 48m | Avg: 50m 38s | Max: 57m 39s
      🟩 GCC                Pass: 100%/18  | Total: 12h 28m | Avg: 41m 35s | Max: 59m 21s
      🟩 MSVC               Pass: 100%/4   | Total:  4h 24m | Avg:  1h 06m | Max:  1h 08m | Hits: 377%/3540  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 08m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 44m 24s | Avg: 22m 12s | Max: 25m 02s
      🟩 v100               Pass: 100%/36  | Total:  1d 06h | Avg: 50m 13s | Max:  1h 08m | Hits: 377%/3540  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  1d 04h | Avg: 55m 05s | Max:  1h 08m | Hits: 377%/3540  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 18m 15s | Avg: 18m 15s | Max: 18m 15s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 10s | Avg: 15m 10s | Max: 15m 10s
      🟩 HostLaunch         Pass: 100%/3   | Total: 56m 04s | Avg: 18m 41s | Max: 19m 46s
      🟩 TestGPU            Pass: 100%/2   | Total: 55m 17s | Avg: 27m 38s | Max: 27m 54s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 44m 24s | Avg: 22m 12s | Max: 25m 02s
      🟩 90a                Pass: 100%/1   | Total: 22m 55s | Avg: 22m 55s | Max: 22m 55s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total: 13h 37m | Avg: 58m 24s | Max:  1h 08m | Hits: 377%/2655  
      🟩 20                 Pass: 100%/24  | Total: 17h 14m | Avg: 43m 07s | Max:  1h 05m | Hits: 376%/885   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 19h 55m | Avg: 32m 19s | Max: 1h 07m | Hits: 253%/9220

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 37m 35s | Avg: 18m 47s | Max: 26m 17s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total: 18h 58m | Avg: 32m 32s | Max:  1h 07m | Hits: 253%/9220  
      🟩 arm64              Pass: 100%/2   | Total: 56m 51s | Avg: 28m 25s | Max: 29m 55s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 01m | Avg: 36m 17s | Max: 53m 20s | Hits: 225%/1844  
      🟩 12.5               Pass: 100%/2   | Total:  1h 48m | Avg: 54m 26s | Max: 56m 46s
      🟩 12.6               Pass: 100%/30  | Total: 15h 05m | Avg: 30m 10s | Max:  1h 07m | Hits: 261%/7376  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 52m 43s | Avg: 26m 21s | Max: 27m 38s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 01m | Avg: 36m 17s | Max: 53m 20s | Hits: 225%/1844  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 48m | Avg: 54m 26s | Max: 56m 46s
      🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 12m | Avg: 30m 27s | Max:  1h 07m | Hits: 261%/7376  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 52m 43s | Avg: 26m 21s | Max: 27m 38s
      🟩 nvcc               Pass: 100%/35  | Total: 19h 03m | Avg: 32m 39s | Max:  1h 07m | Hits: 253%/9220  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 09m | Avg: 32m 18s | Max: 33m 52s
      🟩 Clang15            Pass: 100%/1   | Total: 33m 50s | Avg: 33m 50s | Max: 33m 50s
      🟩 Clang16            Pass: 100%/1   | Total: 33m 07s | Avg: 33m 07s | Max: 33m 07s
      🟩 Clang17            Pass: 100%/1   | Total: 30m 26s | Avg: 30m 26s | Max: 30m 26s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 42m | Avg: 23m 15s | Max: 31m 42s
      🟩 GCC7               Pass: 100%/2   | Total: 59m 38s | Avg: 29m 49s | Max: 30m 00s
      🟩 GCC8               Pass: 100%/1   | Total: 30m 41s | Avg: 30m 41s | Max: 30m 41s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 04m | Avg: 32m 15s | Max: 33m 58s
      🟩 GCC10              Pass: 100%/1   | Total: 32m 37s | Avg: 32m 37s | Max: 32m 37s
      🟩 GCC11              Pass: 100%/1   | Total: 32m 29s | Avg: 32m 29s | Max: 32m 29s
      🟩 GCC12              Pass: 100%/1   | Total: 31m 58s | Avg: 31m 58s | Max: 31m 58s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 49m | Avg: 21m 09s | Max: 33m 39s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 53m | Avg: 56m 31s | Max: 59m 43s | Hits: 226%/3688  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 43m | Avg: 54m 26s | Max:  1h 07m | Hits: 272%/5532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 48m | Avg: 54m 26s | Max: 56m 46s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  6h 29m | Avg: 27m 49s | Max: 33m 52s
      🟩 GCC                Pass: 100%/16  | Total:  7h 01m | Avg: 26m 19s | Max: 33m 58s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 36m | Avg: 55m 16s | Max:  1h 07m | Hits: 253%/9220  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 48m | Avg: 54m 26s | Max: 56m 46s
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total: 19h 55m | Avg: 32m 19s | Max:  1h 07m | Hits: 253%/9220  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total: 18h 26m | Avg: 35m 41s | Max:  1h 07m | Hits: 226%/7376  
      🟩 TestCPU            Pass: 100%/3   | Total: 53m 44s | Avg: 17m 54s | Max: 38m 09s | Hits: 365%/1844  
      🟩 TestGPU            Pass: 100%/3   | Total: 35m 38s | Avg: 11m 52s | Max: 13m 57s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 17m 21s | Avg: 17m 21s | Max: 17m 21s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  8h 54m | Avg: 38m 10s | Max: 59m 43s | Hits: 226%/5532  
      🟩 20                 Pass: 100%/21  | Total: 10h 23m | Avg: 29m 42s | Max:  1h 07m | Hits: 295%/3688  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 8m 49s | Avg: 4m 24s | Max: 6m 51s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 49s | Avg:  4m 24s | Max:  6m 51s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  8m 49s | Avg:  4m 24s | Max:  6m 51s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  8m 49s | Avg:  4m 24s | Max:  6m 51s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  8m 49s | Avg:  4m 24s | Max:  6m 51s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  8m 49s | Avg:  4m 24s | Max:  6m 51s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  8m 49s | Avg:  4m 24s | Max:  6m 51s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  8m 49s | Avg:  4m 24s | Max:  6m 51s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  1m 58s | Avg:  1m 58s | Max:  1m 58s
      🟩 Test               Pass: 100%/1   | Total:  6m 51s | Avg:  6m 51s | Max:  6m 51s
    
  • 🟩 python: Pass: 100%/1 | Total: 40m 19s | Avg: 40m 19s | Max: 40m 19s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 40m 19s | Avg: 40m 19s | Max: 40m 19s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 40m 19s | Avg: 40m 19s | Max: 40m 19s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 40m 19s | Avg: 40m 19s | Max: 40m 19s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 40m 19s | Avg: 40m 19s | Max: 40m 19s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 40m 19s | Avg: 40m 19s | Max: 40m 19s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 40m 19s | Avg: 40m 19s | Max: 40m 19s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 40m 19s | Avg: 40m 19s | Max: 40m 19s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 40m 19s | Avg: 40m 19s | Max: 40m 19s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 78)

# Runner
53 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@elstehle elstehle force-pushed the enh/thrust-scan-offset-types branch from f5514c2 to 2eae8f1 Compare January 20, 2025 13:41
Copy link
Contributor

🟩 CI finished in 1h 51m: Pass: 100%/78 | Total: 2d 03h | Avg: 39m 58s | Max: 1h 09m | Hits: 289%/12720
  • 🟩 cub: Pass: 100%/38 | Total: 1d 07h | Avg: 49m 16s | Max: 1h 09m | Hits: 377%/3540

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total:  1d 05h | Avg: 48m 47s | Max:  1h 09m | Hits: 377%/3540  
      🟩 arm64              Pass: 100%/2   | Total:  1h 55m | Avg: 57m 57s | Max: 59m 46s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  4h 48m | Avg: 57m 43s | Max:  1h 00m | Hits: 377%/885   
      🟩 12.5               Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 01m
      🟩 12.6               Pass: 100%/31  | Total:  1d 00h | Avg: 47m 06s | Max:  1h 09m | Hits: 377%/2655  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 01m
      🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 48m | Avg: 57m 43s | Max:  1h 00m | Hits: 377%/885   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 01m
      🟩 nvcc12.6           Pass: 100%/29  | Total: 22h 19m | Avg: 46m 11s | Max:  1h 09m | Hits: 377%/2655  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 01m
      🟩 nvcc               Pass: 100%/36  | Total:  1d 05h | Avg: 48m 39s | Max:  1h 09m | Hits: 377%/3540  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 37m | Avg: 54m 24s | Max: 57m 35s
      🟩 Clang15            Pass: 100%/1   | Total: 52m 06s | Avg: 52m 06s | Max: 52m 06s
      🟩 Clang16            Pass: 100%/1   | Total: 52m 24s | Avg: 52m 24s | Max: 52m 24s
      🟩 Clang17            Pass: 100%/1   | Total: 50m 58s | Avg: 50m 58s | Max: 50m 58s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 29m | Avg: 47m 07s | Max:  1h 01m
      🟩 GCC7               Pass: 100%/2   | Total:  1h 53m | Avg: 56m 57s | Max: 56m 59s
      🟩 GCC8               Pass: 100%/1   | Total: 55m 02s | Avg: 55m 02s | Max: 55m 02s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 51m | Avg: 55m 38s | Max: 59m 26s
      🟩 GCC10              Pass: 100%/1   | Total: 58m 56s | Avg: 58m 56s | Max: 58m 56s
      🟩 GCC11              Pass: 100%/1   | Total: 58m 28s | Avg: 58m 28s | Max: 58m 28s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 38m | Avg: 32m 43s | Max: 53m 38s
      🟩 GCC13              Pass: 100%/8   | Total:  4h 48m | Avg: 36m 03s | Max: 59m 12s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 02m | Hits: 377%/1770  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 18m | Avg:  1h 09m | Max:  1h 09m | Hits: 377%/1770  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 01m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total: 11h 42m | Avg: 50m 12s | Max:  1h 01m
      🟩 GCC                Pass: 100%/18  | Total: 13h 04m | Avg: 43m 34s | Max: 59m 26s
      🟩 MSVC               Pass: 100%/4   | Total:  4h 21m | Avg:  1h 05m | Max:  1h 09m | Hits: 377%/3540  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 03m | Avg:  1h 01m | Max:  1h 01m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 44m 33s | Avg: 22m 16s | Max: 25m 06s
      🟩 v100               Pass: 100%/36  | Total:  1d 06h | Avg: 50m 46s | Max:  1h 09m | Hits: 377%/3540  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  1d 04h | Avg: 55m 11s | Max:  1h 09m | Hits: 377%/3540  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 19m 48s | Avg: 19m 48s | Max: 19m 48s
      🟩 GraphCapture       Pass: 100%/1   | Total: 21m 04s | Avg: 21m 04s | Max: 21m 04s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 06m | Avg: 22m 02s | Max: 24m 50s
      🟩 TestGPU            Pass: 100%/2   | Total: 54m 36s | Avg: 27m 18s | Max: 30m 13s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 44m 33s | Avg: 22m 16s | Max: 25m 06s
      🟩 90a                Pass: 100%/1   | Total: 23m 50s | Avg: 23m 50s | Max: 23m 50s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total: 13h 24m | Avg: 57m 29s | Max:  1h 09m | Hits: 377%/2655  
      🟩 20                 Pass: 100%/24  | Total: 17h 47m | Avg: 44m 29s | Max:  1h 09m | Hits: 376%/885   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 19h 54m | Avg: 32m 17s | Max: 1h 09m | Hits: 255%/9180

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 39m 40s | Avg: 19m 50s | Max: 27m 13s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total: 18h 57m | Avg: 32m 29s | Max:  1h 09m | Hits: 255%/9180  
      🟩 arm64              Pass: 100%/2   | Total: 57m 30s | Avg: 28m 45s | Max: 30m 37s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  2h 59m | Avg: 35m 57s | Max: 58m 28s | Hits: 227%/1836  
      🟩 12.5               Pass: 100%/2   | Total:  1h 45m | Avg: 52m 33s | Max: 52m 48s
      🟩 12.6               Pass: 100%/30  | Total: 15h 09m | Avg: 30m 19s | Max:  1h 09m | Hits: 262%/7344  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 50m 44s | Avg: 25m 22s | Max: 26m 48s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  2h 59m | Avg: 35m 57s | Max: 58m 28s | Hits: 227%/1836  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 45m | Avg: 52m 33s | Max: 52m 48s
      🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 19m | Avg: 30m 41s | Max:  1h 09m | Hits: 262%/7344  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 50m 44s | Avg: 25m 22s | Max: 26m 48s
      🟩 nvcc               Pass: 100%/35  | Total: 19h 04m | Avg: 32m 41s | Max:  1h 09m | Hits: 255%/9180  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  1h 58m | Avg: 29m 44s | Max: 30m 35s
      🟩 Clang15            Pass: 100%/1   | Total: 33m 09s | Avg: 33m 09s | Max: 33m 09s
      🟩 Clang16            Pass: 100%/1   | Total: 29m 06s | Avg: 29m 06s | Max: 29m 06s
      🟩 Clang17            Pass: 100%/1   | Total: 28m 58s | Avg: 28m 58s | Max: 28m 58s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 37m | Avg: 22m 28s | Max: 29m 40s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 03m | Avg: 31m 41s | Max: 33m 28s
      🟩 GCC8               Pass: 100%/1   | Total: 30m 26s | Avg: 30m 26s | Max: 30m 26s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 04m | Avg: 32m 08s | Max: 33m 00s
      🟩 GCC10              Pass: 100%/1   | Total: 34m 10s | Avg: 34m 10s | Max: 34m 10s
      🟩 GCC11              Pass: 100%/1   | Total: 32m 19s | Avg: 32m 19s | Max: 32m 19s
      🟩 GCC12              Pass: 100%/1   | Total: 34m 15s | Avg: 34m 15s | Max: 34m 15s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 56m | Avg: 22m 01s | Max: 32m 54s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 59m | Avg: 59m 42s | Max:  1h 00m | Hits: 227%/3672  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 47m | Avg: 55m 56s | Max:  1h 09m | Hits: 273%/5508  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 45m | Avg: 52m 33s | Max: 52m 48s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  6h 07m | Avg: 26m 15s | Max: 33m 09s
      🟩 GCC                Pass: 100%/16  | Total:  7h 14m | Avg: 27m 11s | Max: 34m 15s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 47m | Avg: 57m 26s | Max:  1h 09m | Hits: 255%/9180  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 45m | Avg: 52m 33s | Max: 52m 48s
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total: 19h 54m | Avg: 32m 17s | Max:  1h 09m | Hits: 255%/9180  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total: 18h 19m | Avg: 35m 28s | Max:  1h 09m | Hits: 227%/7344  
      🟩 TestCPU            Pass: 100%/3   | Total: 53m 41s | Avg: 17m 53s | Max: 37m 03s | Hits: 365%/1836  
      🟩 TestGPU            Pass: 100%/3   | Total: 41m 42s | Avg: 13m 54s | Max: 15m 37s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 17m 43s | Avg: 17m 43s | Max: 17m 43s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  8h 56m | Avg: 38m 18s | Max:  1h 00m | Hits: 227%/5508  
      🟩 20                 Pass: 100%/21  | Total: 10h 18m | Avg: 29m 28s | Max:  1h 09m | Hits: 296%/3672  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 08s | Avg: 4m 34s | Max: 6m 53s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 08s | Avg:  4m 34s | Max:  6m 53s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 08s | Avg:  4m 34s | Max:  6m 53s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 08s | Avg:  4m 34s | Max:  6m 53s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 08s | Avg:  4m 34s | Max:  6m 53s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 08s | Avg:  4m 34s | Max:  6m 53s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 08s | Avg:  4m 34s | Max:  6m 53s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 08s | Avg:  4m 34s | Max:  6m 53s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 15s | Avg:  2m 15s | Max:  2m 15s
      🟩 Test               Pass: 100%/1   | Total:  6m 53s | Avg:  6m 53s | Max:  6m 53s
    
  • 🟩 python: Pass: 100%/1 | Total: 40m 46s | Avg: 40m 46s | Max: 40m 46s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 40m 46s | Avg: 40m 46s | Max: 40m 46s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 40m 46s | Avg: 40m 46s | Max: 40m 46s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 40m 46s | Avg: 40m 46s | Max: 40m 46s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 40m 46s | Avg: 40m 46s | Max: 40m 46s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 40m 46s | Avg: 40m 46s | Max: 40m 46s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 40m 46s | Avg: 40m 46s | Max: 40m 46s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 40m 46s | Avg: 40m 46s | Max: 40m 46s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 40m 46s | Avg: 40m 46s | Max: 40m 46s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 78)

# Runner
53 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@elstehle elstehle merged commit e556a83 into NVIDIA:main Jan 21, 2025
89 of 92 checks passed
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
davebayer added a commit to davebayer/cccl that referenced this pull request Jan 22, 2025
update docs

update docs

add `memcmp`, `memmove` and `memchr` implementations

implement tests

Use cuda::std::min/max in Thrust (NVIDIA#3364)

Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (NVIDIA#3361)

* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`

Cleanup util_arch (NVIDIA#2773)

Deprecate thrust::null_type (NVIDIA#3367)

Deprecate cub::DeviceSpmv (NVIDIA#3320)

Fixes: NVIDIA#896

Improves `DeviceSegmentedSort` test run time for large number of items and segments (NVIDIA#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Compile basic infra test with C++17 (NVIDIA#3377)

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (NVIDIA#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

Exit with error when RAPIDS CI fails. (NVIDIA#3385)

cuda.parallel: Support structured types as algorithm inputs (NVIDIA#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Deprecate thrust::async (NVIDIA#3324)

Fixes: NVIDIA#100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (NVIDIA#3342)

Fix broken `_CCCL_BUILTIN_ASSUME` macro (NVIDIA#3314)

* add compiler-specific path
* fix device code path
* add _CCC_ASSUME

Deprecate thrust::numeric_limits (NVIDIA#3366)

Replace `typedef` with `using` in libcu++ (NVIDIA#3368)

Deprecate thrust::optional (NVIDIA#3307)

Fixes: NVIDIA#3306

Upgrade to Catch2 3.8  (NVIDIA#3310)

Fixes: NVIDIA#1724

refactor `<cuda/std/cstdint>` (NVIDIA#3325)

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

Update CODEOWNERS (NVIDIA#3331)

* Update CODEOWNERS

* Update CODEOWNERS

* Update CODEOWNERS

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix sign-compare warning (NVIDIA#3408)

Implement more cmath functions to be usable on host and device (NVIDIA#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (NVIDIA#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Fix assert definition for NVHPC due to constexpr issues (NVIDIA#3418)

NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.

Fix this by always using the host definition which should also work on device.

Fixes NVIDIA#3411

Extend CUB reduce benchmarks (NVIDIA#3401)

* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters

Fixes: NVIDIA#3283

Update upload-pages-artifact to v3 (NVIDIA#3423)

* Update upload-pages-artifact to v3

* Empty commit

---------

Co-authored-by: Ashwin Srinath <shwina@users.noreply.github.com>

Replace and deprecate thrust::cuda_cub::terminate (NVIDIA#3421)

`std::linalg` accessors and `transposed_layout` (NVIDIA#2962)

Add round up/down to multiple (NVIDIA#3234)

[FEA]: Introduce Python module with CCCL headers (NVIDIA#3201)

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative

* Run `copy_cccl_headers_to_aude_include()` before `setup()`

* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.

* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

* Bug fix: cuda/_include only exists after shutil.copytree() ran.

* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py

* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)

* Replace := operator (needs Python 3.8+)

* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md

* Restore original README.md: `pip3 install -e` now works on first pass.

* cuda_cccl/README.md: FOR INTERNAL USE ONLY

* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under NVIDIA#3201 (comment))

Command used: ci/update_version.sh 2 8 0

* Modernize pyproject.toml, setup.py

Trigger for this change:

* NVIDIA#3201 (comment)

* NVIDIA#3201 (comment)

* Install CCCL headers under cuda.cccl.include

Trigger for this change:

* NVIDIA#3201 (comment)

Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.

* Factor out cuda_cccl/cuda/cccl/include_paths.py

* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative

* Add missing Copyright notice.

* Add missing __init__.py (cuda.cccl)

* Add `"cuda.cccl"` to `autodoc.mock_imports`

* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)

* Add # TODO: move this to a module-level import

* Modernize cuda_cooperative/pyproject.toml, setup.py

* Convert cuda_cooperative to use hatchling as build backend.

* Revert "Convert cuda_cooperative to use hatchling as build backend."

This reverts commit 61637d6.

* Move numpy from [build-system] requires -> [project] dependencies

* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH

* Remove copy_license() and use license_files=["../../LICENSE"] instead.

* Further modernize cuda_cccl/setup.py to use pathlib

* Trivial simplifications in cuda_cccl/pyproject.toml

* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code

* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml

* Add taplo-pre-commit to .pre-commit-config.yaml

* taplo-pre-commit auto-fixes

* Use pathlib in cuda_cooperative/setup.py

* CCCL_PYTHON_PATH in cuda_cooperative/setup.py

* Modernize cuda_parallel/pyproject.toml, setup.py

* Use pathlib in cuda_parallel/setup.py

* Add `# TOML lint & format` comment.

* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml

* Use pathlib in cuda/cccl/include_paths.py

* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)

* Fixes after git merge main

* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'

```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
  /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>

  Traceback (most recent call last):
    File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
      bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
                                                       ^^^^^^^^^^^^^^^^^
  AttributeError: '_Reduce' object has no attribute 'build_result'

    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```

* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`

* Introduce cuda_cooperative/constraints.txt

* Also add cuda_parallel/constraints.txt

* Add `--constraint constraints.txt` in ci/test_python.sh

* Update Copyright dates

* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)

For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.

* Remove unused cuda_parallel jinja2 dependency (noticed by chance).

* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.

* Make cuda_cooperative, cuda_parallel testing completely independent.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Fix sign-compare warning (NVIDIA#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"

This reverts commit ea33a21.

Error message: NVIDIA#3201 (comment)

* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Restore original ci/matrix.yaml [skip-rapids]

* Use for loop in test_python.sh to avoid code duplication.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]

* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"

This reverts commit ec206fd.

* Implement suggestion by @shwina (NVIDIA#3201 (review))

* Address feedback by @leofang

---------

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

cuda.parallel: Add optional stream argument to reduce_into() (NVIDIA#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (NVIDIA#3434)

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (NVIDIA#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes NVIDIA#3404

Fix CI issues (NVIDIA#3443)

Remove deprecated `cub::min` (NVIDIA#3450)

* Remove deprecated `cuda::{min,max}`

* Drop unused `thrust::remove_cvref` file

Fix typo in builtin (NVIDIA#3451)

Moves agents to `detail::<algorithm_name>` namespace (NVIDIA#3435)

uses unsigned offset types in thrust's scan dispatch (NVIDIA#3436)

Default transform_iterator's copy ctor (NVIDIA#3395)

Fixes: NVIDIA#2393

Turn C++ dialect warning into error (NVIDIA#3453)

Uses unsigned offset types in thrust's sort algorithm calling into `DispatchMergeSort` (NVIDIA#3437)

* uses thrust's dynamic dispatch for merge_sort

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Refactor allocator handling of contiguous_storage (NVIDIA#3050)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Drop thrust::detail::integer_traits (NVIDIA#3391)

Add cuda::is_floating_point supporting half and bfloat (NVIDIA#3379)

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Improve docs of std headers (NVIDIA#3416)

Drop C++11 and C++14 support for all of cccl (NVIDIA#3417)

* Drop C++11 and C++14 support for all of cccl

---------

Co-authored-by: Bernhard Manfred Gruber <bernhardmgruber@gmail.com>

Deprecate a few CUB macros (NVIDIA#3456)

Deprecate thrust universal iterator categories (NVIDIA#3461)

Fix launch args order (NVIDIA#3465)

Add `--extended-lambda` to the list of removed clangd flags (NVIDIA#3432)

add `_CCCL_HAS_NVFP8` macro (NVIDIA#3429)

Add `_CCCL_BUILTIN_PREFETCH` (NVIDIA#3433)

Drop universal iterator categories (NVIDIA#3474)

Ensure that headers in `<cuda/*>` can be build with a C++ only compiler (NVIDIA#3472)

Specialize __is_extended_floating_point for FP8 types (NVIDIA#3470)

Also ensure that we actually can enable FP8 due to FP16 and BF16 requirements

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

Moves CUB kernel entry points to a detail namespace (NVIDIA#3468)

* moves emptykernel to detail ns

* second batch

* third batch

* fourth batch

* fixes cuda parallel

* concatenates nested namespaces

Deprecate block/warp algo specializations (NVIDIA#3455)

Fixes: NVIDIA#3409

Refactor CUB's util_debug (NVIDIA#3345)
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants