[FEA]: Proposal to Change MaxSmOccupancy Inline Specification for Enhanced Compatibility Across Shared Libraries Utilizing Thrust/CUB #1391

eee4017 · 2024-02-16T05:51:41Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUB

Is your feature request related to a problem? Please describe.

I propose modifying the MaxSmOccupancy function's inline specification from inline to force_inline (utilizing the _CCCL_FORCEINLINE) here. This change is crucial for enhancing the compatibility and functionality of projects that employ multiple shared libraries incorporating the Thrust/CUB libraries.

When utilizing Thrust/CUB across several shared libraries, it's possible to encounter a cudaErrorInvalidDeviceFunction error. This issue arises if the compiler fails to correctly inline the MaxSmOccupancy function. Our project structure comprises multiple libraries (e.g., libA.so and libB.so) that utilize Thrust/CUB and are linked together. We discovered that MaxSmOccupancy is implemented within libA.so. However, when invoking Thrust functions in libB.so, the kernel pointer (kernel_ptr), which is a Thrust device function within libB.so, is passed to and queried by MaxSmOccupancy in libA.so. This operation is problematic within CUDA, which triggers cudaErrorInvalidDeviceFunction because passing the function pointer of a device function between libraries is restricted by the CUDA Runtime API, given that CUlibrary structures of CUDA Driver are opaque and managed by CUDA Runtime.

To address this issue and prevent the cudaErrorInvalidDeviceFunction, it's imperative to ensure that MaxSmOccupancy is forcefully inlined. This adjustment ensures that the cudaOccupancyMaxActiveBlocksPerMultiprocessor function is invoked within the same library that calls the Thrust function, thereby circumventing the aforementioned error.

Describe the solution you'd like

Change inline to _CCCL_FORCEINLINE here

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

jrhemstad · 2024-02-16T13:39:10Z

Thanks for the excellent write up! We've dealt with countless insidious issues that originate from the interplay between symbol visibility across shared libraries and how kernel registration works in the CUDA Runtime. It's been a nasty problem that we've hoped was finally put to rest. You can read about the saga here #443.

I think you may have just identified an area that we missed 🙁.

Similar to how in #443 we had to decorate the thrust::cuda_cub::launcher::triple_chevron kernel launch function with _LIBCUDACXX_HIDDEN (which is ultimately just __attribute__((visibility(hidden)))) to avoid symbol collisions across shared objects, it would seem we need to do the same thing with cub::MaxSmOccupancy. Using forceinline as you suggest would probably work too, but the symbol visibility annotation is the more targeted solution to the real root of the problem.

I'll need @gevtushenko to confirm that this is indeed the right fix and then we'll try and take care of that ASAP.

gevtushenko · 2024-02-16T22:43:18Z

@eee4017 thank you for reporting the issue! I agree with your analysis. Every function taking kernel pointers should be hidden. I think it goes beyond SM occupancy calculator and triple chevron launcher. In the CUB dispatch layer, we also have this issue that was likely masked by force inlining. Some places (segmented sort) missed force inline annotation, potentially leading to linkage issue. I've filed #1391 that hides all functions taking kernel pointers. Please, take a look if it addresses the issue for you.

ZelboK · 2024-02-19T00:08:03Z

I was going to update https://github.com/NVIDIA/cccl/pull/592
but I just noticed this issue. Should I hold off?

eee4017 added the feature request New feature or request. label Feb 16, 2024

gevtushenko mentioned this issue Feb 16, 2024

Hide API accepting kernel pointers #1395

Merged

2 tasks

gevtushenko closed this as completed in #1395 Feb 20, 2024

eee4017 mentioned this issue Mar 14, 2024

Fix test_weight_decay and test_graph_reindex PaddlePaddle/Paddle#62707

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Proposal to Change MaxSmOccupancy Inline Specification for Enhanced Compatibility Across Shared Libraries Utilizing Thrust/CUB #1391

[FEA]: Proposal to Change MaxSmOccupancy Inline Specification for Enhanced Compatibility Across Shared Libraries Utilizing Thrust/CUB #1391

eee4017 commented Feb 16, 2024 •

edited

Loading

jrhemstad commented Feb 16, 2024

gevtushenko commented Feb 16, 2024

ZelboK commented Feb 19, 2024

[FEA]: Proposal to Change MaxSmOccupancy Inline Specification for Enhanced Compatibility Across Shared Libraries Utilizing Thrust/CUB #1391

[FEA]: Proposal to Change MaxSmOccupancy Inline Specification for Enhanced Compatibility Across Shared Libraries Utilizing Thrust/CUB #1391

Comments

eee4017 commented Feb 16, 2024 • edited Loading

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

jrhemstad commented Feb 16, 2024

gevtushenko commented Feb 16, 2024

ZelboK commented Feb 19, 2024

eee4017 commented Feb 16, 2024 •

edited

Loading