Run all stdgpu operations on a specified cuda stream #423

tanzby · 2024-06-21T03:34:40Z

I notices that some functions like stdgpu::detail::memcpy is non-async and running on DEFAULT cuda stream. More details: stdgpu::detail::memcpy depends on dispatch_memcpy and it looks like:

dispatch_memcpy(void* destination,
                const void* source,
                index64_t bytes,
                dynamic_memory_type destination_type,
                dynamic_memory_type source_type) {
   
   ...
  
   // use default stream here.
   STDGPU_CUDA_SAFE_CALL(cudaMemcpy(destination, source, static_cast<std::size_t>(bytes), kind));
}

For example. if we use cuda graph and try to catch all operations on stream, error raises because diff streams (default and customers') are mixed.

stdgpu : CUDA ERROR :
  Error     : operation would make the legacy stream depend on a capturing blocking stream
  File      : external/stdgpu/src/stdgpu/cuda/impl/memory.cpp:123
  Function  : void stdgpu::cuda::dispatch_memcpy(void *, const void *, stdgpu::index64_t, stdgpu::dynamic_memory_type, stdgpu::dynamic_memory_type)

So my request: Run all stdgpu operations on a specified cuda stream

The text was updated successfully, but these errors were encountered:

stotko · 2024-06-21T11:14:31Z

Most of the functionality should support custom CUDA streams by taking a respective execution_policy which wraps the stream, see #351. Part of the memory API is one notable exception though, but the mempy-like function are not actually used in the containers. Could you provide some pointers to a particular function in stdgpu that triggers this error when called? Does it already happen when you only create a new container, e.g. auto c = stdgpu::vector<int>::createDeviceObject(1000);?

tanzby · 2024-06-23T16:44:52Z

@stotko Such as stdgpu::unordered_map<>::device_range. It's non-async

Function  : void stdgpu::cuda::dispatch_memcpy(void *, const void *, stdgpu::index64_t, stdgpu::dynamic_memory_type, stdgpu::dynamic_memory_type)
    @          0x52ff5ac stdgpu::cuda::safe_call()
    @          0x52ff476 stdgpu::cuda::dispatch_memcpy()
    @          0x52fd629 stdgpu::detail::dispatch_memcpy()
    @          0x52fd7c2 stdgpu::detail::memcpy()
    @          0x52e7ebf copyHost2DeviceArray<>()
    @          0x52e7e83 stdgpu::atomic_ref<>::store()
    @          0x52e56d9 stdgpu::atomic<>::store()
    @          0x52edf64 stdgpu::detail::unordered_base<>::device_range<>()
    @          0x52edf25 stdgpu::detail::unordered_base<>::device_range()
    @          0x52ed9be stdgpu::unordered_set<>::device_range()

The whole pipeline likes:

inert_kernel<<<>>>(xxx);                                // on stream
auto block_range = block_indices().device_range();      // a sync and blocked operation
update_block_meta_kernel<<<>>>(xxx);                    // on stream

But I have to admit that this is difficult to write in the form of operating on stream. Or I don't know if it can be achieved.

stotko · 2024-06-25T06:49:43Z

Thanks. Even though the device_range() method comes with an overload that accepts an execution_policy, it internally needs an atomic whose load() and store() functions use non-async mempy. We probably need respective overloads for these as well to get full stream support here.

stotko · 2024-11-20T11:03:46Z

Sorry for the long delay. It took a larger refactoring to fill the gaps in the stream support, but with #450 this issue should be resolved.

tanzby added the enhancement label Jun 21, 2024

stotko mentioned this issue Aug 12, 2024

add occupied(n) for unordered set and map #427

Closed

stotko mentioned this issue Oct 30, 2024

Expose occupied() in unordered map #436

Closed

stotko mentioned this issue Nov 19, 2024

Execution policy support for all containers #351

Closed

9 tasks

stotko closed this as completed Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run all stdgpu operations on a specified cuda stream #423

Run all stdgpu operations on a specified cuda stream #423

tanzby commented Jun 21, 2024 •

edited

Loading

stotko commented Jun 21, 2024

tanzby commented Jun 23, 2024 •

edited

Loading

stotko commented Jun 25, 2024 •

edited

Loading

stotko commented Nov 20, 2024 •

edited

Loading

Run all stdgpu operations on a specified cuda stream #423

Run all stdgpu operations on a specified cuda stream #423

Comments

tanzby commented Jun 21, 2024 • edited Loading

stotko commented Jun 21, 2024

tanzby commented Jun 23, 2024 • edited Loading

stotko commented Jun 25, 2024 • edited Loading

stotko commented Nov 20, 2024 • edited Loading

tanzby commented Jun 21, 2024 •

edited

Loading

tanzby commented Jun 23, 2024 •

edited

Loading

stotko commented Jun 25, 2024 •

edited

Loading

stotko commented Nov 20, 2024 •

edited

Loading