Use native ("USM") pointers for backing buffer allocations #162

psalz · 2023-03-08T09:25:38Z

This transitions Celerity away from using SYCL buffers for backing buffer allocations to essentially device-native memory allocations through sycl::malloc_device and sycl::free, which are part of SYCL 2020's USM APIs.

This has several advantages:

It gives us much more control over device memory usage, as we don't have to guess what any given SYCL implementation will do (this is particularly important for multi-GPU support).
It allows us to greatly simplify our accessor implementation, as we no longer need to wrap SYCL accessors internally.
This in turn means that we completely bypass any dataflow tracking within the SYCL runtime. This was always an unnecessary overhead as we have much more precise tracking information anyway.
It opens up the possibility for enabling special vendor-specific optimizations such as CUDA-aware MPI later down the road.

Unfortunately there is also a major downside, which is that SYCL's USM capabilities are very much 1-dimensional at the moment.
This is reflected across all USM APIs, and particularly problematic for us is that there is no way of doing 2D/3D rectangular (strided) copies. Dispatching a series of 1D copies in a loop is not a viable solution, as it is extremely slow. Instead, for now (until SYCL Next, hopefully), we have to resort back to doing copies manually using the underlying vendor APIs. This means we need specialized code paths for each of the backends that we want to support efficiently.

Currently I have only implemented a specialized backend for CUDA as well as a generic (and slow) fallback, however adding new backends should be relatively straightforward. Backends are selected dynamically based on the device used. Since SYCL does not yet officially support CUDA as a backend, we also need an implementation-specific mechanism for detecting whether a device is a CUDA device; this is currently supported for OpenSYCL and DPC++.

I wasn't quite sure what the best way of structuring the backend system would be. Loading specialized libraries dynamically at runtime (similar to what OpenSYCL does) seemed overkill, so I've instead opted to make enabling/disabling backends a compile-time option (-DCELERITY_ENABLE_CUDA_BACKEND).

This makes testing somewhat more difficult though, as the behavior of test cases is now affected by how they are compiled, which SYCL implementation is used, and on the hardware available during runtime. While I've added some unit tests that just opportunistically test whatever is available, to cover more of these combinations in our CI setup, I've also created our first true integration test (we previously used our examples as integration tests, which was always suboptimal) in the form of a Python script that compiles and runs different backend configurations.

github-actions

clang-tidy made some suggestions

include/accessor.h

include/backend/operations.h

include/buffer_storage.h

test/backend_tests.cc

test/integration/backend.cc

PeterTh

Great work!
I especially like that this gets rid of some silly workarounds/hacks that were previously necessary.

include/backend/backend.h

include/buffer_manager.h

include/device_queue.h

test/backend_tests.cc

github-actions

clang-tidy made some suggestions

test/integration/backend.cc

github-actions

clang-tidy made some suggestions

fknorr

Some nitpicks from my side, otherwise I would say this is true and tested by now!

include/accessor.h

include/buffer_manager.h

include/handler.h

include/reduction_manager.h

src/backend/cuda_backend.cc

src/buffer_storage.cc

test/buffer_manager_test_utils.h

test/integration/backend.cc

test/buffer_manager_test_utils.h

fknorr · 2023-03-16T14:40:44Z

Re the ComputeCpp CI failure which I'm also seeing in #163 : The backend library doesn't set the CELERITY_CXX_FLAGS, which leaves CCPP in pre-2020 SYCL mode and causes missing symbol declarations.

github-actions

clang-tidy made some suggestions

test/backend_tests.cc

test/device_selection_tests.cc

test/integration/backend.cc

github-actions

clang-tidy made some suggestions

test/backend_tests.cc

test/device_selection_tests.cc

test/integration/backend.cc

fknorr

Looks good to me now!

include/accessor.h

This introduces the new SKIP macro for skipping tests at runtime.

Use native pointers (allocated using `sycl::malloc_device`) instead of relying on SYCL buffers for backing Celerity virtual buffers. This greatly simplifies various aspects of accessors and buffer management while enabling future optimizations. Futhermore, by using native pointers we completely circumvent any dataflow analysis performed by the SYCL runtime.

Since SYCL 2020 does not support multi-dimensional (rectangular) copies for USM pointers, we have to either do it in a loop (slow) or fall back to vendor-specific APIs. This introduces a new "backend" system that does the latter. Currently only "generic" (= SYCL, slow) and CUDA (when using OpenSYCL or DPC++) are supported. Since backends are configuration during compile time, this additionally introduces a new integration testing mechanism for testing backends. This requires Celerity to be built with different CMake options, so the test is implemented as a Python script.

...instead of range and offset.

psalz requested review from PeterTh, fknorr and facuMH March 8, 2023 09:25

psalz force-pushed the native-buffers branch from 9edacc4 to a71c3da Compare March 8, 2023 10:06

github-actions bot reviewed Mar 8, 2023

View reviewed changes

celerity deleted a comment from github-actions bot Mar 8, 2023

PeterTh reviewed Mar 8, 2023

View reviewed changes

include/backend/backend.h Show resolved Hide resolved

include/buffer_manager.h Outdated Show resolved Hide resolved

include/device_queue.h Outdated Show resolved Hide resolved

test/backend_tests.cc Show resolved Hide resolved

celerity deleted a comment from github-actions bot Mar 8, 2023

github-actions bot reviewed Mar 8, 2023

View reviewed changes

test/integration/backend.cc Show resolved Hide resolved

test/integration/backend.cc Show resolved Hide resolved

test/integration/backend.cc Show resolved Hide resolved

celerity deleted a comment from github-actions bot Mar 8, 2023

github-actions bot reviewed Mar 8, 2023

View reviewed changes

celerity deleted a comment from github-actions bot Mar 8, 2023

fknorr reviewed Mar 8, 2023

View reviewed changes

facuMH reviewed Mar 9, 2023

View reviewed changes

test/buffer_manager_test_utils.h Show resolved Hide resolved

fknorr mentioned this pull request Mar 14, 2023

Support 0-dimensional buffers, accessors and kernels #163

Merged

github-actions bot reviewed Mar 21, 2023

View reviewed changes

test/backend_tests.cc Show resolved Hide resolved

test/device_selection_tests.cc Show resolved Hide resolved

test/integration/backend.cc Show resolved Hide resolved

test/integration/backend.cc Show resolved Hide resolved

test/integration/backend.cc Show resolved Hide resolved

psalz force-pushed the native-buffers branch from 6da7b66 to 1fdc160 Compare March 22, 2023 15:26

psalz mentioned this pull request Mar 24, 2023

Bump minimum required SYCL versions, drop support for ComputeCpp #167

Merged

psalz force-pushed the native-buffers branch 2 times, most recently from 4502e6f to 6de986d Compare March 24, 2023 14:57

github-actions bot reviewed Mar 24, 2023

View reviewed changes

test/backend_tests.cc Show resolved Hide resolved

test/device_selection_tests.cc Show resolved Hide resolved

test/integration/backend.cc Show resolved Hide resolved

test/integration/backend.cc Show resolved Hide resolved

test/integration/backend.cc Show resolved Hide resolved

psalz force-pushed the native-buffers branch from 146a017 to bf07bd3 Compare March 27, 2023 14:38

psalz mentioned this pull request Mar 27, 2023

Update tested CUDA versions documentation #168

Merged

fknorr approved these changes Mar 28, 2023

View reviewed changes

include/accessor.h Outdated Show resolved Hide resolved

include/accessor.h Outdated Show resolved Hide resolved

psalz added 6 commits March 28, 2023 13:48

Update Catch2 to v3.3

2d73e1e

This introduces the new SKIP macro for skipping tests at runtime.

Change access_[host|device]_buffer to receive subrange

0440a6f

...instead of range and offset.

Rename memcpy_strided to memcpy_strided_host

b600d3d

Use FindCUDA instead of FindCUDAToolkit for CMake < 3.17

441f206

psalz force-pushed the native-buffers branch from bf07bd3 to 441f206 Compare March 28, 2023 12:14

psalz merged commit 6a4b81f into master Mar 28, 2023

psalz deleted the native-buffers branch March 28, 2023 13:59

fknorr mentioned this pull request Apr 11, 2023

Remove dead workarounds for unsupported access_mode::atomic #172

Merged

psalz mentioned this pull request Jun 21, 2023

Make buffer manager aware of current device memory usage #179

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use native ("USM") pointers for backing buffer allocations #162

Use native ("USM") pointers for backing buffer allocations #162

psalz commented Mar 8, 2023

github-actions bot left a comment

PeterTh left a comment

github-actions bot left a comment

github-actions bot left a comment

fknorr left a comment

fknorr commented Mar 16, 2023

github-actions bot left a comment

github-actions bot left a comment

fknorr left a comment

Use native ("USM") pointers for backing buffer allocations #162

Use native ("USM") pointers for backing buffer allocations #162

Conversation

psalz commented Mar 8, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

PeterTh left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

fknorr left a comment

Choose a reason for hiding this comment

fknorr commented Mar 16, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

fknorr left a comment

Choose a reason for hiding this comment