Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove multi-pass mechanism #173

Merged
merged 8 commits into from
May 24, 2023
Merged

Remove multi-pass mechanism #173

merged 8 commits into from
May 24, 2023

Conversation

psalz
Copy link
Member

@psalz psalz commented Apr 13, 2023

While the multi-pass approach to executing CGFs has served us well in getting the Celerity idea to work early on, it has also caused us a lot of problems in regards to reference captures and lifetimes. This PR removes the multi-pass execution and replaces it with an accessor "hydration" mechanism for an overall safer API that is more in line with SYCL.

Highlights:

  • Instead of storing the entire CGF, we now only store the inner "command function" lambda
  • Accessors captured into command function closures are "hydrated" before launching kernel
  • Buffers and host objects can (and should) now be captured by reference into CGFs; added deprecation warnings
  • Accessors, side-effects and reductions may now be created from non-const buffers
    • Deprecate allow_by_ref
  • Introduce new mechnaism to tie buffer and host object lifetimes to tasks (required as we no longer store captured copies inside the CGF)

Aditionally, this introduces a new "CGF diagnostics" utility type for catching common errors early. It currently provides two types of diagnostics:

  • Check whether accessor target matches kernel type (we had this before, but now it is checked during CGF submission and throws synchronously in the main thread).
  • Check whether all accessors (and side effects) are being copied into a kernel. Fewer accessors being copied than expected either means there are unused accessors (potential performance bug), or accessors are being captured by reference (dangling reference - very bad!).

image

@psalz psalz requested review from fknorr and PeterTh and removed request for fknorr April 13, 2023 12:53
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

examples/reduction/reduction.cc Show resolved Hide resolved
test/runtime_deprecation_tests.cc Show resolved Hide resolved
test/runtime_tests.cc Show resolved Hide resolved
test/runtime_tests.cc Show resolved Hide resolved
test/runtime_tests.cc Show resolved Hide resolved
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful stuff!
I am both elated and saddened by getting rid of the prepass, it's the end of an era for Celerity ;)

One thing I noticed while reading through this is that we might be a bit inconsistent in our error handling. Mostly, it seems to follow the principle that implementation issues are checked using assert, and probably user errors are reported using exceptions (if it's not possible at compile time). But sometimes I think we assert things potentially caused by user errors.

include/accessor.h Outdated Show resolved Hide resolved
include/buffer.h Show resolved Hide resolved
test/graph_compaction_tests.cc Show resolved Hide resolved
@PeterTh
Copy link
Contributor

PeterTh commented Apr 13, 2023

Oh, and please do include updated benchmark results before merging (even though nothing much should happen).

Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're making really quick progress here, I love it!

I have a couple of concerns nonetheless.

  1. batch_sycl_reduction_maker seems both overly complicated and also not sane to me (the correct index assignment in .make(reductions...) depends on function argument evaluation order). It should be possible to implement this functionality much simpler without any helper class by using a matching size_t parameter pack like so:
template <typename KernelFlavor, typename KernelName, int Dims, typename Kernel, size_t... ReductionIndices, typename... Reductions>
auto make_device_kernel_launcher(const range<Dims>& global_range, const id<Dims>& global_offset,
    typename detail::kernel_flavor_traits<KernelFlavor, Dims>::local_size_type local_range, Kernel kernel,
    std::index_sequence<ReductionIndices...>, Reductions... reductions) {
    // ...
        detail::invoke_sycl_parallel_for<KernelName>(cgh, sycl_global_range,
            make_sycl_reduction(reductions, static_cast<DataT*>(m_ptrs[ReductionIndices])...,
            detail::bind_simple_kernel(hydrated_kernel, global_range, global_offset, detail::id_cast<Dims>(execution_sr.offset)));
    // ...
}

// call site:
auto launcher = make_device_kernel_launcher<KernelFlavor, KernelName, Dims>(global_range, global_offset, local_range, kernel,
        std::index_sequence_for<Reductions...>(), reductions...);
  1. By deprecating allow_by_ref, there are no backstops to dangling reference captures inside host task kernels. Can we somehow emit a warning when a user relies on ref-captures instead of side effects? Ideally we would disallow these altogether in the future, but it would be best not to break the interface here.

  2. Host kernels can capture arbitrary expensive-to-copy data. Should we maybe make accessor(accessor &&) hydrating as well to keep expensive kernel lambdas movable, or do we decide that this is not worth the effort (also a valid choice in my book!)

.clang-tidy Show resolved Hide resolved
examples/reduction/reduction.cc Show resolved Hide resolved
include/accessor.h Outdated Show resolved Hide resolved
include/accessor.h Show resolved Hide resolved
include/accessor.h Outdated Show resolved Hide resolved
include/host_object.h Outdated Show resolved Hide resolved
include/host_object.h Show resolved Hide resolved
include/task.h Outdated Show resolved Hide resolved
include/task.h Outdated Show resolved Hide resolved
test/accessor_tests.cc Show resolved Hide resolved
@psalz psalz force-pushed the remove-multipass branch 2 times, most recently from 62f59a9 to 11f522d Compare April 17, 2023 15:10
Copy link
Contributor

@fknorr fknorr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good now, thanks!

From live discussion with @psalz: We decided to abandon the allow_by_ref safeguard altogether, which means that ref-capturing anything into host kernels will lead to UB without any reasonable way for us to detect it. Even if we could somehow forward the information that submit was called with allow_by_ref, using std::is_standard_layout on the host lambda would mean that the user cannot value-capture any type that contains references, which seemed overly restrictive to us.

About hydration-on-move, this appears difficult to achieve since we could not report user errors though exceptions out of the noexcept move constructors. Since we need to copy lambdas for hydration at least on oversubscripted tasks, the use of this is probably very limited anyway.

@facuMH facuMH mentioned this pull request May 3, 2023
@facuMH facuMH force-pushed the remove-multipass branch from f2e4d9c to ef428dc Compare May 3, 2023 13:11
include/closure_hydrator.h Outdated Show resolved Hide resolved
psalz added 2 commits May 23, 2023 15:11
This is now implemented by the minimum version of DPC++ we require.
…exity

...until our CI setup supports overriding configurations on a per-folder
basis (this check generates too many false positives for our unit tests).
@psalz psalz force-pushed the remove-multipass branch from ef428dc to da141e4 Compare May 23, 2023 13:16
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

include/handler.h Show resolved Hide resolved
psalz added 2 commits May 23, 2023 18:21
- Instead of storing the entire CGF, we now only store the inner "command function" lambda
- Accessors captured into command function closures are "hydrated" before launching kernel
- Buffers and host objects can (and should) now be captured by reference into CGFs; added deprecation warnings
   - Accessors, side-effects and reductions may now be created from non-const buffers
- Deprecate allow_by_ref
- Introduce new mechnaism to tie buffer and host object lifetimes to
  tasks (required as we no longer store captured copies inside the CGF)

Other changes:
- Fix a bug in test "horizons correctly deal with antidependencies",
  which relied on fixed order of dependencies
- Change output of "command graph printing is unchanged" smoke test
  (again due to change in ordering)
It currently provides two types of diagnostics:
 - Check whether accessor target matches kernel type (we had this
   before, but now it is checked during CGF submission and throws
   synchronously in the main thread).
 - Check whether all accessors (and side effects) are being copied into
   a kernel. Fewer accessors being copied than expected either means
   there are unused accessors (potential performance bug), or accessors
   are being captured by reference (dangling reference - very bad!).
@psalz psalz force-pushed the remove-multipass branch from a172676 to 21870bb Compare May 23, 2023 16:21
Copy link
Contributor

@PeterTh PeterTh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me aside from some minor missed documentation updates: docs/overview.md still has 2 mentions of "Prepass".

Also, performance in the overhead benchmark set is basically unchanged in overall metrics (less than 1‰ diff) which is good and expected.

psalz added 4 commits May 24, 2023 16:09
Use unified terminology in accessors and closure hydrator to clarify
which ranges and offsets refer to backing buffers, and which ones refer
to virtual buffer coordinates.
@psalz psalz force-pushed the remove-multipass branch from 21870bb to 37cf095 Compare May 24, 2023 14:09
@psalz psalz merged commit 00b12cf into master May 24, 2023
@psalz psalz deleted the remove-multipass branch May 24, 2023 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants