-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove multi-pass mechanism #173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful stuff!
I am both elated and saddened by getting rid of the prepass, it's the end of an era for Celerity ;)
One thing I noticed while reading through this is that we might be a bit inconsistent in our error handling. Mostly, it seems to follow the principle that implementation issues are checked using assert
, and probably user errors are reported using exceptions (if it's not possible at compile time). But sometimes I think we assert
things potentially caused by user errors.
Oh, and please do include updated benchmark results before merging (even though nothing much should happen). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're making really quick progress here, I love it!
I have a couple of concerns nonetheless.
batch_sycl_reduction_maker
seems both overly complicated and also not sane to me (the correct index assignment in.make(reductions...)
depends on function argument evaluation order). It should be possible to implement this functionality much simpler without any helper class by using a matchingsize_t
parameter pack like so:
template <typename KernelFlavor, typename KernelName, int Dims, typename Kernel, size_t... ReductionIndices, typename... Reductions>
auto make_device_kernel_launcher(const range<Dims>& global_range, const id<Dims>& global_offset,
typename detail::kernel_flavor_traits<KernelFlavor, Dims>::local_size_type local_range, Kernel kernel,
std::index_sequence<ReductionIndices...>, Reductions... reductions) {
// ...
detail::invoke_sycl_parallel_for<KernelName>(cgh, sycl_global_range,
make_sycl_reduction(reductions, static_cast<DataT*>(m_ptrs[ReductionIndices])...,
detail::bind_simple_kernel(hydrated_kernel, global_range, global_offset, detail::id_cast<Dims>(execution_sr.offset)));
// ...
}
// call site:
auto launcher = make_device_kernel_launcher<KernelFlavor, KernelName, Dims>(global_range, global_offset, local_range, kernel,
std::index_sequence_for<Reductions...>(), reductions...);
-
By deprecating
allow_by_ref
, there are no backstops to dangling reference captures inside host task kernels. Can we somehow emit a warning when a user relies on ref-captures instead of side effects? Ideally we would disallow these altogether in the future, but it would be best not to break the interface here. -
Host kernels can capture arbitrary expensive-to-copy data. Should we maybe make
accessor(accessor &&)
hydrating as well to keep expensive kernel lambdas movable, or do we decide that this is not worth the effort (also a valid choice in my book!)
62f59a9
to
11f522d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good now, thanks!
From live discussion with @psalz: We decided to abandon the allow_by_ref
safeguard altogether, which means that ref-capturing anything into host kernels will lead to UB without any reasonable way for us to detect it. Even if we could somehow forward the information that submit
was called with allow_by_ref
, using std::is_standard_layout
on the host lambda would mean that the user cannot value-capture any type that contains references, which seemed overly restrictive to us.
About hydration-on-move, this appears difficult to achieve since we could not report user errors though exceptions out of the noexcept
move constructors. Since we need to copy lambdas for hydration at least on oversubscripted tasks, the use of this is probably very limited anyway.
This is now implemented by the minimum version of DPC++ we require.
…exity ...until our CI setup supports overriding configurations on a per-folder basis (this check generates too many false positives for our unit tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
- Instead of storing the entire CGF, we now only store the inner "command function" lambda - Accessors captured into command function closures are "hydrated" before launching kernel - Buffers and host objects can (and should) now be captured by reference into CGFs; added deprecation warnings - Accessors, side-effects and reductions may now be created from non-const buffers - Deprecate allow_by_ref - Introduce new mechnaism to tie buffer and host object lifetimes to tasks (required as we no longer store captured copies inside the CGF) Other changes: - Fix a bug in test "horizons correctly deal with antidependencies", which relied on fixed order of dependencies - Change output of "command graph printing is unchanged" smoke test (again due to change in ordering)
It currently provides two types of diagnostics: - Check whether accessor target matches kernel type (we had this before, but now it is checked during CGF submission and throws synchronously in the main thread). - Check whether all accessors (and side effects) are being copied into a kernel. Fewer accessors being copied than expected either means there are unused accessors (potential performance bug), or accessors are being captured by reference (dangling reference - very bad!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me aside from some minor missed documentation updates: docs/overview.md
still has 2 mentions of "Prepass".
Also, performance in the overhead benchmark set is basically unchanged in overall metrics (less than 1‰ diff) which is good and expected.
Use unified terminology in accessors and closure hydrator to clarify which ranges and offsets refer to backing buffers, and which ones refer to virtual buffer coordinates.
While the multi-pass approach to executing CGFs has served us well in getting the Celerity idea to work early on, it has also caused us a lot of problems in regards to reference captures and lifetimes. This PR removes the multi-pass execution and replaces it with an accessor "hydration" mechanism for an overall safer API that is more in line with SYCL.
Highlights:
allow_by_ref
Aditionally, this introduces a new "CGF diagnostics" utility type for catching common errors early. It currently provides two types of diagnostics: