-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[L0][OpenCL] Emulate Fill with copy when patternSize is not a power of 2 #1412
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1412 +/- ##
==========================================
- Coverage 14.82% 12.51% -2.32%
==========================================
Files 250 239 -11
Lines 36220 35949 -271
Branches 4094 4076 -18
==========================================
- Hits 5369 4498 -871
- Misses 30800 31447 +647
+ Partials 51 4 -47 ☔ View full report in Codecov by Sentry. |
59cdd8a
to
293b670
Compare
3887dfd
to
ac0274f
Compare
friendly ping @oneapi-src/unified-runtime-level-zero-write @oneapi-src/unified-runtime-opencl-write, could I get a review on this please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
a6f5dfe
to
617d8ad
Compare
617d8ad
to
38037e9
Compare
9de2356
to
e4a8d29
Compare
41ab8b4
to
2727e8a
Compare
…f 2 (#12912) oneapi-src/unified-runtime#1412 --------- Co-authored-by: Kenneth Benzie (Benie) <k.benzie@codeplay.com>
This PR changes the `queue.fill()` implementation to make use of the native functions for a specific backend. It also unifies that implementation with the one for memset, since it is just an 8-bit subset operation of fill. In the CUDA case, both memset and fill are currently calling `urEnqueueUSMFill` which depending on the size of the filling pattern calls either `cuMemsetD8Async`, `cuMemsetD16Async`, `cuMemsetD32Async` or `commonMemSetLargePattern`. Before this patch memset was using the same thing, just beforehand setting patternSize always to 1 byte which resulted in calling `cuMemsetD8Async`. In other backends, the behaviour is analogous. The fill method was just invoking a `parallel_for` to fill the memory with the pattern which was making this operation quite slow. This PR depends on: - oneapi-src/unified-runtime#1395 - oneapi-src/unified-runtime#1412
LevelZero changes:
enqueueMemFillHelper
to allow calling it with pattern sizes which are not powers of 2. In those cases filling is emulated with copying.OpenCL changes:
isPowerOf2
to the USM fill functionThose changes are necessary for the PR: intel/llvm#12702 which refactors
queue.fill()
to make use of theurEnqueueUSMFill
.intel/llvm CI: intel/llvm#12912