[CUDA] ProfilingTag/profiling_queue.cpp failing on unrelated changes #14053

sarnex · 2024-06-05T14:19:09Z

Describe the bug

https://github.com/intel/llvm/actions/runs/9374542664/job/25810825578

# RUN: at line 3
/__w/llvm/llvm/toolchain/bin//clang++   -fsycl -fsycl-targets=nvptx64-nvidia-cuda  /__w/llvm/llvm/llvm/sycl/test-e2e/ProfilingTag/profiling_queue.cpp -o /__w/llvm/llvm/build-e2e/ProfilingTag/Output/profiling_queue.cpp.tmp.out
# executed command: /__w/llvm/llvm/toolchain/bin//clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda /__w/llvm/llvm/llvm/sycl/test-e2e/ProfilingTag/profiling_queue.cpp -o /__w/llvm/llvm/build-e2e/ProfilingTag/Output/profiling_queue.cpp.tmp.out
# note: command had no output on stdout or stderr
# RUN: at line 4
env SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT=1 ONEAPI_DEVICE_SELECTOR=cuda:gpu  /__w/llvm/llvm/build-e2e/ProfilingTag/Output/profiling_queue.cpp.tmp.out
# executed command: env SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT=1 ONEAPI_DEVICE_SELECTOR=cuda:gpu /__w/llvm/llvm/build-e2e/ProfilingTag/Output/profiling_queue.cpp.tmp.out
# .---command stdout------------
# | StartTagSubmit: 84403198
# | StartTagStart: 84461570
# | StartTagEnd: 84463615
# | EndTagSubmit: 84465667
# | EndTagStart: 84515838
# | EndTagEnd: 84518913
# | E1Start: 84467712
# | E1End: 84513793
# | E2Start: 84459518
# | E2End: 84503555
# | StartTagEnd <= E2Start Failed!
# `-----------------------------
# error: command failed with exit status: 1

This seems to also affect other ProfilingTag tests, e.g. in_order_queue.cpp in https://github.com/intel/llvm/actions/runs/9348977102/job/25729306575.

To reproduce

No response

Environment

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

See #14053 Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

steffenlarsen · 2024-06-06T10:07:26Z

Edited the description as it looks like this affects other ProfilingTag tests as well.

CUDA backend is currently failing the profiling tag tests due to sporadically returning times that do no correspond with the timings of relative time queries (e.g. start happening before submission) or times that are before previous events finish. This commit disables these tests while intel#14053 is being addressed. Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>

CUDA backend is currently failing the profiling tag tests due to sporadically returning times that do no correspond with the timings of relative time queries (e.g. start happening before submission) or times that are before previous events finish. This commit disables these tests while #14053 is being addressed. Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>

CUDA backend is currently failing the profiling tag tests due to sporadically returning times that do no correspond with the timings of relative time queries (e.g. start happening before submission) or times that are before previous events finish. This commit disables these tests while intel#14053 is being addressed. Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>

konradkusiak97 · 2024-06-24T13:14:48Z

This surprisingly still fails on CUDA with the latest changes to the timing events: oneapi-src/unified-runtime#1634

The only tests that I am able to trigger to fail are the default_queue.cpp and in_order_queue.cpp. In both cases the tests randomly fail with

StartTagSubmit <= StartTagEnd Failed! 
StartTagSubmit <= StartTagStart Failed!

Those only fail when other tests are run concurrently e.i. running default_queue.cpp 1000 times in a loop passes but if at the same time other tests are triggered, it will fail randomly.

Since we are using now an extra stream to record the submission time (the HostSubmitTimeStream), my guess is that sometimes this stream is just not ready to record before the main workload stream is (this one records start and end tags). Moving creation of the HostSubmitStream to urQueueCreate or even earlier, when Context is created also doesn't solve the issue so we might need to think of a different way how to record the host submission time, I might be wrong though.

steffenlarsen · 2024-06-24T14:18:43Z

Without having had a look, it only happening for default_queue.cpp and in_order_queue.cpp could point towards some differing behavior w.r.t. when profiling is enabled on the queue. This could be from either the runtime or the UR adapter, though given it only happens for CUDA I would assume the latter.

That said, it could also be timing-based, so it could also be a red herring.

konradkusiak97 · 2024-06-24T14:38:31Z

I think the reason why the other two tests are not failing could be because when profiling is enabled on the queue, that triggers the HostSubmitTimeStream to be created and used already in the very first kernel, before StartTagE. I checked that If we move StartTagE before that kernel, those tests will also fail. Following this logic, I tried to create HostSubmitTimeStream regardless of if any profiling is enabled and use it also to record EvBase. This gave the best results with the tests only failing on the 10th try.

I think this could be timing-based as you say. In the end, we don't enforce any dependency between submit time event and start time event so we can't guarantee they will be recorded in that order maybe.

sarnex added bug Something isn't working cuda CUDA back-end labels Jun 5, 2024

This was referenced Jun 5, 2024

[SYCL][E2E] Disable flaky profiling_queue.cpp test on CUDA #14054

Merged

[SYCL] Allow specifying -foffload-lto with the new offload driver and build libdevice with thinLTO #14036

Merged

sarnex added a commit that referenced this issue Jun 5, 2024

[SYCL][E2E] Disable flaky profiling_queue.cpp test on CUDA (#14054)

aa92b24

See #14053 Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>

steffenlarsen mentioned this issue Jun 6, 2024

[SYCL][TEST-E2E] Disallow dep_events.cpp test built for CUDA backend to run on Windows #13957

Merged

This was referenced Jun 6, 2024

[SYCL][E2E] Disable ProfilingTag tests on CUDA #14073

Merged

[SYCL][Docs] Move sycl_ext_oneapi_enqueue_functions to experimental #14017

Merged

konradkusiak97 added the confirmed label Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] ProfilingTag/profiling_queue.cpp failing on unrelated changes #14053

[CUDA] ProfilingTag/profiling_queue.cpp failing on unrelated changes #14053

sarnex commented Jun 5, 2024 •

edited by steffenlarsen

Loading

steffenlarsen commented Jun 6, 2024

konradkusiak97 commented Jun 24, 2024 •

edited

Loading

steffenlarsen commented Jun 24, 2024

konradkusiak97 commented Jun 24, 2024

[CUDA] ProfilingTag/profiling_queue.cpp failing on unrelated changes #14053

[CUDA] ProfilingTag/profiling_queue.cpp failing on unrelated changes #14053

Comments

sarnex commented Jun 5, 2024 • edited by steffenlarsen Loading

Describe the bug

To reproduce

Environment

Additional context

steffenlarsen commented Jun 6, 2024

konradkusiak97 commented Jun 24, 2024 • edited Loading

steffenlarsen commented Jun 24, 2024

konradkusiak97 commented Jun 24, 2024

sarnex commented Jun 5, 2024 •

edited by steffenlarsen

Loading

konradkusiak97 commented Jun 24, 2024 •

edited

Loading