Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUDA] ProfilingTag/profiling_queue.cpp failing on unrelated changes #14053

Open
sarnex opened this issue Jun 5, 2024 · 4 comments
Open

[CUDA] ProfilingTag/profiling_queue.cpp failing on unrelated changes #14053

sarnex opened this issue Jun 5, 2024 · 4 comments
Labels
bug Something isn't working confirmed cuda CUDA back-end

Comments

@sarnex
Copy link
Contributor

sarnex commented Jun 5, 2024

Describe the bug

https://github.com/intel/llvm/actions/runs/9374542664/job/25810825578

# RUN: at line 3
/__w/llvm/llvm/toolchain/bin//clang++   -fsycl -fsycl-targets=nvptx64-nvidia-cuda  /__w/llvm/llvm/llvm/sycl/test-e2e/ProfilingTag/profiling_queue.cpp -o /__w/llvm/llvm/build-e2e/ProfilingTag/Output/profiling_queue.cpp.tmp.out
# executed command: /__w/llvm/llvm/toolchain/bin//clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda /__w/llvm/llvm/llvm/sycl/test-e2e/ProfilingTag/profiling_queue.cpp -o /__w/llvm/llvm/build-e2e/ProfilingTag/Output/profiling_queue.cpp.tmp.out
# note: command had no output on stdout or stderr
# RUN: at line 4
env SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT=1 ONEAPI_DEVICE_SELECTOR=cuda:gpu  /__w/llvm/llvm/build-e2e/ProfilingTag/Output/profiling_queue.cpp.tmp.out
# executed command: env SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT=1 ONEAPI_DEVICE_SELECTOR=cuda:gpu /__w/llvm/llvm/build-e2e/ProfilingTag/Output/profiling_queue.cpp.tmp.out
# .---command stdout------------
# | StartTagSubmit: 84403198
# | StartTagStart: 84461570
# | StartTagEnd: 84463615
# | EndTagSubmit: 84465667
# | EndTagStart: 84515838
# | EndTagEnd: 84518913
# | E1Start: 84467712
# | E1End: 84513793
# | E2Start: 84459518
# | E2End: 84503555
# | StartTagEnd <= E2Start Failed!
# `-----------------------------
# error: command failed with exit status: 1

This seems to also affect other ProfilingTag tests, e.g. in_order_queue.cpp in https://github.com/intel/llvm/actions/runs/9348977102/job/25729306575.

To reproduce

No response

Environment

No response

Additional context

No response

@steffenlarsen
Copy link
Contributor

Edited the description as it looks like this affects other ProfilingTag tests as well.

steffenlarsen added a commit to steffenlarsen/llvm that referenced this issue Jun 6, 2024
CUDA backend is currently failing the profiling tag tests due to
sporadically returning times that do no correspond with the timings of
relative time queries (e.g. start happening before submission) or times
that are before previous events finish. This commit disables these tests
while intel#14053 is being addressed.

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
steffenlarsen added a commit that referenced this issue Jun 10, 2024
CUDA backend is currently failing the profiling tag tests due to
sporadically returning times that do no correspond with the timings of
relative time queries (e.g. start happening before submission) or times
that are before previous events finish. This commit disables these tests
while #14053 is being addressed.

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
ianayl pushed a commit to ianayl/sycl that referenced this issue Jun 13, 2024
CUDA backend is currently failing the profiling tag tests due to
sporadically returning times that do no correspond with the timings of
relative time queries (e.g. start happening before submission) or times
that are before previous events finish. This commit disables these tests
while intel#14053 is being addressed.

Signed-off-by: Larsen, Steffen <steffen.larsen@intel.com>
@konradkusiak97
Copy link
Contributor

konradkusiak97 commented Jun 24, 2024

This surprisingly still fails on CUDA with the latest changes to the timing events: oneapi-src/unified-runtime#1634

The only tests that I am able to trigger to fail are the default_queue.cpp and in_order_queue.cpp. In both cases the tests randomly fail with

StartTagSubmit <= StartTagEnd Failed! 
StartTagSubmit <= StartTagStart Failed!

Those only fail when other tests are run concurrently e.i. running default_queue.cpp 1000 times in a loop passes but if at the same time other tests are triggered, it will fail randomly.

Since we are using now an extra stream to record the submission time (the HostSubmitTimeStream), my guess is that sometimes this stream is just not ready to record before the main workload stream is (this one records start and end tags). Moving creation of the HostSubmitStream to urQueueCreate or even earlier, when Context is created also doesn't solve the issue so we might need to think of a different way how to record the host submission time, I might be wrong though.

@steffenlarsen
Copy link
Contributor

Without having had a look, it only happening for default_queue.cpp and in_order_queue.cpp could point towards some differing behavior w.r.t. when profiling is enabled on the queue. This could be from either the runtime or the UR adapter, though given it only happens for CUDA I would assume the latter.

That said, it could also be timing-based, so it could also be a red herring.

@konradkusiak97
Copy link
Contributor

I think the reason why the other two tests are not failing could be because when profiling is enabled on the queue, that triggers the HostSubmitTimeStream to be created and used already in the very first kernel, before StartTagE. I checked that If we move StartTagE before that kernel, those tests will also fail. Following this logic, I tried to create HostSubmitTimeStream regardless of if any profiling is enabled and use it also to record EvBase. This gave the best results with the tests only failing on the 10th try.

I think this could be timing-based as you say. In the end, we don't enforce any dependency between submit time event and start time event so we can't guarantee they will be recorded in that order maybe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmed cuda CUDA back-end
Projects
None yet
Development

No branches or pull requests

3 participants