-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore errors from alpaka::enqueue()
in CachingAllocator::free()
#44730
Ignore errors from alpaka::enqueue()
in CachingAllocator::free()
#44730
Conversation
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44730/39941
|
A new Pull Request was created by @makortel for master. It involves the following packages:
@fwyzard, @cmsbuild, @makortel can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
enable gpu |
@cmsbuild, please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3906d3/38824/summary.html Comparison SummarySummary:
GPU Comparison SummarySummary:
|
hi Matti, I'm just trying to follow what should happen:
is it correct ? Then, with these changes
However, when the block goes out of scope, it should result in a CUDA call to free the memory. |
Correct.
The deleter used by Alpaka CUDA/HIP backend in the Alpaka buffer does not throw an exception, but prints an error message, if the The destructors of Alpaka's Queue and Event follow the same pattern, so also they can be left to be destructed without special attention. |
I'm thinking to add a unit test |
ok for me :) |
0a5eef6
to
eb061c6
Compare
Added the test. Without this PR the test fails on CUDA backend. |
eb061c6
to
65d51bf
Compare
+heterogeneous |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44730/39975
|
Pull request #44730 was updated. can you please check and sign again. |
@cmsbuild, please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3906d3/38877/summary.html Comparison SummarySummary:
GPU Comparison SummarySummary:
|
+1 |
PR description:
#44634 reported an HLT job failure caused by an illegal memory access on a GPU. The failure was reported as a crash instead of a caught exception because of a second exception being thrown from
CachingAllocator<T>::free()
byalpaka::enqueue()
when objects using cached allocations were being deleted as part of the stack unwinding of the original exception.The
alpaka::enqueue()
is used inCachingAllocator<T>::free()
to "record" the alpaka Event in the Queue when the freed memory block is supposed to be recached. This PR changes the behavior such that ifalpaka::enqueue()
throws an exception, the memory block is treated as freed instead of recached.I checked the alpaka Buffers, Queues, and Events that their destructors do not throw exceptions, but report any errors from the underlying APIs as printouts.
PR validation:
I tested the reproducer in #44634 on a GPU node with
CUDA_LAUNCH_BLOCKING=1 cmsRun ...
, and now the job reports the exception in a useful wayAfterwards the job still crashes, but in direct CUDA code (
cms::cuda::abortOnCudaError()
inSiPixelGainCalibrationForHLTGPU::~SiPixelGainCalibrationForHLTGPU()
), but that is probably not worth of addressing at this time, when the direct CUDA code is expected to be removed later on.Without
CUDA_LAUNCH_BLOCKING=1
the reported exception message is no longer useful, but at least the job contains printouts from Alpaka code that include thecudaErrorIllegalAddress
error name. So while not ideal, the log contains more useful information than before this PR.The added unit test succeeds on Serial and CUDA backends (and without the change of this PR the unit test fails on CUDA backend, and succeeds on Serial backend).
If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:
To be backported to 14_0_X