Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore errors from alpaka::enqueue() in CachingAllocator::free() #44730

Merged

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Apr 12, 2024

PR description:

#44634 reported an HLT job failure caused by an illegal memory access on a GPU. The failure was reported as a crash instead of a caught exception because of a second exception being thrown from CachingAllocator<T>::free() by alpaka::enqueue() when objects using cached allocations were being deleted as part of the stack unwinding of the original exception.

The alpaka::enqueue() is used in CachingAllocator<T>::free() to "record" the alpaka Event in the Queue when the freed memory block is supposed to be recached. This PR changes the behavior such that if alpaka::enqueue() throws an exception, the memory block is treated as freed instead of recached.

I checked the alpaka Buffers, Queues, and Events that their destructors do not throw exceptions, but report any errors from the underlying APIs as printouts.

PR validation:

I tested the reproducer in #44634 on a GPU node with CUDA_LAUNCH_BLOCKING=1 cmsRun ..., and now the job reports the exception in a useful way

----- Begin Fatal Exception 05-Apr-2024 20:44:47 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 378940 lumi: 21 event: 5339574 stream: 0
   [1] Running path 'DST_PFScouting_JetHT_v1'
   [2] Calling method for module PFClusterSoAProducer@alpaka/'hltParticleFlowClusterHBHESoA'
Exception Message:
A std::exception was thrown.
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-c6af69ddd6f2ee5be4f2b069590bae19/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp(259) 'TApi::setDevice(queue.m_spQueueImpl->m_dev.getNativeHandle())' A previous API call (not this one) set the error  : 'cudaErrorIllegalAddress': 'an illegal memory access was encountered'!
----- End Fatal Exception -------------------------------------------------

Afterwards the job still crashes, but in direct CUDA code (cms::cuda::abortOnCudaError() in SiPixelGainCalibrationForHLTGPU::~SiPixelGainCalibrationForHLTGPU()), but that is probably not worth of addressing at this time, when the direct CUDA code is expected to be removed later on.

Without CUDA_LAUNCH_BLOCKING=1 the reported exception message is no longer useful, but at least the job contains printouts from Alpaka code that include the cudaErrorIllegalAddress error name. So while not ideal, the log contains more useful information than before this PR.

The added unit test succeeds on Serial and CUDA backends (and without the change of this PR the unit test fails on CUDA backend, and succeeds on Serial backend).

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to 14_0_X

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 12, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44730/39941

  • This PR adds an extra 16KB to repository

  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel for master.

It involves the following packages:

  • HeterogeneousCore/AlpakaInterface (heterogeneous)

@fwyzard, @cmsbuild, @makortel can you please review it and eventually sign? Thanks.
@missirol, @rovere this is something you requested to watch as well.
@sextonkennedy, @antoniovilela, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3906d3/38824/summary.html
COMMIT: 0a5eef6
CMSSW: CMSSW_14_1_X_2024-04-12-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44730/38824/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 10 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 455
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39285
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Apr 12, 2024

hi Matti, I'm just trying to follow what should happen:

  • a CUDA error occurs, which causes an exception
  • while unwinding the stack, a GPU alpaka buffer is freed
  • CachingAllocator::free() records an event in the current queue
  • since the CUDA runtime is in an error state, this fails, and raises another exception, which results in a call to terminate()

is it correct ?

Then, with these changes

  • the second exception raised by recording the event is caught
  • the block is not put back into the allocator pool

However, when the block goes out of scope, it should result in a CUDA call to free the memory.
Shouldn't this also cause a second exception, and thus a call to terminate() ?

@makortel
Copy link
Contributor Author

  • a CUDA error occurs, which causes an exception
  • while unwinding the stack, a GPU alpaka buffer is freed
  • CachingAllocator::free() records an event in the current queue
  • since the CUDA runtime is in an error state, this fails, and raises another exception, which results in a call to terminate()

is it correct ?

Correct.

Then, with these changes

  • the second exception raised by recording the event is caught
  • the block is not put back into the allocator pool

However, when the block goes out of scope, it should result in a CUDA call to free the memory. Shouldn't this also cause a second exception, and thus a call to terminate() ?

The deleter used by Alpaka CUDA/HIP backend in the Alpaka buffer does not throw an exception, but prints an error message, if the cudaFree() returns an error
https://github.com/alpaka-group/alpaka/blob/a4142d3feb7686d803e1ec5f25d7b2278337f455/include/alpaka/mem/buf/BufUniformCudaHipRt.hpp#L266
https://github.com/alpaka-group/alpaka/blob/a4142d3feb7686d803e1ec5f25d7b2278337f455/include/alpaka/core/UniformCudaHip.hpp#L110-L111

The destructors of Alpaka's Queue and Event follow the same pattern, so also they can be left to be destructed without special attention.

@makortel
Copy link
Contributor Author

I'm thinking to add a unit test

@fwyzard
Copy link
Contributor

fwyzard commented Apr 15, 2024

I'm thinking to add a unit test

ok for me :)

@makortel
Copy link
Contributor Author

Added the test. Without this PR the test fails on CUDA backend.

@fwyzard
Copy link
Contributor

fwyzard commented Apr 16, 2024

+heterogeneous

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44730/39975

@cmsbuild
Copy link
Contributor

Pull request #44730 was updated. can you please check and sign again.

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3906d3/38877/summary.html
COMMIT: 65d51bf
CMSSW: CMSSW_14_1_X_2024-04-16-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/44730/38877/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 18 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 787
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 38953
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@rappoccio
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit f81a842 into cms-sw:master Apr 18, 2024
14 checks passed
@makortel makortel deleted the alpakaCachingAllocatorFreeException branch April 22, 2024 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants