Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix uploading the EventSetup conditions to multiple CUDA devices #34948

Merged

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Aug 19, 2021

PR description:

When multiple CUDA devices are available, each CUDA EventSetup object uses a different CUDA event to keep track if the conditions have been uploaded to each device. However, all events were associated to the default device, instead of the correct one, leading to a runtime error when multiple devices were used.

These changes associate to the correct CUDA device the events used to track if the conditions have been transferred.

PR validation:

Without this PR, running any GPU-enabled job on a system with multiple GPUs would fail with the error

----- Begin Fatal Exception 19-Aug-2021 10:35:57 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 6 stream: 1
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module PrimaryVertexMonitor/'pixelPVMonitor'
   [3] Prefetching for module PixelVertexProducerFromSoA/'pixelVertices'
   [4] Prefetching for module PixelVertexSoAFromCUDA/'pixelVerticesSoA@cuda'
   [5] Prefetching for module PixelVertexProducerCUDA/'pixelVerticesCUDA'
   [6] Prefetching for module CAHitNtupletCUDA/'pixelTracksCUDA'
   [7] Prefetching for module SiPixelRecHitCUDA/'siPixelRecHitsPreSplittingCUDA'
   [8] Calling method for module SiPixelRawToClusterCUDA/'siPixelClustersPreSplittingCUDA'
Exception Message:
A std::exception was thrown.

/build/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/28bc67141038e9b7c17a74564c8e6173/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_1_X_2021-08-15-0000/src/HeterogeneousCore/CUDACore/interface/ESProduct.h, line 80:
cudaCheck(cudaEventRecord(data.m_event.get(), cudaStream));
cudaErrorInvalidResourceHandle: invalid resource handle
----- End Fatal Exception -------------------------------------------------

With this PR, these jobs complete correctly.
For example, step 3 from runTheMatrix.py -l 10824.502 -e -t4 --what gpu:

$ cudaComputeCapabilities 
   0     6.1    GeForce GTX 1080 Ti
   1     6.1    GeForce GTX 1080 Ti
   2     6.1    GeForce GTX 1080 Ti
   3     6.1    GeForce GTX 1080 Ti
   4     6.1    GeForce GTX 1080 Ti
   5     6.1    GeForce GTX 1080 Ti
   6     6.1    GeForce GTX 1080 Ti
   7     6.1    GeForce GTX 1080 Ti

$ cmsRun step3_RAW2DIGI_RECO_VALIDATION_
DQM.py
%MSG-i ThreadStreamSetup:  (NoModuleName) 19-Aug-2021 10:59:19 CEST pre-events
setting # threads 4
setting # streams 4
%MSG
19-Aug-2021 10:59:32 CEST  Initiating request to open file file:step2.root
19-Aug-2021 10:59:33 CEST  Successfully opened file file:step2.root
Begin processing the 1st record. Run 1, Event 6, LumiSection 1 on stream 0 at 19-Aug-2021 10:59:40.893 CEST
Begin processing the 2nd record. Run 1, Event 2, LumiSection 1 on stream 1 at 19-Aug-2021 10:59:40.918 CEST
Begin processing the 3rd record. Run 1, Event 3, LumiSection 1 on stream 3 at 19-Aug-2021 10:59:40.918 CEST
Begin processing the 4th record. Run 1, Event 4, LumiSection 1 on stream 2 at 19-Aug-2021 10:59:40.937 CEST
Begin processing the 5th record. Run 1, Event 1, LumiSection 1 on stream 2 at 19-Aug-2021 10:59:46.703 CEST
Begin processing the 6th record. Run 1, Event 5, LumiSection 1 on stream 1 at 19-Aug-2021 10:59:46.813 CEST
Begin processing the 7th record. Run 1, Event 8, LumiSection 1 on stream 0 at 19-Aug-2021 10:59:46.851 CEST
Begin processing the 8th record. Run 1, Event 9, LumiSection 1 on stream 3 at 19-Aug-2021 10:59:46.947 CEST
Begin processing the 9th record. Run 1, Event 7, LumiSection 1 on stream 2 at 19-Aug-2021 10:59:47.269 CEST
Begin processing the 10th record. Run 1, Event 10, LumiSection 1 on stream 3 at 19-Aug-2021 10:59:47.469 CEST
19-Aug-2021 10:59:48 CEST  Closed file file:step2.root

If this PR is a backport please specify the original PR and why you need to backport that PR:

These changes should be backported to CMSSW_12_0_X.
These changes should be backported to CMSSW_11_3_X if there are plans to use that release cycle for more online tests.

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 19, 2021

@makortel, what do you think of the changes ?

Some other possibilities could be:

  • do not make changes to cms::cuda::ScopedSetDevice, and set/reset manually the device in cms::cuda::ESProduct;
  • only add the default constructor to cms::cuda::ScopedSetDevice, and set the device manually in cms::cuda::ESProduct;
  • add an overload to cms::cuda::EventCache.get() that takes the desired device, cms::cuda::EventCache.get(int device) and uses that instead of the current device.

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 19, 2021

please test

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34948/24769

  • This PR adds an extra 20KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 19, 2021

For what is worth, I explicitly disagree that these changes required by clang-format

diff --git a/HeterogeneousCore/CUDAUtilities/interface/ScopedSetDevice.h b/HeterogeneousCore/CUDAUtilities/interface/ScopedSetDevice.h
index d11baa0191d4..0cf740af8f71 100644
--- a/HeterogeneousCore/CUDAUtilities/interface/ScopedSetDevice.h
+++ b/HeterogeneousCore/CUDAUtilities/interface/ScopedSetDevice.h
@@ -10,14 +10,10 @@ namespace cms {
     class ScopedSetDevice {
     public:
       // Store the original device, without setting a new one
-      ScopedSetDevice() {
-        cudaCheck(cudaGetDevice(&originalDevice_));
-      }
+      ScopedSetDevice() { cudaCheck(cudaGetDevice(&originalDevice_)); }
 
       // Store the original device, and set a new current device
-      explicit ScopedSetDevice(int device) : ScopedSetDevice() {
-        set(device);
-      }
+      explicit ScopedSetDevice(int device) : ScopedSetDevice() { set(device); }
 
       // Restore the original device
       ~ScopedSetDevice() {
@@ -29,9 +25,7 @@ namespace cms {
 
       // Set a new current device, without changing the original device
       // that will be restored when this object is destroyed
-      void set(int device) {
-        cudaCheck(cudaSetDevice(device));
-      }
+      void set(int device) { cudaCheck(cudaSetDevice(device)); }
 
     private:
       int originalDevice_;

make the code more readable and maintainable.

Quite the opposite I would say.

Add a default constructor that stores the current devices, without
changing it.

Add a set() method to change the current device, without affecting the
stored value for the original device.
Associate to the correct CUDA device the events used to track if the
conditions have been transferred to each device.
@fwyzard fwyzard force-pushed the CUDA_fix_concurrent_EventSetup_filling branch from 65d2aac to 097fe4a Compare August 19, 2021 09:45
@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 19, 2021

please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34948/24770

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard (Andrea Bocci) for master.

It involves the following packages:

  • HeterogeneousCore/CUDACore (heterogeneous)
  • HeterogeneousCore/CUDAUtilities (heterogeneous)

@makortel, @fwyzard can you please review it and eventually sign? Thanks.
@makortel, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 19, 2021

type bugfix

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86065e/17884/summary.html
COMMIT: 097fe4a
CMSSW: CMSSW_12_1_X_2021-08-18-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34948/17884/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3000352
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3000330
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 38 files compared)
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

Thanks @fwyzard for the fix!

what do you think of the changes ?

I agree with the changes, but let me think out loud below.

Some other possibilities could be:

  • do not make changes to cms::cuda::ScopedSetDevice, and set/reset manually the device in cms::cuda::ESProduct;

I think it is better for the ESProduct constructor to set the current device back to original. I'd expect majority of the CUDA ESProducers to not call any CUDA API calls explicitly in their produce() functions (I didn't check, could easily be wrong), but if they would, I'd certainly expect the construction of the ESProduct to not lead to the current device being changed.

With that requirement I think it is better to use the ScopedSetDevice than call cudaGetDevice()/cudaSetDevice() explicitly.

  • only add the default constructor to cms::cuda::ScopedSetDevice, and set the device manually in cms::cuda::ESProduct;

I think I like more the device setting going through ScopedSetDevice.

  • add an overload to cms::cuda::EventCache.get() that takes the desired device, cms::cuda::EventCache.get(int device) and uses that instead of the current device.

That would mean that the EventCache::get() would internally use ScopedSetDevice that would lead to more calls to cudaSetDevice() (via ScopedSetDevice constructor). Probably not a big deal in practice though.

The ScopedContext classes are the other use case EventCache, and they rely on the "current device" idiom already (although in principle I'd like to pass around the device explicitly).

@makortel
Copy link
Contributor

enable gpu

@makortel
Copy link
Contributor

@cmsbuild, please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 19, 2021

what do you think of the changes ?

I agree with the changes, but let me think out loud below.

👍
I agree with your conclusions.

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 19, 2021

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86065e/17893/summary.html
COMMIT: 097fe4a
CMSSW: CMSSW_12_1_X_2021-08-19-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34948/17893/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19735
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19729
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3000352
  • DQMHistoTests: Total failures: 12
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3000318
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 38 files compared)
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@perrotta
Copy link
Contributor

+1

@@ -19,23 +20,24 @@ namespace cms {
class ESProduct {
public:
ESProduct() : gpuDataPerDevice_(numberOfDevices()) {
cms::cuda::ScopedSetDevice scopedDevice;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be protected for 0 devices, I'll prepare a fix.

On the other hand, I need to re-educate myself why exactly this happens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, I need to re-educate myself why exactly this happens.

Seems that the PixelCPEFast ESProduct is used for both GPU modules and CPU modules, and it holds cms::cuda::ESProduct<...> as a member. This should really be refactored, but it might be most efficient to leave that to the time we have a pattern for issuing the CUDA operations from the ESProducer instead of the ESProduct (hopefully later this fall).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants