-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix uploading the EventSetup conditions to multiple CUDA devices #34948
Fix uploading the EventSetup conditions to multiple CUDA devices #34948
Conversation
@makortel, what do you think of the changes ? Some other possibilities could be:
|
please test |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34948/24769
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
For what is worth, I explicitly disagree that these changes required by diff --git a/HeterogeneousCore/CUDAUtilities/interface/ScopedSetDevice.h b/HeterogeneousCore/CUDAUtilities/interface/ScopedSetDevice.h
index d11baa0191d4..0cf740af8f71 100644
--- a/HeterogeneousCore/CUDAUtilities/interface/ScopedSetDevice.h
+++ b/HeterogeneousCore/CUDAUtilities/interface/ScopedSetDevice.h
@@ -10,14 +10,10 @@ namespace cms {
class ScopedSetDevice {
public:
// Store the original device, without setting a new one
- ScopedSetDevice() {
- cudaCheck(cudaGetDevice(&originalDevice_));
- }
+ ScopedSetDevice() { cudaCheck(cudaGetDevice(&originalDevice_)); }
// Store the original device, and set a new current device
- explicit ScopedSetDevice(int device) : ScopedSetDevice() {
- set(device);
- }
+ explicit ScopedSetDevice(int device) : ScopedSetDevice() { set(device); }
// Restore the original device
~ScopedSetDevice() {
@@ -29,9 +25,7 @@ namespace cms {
// Set a new current device, without changing the original device
// that will be restored when this object is destroyed
- void set(int device) {
- cudaCheck(cudaSetDevice(device));
- }
+ void set(int device) { cudaCheck(cudaSetDevice(device)); }
private:
int originalDevice_; make the code more readable and maintainable. Quite the opposite I would say. |
Add a default constructor that stores the current devices, without changing it. Add a set() method to change the current device, without affecting the stored value for the original device.
Associate to the correct CUDA device the events used to track if the conditions have been transferred to each device.
65d2aac
to
097fe4a
Compare
please test |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34948/24770
|
A new Pull Request was created by @fwyzard (Andrea Bocci) for master. It involves the following packages:
@makortel, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
type bugfix |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86065e/17884/summary.html Comparison SummarySummary:
|
Thanks @fwyzard for the fix!
I agree with the changes, but let me think out loud below.
I think it is better for the With that requirement I think it is better to use the
I think I like more the device setting going through
That would mean that the The |
enable gpu |
@cmsbuild, please test |
👍 |
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-86065e/17893/summary.html GPU Comparison SummarySummary:
Comparison SummarySummary:
|
+1 |
@@ -19,23 +20,24 @@ namespace cms { | |||
class ESProduct { | |||
public: | |||
ESProduct() : gpuDataPerDevice_(numberOfDevices()) { | |||
cms::cuda::ScopedSetDevice scopedDevice; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be protected for 0 devices, I'll prepare a fix.
On the other hand, I need to re-educate myself why exactly this happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, I need to re-educate myself why exactly this happens.
Seems that the PixelCPEFast
ESProduct is used for both GPU modules and CPU modules, and it holds cms::cuda::ESProduct<...>
as a member. This should really be refactored, but it might be most efficient to leave that to the time we have a pattern for issuing the CUDA operations from the ESProducer instead of the ESProduct (hopefully later this fall).
PR description:
When multiple CUDA devices are available, each CUDA EventSetup object uses a different CUDA event to keep track if the conditions have been uploaded to each device. However, all events were associated to the default device, instead of the correct one, leading to a runtime error when multiple devices were used.
These changes associate to the correct CUDA device the events used to track if the conditions have been transferred.
PR validation:
Without this PR, running any GPU-enabled job on a system with multiple GPUs would fail with the error
With this PR, these jobs complete correctly.
For example, step 3 from
runTheMatrix.py -l 10824.502 -e -t4 --what gpu
:If this PR is a backport please specify the original PR and why you need to backport that PR:
These changes should be backported to CMSSW_12_0_X.
These changes should be backported to CMSSW_11_3_X if there are plans to use that release cycle for more online tests.