Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to choose current device because CUDAService is disabled. #32428

Open
silviodonato opened this issue Dec 9, 2020 · 29 comments
Open

Unable to choose current device because CUDAService is disabled. #32428

silviodonato opened this issue Dec 9, 2020 · 29 comments

Comments

@silviodonato
Copy link
Contributor

In CMSSW_11_3_X_2020-12-08-2300, we are getting in wf 136.885522, 136.888522, 10824.522, 11634.522

----- Begin Fatal Exception 09-Dec-2020 10:02:40 CET-----------------------
An exception of category 'CUDAError' occurred while
   [0] Processing  Event run: 320822 lumi: 40 event: 64112784 stream: 2
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module RecHitTask/'recHitPreRecoTask'
   [3] Prefetching for module HcalCPURecHitsProducer/'hbheprereco'
   [4] Prefetching for module HBHERecHitProducerGPU/'hbheRecHitProducerGPU'
   [5] Calling method for module HcalDigisProducerGPU/'hcalDigisGPU'
   [6] Calling cms::cuda::chooseDevice()
Exception Message:
Unable to choose current device because CUDAService is disabled. If CUDAService was not explicitly
disabled in the configuration, the probable cause is that there is no GPU or there is some problem
in the CUDA runtime or drivers.
----- End Fatal Exception -------------------------------------------------

It seems related to #31720 . It sounds like a kind of expected error due to missing GPU in the IB test machines.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 9, 2020

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@silviodonato
Copy link
Contributor Author

assign heterogeneous, core

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 9, 2020

New categories assigned: heterogeneous,core

@Dr15Jones,@smuzaffar,@makortel,@makortel,@fwyzard you have been requested to review this Pull request/Issue and eventually sign? Thanks

@fwyzard
Copy link
Contributor

fwyzard commented Dec 9, 2020

The workflows 136.885522, 136.888522, 10824.522, 11634.522 are explicitly running the HCAL reconstruction on GPU - so it looks like they are behaving as expected.

How would you like them to behave when there are no GPUs ?

@fwyzard
Copy link
Contributor

fwyzard commented Dec 9, 2020

Here are some options:

  1. keep the GPU-only failing as they do now
  2. implement a way to mark the GPU-only workflows as "expected failures"
  3. implement a way to avoid running the GPU-only workflows
  4. change the workflows to run on GPU if available, and fall back to CPU otherwise

@fwyzard
Copy link
Contributor

fwyzard commented Dec 9, 2020

I would guess that

  • 2. requires changes to the testing and reporting infrastructures
  • 3. could be done either in the test infrastructure, or in cmsDriver / runTheMatrix ?
  • 4. is doable in CMSSW; the plan was to add an automatic GPU/CPU workflow, instead of replacing these ones, but we could also replace them instead

@makortel
Copy link
Contributor

makortel commented Dec 9, 2020

I think 4. would in principle be a viable alternative, but I also think we should find a way to run matrix workflows on GPU resources, and when that is done, 4. has a risk of not catching certain kind of problems because the "problematic workflow" could fall back to CPU and succeed.

Therefore I think we should look into 3., extending it to running GPU-only workflows on GPU resources (similarly to the "GPU Unit Tests" we have in IBs).

If solving 3. is not quick, it could be useful to do 2..

@smuzaffar, what do you think?

@makortel
Copy link
Contributor

makortel commented Dec 9, 2020

A possible straightforward way to implement 3. would be to define a separate -w set for GPU-only workflows. Then the GPU-only workflows would not be run by default, and it would be easy to run only them with runTheMatrix.py -w gpu.

@smuzaffar
Copy link
Contributor

+1 for -w gpu

@silviodonato
Copy link
Contributor Author

assign pdmv
because this change will affect runTheMatrix.py

@cmsbuild
Copy link
Contributor

New categories assigned: pdmv

@chayanit,@wajidalikhan,@jordan-martins you have been requested to review this Pull request/Issue and eventually sign? Thanks

@srimanob
Copy link
Contributor

Hi,

Just for my education on how IB works, why we don't see i.e. 10824.512 failed (ECAL-gpu only) in the same IB?
https://cms-sw.github.io/relvalLogDetail.html#slc7_amd64_gcc900;CMSSW_11_3_X_2020-12-08-2300
random resource?

@fwyzard
Copy link
Contributor

fwyzard commented Dec 15, 2020

Because until #31719 gets merged, there is no ECAL-only gpu workflow - 10824.512 is identical to 10824.511 and runs on cpu.

@silviodonato
Copy link
Contributor Author

@fwyzard @makortel @smuzaffar I've made #32547 , I hope this is what you asked for

@silviodonato
Copy link
Contributor Author

@makortel @smuzaffar with #32547 I created -w gpu option in runTheMatrix.py. At the moment, gpu matrix is included in the standard matrix. We need two things

  • IB tests: run the gpu workflows using GPU enabled
  • PR tests: preserve the possibility to run the gpu workflows

I think we need to update jenkins to run specific matrix (eg. gpu and update) and then to remove gpu matrix from the standard matrix

@smuzaffar
Copy link
Contributor

@silviodonato , I am working on improving GPU PR tests. Currently when we enable GPU tests then bot runs two jobs

  1. Run standard PR tests (compilation of externals cmssw and run unit tests, addon tests and relvals )
  2. Run special PR tests ( compilation of externals, cmssw and run gpu relvals)

cms-sw/cms-bot#1459 should allow to run GPU tests as a additional test within the standard PR test. This will avoid the compilation of externals and cmssw on GPU machines. Once cms-bot changes are merged then I can include -w gpu for GPU relvals tests.

About IBs tests, I will add an extra GPU relval tests which will run runTheMatrix with -w gpu option.

@smuzaffar
Copy link
Contributor

@silviodonato , for IBs tests, the easiest solution is to create a new IB queue e.g. GPU_X and run its tests of a GPU now. This will use the existing build and reporting system and results will be available via usual IB pages. Last night I ran a test GPU_X IB and you can already see the results here

https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/ib/CMSSW_11_3_X
unit test: https://cmssdt.cern.ch/SDT/cgi-bin/showBuildLogs.py/slc7_amd64_gcc900/www/thu/11.3.GPU-thu-23/CMSSW_11_3_GPU_X_2021-01-07-2300?utests
RelVals: https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_11_3/2021-01-07-2300?selectedArchs=slc7_amd64_gcc900&selectedFlavors=GPU_X&selectedStatus=failed&selectedStatus=known_failed&selectedStatus=passed

If this looks good then I will suggest to keep this speical GPU IB alive

@silviodonato
Copy link
Contributor Author

Thanks @smuzaffar, it looks perfect to me.
About PR tests, I see from #31719 (comment) that the workflows used as GPU tests are defined here https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config#L2 .

Can I remove -w gpu from the standard matrix in order to remove the failing workflows from the IB tests? Does this break anything?

@smuzaffar
Copy link
Contributor

Looks like all the tests under https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config#L2 do not belong to -w gpu. e.g if I set gpu to False here https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/MatrixReader.py#L82 then runTheMatrix failed to find the workflow we run for GPU PR tests

> runTheMatrix.py -n -l 10824.501,10824.502,10824.511,10824.512
processing relval_standard
processing relval_highstats
processing relval_pileup
processing relval_generator
processing relval_extendedgen
processing relval_production
processing relval_ged
ignoring relval_upgrade from default matrix
ignoring relval_gpu from default matrix
processing relval_2017
processing relval_2026
ignoring relval_identity from default matrix
processing relval_machine
processing relval_premix
Traceback (most recent call last):
  File "/tmp/muzaffar/CMSSW_11_3_X_2021-01-08-1100/bin/slc7_amd64_gcc900/runTheMatrix.py", line 370, in <module>
    ret = runSelected(opt)
  File "/tmp/muzaffar/CMSSW_11_3_X_2021-01-08-1100/bin/slc7_amd64_gcc900/runTheMatrix.py", line 30, in runSelected
    if len(undefSet)>0: raise ValueError('Undefined workflows: '+', '.join(map(str,list(undefSet))))
ValueError: Undefined workflows: 10824.512, 10824.502

> runTheMatrix.py -n -l 10824.501,10824.502,10824.511,10824.512 -w gpu
ignoring non-requested file relval_standard
ignoring non-requested file relval_highstats
ignoring non-requested file relval_pileup
ignoring non-requested file relval_generator
ignoring non-requested file relval_extendedgen
ignoring non-requested file relval_production
ignoring non-requested file relval_ged
ignoring non-requested file relval_upgrade
processing relval_gpu
ignoring non-requested file relval_2017
ignoring non-requested file relval_2026
ignoring non-requested file relval_identity
ignoring non-requested file relval_machine
ignoring non-requested file relval_premix
Traceback (most recent call last):
  File "/tmp/muzaffar/CMSSW_11_3_X_2021-01-08-1100/bin/slc7_amd64_gcc900/runTheMatrix.py", line 370, in <module>
    ret = runSelected(opt)
  File "/tmp/muzaffar/CMSSW_11_3_X_2021-01-08-1100/bin/slc7_amd64_gcc900/runTheMatrix.py", line 30, in runSelected
    if len(undefSet)>0: raise ValueError('Undefined workflows: '+', '.join(map(str,list(undefSet))))
ValueError: Undefined workflows: 10824.511, 10824.501

Before dropping gpu from standard, we need to understand which wf should go in the GPU PR tests

@smuzaffar
Copy link
Contributor

@fwyzard , currently the PR tests for GPU runs https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config#L2 workflows. Which are combination of -w gpu and -w standard ( see #32428 (comment) ) . Can we update this list to only run few of -w gpu tests only? Once we have this then we can disable gpu for default runTheMatrix

@fwyzard
Copy link
Contributor

fwyzard commented Jan 11, 2021

@smuzaffar , the GPU workflows are only those from the -w gpu option.
The .501 and .511 are meant to give the same results as the GPU workflows .502 and .512, while running on the CPU.

So I think we can keep the .5?1 workflows in the standard tests, and run only the .5?2 in the GPU tests.

@smuzaffar
Copy link
Contributor

@fwyzard , cms-sw/cms-bot#1463 should run .5?2 for GPU tests now. About the .5?1 cpu tests, what should we do? Should we always run them as part of normal PR relval tests ( this might increase the PR tests time a bit)? OR do we want to run an additional relval/cpu if gpu tests are enable?

@fwyzard
Copy link
Contributor

fwyzard commented Jan 11, 2021

Eventually we should have a single GPU workflow, and a single CPU-equivalent workflow, and I think we should run both of them for all PRs that potentially affect those workflows (i.e. probably all PRs that affect ECAL, HCAL, Pixel, Tracking, PF - so a good fraction of those that require a RECO signature).

O&C has decided that changes to the GPU reconstruction are signed only by @cms-sw/reconstruction-l2 , not by @cms-sw/heterogeneous-l2 , so I guess it's really up to them to speak up on how the prefer the tests to be organised.

@slava77
Copy link
Contributor

slava77 commented Jan 11, 2021

O&C has decided that changes to the GPU reconstruction are signed only by @cms-sw/reconstruction-l2 , not by @cms-sw/heterogeneous-l2 , so I guess it's really up to them to speak up on how the prefer the tests to be organised.

IIRC the workflow definitions are not signed by reco either, it's in PDMV hands.

For the short matrix it's probably more practical to add a data workflow which does not also rerun the HLT, since none of the GPU-related reco relies on it; this is mainly for time/cost optimization of the workflow.

@silviodonato
Copy link
Contributor Author

Issue solved by #32650.
The GPU tests runs, currently, 10824.502,10824.512 https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config#L2

@fwyzard
Copy link
Contributor

fwyzard commented Jan 19, 2021

Now that #31719 has been merged, we should add the .522 workflow to the GPU tests.

I'm kind of lost about the technical details: do we add it to the matrix, or to the bot configuration ?

@fwyzard
Copy link
Contributor

fwyzard commented Mar 26, 2025

+heterogeneous

I believe this was fixed a long time ago for the CUDA workflows.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 26, 2025

cms-bot internal usage

@makortel
Copy link
Contributor

+core

I believe this was fixed a long time ago for the CUDA workflows.

I agree. I think the root cause has been addressed in several ways by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants