Unable to choose current device because CUDAService is disabled. #32428

silviodonato · 2020-12-09T11:00:04Z

In CMSSW_11_3_X_2020-12-08-2300, we are getting in wf 136.885522, 136.888522, 10824.522, 11634.522

----- Begin Fatal Exception 09-Dec-2020 10:02:40 CET-----------------------
An exception of category 'CUDAError' occurred while
   [0] Processing  Event run: 320822 lumi: 40 event: 64112784 stream: 2
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module RecHitTask/'recHitPreRecoTask'
   [3] Prefetching for module HcalCPURecHitsProducer/'hbheprereco'
   [4] Prefetching for module HBHERecHitProducerGPU/'hbheRecHitProducerGPU'
   [5] Calling method for module HcalDigisProducerGPU/'hcalDigisGPU'
   [6] Calling cms::cuda::chooseDevice()
Exception Message:
Unable to choose current device because CUDAService is disabled. If CUDAService was not explicitly
disabled in the configuration, the probable cause is that there is no GPU or there is some problem
in the CUDA runtime or drivers.
----- End Fatal Exception -------------------------------------------------

It seems related to #31720 . It sounds like a kind of expected error due to missing GPU in the IB test machines.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2020-12-09T11:00:20Z

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

silviodonato · 2020-12-09T11:00:31Z

assign heterogeneous, core

cmsbuild · 2020-12-09T11:00:41Z

New categories assigned: heterogeneous,core

@Dr15Jones,@smuzaffar,@makortel,@makortel,@fwyzard you have been requested to review this Pull request/Issue and eventually sign? Thanks

fwyzard · 2020-12-09T11:21:44Z

The workflows 136.885522, 136.888522, 10824.522, 11634.522 are explicitly running the HCAL reconstruction on GPU - so it looks like they are behaving as expected.

How would you like them to behave when there are no GPUs ?

fwyzard · 2020-12-09T11:23:38Z

Here are some options:

keep the GPU-only failing as they do now
implement a way to mark the GPU-only workflows as "expected failures"
implement a way to avoid running the GPU-only workflows
change the workflows to run on GPU if available, and fall back to CPU otherwise

fwyzard · 2020-12-09T11:25:23Z

I would guess that

2. requires changes to the testing and reporting infrastructures
3. could be done either in the test infrastructure, or in cmsDriver / runTheMatrix ?
4. is doable in CMSSW; the plan was to add an automatic GPU/CPU workflow, instead of replacing these ones, but we could also replace them instead

makortel · 2020-12-09T14:43:55Z

I think 4. would in principle be a viable alternative, but I also think we should find a way to run matrix workflows on GPU resources, and when that is done, 4. has a risk of not catching certain kind of problems because the "problematic workflow" could fall back to CPU and succeed.

Therefore I think we should look into 3., extending it to running GPU-only workflows on GPU resources (similarly to the "GPU Unit Tests" we have in IBs).

If solving 3. is not quick, it could be useful to do 2..

@smuzaffar, what do you think?

makortel · 2020-12-09T14:48:02Z

A possible straightforward way to implement 3. would be to define a separate -w set for GPU-only workflows. Then the GPU-only workflows would not be run by default, and it would be easy to run only them with runTheMatrix.py -w gpu.

smuzaffar · 2020-12-09T15:06:56Z

+1 for -w gpu

silviodonato · 2020-12-10T08:19:31Z

assign pdmv
because this change will affect runTheMatrix.py

cmsbuild · 2020-12-10T08:19:52Z

New categories assigned: pdmv

@chayanit,@wajidalikhan,@jordan-martins you have been requested to review this Pull request/Issue and eventually sign? Thanks

srimanob · 2020-12-15T16:44:13Z

Hi,

Just for my education on how IB works, why we don't see i.e. 10824.512 failed (ECAL-gpu only) in the same IB?
https://cms-sw.github.io/relvalLogDetail.html#slc7_amd64_gcc900;CMSSW_11_3_X_2020-12-08-2300
random resource?

fwyzard · 2020-12-15T17:38:57Z

Because until #31719 gets merged, there is no ECAL-only gpu workflow - 10824.512 is identical to 10824.511 and runs on cpu.

silviodonato · 2020-12-18T16:07:26Z

@fwyzard @makortel @smuzaffar I've made #32547 , I hope this is what you asked for

silviodonato · 2021-01-07T09:44:35Z

@makortel @smuzaffar with #32547 I created -w gpu option in runTheMatrix.py. At the moment, gpu matrix is included in the standard matrix. We need two things

IB tests: run the gpu workflows using GPU enabled
PR tests: preserve the possibility to run the gpu workflows

I think we need to update jenkins to run specific matrix (eg. gpu and update) and then to remove gpu matrix from the standard matrix

smuzaffar · 2021-01-07T11:11:03Z

@silviodonato , I am working on improving GPU PR tests. Currently when we enable GPU tests then bot runs two jobs

Run standard PR tests (compilation of externals cmssw and run unit tests, addon tests and relvals )
Run special PR tests ( compilation of externals, cmssw and run gpu relvals)

cms-sw/cms-bot#1459 should allow to run GPU tests as a additional test within the standard PR test. This will avoid the compilation of externals and cmssw on GPU machines. Once cms-bot changes are merged then I can include -w gpu for GPU relvals tests.

About IBs tests, I will add an extra GPU relval tests which will run runTheMatrix with -w gpu option.

smuzaffar · 2021-01-08T10:41:20Z

@silviodonato , for IBs tests, the easiest solution is to create a new IB queue e.g. GPU_X and run its tests of a GPU now. This will use the existing build and reporting system and results will be available via usual IB pages. Last night I ran a test GPU_X IB and you can already see the results here

https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/ib/CMSSW_11_3_X
unit test: https://cmssdt.cern.ch/SDT/cgi-bin/showBuildLogs.py/slc7_amd64_gcc900/www/thu/11.3.GPU-thu-23/CMSSW_11_3_GPU_X_2021-01-07-2300?utests
RelVals: https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_11_3/2021-01-07-2300?selectedArchs=slc7_amd64_gcc900&selectedFlavors=GPU_X&selectedStatus=failed&selectedStatus=known_failed&selectedStatus=passed

If this looks good then I will suggest to keep this speical GPU IB alive

silviodonato · 2021-01-08T11:09:45Z

Thanks @smuzaffar, it looks perfect to me.
About PR tests, I see from #31719 (comment) that the workflows used as GPU tests are defined here https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config#L2 .

Can I remove -w gpu from the standard matrix in order to remove the failing workflows from the IB tests? Does this break anything?

smuzaffar · 2021-01-08T13:33:02Z

Looks like all the tests under https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config#L2 do not belong to -w gpu. e.g if I set gpu to False here https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/MatrixReader.py#L82 then runTheMatrix failed to find the workflow we run for GPU PR tests

> runTheMatrix.py -n -l 10824.501,10824.502,10824.511,10824.512
processing relval_standard
processing relval_highstats
processing relval_pileup
processing relval_generator
processing relval_extendedgen
processing relval_production
processing relval_ged
ignoring relval_upgrade from default matrix
ignoring relval_gpu from default matrix
processing relval_2017
processing relval_2026
ignoring relval_identity from default matrix
processing relval_machine
processing relval_premix
Traceback (most recent call last):
  File "/tmp/muzaffar/CMSSW_11_3_X_2021-01-08-1100/bin/slc7_amd64_gcc900/runTheMatrix.py", line 370, in <module>
    ret = runSelected(opt)
  File "/tmp/muzaffar/CMSSW_11_3_X_2021-01-08-1100/bin/slc7_amd64_gcc900/runTheMatrix.py", line 30, in runSelected
    if len(undefSet)>0: raise ValueError('Undefined workflows: '+', '.join(map(str,list(undefSet))))
ValueError: Undefined workflows: 10824.512, 10824.502

> runTheMatrix.py -n -l 10824.501,10824.502,10824.511,10824.512 -w gpu
ignoring non-requested file relval_standard
ignoring non-requested file relval_highstats
ignoring non-requested file relval_pileup
ignoring non-requested file relval_generator
ignoring non-requested file relval_extendedgen
ignoring non-requested file relval_production
ignoring non-requested file relval_ged
ignoring non-requested file relval_upgrade
processing relval_gpu
ignoring non-requested file relval_2017
ignoring non-requested file relval_2026
ignoring non-requested file relval_identity
ignoring non-requested file relval_machine
ignoring non-requested file relval_premix
Traceback (most recent call last):
  File "/tmp/muzaffar/CMSSW_11_3_X_2021-01-08-1100/bin/slc7_amd64_gcc900/runTheMatrix.py", line 370, in <module>
    ret = runSelected(opt)
  File "/tmp/muzaffar/CMSSW_11_3_X_2021-01-08-1100/bin/slc7_amd64_gcc900/runTheMatrix.py", line 30, in runSelected
    if len(undefSet)>0: raise ValueError('Undefined workflows: '+', '.join(map(str,list(undefSet))))
ValueError: Undefined workflows: 10824.511, 10824.501

Before dropping gpu from standard, we need to understand which wf should go in the GPU PR tests

smuzaffar · 2021-01-11T08:49:23Z

@fwyzard , currently the PR tests for GPU runs https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config#L2 workflows. Which are combination of -w gpu and -w standard ( see #32428 (comment) ) . Can we update this list to only run few of -w gpu tests only? Once we have this then we can disable gpu for default runTheMatrix

fwyzard · 2021-01-11T09:33:12Z

@smuzaffar , the GPU workflows are only those from the -w gpu option.
The .501 and .511 are meant to give the same results as the GPU workflows .502 and .512, while running on the CPU.

So I think we can keep the .5?1 workflows in the standard tests, and run only the .5?2 in the GPU tests.

smuzaffar · 2021-01-11T10:46:51Z

@fwyzard , cms-sw/cms-bot#1463 should run .5?2 for GPU tests now. About the .5?1 cpu tests, what should we do? Should we always run them as part of normal PR relval tests ( this might increase the PR tests time a bit)? OR do we want to run an additional relval/cpu if gpu tests are enable?

fwyzard · 2021-01-11T11:25:05Z

Eventually we should have a single GPU workflow, and a single CPU-equivalent workflow, and I think we should run both of them for all PRs that potentially affect those workflows (i.e. probably all PRs that affect ECAL, HCAL, Pixel, Tracking, PF - so a good fraction of those that require a RECO signature).

O&C has decided that changes to the GPU reconstruction are signed only by @cms-sw/reconstruction-l2 , not by @cms-sw/heterogeneous-l2 , so I guess it's really up to them to speak up on how the prefer the tests to be organised.

slava77 · 2021-01-11T13:47:20Z

O&C has decided that changes to the GPU reconstruction are signed only by @cms-sw/reconstruction-l2 , not by @cms-sw/heterogeneous-l2 , so I guess it's really up to them to speak up on how the prefer the tests to be organised.

IIRC the workflow definitions are not signed by reco either, it's in PDMV hands.

For the short matrix it's probably more practical to add a data workflow which does not also rerun the HLT, since none of the GPU-related reco relies on it; this is mainly for time/cost optimization of the workflow.

silviodonato · 2021-01-19T11:44:09Z

Issue solved by #32650.
The GPU tests runs, currently, 10824.502,10824.512 https://github.com/cms-sw/cms-bot/blob/master/cmssw-pr-test-config#L2

fwyzard · 2021-01-19T13:54:52Z

Now that #31719 has been merged, we should add the .522 workflow to the GPU tests.

I'm kind of lost about the technical details: do we add it to the matrix, or to the bot configuration ?

fwyzard · 2025-03-26T00:17:33Z

+heterogeneous

I believe this was fixed a long time ago for the CUDA workflows.

cmsbuild · 2025-03-26T00:17:57Z

cms-bot internal usage

makortel · 2025-03-26T13:22:58Z

+core

I believe this was fixed a long time ago for the CUDA workflows.

I agree. I think the root cause has been addressed in several ways by now.

cmsbuild added the pending-assignment label Dec 9, 2020

cmsbuild added core-pending heterogeneous-pending pending-signatures and removed pending-assignment labels Dec 9, 2020

cmsbuild added the pdmv-pending label Dec 10, 2020

silviodonato mentioned this issue Dec 18, 2020

Move GPU workflows to a specific GPU matrix #32547

Merged

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels Mar 26, 2025

cmsbuild added core-approved and removed core-pending labels Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to choose current device because CUDAService is disabled. #32428

Unable to choose current device because CUDAService is disabled. #32428

silviodonato commented Dec 9, 2020

cmsbuild commented Dec 9, 2020

silviodonato commented Dec 9, 2020

cmsbuild commented Dec 9, 2020

fwyzard commented Dec 9, 2020

fwyzard commented Dec 9, 2020 •

edited

Loading

fwyzard commented Dec 9, 2020

makortel commented Dec 9, 2020

makortel commented Dec 9, 2020

smuzaffar commented Dec 9, 2020

silviodonato commented Dec 10, 2020

cmsbuild commented Dec 10, 2020

srimanob commented Dec 15, 2020

fwyzard commented Dec 15, 2020

silviodonato commented Dec 18, 2020

silviodonato commented Jan 7, 2021

smuzaffar commented Jan 7, 2021

smuzaffar commented Jan 8, 2021

silviodonato commented Jan 8, 2021

smuzaffar commented Jan 8, 2021

smuzaffar commented Jan 11, 2021

fwyzard commented Jan 11, 2021

smuzaffar commented Jan 11, 2021

fwyzard commented Jan 11, 2021

slava77 commented Jan 11, 2021

silviodonato commented Jan 19, 2021

fwyzard commented Jan 19, 2021

fwyzard commented Mar 26, 2025

cmsbuild commented Mar 26, 2025 •

edited

Loading

makortel commented Mar 26, 2025

Unable to choose current device because CUDAService is disabled. #32428

Unable to choose current device because CUDAService is disabled. #32428

Comments

silviodonato commented Dec 9, 2020

cmsbuild commented Dec 9, 2020

silviodonato commented Dec 9, 2020

cmsbuild commented Dec 9, 2020

fwyzard commented Dec 9, 2020

fwyzard commented Dec 9, 2020 • edited Loading

fwyzard commented Dec 9, 2020

makortel commented Dec 9, 2020

makortel commented Dec 9, 2020

smuzaffar commented Dec 9, 2020

silviodonato commented Dec 10, 2020

cmsbuild commented Dec 10, 2020

srimanob commented Dec 15, 2020

fwyzard commented Dec 15, 2020

silviodonato commented Dec 18, 2020

silviodonato commented Jan 7, 2021

smuzaffar commented Jan 7, 2021

smuzaffar commented Jan 8, 2021

silviodonato commented Jan 8, 2021

smuzaffar commented Jan 8, 2021

smuzaffar commented Jan 11, 2021

fwyzard commented Jan 11, 2021

smuzaffar commented Jan 11, 2021

fwyzard commented Jan 11, 2021

slava77 commented Jan 11, 2021

silviodonato commented Jan 19, 2021

fwyzard commented Jan 19, 2021

fwyzard commented Mar 26, 2025

cmsbuild commented Mar 26, 2025 • edited Loading

makortel commented Mar 26, 2025

fwyzard commented Dec 9, 2020 •

edited

Loading

cmsbuild commented Mar 26, 2025 •

edited

Loading