Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDAService verbosity #35117

Merged
merged 6 commits into from
Sep 20, 2021
Merged

CUDAService verbosity #35117

merged 6 commits into from
Sep 20, 2021

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Sep 2, 2021

PR description:

Add the NVIDIA driver, CUDA driver and runtime library versions to the CUDAService message.
Make the CUDAService less verbose by default, with an option to display the full messages, and enable it in the MessageLogger by default.

PR validation:

The default, compact message on a machine with two GPUs now looks like

CUDA runtime version 11.2, driver version 11.4, NVIDIA driver version 470.57.02
CUDA device 0: Tesla T4 (sm_75)
CUDA device 1: Tesla T4 (sm_75)

The full, verbose message on the same machine now looks like

NVIDIA driver:    470.57.02
CUDA driver API:  11.4 (compiled with 11.2)
CUDA runtime API: 11.2 (compiled with 11.2)
CUDA runtime successfully initialised, found 2 compute devices.

CUDA device 0: Tesla T4
  compute capability:          7.5 (sm_75)
  streaming multiprocessors:            40
  CUDA cores:                         2560
  single to double performance:       32:1
  compute mode:           default (shared)
  memory:  15009 MB free /  15109 MB total
  constant memory:                   64 kB
  L2 cache size:                   4096 kB
  L1 cache mode:   local and global memory

Other capabilities
  can map host memory into the CUDA address space for use with cudaHostAlloc()/cudaHostGetDevicePointer()
  does not support coherently accessing pageable memory without calling cudaHostRegister() on it
  cannot access pageable memory via the host's page tables
  can access host registered memory at the same virtual address as the host
  shares a unified address space with the host
  supports allocating managed memory on this system
  can coherently access managed memory concurrently with the host
  the host cannot directly access managed memory on the device without migration
  supports launching cooperative kernels via cudaLaunchCooperativeKernel()
  supports launching cooperative kernels via cudaLaunchCooperativeKernelMultiDevice()

CUDA flags
  thread policy:                   default
  pinned host memory allocations:  enabled
  kernel host memory reuse:       disabled

CUDA limits
  printf buffer size:                 1 MB
  stack size:                         1 kB
  malloc heap size:                   8 MB
  runtime sync depth:                    2
  runtime pending launch count:       2048

CUDA device 1: Tesla T4
  compute capability:          7.5 (sm_75)
  streaming multiprocessors:            40
  CUDA cores:                         2560
  single to double performance:       32:1
  compute mode:           default (shared)
  memory:  15009 MB free /  15109 MB total
  constant memory:                   64 kB
  L2 cache size:                   4096 kB
  L1 cache mode:   local and global memory

Other capabilities
  can map host memory into the CUDA address space for use with cudaHostAlloc()/cudaHostGetDevicePointer()
  does not support coherently accessing pageable memory without calling cudaHostRegister() on it
  cannot access pageable memory via the host's page tables
  can access host registered memory at the same virtual address as the host
  shares a unified address space with the host
  supports allocating managed memory on this system
  can coherently access managed memory concurrently with the host
  the host cannot directly access managed memory on the device without migration
  supports launching cooperative kernels via cudaLaunchCooperativeKernel()
  supports launching cooperative kernels via cudaLaunchCooperativeKernelMultiDevice()

CUDA flags
  thread policy:                   default
  pinned host memory allocations:  enabled
  kernel host memory reuse:       disabled

CUDA limits
  printf buffer size:                 1 MB
  stack size:                         1 kB
  malloc heap size:                   8 MB
  runtime sync depth:                    2
  runtime pending launch count:       2048

CUDAService fully initialized

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 2, 2021

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35117/25012

  • This PR adds an extra 28KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 2, 2021

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 2, 2021

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 2, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35117/25018

  • This PR adds an extra 28KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 2, 2021

A new Pull Request was created by @fwyzard (Andrea Bocci) for master.

It involves the following packages:

  • Configuration/StandardSequences (operations)
  • HLTrigger/Configuration (hlt)
  • HeterogeneousCore/CUDAServices (heterogeneous)

@perrotta, @makortel, @Martin-Grunewald, @fwyzard, @qliphy, @fabiocos, @davidlange6 can you please review it and eventually sign? Thanks.
@fabiocos, @makortel, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @Martin-Grunewald, @lecriste, @mtosi, @ebrondol, @mmusich, @dgulhan, @slomeo this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 17, 2021

Finally everything looks good:

%MSG-i CUDAService:  (NoModuleName) 17-Sep-2021 07:48:11 UTC pre-events
CUDA runtime version 11.4, driver version 11.2, NVIDIA driver version 460.27.04
CUDA device 0: Tesla V100S-PCIE-32GB (sm_70)
%MSG

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 17, 2021

+heterogeneous

@Martin-Grunewald
Copy link
Contributor

+1

@perrotta
Copy link
Contributor

@fwyzard @makortel do I understand correctly that this PR depends on #35298, which has to be merged first then.
Otherwise we can test it without #35298 and just merge this one first, if the tests report no issues

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 19, 2021 via email

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 19, 2021 via email

@fwyzard
Copy link
Contributor Author

fwyzard commented Sep 19, 2021

Once #35298 is merged, we can remove

process.load("FWCore.MessageService.MessageLogger_cfi")

from Configuration/StandardSequences/python/Services_cff.py

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39073b/18741/summary.html
COMMIT: 1a5a345
CMSSW: CMSSW_12_1_X_2021-09-19-0000/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/35117/18741/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19735
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19729
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 7 differences found in the comparisons
  • DQMHistoTests: Total files compared: 40
  • DQMHistoTests: Total histograms compared: 3211080
  • DQMHistoTests: Total failures: 11
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3211046
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 39 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 169 log files, 37 edm output root files, 40 DQM output files
  • TriggerResults: no differences found

@perrotta
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will be automatically merged.

@cmsbuild cmsbuild merged commit 8e41bcd into cms-sw:master Sep 20, 2021
@fwyzard fwyzard deleted the CUDAService_verbosity branch July 31, 2022 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants