Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of TF CUDA tests for 14_1_X #44376

Merged
merged 2 commits into from
Mar 18, 2024

Conversation

valsdav
Copy link
Contributor

@valsdav valsdav commented Mar 12, 2024

PR description:

This PR improves the handling of CUDA unit tests for the TensorFlow package, using a new tf_cuda_support tool from scram , which checks if the GPU support is enabled in TensorFlow compilation.

The PR also makes the TF cuda tests more strict by checking explicitely if a CUDA device is visible to TF and not only to cmssw.

The test testTFVisibleDevicesCUDA is in fact run by the framework as a CUDA device is registered, but then TF does not recognize the device and the test fails. The other testTF*CUDA tests were passing silently using the CPU to run the test. After this PR all the TF sessions using tf::backend::cuda , but not finding a GPU will fail explicitly.

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 12, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39431

  • This PR adds an extra 12KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @valsdav for master.

It involves the following packages:

  • PhysicsTools/TensorFlow (ml)

@cmsbuild, @valsdav, @wpmccormack can you please review it and eventually sign? Thanks.
@makortel, @riga this is something you requested to watch as well.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@smuzaffar
Copy link
Contributor

@smuzaffar
Copy link
Contributor

@valsdav , I was thinking to add tf_cuda_support toolfile so that we can have something like

<iftool name="tf_cuda_support">
  all tests which requires TF to be build with GPU support
</iftool>

@valsdav
Copy link
Contributor Author

valsdav commented Mar 12, 2024

@valsdav , any idea why only testTFVisibleDevicesCUDA fails for GPU IBs (https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_14_1_GPU_X_2024-03-11-2300/unitTestLogs/PhysicsTools/TensorFlow#/89-89 ) ?

Hi @smuzaffar I realized this morning testing locally that only that test is failing because it is the only one explicitly checking the list of devices visible to TF.

The other tests that are run when a CUDA device is visible to the cmssw framework are silently fall-backing to CPU running in TF. I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

@valsdav
Copy link
Contributor Author

valsdav commented Mar 12, 2024

all tests which requires TF to be build with GPU support

That's a good idea! Thanks!

Comment on lines 8 to 24
<!-- <iftool name="cuda"> -->
<!-- <bin name="testTFHelloWorldCUDA" file="testRunner.cpp,testHelloWorldCUDA.cc"> -->
<!-- <use name="boost_filesystem"/> -->
<!-- <use name="catch2"/> -->
<!-- <use name="cppunit"/> -->
<!-- <use name="cuda"/> -->
<!-- <use name="tensorflow-cc"/> -->
<!-- <use name="FWCore/ParameterSet"/> -->
<!-- <use name="FWCore/ParameterSetReader"/> -->
<!-- <use name="FWCore/PluginManager"/> -->
<!-- <use name="FWCore/ServiceRegistry"/> -->
<!-- <use name="FWCore/Utilities"/> -->
<!-- <use name="HeterogeneousCore/CUDAServices"/> -->
<!-- <use name="HeterogeneousCore/CUDAUtilities"/> -->
<!-- <use name="PhysicsTools/TensorFlow"/> -->
<!-- </bin> -->
<!-- </iftool> -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want to comment out a whole test, you can just do

Suggested change
<!-- <iftool name="cuda"> -->
<!-- <bin name="testTFHelloWorldCUDA" file="testRunner.cpp,testHelloWorldCUDA.cc"> -->
<!-- <use name="boost_filesystem"/> -->
<!-- <use name="catch2"/> -->
<!-- <use name="cppunit"/> -->
<!-- <use name="cuda"/> -->
<!-- <use name="tensorflow-cc"/> -->
<!-- <use name="FWCore/ParameterSet"/> -->
<!-- <use name="FWCore/ParameterSetReader"/> -->
<!-- <use name="FWCore/PluginManager"/> -->
<!-- <use name="FWCore/ServiceRegistry"/> -->
<!-- <use name="FWCore/Utilities"/> -->
<!-- <use name="HeterogeneousCore/CUDAServices"/> -->
<!-- <use name="HeterogeneousCore/CUDAUtilities"/> -->
<!-- <use name="PhysicsTools/TensorFlow"/> -->
<!-- </bin> -->
<!-- </iftool> -->
<!--
<iftool name="cuda">
<bin name="testTFHelloWorldCUDA" file="testRunner.cpp,testHelloWorldCUDA.cc">
<use name="boost_filesystem"/>
<use name="catch2"/>
<use name="cppunit"/>
<use name="cuda"/>
<use name="tensorflow-cc"/>
<use name="FWCore/ParameterSet"/>
<use name="FWCore/ParameterSetReader"/>
<use name="FWCore/PluginManager"/>
<use name="FWCore/ServiceRegistry"/>
<use name="FWCore/Utilities"/>
<use name="HeterogeneousCore/CUDAServices"/>
<use name="HeterogeneousCore/CUDAUtilities"/>
<use name="PhysicsTools/TensorFlow"/>
</bin>
</iftool>
-->

@fwyzard
Copy link
Contributor

fwyzard commented Mar 12, 2024

The other testTF*CUDA tests are instead silently using the CPU to run the test.

I think this part should be fixed: if we want the tests to run with CUDA, they should not fall back to CPU.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 12, 2024

I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ?

@antoniovilela
Copy link
Contributor

hold

  • As discussed at ORP.

@cmsbuild
Copy link
Contributor

Pull request has been put on hold by @antoniovilela
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@cmsbuild cmsbuild added the hold label Mar 12, 2024
@smuzaffar
Copy link
Contributor

I will add tf_cuda_support tool and will redo this PR

@valsdav
Copy link
Contributor Author

valsdav commented Mar 12, 2024

I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ?

I have a solution for this, @smuzaffar do you want me to push it here so that it can be included in the new PR? Thanks

@smuzaffar
Copy link
Contributor

@valsdav , I have opened cms-sw/cmsdist#9066 which adds the suggested new tool tf_cuda_support . Feel free to update this PR with

  • Changes needed for In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs
  • Move TF GPU unit tests in <iftool name="tf_cuda_support"> </iftool> block

@valsdav valsdav changed the title Disabling TF CUDA tests for 14_1_X Improve handling of TF CUDA tests for 14_1_X Mar 12, 2024
@valsdav valsdav force-pushed the tensorflow-disable-gpu-tests_14_1_X branch from b59bfdd to 52f14fb Compare March 13, 2024 01:47
@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38108/summary.html
COMMIT: b9b41e4
CMSSW: CMSSW_14_1_X_2024-03-13-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44376/38108/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 1 lines to the logs
  • Reco comparison results: 50 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3297383
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3297357
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

@smuzaffar
Copy link
Contributor

@cms-sw/ml-l2 , can you please review this?
@cms-sw/orp-l2 , this now has all the required changes. you can unhold it now

@valsdav
Copy link
Contributor Author

valsdav commented Mar 18, 2024

+1

@smuzaffar
Copy link
Contributor

@valsdav , can you please backport it for 14.0.X ?

@antoniovilela
Copy link
Contributor

unhold

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@antoniovilela
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants