Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU backend option for TensorFlow session #40551

Merged
merged 7 commits into from
Feb 13, 2023

Conversation

valsdav
Copy link
Contributor

@valsdav valsdav commented Jan 17, 2023

PR description:

This PR introduces a GPU backend option for TensorFlow sessions. (The interface is similar to the ONNX GPU support one #36963)

The GPU session can be activated by user code by requesting tensorflow::Backend::cuda in the session creation.
This is needed to be able to compile TF with GPU support in the cmssw-dist (see cms-sw/cmsdist#7648).

By default the CPU backend is used for all the standard workflows.
Moreover, TensorFlow has been setup to avoid occupying all the cuda memory of a GPU, but just the necessary one (allow_growth=true option).

Tests have been modified to run both a CPU and a GPU version, if a device is available.

PR validation

We are working to provide a test with the GPU backend active in a reconstruction sequence.

@yongbinfeng @riga @tvami @smuzaffar

@tvami
Copy link
Contributor

tvami commented Jan 17, 2023

test parameters:

  • enable_test = gpu,threading,profiling

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-40551/33781

  • This PR adds an extra 12KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @valsdav (Davide Valsecchi) for master.

It involves the following packages:

  • PhysicsTools/TensorFlow (reconstruction)

@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@makortel, @riga this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@tvami
Copy link
Contributor

tvami commented Jan 17, 2023

@cmsbuild , please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39a9f4/30044/summary.html
COMMIT: 2fa5675
CMSSW: CMSSW_13_0_X_2023-01-17-1100/el8_amd64_gcc11
Additional Tests: GPU,THREADING,PROFILING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/40551/30044/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

There are some workflows for which there are errors in the baseline:
11634.15 step 3
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3555538
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3555513
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 211 log files, 162 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19862
  • DQMHistoTests: Total failures: 32
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19830
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor

@valsdav @tvami , there are few unit tests which failed on ppc64le ( where we have GPUs available). See all the testTF logs here. Can you please makethe same change in https://github.com/cms-sw/cmssw/tree/39394a2c91eba54f396c5e454c87403b89797d52/PhysicsTools/TensorFlow/test

@smuzaffar
Copy link
Contributor

also for testTheano failure, may be we can set CUDA_VISIBLE_DEVICES=0 in https://github.com/cms-sw/cmssw/blob/39394a2c91eba54f396c5e454c87403b89797d52/PhysicsTools/PythonAnalysis/test/testTheano.sh

@valsdav
Copy link
Contributor Author

valsdav commented Jan 18, 2023

@valsdav @tvami , there are few unit tests which failed on ppc64le ( where we have GPUs available). See all the testTF logs here. Can you please makethe same change in https://github.com/cms-sw/cmssw/tree/39394a2c91eba54f396c5e454c87403b89797d52/PhysicsTools/TensorFlow/test

Hi @smuzaffar I checked and the tests are using the same options already. The problem seems a CUDA driver version mismatch:

Running .2023-01-18 00:46:58.906070: E tensorflow/core/common_runtime/session.cc:91] Failed to create session: Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
2023-01-18 00:46:58.906200: E tensorflow/c/c_api.cc:2193] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
  File "/scratch/cmsbuild/jenkins_b/workspace/ib-run-pr-tests/CMSSW_13_0_X_2023-01-16-2300/src/PhysicsTools/TensorFlow/test/createconstantgraph.py", line 37, in <module>
    sess = tf.Session()
  File "/scratch/cmsbuild/jenkins_b/workspace/ib-run-pr-tests/testBuildDir/el8_ppc64le_gcc11/external/py3-tensorflow/2.6.4-b4a2a02538720942cedb33347acf877b/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1601, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/scratch/cmsbuild/jenkins_b/workspace/ib-run-pr-tests/testBuildDir/el8_ppc64le_gcc11/external/py3-tensorflow/2.6.4-b4a2a02538720942cedb33347acf877b/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 711, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

@smuzaffar
Copy link
Contributor

smuzaffar commented Jan 18, 2023

@fwyzard , on our Power PC nodes, scram hook to find the compatibility between cuda driver and runtime [a] shows that driver and runtime are compatible so it add $CUDA_BASE/driver but tensorflow disagrees . No idea why TF is still checking for cuda device which is is explicitly set to not use GPU . Any idea how to avoid it?

[a]

LD_LIBRARY_PATH=$CUDA_BASE/drivers:$LD_LIBRARY_PATH /cvmfs/cms-ci.cern.ch/week0/PR_d4f6df80/el8_ppc64le_gcc11/external/cuda-compatible-runtime/1.0-c494f8374f9c5297a98f1d2cb2c28cdf/test/cuda-compatible-runtime -k
11.5
Singularity> echo $?
0

@valsdav
Copy link
Contributor Author

valsdav commented Jan 18, 2023

@valsdav @tvami , there are few unit tests which failed on ppc64le ( where we have GPUs available). See all the testTF logs here. Can you please makethe same change in https://github.com/cms-sw/cmssw/tree/39394a2c91eba54f396c5e454c87403b89797d52/PhysicsTools/TensorFlow/test

In a couple of tests the Session was created directly: I have added the hotfix also there.
For the Theano test I have included export CUDA_VISIBLE_DEVICES="" to disable the GPU use.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-40551/33788

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2023

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-40551/34102

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2023

Pull request #40551 was updated. @cmsbuild, @mandrenguyen, @clacaputo can you please check and sign again.

@mandrenguyen
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 8, 2023

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39a9f4/30487/summary.html
COMMIT: c85d7d8
CMSSW: CMSSW_13_0_X_2023-02-07-2300/el8_amd64_gcc11
Additional Tests: GPU,THREADING,PROFILING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/40551/30487/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39a9f4/30487/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-39a9f4/30487/git-merge-result

Comparison Summary

Summary:

  • You potentially added 140 lines to the logs
  • Reco comparison results: 3 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3555852
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3555827
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19862
  • DQMHistoTests: Total failures: 258
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19604
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: found differences in 3 / 3 workflows

@valsdav
Copy link
Contributor Author

valsdav commented Feb 8, 2023

We still see problems in the tests run on el8_ppc64le_gcc11 where GPUs are present.

I think it is a problem with some module still not properly setup through the new setBackend interface. I'm investigating...

@valsdav
Copy link
Contributor Author

valsdav commented Feb 9, 2023

Tests on el8_ppc64le_gcc11 were failing due to not including this PR changes. Tests have been restarted in cms-sw/cmsdist#7648 and run fine.

@smuzaffar smuzaffar modified the milestones: CMSSW_13_0_X, CMSSW_13_1_X Feb 11, 2023
@valsdav
Copy link
Contributor Author

valsdav commented Feb 13, 2023

I am planning to introduce some improvements in the organization of the TF options we are exposing to the users, but I think that can be done in a separate PR.

The changes in the PR allow all the tests to pass (with backend::cpu by default). We will provide a test in the runTheMatrix with a GPU activated workflow.

Do you think we can merge this one? Thanks!

@mandrenguyen
Copy link
Contributor

+1
resign

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@rappoccio
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 39853ed into cms-sw:master Feb 13, 2023
@valsdav
Copy link
Contributor Author

valsdav commented Feb 13, 2023

@rappoccio just a kind reminder that this PR should be merged along with cms-sw/cmsdist#7648
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants