Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update UCX to version 1.12.1 #7809

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Apr 20, 2022

Add the xpmem library from the HEAD of the master branch as of 2022.03.08, corresponding to the commit 61c39efdea943ac863037d7e35b236145904e64d.
Based on v2.6.3 with updates for Linux kernel up to 5.17.

Enable additional libraries in UCX:

  • enable the use of xpmem for intra-node communication;
  • enable the use of ROCm for AMD gpus;
  • remove the ROCm GDR module, which is not compatible with GDRCopy v2.x.

Update UCX to version 1.12.1:

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard (Andrea Bocci) for branch IB/CMSSW_12_4_X/master.

@cmsbuild, @smuzaffar, @aandvalenzuela, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @qliphy you are the release manager for this.
cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 20, 2022

@smuzaffar do I need to add xpmem to cmssw-tool-conf.spec ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 20, 2022

please test

@cmsbuild
Copy link
Contributor

Pull request #7809 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 20, 2022

This PR includes #7795 to ease testing. It can be rebased once that is merged.

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24064/summary.html
COMMIT: bc787d8
CMSSW: CMSSW_12_4_X_2022-04-20-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7809/24064/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

Requested to quit.
* The action "build-external+ucx+1.12.1-871f2c8f3832a729236a3a4b83fb7b49" was not completed successfully because Failed to build ucx. Log file in /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/slc7_amd64_gcc10/external/ucx/1.12.1-871f2c8f3832a729236a3a4b83fb7b49/log. Final lines of the log file:
67 |         ret = gdr_copy_from_bar(buffer, (void *)remote_addr, length);
|               ^~~~~~~~~~~~~~~~~
|               gdr_copy_from_mapping
rocm_gdr_ep.c:67:15: error: nested extern declaration of 'gdr_copy_from_bar' [-Werror=nested-externs]
cc1: all warnings being treated as errors
make[4]: *** [libuct_rocm_gdr_la-rocm_gdr_ep.lo] Error 1
make[4]: *** Waiting for unfinished jobs....
make[4]: Leaving directory `/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/slc7_amd64_gcc10/external/ucx/1.12.1-871f2c8f3832a729236a3a4b83fb7b49/ucx-1.12.1/src/uct/rocm/gdr'
make[3]: *** [all-recursive] Error 1


@smuzaffar
Copy link
Contributor

@smuzaffar do I need to add xpmem to cmssw-tool-conf.spec ?

No, there is no need to explicitly add it in cmssw-tool-conf.

@cmsbuild
Copy link
Contributor

Pull request #7809 was updated.

@fwyzard fwyzard force-pushed the IB/CMSSW_12_4_X/master_UCX_updates branch from fb64181 to ef20ef1 Compare April 20, 2022 22:47
@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 20, 2022

please test

@cmsbuild
Copy link
Contributor

Pull request #7809 was updated.

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24079/summary.html
COMMIT: ef20ef1
CMSSW: CMSSW_12_4_X_2022-04-20-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7809/24079/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

+ sed '-e/SUBDIRS/s/ *\//' -i src/uct/rocm/Makefile.am
+ sed '-e/src\/uct\/rocm\/gdr\/configure\.m4/d' -i src/uct/rocm/configure.m4
+ rm -rf src/uct/rocm/gdr
+ ./autogen.sh
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.Saz1OQ: line 48: ./autogen.sh: No such file or directory
error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.Saz1OQ (%prep)


RPM build errors:
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.Saz1OQ (%prep)



@cmsbuild
Copy link
Contributor

Pull request #7809 was updated.

Requires: rdma-core
Requires: rocm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fwyzard , this will package rocm for all archs. I think we should include and configure rocm only for x86_64 archs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, of course.

Add the xpmem library from the HEAD of the master branch as of 2022.03.08,
corresponding to the commit 61c39efdea943ac863037d7e35b236145904e64d.

Based on xpmem v2.6.3 with updates for Linux kernel up to 5.17.
Enable additional libraries in UCX:
  - enable the use of xpmem for intra-node communication;
  - enable the use of ROCm for AMD gpus (only for x86_64);
  - remove the ROCm GDR module, which is not compatible with GDRCopy v2.x.

Update UCX to version 1.12.1:
  - change the default for UCX_MEM_CUDA_HOOK_MODE from "reloc" to "bistro";
  - various bug fixes for CUDA and ROCm;
  - see https://github.com/openucx/ucx/releases/tag/v1.12.1 for the full
    change log.
@fwyzard fwyzard force-pushed the IB/CMSSW_12_4_X/master_UCX_updates branch from 539a478 to 431af70 Compare April 21, 2022 22:06
@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 21, 2022

please test

@cmsbuild
Copy link
Contributor

Pull request #7809 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 21, 2022

@cmsbuild, please test for el8_amd64_gcc10

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 21, 2022

@cmsbuild, please test for el9_amd64_gcc11

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 21, 2022

@cmsbuild, please test for el8_ppc64le_gcc10

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 21, 2022

@cmsbuild, please test for el8_aarch64_gcc10

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24116/summary.html
COMMIT: 431af70
CMSSW: CMSSW_12_4_X_2022-04-20-2300/el8_ppc64le_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7809/24116/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24116/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24116/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test testFWCoreUtilities had ERRORS
---> test DRNTest had ERRORS

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24118/summary.html
COMMIT: 431af70
CMSSW: CMSSW_12_4_X_2022-04-20-2300/el8_aarch64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7809/24118/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24118/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24118/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test TestFWCoreServicesDriver had ERRORS
---> test testFWCoreUtilities had ERRORS
---> test DRNTest had ERRORS

@smuzaffar
Copy link
Contributor

please test with cms-sw/cms-bot#1751

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24114/summary.html
COMMIT: 431af70
CMSSW: CMSSW_12_4_X_2022-04-20-2300/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7809/24114/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24114/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24114/git-merge-result

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /pool/condor/dir_117278/jenkins/workspace/compare-root-files-short-matrix/data/PR-ee27b4/39434.75_TTbar_14TeV+2026D88_HLT75e33+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HLT75e33+HARVESTGlobal

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 62740 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3695434
  • DQMHistoTests: Total failures: 594594
  • DQMHistoTests: Total nulls: 380
  • DQMHistoTests: Total successes: 3100438
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.31499999999999995 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 10224.0 ): 0.063 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 11834.0 ): 2.372 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 250202.181 ): 0.006 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 25202.0 ): -0.117 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 7.3 ): -2.009 KiB SiStrip/MechanicalView
  • Checked 205 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: found differences in 14 / 48 workflows

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24128/summary.html
COMMIT: 431af70
CMSSW: CMSSW_12_4_X_2022-04-21-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/7809/24128/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24128/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ee27b4/24128/git-merge-result

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-ee27b4/39434.75_TTbar_14TeV+2026D88_HLT75e33+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HLT75e33+HARVESTGlobal

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 12 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3695434
  • DQMHistoTests: Total failures: 19
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3695392
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 205 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor

+externals

1 similar comment
@smuzaffar
Copy link
Contributor

+externals

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_12_4_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants