Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmsTriton bug fixes and improvements #43814

Merged
merged 19 commits into from
Feb 11, 2024
Merged

Conversation

kpedro88
Copy link
Contributor

@kpedro88 kpedro88 commented Jan 30, 2024

PR description:

Rollup of several pending bug fixes and improvements, primarily aimed at the cmsTriton server management script:

  1. Some relval jobs have been seeing timeouts when starting the local fallback server if the server image (hosted on /cvms/unpacked.cern.ch) is not already loaded in the worker node's cache. It does not appear to be possible for a non-privileged user to check if something is in the cvmfs cache or not, so the default server start timeout is just extended to minimize these failures.
  2. Thread control settings for the local fallback server are only used in the CPU case, not the GPU case.
  3. The OSG apptainer version is used by default if available. This avoids the need to propagate updated apptainer versions through a slew of different containers.
  4. The SONIC special workflows are extended to 14 TeV processes for Run 3.
  5. The TritonService will print the local fallback server log if the process is terminated because of an exception that could be related to the local fallback server. (This is denoted by a third argument in the TRITON_THROW_IF_ERROR macro.) During inference requests, which can be retried, an exception does not necessarily lead to process termination, so the notification to the TritonService can be undone in that case.
  6. The unit of the inference request timeout can now be selected. (This was useful for testing the above feature.)
  7. More granular verbosity control in the unit test script is now available; clients, the fallback server, and the TritonService itself can all be targeted separately.
  8. The unit test script is migrated from VarParsing to argparse, now that the latter works properly with cmsRun.
  9. The GPU driver check is skipped if the local fallback server is not using GPU.
  10. Fix handling of symbolic links in checksum and versioncheck modes of cmsTritonConfigTool
  11. Avoid throwing exceptions in TritonMemResource destructors (addresses TritonService throwing multiple exceptions #38260)

PR validation:

Unit tests pass.

The following command was used to test item 5:

cmsRun test/tritonTest_cfg.py --modules TritonGraphProducer --maxEvents 10 --timeout 1 --timeoutUnit microseconds --verboseService --device CPU --noShm

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Not intended to be backported.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jan 30, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-43814/38622

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

@kpedro88
Copy link
Contributor Author

kpedro88 commented Feb 6, 2024

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2024

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-4df92d/37258/summary.html
COMMIT: ec8dda1
CMSSW: CMSSW_14_1_X_2024-02-06-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/43814/37258/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@makortel
Copy link
Contributor

makortel commented Feb 7, 2024

+heterogeneous

@srimanob
Copy link
Contributor

srimanob commented Feb 7, 2024

+Upgrade

@kpedro88
Copy link
Contributor Author

kpedro88 commented Feb 8, 2024

@cms-sw/pdmv-l2 please sign (from your side, just expanding a special workflow to another fragment)

@AdrianoDee
Copy link
Contributor

+pdmv

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 9, 2024

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@antoniovilela
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 86e33f4 into cms-sw:master Feb 11, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants