-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ONNXRuntime-based implementation of DeepJet, DeepAK8 and DeepDoubleX #28112
Conversation
The code-checks are being triggered in jenkins. |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28112/12128
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28112/12129
|
A new Pull Request was created by @hqucms (Huilin Qu) for master. It involves the following packages: PhysicsTools/NanoAOD The following packages do not have a category, yet: PhysicsTools/ONNXRuntime @perrotta, @cmsbuild, @fgolf, @slava77, @santocch, @peruzzim can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a bunch of basic questions.
// Returns: a std::vector<std::vector<float>>, with the order matched to `output_names`. | ||
// When `output_names` is empty, will return all outputs ordered as in `getOutputNames()`. | ||
FloatArrays run(const std::vector<std::string>& input_names, | ||
FloatArrays input_values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this vector of vectors intentionally passed by value (leading to copying)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but thinking again it may not be the best way -- The problem is that Value::CreateTensor
needs T *
rather than const T *
, so here we have to pass a non-const FloatArrays &
which does not seem very nice to me. Still, I change it to passing by reference, as this object is typically fairly large.
std::vector<std::vector<unsigned int>> input_shapes_; // shapes of each input group | ||
std::unordered_map<std::string, PreprocessParams> prep_info_map_; // preprocessing info for each input group | ||
|
||
FloatArrays data_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big is the data_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~5k floats.
+1
|
+1 |
@peruzzim when do you expect we may finally have a scrutiny of this update on the xpog side? |
+xpog |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
This PR adds a preliminary ONNXRuntime-based implementation of DeepJet and DeepAK8, following the proposal in #27458. The runtime of DeepJet (DeepFlavour) b-tagger is reduced by ~7x, while for DeepAK8 the runtime is reduced by a few percent.
Timing benchmark based on 1000 JetHT Run2017F events (running the standard NanoAOD sequence), measured on a lxplus7 node:
[before]
[after]
(Similar speedup is also observed on an AMD CPU.)
Currently we set
MLAS_DYNAMIC_CPU_ARCH=0
(introduced in hqucms/onnxruntime@7222aea) so ONNXRuntime does not attempt to use AVX/AVX2-based kernels even if they are available on the machine. Allowing ONNXRuntime to dynamically switch to AVX/AVX2-based kernels on CPUs supporting these instructions can bring another speed-up of 1.5x ~ 2x, e.g.,(But then the results are not bitwise reproducible across different CPU architectures).
Overview of the speedups
(DeepTauID implementation is not included in this PR.)
More details in https://indico.cern.ch/event/855787/contributions/3601398/attachments/1929206/3194789/ML_inference_ONNXRuntime_RECOAT_20191018_H_Qu.pdf.
PR Dependencies:
PR validation:
pstree
shows no additional threads being created during the job run.