Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce ML inference time in b-tag and related jet taggers: focus on ParticleNet #32883

Closed
1 of 5 tasks
slava77 opened this issue Feb 12, 2021 · 25 comments
Closed
1 of 5 tasks

Comments

@slava77
Copy link
Contributor

slava77 commented Feb 12, 2021

This is a replacement/refresh for #25230 where the total of ML jet taggers was 20% of miniAOD time.

in a recent variant of reminiAOD (now the 2018 UL remini wf 136.88811) jet tagging inference takes 15% of the miniAOD processing time, as measured in CMSSW_11_3_0_pre2

   0.39 pfDeepCSVJetTagsAK8PFPuppiSoftDropSubjets          DeepFlavourJetTagsProducer
   0.51            pfDeepCSVJetTagsAK8Puppi          DeepFlavourJetTagsProducer
   1.06               pfDeepCSVJetTagsPuppi          DeepFlavourJetTagsProducer
   1.51 pfMassIndependentDeepDoubleCvLV2JetTagsSlimmedAK8DeepTags      DeepDoubleXONNXJetTagsProducer
   1.52 pfMassIndependentDeepDoubleCvBV2JetTagsSlimmedAK8DeepTags      DeepDoubleXONNXJetTagsProducer
   1.61 pfMassIndependentDeepDoubleBvLV2JetTagsSlimmedAK8DeepTags      DeepDoubleXONNXJetTagsProducer
   7.96 pfMassDecorrelatedDeepBoostedJetTagsSlimmedAK8DeepTags       BoostedJetONNXJetTagsProducer
   8.23 pfDeepBoostedJetTagsSlimmedAK8DeepTags       BoostedJetONNXJetTagsProducer
  16.56 pfDeepFlavourJetTagsSlimmedDeepFlavour      DeepFlavourONNXJetTagsProducer
  17.62 pfHiggsInteractionNetTagsSlimmedAK8DeepTags       BoostedJetONNXJetTagsProducer
  65.43 pfMassDecorrelatedParticleNetJetTagsSlimmedAK8DeepTags       BoostedJetONNXJetTagsProducer
  65.53 pfParticleNetJetTagsSlimmedAK8DeepTags       BoostedJetONNXJetTagsProducer
  66.13 pfParticleNetAK4JetTagsSlimmedDeepFlavour       BoostedJetONNXJetTagsProducer
Total of the above: 254.06 ms/ev
@cmsbuild
Copy link
Contributor

A new Issue was created by @slava77 Slava Krutelyov.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@slava77
Copy link
Contributor Author

slava77 commented Feb 12, 2021

@slava77
Copy link
Contributor Author

slava77 commented Feb 12, 2021

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@slava77,@perrotta,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@kpedro88
Copy link
Contributor

@mialiu149 @jmduarte

@andrzejnovak
Copy link
Contributor

Analyzing inputs with LRP/Integrated Gradient and removing some 40% lowest-scoring variables reduced the inference time for DDX V2 by half (not counting model load/initiation).

@slava77
Copy link
Contributor Author

slava77 commented Feb 12, 2021

Analyzing inputs with LRP/Integrated Gradient and removing some 40% lowest-scoring variables reduced the inference time for DDX V2 by half (not counting model load/initiation).

please clarify if V2 already has this reduction or if it is a possible improvement.
Thank you.

@slava77
Copy link
Contributor Author

slava77 commented Feb 12, 2021

to avoid potential improvements from the last few months, I updated the timing values using 11_3_0_pre2 (instead of 11_2_0_pre9). The results did not really change.

@andrzejnovak
Copy link
Contributor

Analyzing inputs with LRP/Integrated Gradient and removing some 40% lowest-scoring variables reduced the inference time for DDX V2 by half (not counting model load/initiation).

please clarify if V2 already has this reduction or if it is a possible improvement.
Thank you.

V2 already has this. The V2 is similar in time to the V1 (after the ONNX update) even though it considered more inputs

@riga
Copy link
Contributor

riga commented Feb 18, 2021

ONNXRuntime was updated from 1.3.0 to 1.6.0 yesterday (cms-sw/cmsdist#6649), and should be available with IB CMSSW_11_3_X_2021-02-17-2300. After going through a few commits, the update should also bring some performance improvements, so maybe it's worth checking the impact on the miniAOD time again.

@slava77
Copy link
Contributor Author

slava77 commented Feb 18, 2021

11_3_0_pre2 -> CMSSW_11_3_X_2021-02-17-2300 in ms/ev

   0.39 -> 0.35 pfDeepCSVJetTagsAK8PFPuppiSoftDropSubjets          DeepFlavourJetTagsProducer
   0.51 -> 0.46          pfDeepCSVJetTagsAK8Puppi          DeepFlavourJetTagsProducer
   1.06 -> 0.99             pfDeepCSVJetTagsPuppi          DeepFlavourJetTagsProducer
   1.51 -> 1.48 pfMassIndependentDeepDoubleCvLV2JetTagsSlimmedAK8DeepTags      DeepDoubleXONNXJetTagsProducer
   1.52 -> 1.47 pfMassIndependentDeepDoubleCvBV2JetTagsSlimmedAK8DeepTags      DeepDoubleXONNXJetTagsProducer
   1.61 -> 1.54 pfMassIndependentDeepDoubleBvLV2JetTagsSlimmedAK8DeepTags      DeepDoubleXONNXJetTagsProducer
   7.96 -> 7.67 pfMassDecorrelatedDeepBoostedJetTagsSlimmedAK8DeepTags       BoostedJetONNXJetTagsProducer
   8.23 -> 8.03 pfDeepBoostedJetTagsSlimmedAK8DeepTags       BoostedJetONNXJetTagsProducer
  16.56 -> 13.92 pfDeepFlavourJetTagsSlimmedDeepFlavour      DeepFlavourONNXJetTagsProducer
  17.62 -> 16.04 pfHiggsInteractionNetTagsSlimmedAK8DeepTags       BoostedJetONNXJetTagsProducer
  65.43 -> 64.01 pfMassDecorrelatedParticleNetJetTagsSlimmedAK8DeepTags       BoostedJetONNXJetTagsProducer
  65.53 -> 63.97 pfParticleNetJetTagsSlimmedAK8DeepTags       BoostedJetONNXJetTagsProducer
  66.13 -> 62.91 pfParticleNetAK4JetTagsSlimmedDeepFlavour       BoostedJetONNXJetTagsProducer
Total of the above: 254.06 ms/ev -> 242.84 ms/ev. 

There is about 5% reduction, which kind of looks correlated with use of ONNX rather than the job running generally faster or some other changes between the releases.

I would not consider the 5% reduction a significant enough effect to resolve this issue.

@hqucms
Copy link
Contributor

hqucms commented Feb 18, 2021

Do we plan to enable AVX/AVX2 support in ONNXRuntime at some point, either explicitly or implicitly via the MLAS_DYNAMIC_CPU_ARCH flag? It will speed things up quite a lot (e.g., ~2x for ParticleNet).

@slava77
Copy link
Contributor Author

slava77 commented Apr 28, 2021

Do we plan to enable AVX/AVX2 support in ONNXRuntime at some point, either explicitly or implicitly via the MLAS_DYNAMIC_CPU_ARCH flag? It will speed things up quite a lot (e.g., ~2x for ParticleNet).

What is the range of the "Dynamic"? Is it smart enough to stay with AVX2 or will it push for AVX512 wherever available regardless of possible frequency scaling implications?

Considering that I found out recently that we are effectively using dynamic in TF (#33442) and operationally things were OK, I think that it's reasonable to try it wider.
@hqucms would you be available to make a PR to test the feature?
Lets try it.
Thanks.

@hqucms
Copy link
Contributor

hqucms commented Apr 28, 2021

@slava77

Yes the level of "dynamic" can be controlled:

  • MLAS_DYNAMIC_CPU_ARCH=0: no AVX
  • MLAS_DYNAMIC_CPU_ARCH=1: up to AVX
  • MLAS_DYNAMIC_CPU_ARCH=2: up to AVX2
  • MLAS_DYNAMIC_CPU_ARCH>2: up to AVX512

Sure I can open a PR. Any suggestion on how dynamic we want to use?

@slava77
Copy link
Contributor Author

slava77 commented Apr 28, 2021

Sure I can open a PR. Any suggestion on how dynamic we want to use?

=2 looks reasonable; I'm not sure if we'd need to "regress" to =1.

@hqucms
Copy link
Contributor

hqucms commented Apr 28, 2021

@slava77 OK I made the PR: cms-sw/cmsdist#6855.
What kind of tests do we want to do with it?

@slava77
Copy link
Contributor Author

slava77 commented Apr 28, 2021

@slava77 OK I made the PR: cms-sw/cmsdist#6855.
What kind of tests do we want to do with it?

jenkins tests with timing monitored in miniAOD should be enough to confirm the benefits.

I guess that there will be small differences between the =0 and =2 versions.
So, merging may required a bit of a leap of faith that the differences will go away because the available infrastructure has AVX2 already.

@hqucms
Copy link
Contributor

hqucms commented Apr 28, 2021

I guess that there will be small differences between the =0 and =2 versions.

@slava77 You are referring to the small numerical difference of the outputs right?

@slava77
Copy link
Contributor Author

slava77 commented Apr 28, 2021

@slava77 You are referring to the small numerical difference of the outputs right?

yes

@jpata
Copy link
Contributor

jpata commented Apr 11, 2022

@emilbols please take note of this performance issue, and let us know the plans to address this.

@cms-sw/btv-pog-l2

@emilbols
Copy link
Contributor

@emilbols please take note of this performance issue, and let us know the plans to address this.

@cms-sw/btv-pog-l2

I believe after PR cms-sw/cmsdist#6855 there was a reduction for all the ONNX modules cms-sw/cmsdist#6855 (comment). If im not mistaken the table referenced here is before that.

A simple thing that might be useful to do, is to make sure the tagger is only running of the phase space that is needed. For instance I believe DeepJet runs on jets beyond eta 2.5 and below pt 20 GeV even though it is not used in this phase space. On the actual ML inference side, we have to investigate further how to improve the situation. I will bring it up with the BTV conveners.

@hqucms @ademoor @riga @jmduarte @andrzejnovak

@jpata
Copy link
Contributor

jpata commented Apr 13, 2022

Thanks for confirming.

Comparing 11_3_0 (which already includes the AVX2 fix I think) and 12_4_0_pre2 in Run3 MINIAOD, 400evs, on the exact same machine:

  • BoostedJetONNXJetTagsProducer: 78ms (11%) -> 37ms (6%)
  • DeepFlavourONNXJetTagsProducer: 11ms (1.5%) -> 5ms (1%)

Additional improvements (e.g. not doing inference in unused phase space) would be useful.

@jpata
Copy link
Contributor

jpata commented May 5, 2022

+reconstruction

@jpata
Copy link
Contributor

jpata commented May 5, 2022

@cmsbuild please close

@cmsbuild
Copy link
Contributor

cmsbuild commented May 5, 2022

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants