Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce ML inference time in b-tag related jet taggers #25230

Closed
slava77 opened this issue Nov 14, 2018 · 31 comments
Closed

reduce ML inference time in b-tag related jet taggers #25230

slava77 opened this issue Nov 14, 2018 · 31 comments

Comments

@slava77
Copy link
Contributor

slava77 commented Nov 14, 2018

with #24918 included, b-tag-related ML inference is at around 20% of the miniAOD processing time

Here are the averages (excluding peak [initialization?] time) from 1K events in workflow 136.8311 running with this PR on top of 10_4_0_pre1

   0.33 pfDeepCSVJetTagsAK8PFPuppiSoftDropSubjets          DeepFlavourJetTagsProducer
   0.41            pfDeepCSVJetTagsAK8Puppi          DeepFlavourJetTagsProducer 
   0.94               pfDeepCSVJetTagsPuppi          DeepFlavourJetTagsProducer
  16.01 pfDeepDoubleCvLJetTagsSlimmedAK8DeepTags        DeepDoubleXTFJetTagsProducer
  16.09 pfDeepDoubleBvLJetTagsSlimmedAK8DeepTags        DeepDoubleXTFJetTagsProducer
  16.10 pfDeepDoubleCvBJetTagsSlimmedAK8DeepTags        DeepDoubleXTFJetTagsProducer
  16.20 pfMassIndependentDeepDoubleCvLJetTagsSlimmedAK8DeepTags        DeepDoubleXTFJetTagsProducer
  16.58 pfMassIndependentDeepDoubleCvBJetTagsSlimmedAK8DeepTags        DeepDoubleXTFJetTagsProducer
  28.86 pfDeepBoostedJetTagsSlimmedAK8DeepTags          DeepBoostedJetTagsProducer
  29.72 pfMassDecorrelatedDeepBoostedJetTagsSlimmedAK8DeepTags          DeepBoostedJetTagsProducer
 153.85 pfDeepFlavourJetTagsSlimmedDeepFlavour        DeepFlavourTFJetTagsProducer
Total from above: 295.09 ms/ev

This issue is to keep track of the progress, aiming for a reduction.
The next BTV-related jet-tag update should not increase the total time.

@ferencek @mverzett @kskovpen @andrzejnovak @jmduarte @hqucms

@slava77
Copy link
Contributor Author

slava77 commented Nov 14, 2018

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @slava77 Slava Krutelyov.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@kpedro88
Copy link
Contributor

For miniAOD, one way to reduce the total time used by ML tagging algorithms is to run them only on the jets for which the results will actually be used/stored. This is relevant for AK8 jets, since the low-pt ones are not stored. My understanding is that DeepAK8 has an enforced pT cut in the tagger, while DeepDoubleB does not, but maybe this is out of date.

Tagging a few more people: @gouskos @rappoccio

@rappoccio
Copy link
Contributor

That's a good point. The same preselection could be used, and only run this on the constituents of a preclustered sample with a pt cut on the jet.

@kpedro88
Copy link
Contributor

Also, @violatingcp found that for CNN inference, TensorFlow v1.10 is 2.4x faster than TensorFlow v1.6 (which we're currently using).

@slava77
Copy link
Contributor Author

slava77 commented Nov 15, 2018 via email

@violatingcp
Copy link
Contributor

Ah, so TF v1.6 used the CMSSW standard version. Indeed it looks like the tf version I am using is using AVX, but not AVX2. I can disable and try to run.

@slava77
Copy link
Contributor Author

slava77 commented May 24, 2019

@ferencek @mverzett @kskovpen @andrzejnovak @jmduarte @hqucms
was there any significant progress on this topic of speeding up ML b-tags in the recent months?
Please comment.
Thank you.

@mverzett
Copy link
Contributor

Not really, we are looking into ways to reduce the input size and eventually the graph size, but we are a long way from a usable solution.

@hqucms
Copy link
Contributor

hqucms commented Jun 27, 2019

@slava77
In some recent studies, we found that there is a significant difference (4-5x) between standalone MXNet and MXNet in CMSSW in the inference speed. After some investigation, the cause seems to be the linking with the BLAS library. The standalone MXNet is statically linked to OpenBLAS. The one in CMSSW, despite being dynamically linked to OpenBLAS [0], seems not calling into OpenBLAS when running inside cmsRun.

A simple test to show this:

[Test setup]
Using CMSSW_10_6_1 on lxplus7
Python cfg: RecoBTag/MXNet/test/test_deep_boosted_jet_cfg.py
Input file: /store/mc/RunIIFall17MiniAODv2/ZprimeToTT_M1000_W10_TuneCP2_13TeV-madgraphMLM-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14_ext1-v1/90000/0C1D564B-DB28-E911-912C-0CC47A4D76D0.root (copied to local)

> cmsRun test_deep_boosted_jet_cfg.py inputFiles=file:TT-M1000.root maxEvents=1000 >& orig.log &                                                                                                            
TimeReport   0.020148     0.020148     0.020148  pfDeepBoostedJetTags
TimeReport   0.018684     0.018684     0.018684  pfMassDecorrelatedDeepBoostedJetTags

> LD_PRELOAD=/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/OpenBLAS/0.3.5/lib/libopenblas.so cmsRun test_deep_boosted_jet_cfg.py inputFiles=file:TT-M1000.root maxEvents=1000 >& preload_openblas_cmssw.log &
TimeReport   0.005091     0.005091     0.005091  pfDeepBoostedJetTags
TimeReport   0.004962     0.004962     0.004962  pfMassDecorrelatedDeepBoostedJetTags

When using LD_PRELOAD we saw a speedup of ~4x, roughly consistent with standalone MXNet now.

(BTW, this still leave ~20-30% difference, which is likely because the CMSSW OpenBLAS is compiled with DYNAMIC_ARCH=0 therefore not able to exploit newer instructions eg AVX available on the machine -- maybe we should also consider enabling DYNAMIC_ARCH?)

A search in the external libraries reveal that there are two other libraries that can provide the same cblas functions as OpenBLAS:

/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libcblas.so
0000000000007c44 T cblas_sgemm

/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libgslcblas.so
0000000000002c10 T cblas_sgemm

/cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libopenblas.so
000000000008e010 T cblas_sgemm

It seems that one of them is loaded before OpenBLAS and therefore hiding the OpenBLAS symbols. Any suggestion on how to fix this?

Thank you!

[0]

> ldd /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/mxnet-predict/1.2.1-pafccj2/lib/libmxnetpredict.so 
        linux-vdso.so.1 =>  (0x00007ffd9af8a000)
        libopenblas.so.0 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_1/external/slc7_amd64_gcc700/lib/libopenblas.so.0 (0x00007f9a63d0e000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f9a63b06000)
        libstdc++.so.6 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libstdc++.so.6 (0x00007f9a674df000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f9a63804000)
        libgcc_s.so.1 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libgcc_s.so.1 (0x00007f9a674c6000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f9a635e8000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f9a6321b000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9a67471000)
        libgfortran.so.4 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libgfortran.so.4 (0x00007f9a63048000)
        libquadmath.so.0 => /cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/gcc/7.0.0-pafccj/lib64/libquadmath.so.0 (0x00007f9a63008000)

@slava77
Copy link
Contributor Author

slava77 commented Jun 27, 2019

@mrodozov @smuzaffar @fabiocos
please take a look and comment on #25230 (comment)
It looks like we have multiple libraries providing the symbols and perhaps the libopenblas.so should replace the rest

@smuzaffar
Copy link
Contributor

@slava77 , looks like root MathMore library (which was loaded first) is linked against gslcblas. We are lokking in to root configuration to see if it can use openblas.

@mrodozov
Copy link
Contributor

Input file: /store/mc/RunIIFall17MiniAODv2/ZprimeToTT_M1000_W10_TuneCP2_13TeV-madgraphMLM-pythia8/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14_ext1-v1/90000/0C1D564B-DB28-E911-912C-0CC47A4D76D0.root

@hqucms
File is not there anymore, would you put it somewhere on afs or any other place we can copy it so we can test with CMSSW pls ?
Thanks

@hqucms
Copy link
Contributor

hqucms commented Jun 28, 2019

Thank you for looking into this!
@mrodozov I copied the file to my EOS space:
/eos/cms/store/user/hqu/forTesting/TT-M1000.root

@smuzaffar
Copy link
Contributor

cms-sw/cmsdist#5063 should allow us to link against openblas instead of gslcblas. Once integrated in IB then we will inform you so that you can re-run the tests

@smuzaffar
Copy link
Contributor

@slava77 and @hqucms , running this test with cms-sw/cmsdist#5063 I get these numbers

TimeReport 0.003751 0.003751 0.003751 pfDeepBoostedJetTags
TimeReport 0.003788 0.003788 0.003788 pfMassDecorrelatedDeepBoostedJetTags

@slava77 , if you need to test it then I can include it in GCC 8 IBs.

@kpedro88
Copy link
Contributor

kpedro88 commented Jul 1, 2019

It would be great to backport this to 10_6_X so UL reco and analysis can profit from the speed improvement.

@slava77
Copy link
Contributor Author

slava77 commented Jul 2, 2019

tests in cms-sw/cmsdist#5063 showed a small regression in some b-tag discriminant.
We need to understand the origins of that before merging to the master branch or backporting.

@hqucms
Copy link
Contributor

hqucms commented Jul 2, 2019

@slava77 and @hqucms , running this test with cms-sw/cmsdist#5063 I get these numbers

TimeReport 0.003751 0.003751 0.003751 pfDeepBoostedJetTags
TimeReport 0.003788 0.003788 0.003788 pfMassDecorrelatedDeepBoostedJetTags

Thank you @smuzaffar !
These numbers are consistent with my tests -- they are better than the ones I quote in #25230 (comment), which I think is due to the enabling of DYNAMIC_ARCH.

@fabiocos
Copy link
Contributor

fabiocos commented Jul 2, 2019

@kpedro88 @slava77 as the production for 2017 UL is about to start, if this is not merely technical we need to understand up to which point we may afford to backport things

@slava77
Copy link
Contributor Author

slava77 commented Jul 2, 2019

@kpedro88 @slava77 as the production for 2017 UL is about to start, if this is not merely technical we need to understand up to which point we may afford to backport things

The impact of cms-sw/cmsdist#5063 is small for the UL.
The backports may be more important for the releases where these taggers have to run in nano.

@hqucms
Copy link
Contributor

hqucms commented Jul 7, 2019

Another thing: I came across a fairly new ML inference engine called ONNX Runtime recently and did some tests with it. It seems that it can bring another ~1.5x speed-up compared to MXNet+OpenBLAS for DeepAK8. More interestingly, it seems to bring ~3-5x speed-up for the AK4 DeepJet model compared to TensorFlow. Might be interesting to get this into CMSSW.

@fabiocos
Copy link
Contributor

fabiocos commented Jul 8, 2019

@hqucms I think that we are mixing different things in this thread:

  • effect of revised BLAS implementation used by MXNet: which is the outcome? Do we understand the regression? Is a backport safe in this respect?

  • possible alternative engines to MXNet: this is a new development, that may be of course tested upon agreement with the RECO group, but I see it as a separate issue

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 8, 2019 via email

@hqucms
Copy link
Contributor

hqucms commented Jul 8, 2019

@davidlange6 I can make a PR to cmsdist since I have already tested it a bit.
Also per @fabiocos's request I opened a new issue (#27458) regarding ONNX runtime to avoid mixing up two fronts.

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 8, 2019 via email

@slava77
Copy link
Contributor Author

slava77 commented Jul 8, 2019

@hqucms I think that we are mixing different things in this thread:

@fabiocos
this is a generic thread, opened in November last year. Actually, back then there was some promising development unrelated to either the openblas or the ONNX, which did not materialize with a solution.

@fabiocos
Copy link
Contributor

fabiocos commented Jul 9, 2019

@slava77 you are correct, but at some point the discussing was about the regression due to cms-sw/cmsdist#5063 and I believe it would be good to keep that issue the whatever other development distinct, just that

@slava77
Copy link
Contributor Author

slava77 commented Jul 11, 2019

@slava77 you are correct, but at some point the discussing was about the regression due to cms-sw/cmsdist#5063 and I believe it would be good to keep that issue the whatever other development distinct, just that

So, it looks like this regression is an indication of a random changes in the algorithm
also seen in #27488 (comment) and #27428 (comment)

I created #27504 to keep track of the problem

@slava77
Copy link
Contributor Author

slava77 commented Feb 12, 2021

this could partly be closed now after ONNX inference was implemented ( #28112; via #27458).
However, the total cost is still fairly large after other additions.

Since new inference engines showed up since the late 2018 when this issue was opened, I will reopen a fresh one (#32883) and close this one.

@slava77 slava77 closed this as completed Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests