Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-reproducibility in JetMET/{Jet,MET}Validation histograms in phase2 workflows #39754

Open
makortel opened this issue Oct 18, 2022 · 23 comments

Comments

@makortel
Copy link
Contributor

It seems that we have non-reproducibility in some JetMET/{Jet,MET}Validation histograms that are visible in PR tests. So far seen (at least) in

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

assign dqm

FYI @cms-sw/jetmet-pog-l2

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@jfernan2,@ahmad3213,@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@smuzaffar
Copy link
Contributor

smuzaffar commented Oct 18, 2022

These were first seen in #39699 (see logs] , so could #39699 change is responsible for this?

@makortel
Copy link
Contributor Author

An earlier test in #39699 (comment) reports only 6 DQM histograms with comparison differences, which would suggest that #39699 would not be responsible for the differences (or it at least the answer is less clear).

On the other hand, the occurrence of these differences seem to be random and not very frequent, so it could be that the PR responsible for this has clean comparisons in its tests.

@perrotta
Copy link
Contributor

This got somehow fixed, since the same histos now reproduce nicely.
Can this get closed?

@makortel
Copy link
Contributor Author

Sure

@makortel
Copy link
Contributor Author

Seems that we are again seeing these

@makortel
Copy link
Contributor Author

Documenting here #41019 (comment) workflow 20834.0 shows differences in

JetMET/METValidation/slimmedMETsPuppi/{METResolution_GenMETTrue_InMETBins, METUnc_ElectronEnDown, METUnc_ElectronEnUp}
JetMET/METValidation/PfMetT0pcT1/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/PfMetT1/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/pfMet/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/pfMetT0pc/METResolution_GenMETTrue_InMETBins
JetMET/METValidation/slimmedMETs/METResolution_GenMETTrue_InMETBins
JetMET/Jet/CleanedslimmedJetsAK8/Pt_profile
ParticleFlow/slimmedMETValidation/CompWithPFMET/{profileRMS_delta_set_VS_set_,profile_delta_set_VS_set_}

Also 20834.75, 20834.76, 20896.0, 20900.0, 21034.999, and 23234.0 show differences

(also #41016 (comment) can be related)

@makortel
Copy link
Contributor Author

Here #41328 (comment) are also many differences in many JetMET folders in workflows 23234.0, 23634.0, 23634.911, 23696.0, 23700.0, 23834.999.

Curiously the baseline was run on Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz (Broadwell) and the PR test on Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz (Cascade Lake).

@makortel
Copy link
Contributor Author

assign upgrade

@cmsbuild
Copy link
Contributor

New categories assigned: upgrade

@AdrianoDee,@srimanob you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor Author

assign reconstruction, simulation

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction,simulation

@mdhildreth,@mandrenguyen,@clacaputo,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor Author

I looked a bit more details of the differences in #42123 (comment). I noticed in this case

  • the baseline tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)
  • the PR tests were run on Intel(R) Xeon(R) Gold 5218 (Cascade Lake)

Could some TensorFlow / ONNX ML model is somehow sensitive to the use of AVX-512 instructions? (we have seen similar behavior with some ML models before)

@makortel
Copy link
Contributor Author

makortel commented Aug 9, 2023

In #42507 (comment)

  • the baseline tests were run on Intel(R) Xeon(R) Gold 5218 (Cascade Lake)
  • the PR tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)

@missirol
Copy link
Contributor

missirol commented Aug 11, 2023

#42540 (comment) and #42534 (comment) are probably examples of this issue (I do not know how to find the specs of the machines used for the tests).

@mmusich
Copy link
Contributor

mmusich commented Aug 11, 2023

(I do not know how to find the specs of the machines used for the tests).

I would really be interested to know how to do that as well!

@makortel
Copy link
Contributor Author

(I do not know how to find the specs of the machines used for the tests).

I would really be interested to know how to do that as well!

You can look at the end of the framework job report XML file (JobReport<N>.xml) of e.g. any step of any matrix workflow (as they are all run on the same machine, it doesn't matter which one). There is something along

<PerformanceSummary Metric="SystemCPU">
  <Metric Name="CPUModels" Value="Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz"/>

that tells the CPU model.

In #42540 (comment)

  • PR tests were run on Intel(R) Xeon(R) Gold 5218 CPU (Cascade Lake)
  • baseline tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)

In #42534 (comment)

  • PR tests were run on Intel(R) Xeon(R) Silver 4216 CPU (Cascade Lake)
  • baseline tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)

@missirol
Copy link
Contributor

Another example in #42554 (comment).

  • PR tests were run on Intel(R) Xeon(R) Gold 5218 CPU (Cascade Lake).
  • baseline tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell).

@missirol
Copy link
Contributor

Another example in #42512 (comment).

  • PR tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell).
  • baseline tests were run on Intel(R) Xeon(R) Silver 4216 CPU (Cascade Lake).

@missirol
Copy link
Contributor

Another example in #42610 (comment) :

  • the baseline tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)
  • the PR tests were run on Intel(R) Xeon(R) Silver 4216 CPU (Cascade lake)

@missirol
Copy link
Contributor

missirol commented Sep 2, 2023

Another example in #42707 (comment) :

  • the baseline tests were run on Intel(R) Xeon(R) Silver 4216 CPU (Cascade lake)
  • the PR tests were run on Intel(R) Xeon(R) CPU E5-2683 v4 (Broadwell)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants