Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IB Failure from Worflow 537 (DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7) #34531

Closed
qliphy opened this issue Jul 17, 2021 · 42 comments
Closed

Comments

@qliphy
Copy link
Contributor

qliphy commented Jul 17, 2021

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_0_DEVEL_X_2021-07-16-1100/pyRelValMatrixLogs/run/537.0_DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+HARVESTGEN/step1_DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+HARVESTGEN.log#/972-972

%MSG-w ExternalLHEProducer: ExternalLHEProducer:externalLHEProducer@beginRun 16-Jul-2021 20:37:26 CEST Run: 1
mergeLHE.py is not a relative path. Run it as a shell command.
%MSG
[INFO] >>> launch mergeLHE.py in /data/cmsbld/jenkins/workspace/ib-run-relvals/CMSSW_12_0_DEVEL_X_2021-07-16-1100/pyRelval/537.0_DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+HARVESTGEN
[INFO] >>> Merge 4 files: [thread0/cmsgrid_final.lhe, thread1/cmsgrid_final.lhe, thread2/cmsgrid_final.lhe, thread3/cmsgrid_final.lhe]
[INFO] >>> Write to output: cmsgrid_final.lhe
Traceback (most recent call last):
File "/cvmfs/cms-ib.cern.ch/nweek-02689/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_DEVEL_X_2021-07-16-1100/bin/slc7_amd64_gcc900/mergeLHE.py", line 418, in
main()
File "/cvmfs/cms-ib.cern.ch/nweek-02689/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_DEVEL_X_2021-07-16-1100/bin/slc7_amd64_gcc900/mergeLHE.py", line 414, in main
lhe_merger.merge()
File "/cvmfs/cms-ib.cern.ch/nweek-02689/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_DEVEL_X_2021-07-16-1100/bin/slc7_amd64_gcc900/mergeLHE.py", line 201, in merge
line = next(self._f[i])
StopIteration
----- Begin Fatal Exception 16-Jul-2021 20:37:26 CEST-----------------------
An exception of category 'ExternalLHEProducer' occurred while
[0] Processing global begin Run run: 1
[1] Calling method for module ExternalLHEProducer/'externalLHEProducer'
Exception Message:
Child failed with exit code 1.
----- End Fatal Exception -------------------------------------------------
Another exception was caught while trying to clean up runs after the primary fatal exception.

@qliphy
Copy link
Contributor Author

qliphy commented Jul 17, 2021

assign generators

@cmsbuild
Copy link
Contributor

New categories assigned: generators

@Saptaparna,@mkirsano,@SiewYan,@alberto-sanchez,@agrohsje,@GurpreetSinghChahal you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @qliphy Qiang Li.

@Dr15Jones, @perrotta, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@qliphy
Copy link
Contributor Author

qliphy commented Jul 17, 2021

related to #34463

@theofil @colizz @Dominic-Stafford please have a look.

@qliphy
Copy link
Contributor Author

qliphy commented Jul 17, 2021

you can reproduce the error locally:
runTheMatrix.py -l 537.0 --job-reports -t 4 --ibeos

yet it works well with single thread:
runTheMatrix.py -l 537.0 --job-reports
(Note PR test is also with single thread)

@colizz
Copy link
Contributor

colizz commented Jul 17, 2021

Thanks @qliphy,
it seems this is due to some MG issue (triggered by too few events requested). For one of the four LHE routines, MG reports an error in calculating the systematics:
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc900/CMSSW_12_0_DEVEL_X_2021-07-16-1100/pyRelValMatrixLogs/run/537.0_DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+HARVESTGEN/step1_DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+HARVESTGEN.log#/942

This LHE hence does not include any systematics. Therefore we meet the error when combining this LHE with other three LHEs.

P.S.
I found the MG issue can be reproduced by untaring gridpack /cvmfs/cms.cern.ch/phys_generator/gridpacks/2017/13TeV/madgraph/V5_2.6.1/DYellell012j_5f_NLO_FXFX/dyellell012j_5f_NLO_FXFX_slc7_amd64_gcc700_CMSSW_10_6_4_tarball.tar.xz

and run 2 events ./runcmsgrid.sh 2 234567 1 .
However, running 10 events ./runcmsgrid.sh 10 234567 1 can avoid this.

Maybe we are running too few events (10 split to 2, 2, 3, 3 events to four threads) that triggers this error?

@qliphy
Copy link
Contributor Author

qliphy commented Jul 17, 2021

@colizz Thanks! I think you are right. With 2 threads, the events are splitting to 5+5 for 2 jobs and it works well

runTheMatrix.py -l 537 --job-reports -t 2 --ibeos

@cms-sw/pdmv-l2 @smuzaffar Is there some way to set larger events or smaller threads for this specific workflow 537?

Or should we set generateConcurrently to false? @theofil
https://github.com/cms-sw/cmssw/blob/master/Configuration/Generator/python/DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7_cff.py#L42

@theofil
Copy link
Contributor

theofil commented Jul 17, 2021

@qliphy setting the generateConcurrently to false could be an acceptable way to proceed for now, unless we find a better way to make it work e.g., by the suggestion you made to set larger sample events per thread.

@theofil
Copy link
Contributor

theofil commented Jul 18, 2021

thanks to @colizz and @qliphy for analyzing the various symptoms and figuring out the details.

To summarize: I can confirm that running with in 1 thread is OK, with 4 threads we get error and with 2 threads is again fine. Therefore although the issue is on the merging of the multiple LHEs, it seems that the problem resides on the files we try to merge when statistics are low rather on the mergeLHE.py script per se.

We can proceed either by:

To my mind, the second solution could be easily implementable but the first would be preferable.

Including @cms-sw/pdmv-l2 @smuzaffar @Dominic-Stafford for reference and commenting

@qliphy
Copy link
Contributor Author

qliphy commented Jul 21, 2021

@smuzaffar Is there some way to increase event number for a specific workflow in IB test? Somewhere in cms-bot?

@davidlange6
Copy link
Contributor

davidlange6 commented Jul 21, 2021 via email

@smuzaffar
Copy link
Contributor

@qliphy , yes there was ways to do that e.g. via Configuration/PyReleaseValidation/python where the workflow is defined. There are many workflows which override the default events counts. But I agree with @davidlange6 that we should understand the reason and fix the code to behave properly if not correct number of event are given

@Dominic-Stafford
Copy link
Contributor

Dominic-Stafford commented Jul 21, 2021

I think this issue is more a physics issue rather than a bug, that asking madgraph for only 2 events at NLO could result in the xs being zero or something similar, which causes the issue with calculating the systematics. Maybe @colizz could comment if it would be easy to modify mergeLHE.py to skip LHEs which have problematic systematics, but otherwise I'm not sure what the proper behaviour would be. In practise I think generating 20 events should be enough to avoid the problem, and I would expect any users would want to generate a lot more events than this, so could we just increase the number of events in the test to 20, please?

@colizz
Copy link
Contributor

colizz commented Jul 21, 2021

I think this issue is more a physics issue rather than a bug, that asking madgraph for only 2 events at NLO could result in the xs being zero or something similar, which causes the issue with calculating the systematics. Maybe @colizz could comment if it would be easy to modify mergeLHE.py to skip LHEs which have problematic systematics, but otherwise I'm not sure what the proper behaviour would be. In practise I think generating 20 events should be enough to avoid the problem, and I would expect any users would want to generate a lot more events than this, so could we just increase the number of events in the test to 20, please?

yes, I agree this is more a physics issue of madgraph, and that maybe extending the event number can be a good solution.

I think the message is that we cannot merely request 2 events from this gridpack. Even we run in one thread, for 2 events, the LHE production still fails (though silently). The LHE is then problematic and will cause a buggy EDM-format output. Therefore I may not suggest modifying mergeLHE.py which causes the error to be silent again.
As @smuzaffar mentioned that there are other workflows that override the event number, maybe we can do it.

@davidlange6
Copy link
Contributor

at least for me running 1000 events fails too.. So there is a problem to fix having nothing to do with a small number of processed events.

The first error looks to be caused by the thread0 lhe file having its always proceeded by some whitespace which the merging script does not expect. Always the thread0 and only the thread0. Thats easy to fix in the parsing. Then the next error is that header line counts are different. Thats really true (by 1000+ lines of difference). Again, thread0 is different from the rest.

Anyway easy to reproduce

@agrohsje
Copy link

@Dominic-Stafford is right. It is not a bug, we made an unfortunate choice. The systematics module of MG5 prints a summary of the cross section change. Now if at NLO one event has a weight and the other -the weight, the cross section is 0.0 and the summary calculation involves dividing by 0. If we pick 3 events per thread the problem is solved. One per thread is also ok, just 2 is bad.

@davidlange6
Copy link
Contributor

so how many events does it take to get three per thread?

@davidlange6
Copy link
Contributor

(with 4 threads)

@davidlange6
Copy link
Contributor

one might think that 4*3+some margin would be sufficient....

@agrohsje
Copy link

hi @davidlange6. it is really the canceling for two events. with an odd number of events per thread this will not happen. so 4*3 is perfectly fine.

@davidlange6
Copy link
Contributor

so its really just me for which this workflow crashes with 1000 events and 4 threads, or is this a speculative statement? I'd be happy to have screwed up the configuration somehow.

@Dr15Jones
Copy link
Contributor

it is really the canceling for two events. with an odd number of events per thread this will not happen. so 4*3 is perfectly fine.

Running 4 threads (really 4 streams) over 12 threads in no way guarantees 3 events per thread. The events are assigned on a first come first serve basis so it is very each for the number of events to be different per stream, especially for small numbers of events.

@agrohsje
Copy link

@davidlange6 do you still have the output line?
the crash above is from 2 events exactly canceling in the sum of weights resulting in 0 cross section. the same could happen if a single thread has a very low, even number of events. @Dr15Jones taking into account what you said, is there a way we can avoid or do we need to run 1 thread. is there a way to make sure each thread has sufficient events to avoid ?

@agrohsje
Copy link

an alternative to a single thread would be modifying the python code in the corresponding gridpack and adding a patch to genproduction for systematics.py.

@davidlange6
Copy link
Contributor

hum, I started over and don't reproduce my problem - though, that said, this does not give exit code 0

runTheMatrix.py -l 537 -t 4 --command "-n 1000"

@colizz
Copy link
Contributor

colizz commented Jul 27, 2021

it is really the canceling for two events. with an odd number of events per thread this will not happen. so 4*3 is perfectly fine.

Running 4 threads (really 4 streams) over 12 threads in no way guarantees 3 events per thread. The events are assigned on a first come first serve basis so it is very each for the number of events to be different per stream, especially for small numbers of events.

Thanks but I was confused. Reading the code

auto const nEventsAve = nEvents_ / nThreads_;
the LHE events assigned per thread is by averaging the total requested event number.

  • ah. I now understand that you mean in the later processing of events, the first-come-first-serve principle is followed. Since here we are generating LHE events from scratch, the events number assigned per thread is given by hand (and here, by average).

@agrohsje
Copy link

@davidlange6
cat runall-report-step123-.log
2021; exit: 0 0
1 1 tests passed, 0 0 failed

@Dr15Jones @colizz in this case there is no problem as the systematics module runs in the lhe step, so 3 avoids the problem.

however about first come first serve: I see in the logs:
EGPtr::shoot() thrown an exception, event skipped: More events requested than available in FxFxReader FxFxLHReader
@Dominic-Stafford @theofil this will not cause trouble at some point ?

@Dominic-Stafford
Copy link
Contributor

EGPtr::shoot() thrown an exception, event skipped: More events requested than available in FxFxReader FxFxLHReader
@Dominic-Stafford @theofil this will not cause trouble at some point ?

This is just because Herwig silently skips events which do not pass the merging, so at the end it throws this exception because it's reached the end of the lhe file before it's generated all the events CMSSW asked for. It's related to the fact the lhe event information doesn't line up with the generated events, and so is on our to-do list, but not relevant to this issue

@agrohsje
Copy link

@Dominic-Stafford ok, that was my understanding.
@davidlange6 any additional feedback on your job output? my test worked well.
if there is no further follow-up I suggest to go with an odd number per thread.

@theofil
Copy link
Contributor

theofil commented Aug 3, 2021

@davidlange6 @Dr15Jones @smuzaffar @colizz @agrohsje @Dominic-Stafford

would you change the number of events per thread to an odd number or how do we proceed ?

@agrohsje
Copy link

agrohsje commented Aug 3, 2021

+generators

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 3, 2021

This issue is fully signed and ready to be closed.

@smuzaffar
Copy link
Contributor

smuzaffar commented Aug 3, 2021

Looks like mergeLHE.py script search for <init> and </init> at the start of the line
https://github.com/cms-sw/cmssw/blob/master/GeneratorInterface/LHEInterface/scripts/mergeLHE.py#L199
https://github.com/cms-sw/cmssw/blob/master/GeneratorInterface/LHEInterface/scripts/mergeLHE.py#L210
while in the thread*/cmsgrid_final.lhe files I see that there are line with starting spaces in them e.g.

>grep '<\(\/\|\)init>' thread*/cmsgrid_final.lhe
thread0/cmsgrid_final.lhe:  <init>
thread0/cmsgrid_final.lhe:  </init>
thread1/cmsgrid_final.lhe:<init>
thread1/cmsgrid_final.lhe:</init>
thread2/cmsgrid_final.lhe:<init>
thread2/cmsgrid_final.lhe:</init>
thread3/cmsgrid_final.lhe:<init>
thread3/cmsgrid_final.lhe:</init>

could this be causing mergeLHE.py to fail?

@smuzaffar
Copy link
Contributor

by the way you can download the thread*/cmsgrid_final.lhe files from https://cern.ch/muzaffar/537_step1.tgz

@colizz
Copy link
Contributor

colizz commented Aug 3, 2021

Looks like mergeLHE.py script search for <init> and </init> at the start of the line
https://github.com/cms-sw/cmssw/blob/master/GeneratorInterface/LHEInterface/scripts/mergeLHE.py#L199
https://github.com/cms-sw/cmssw/blob/master/GeneratorInterface/LHEInterface/scripts/mergeLHE.py#L210
while in the thread*/cmsgrid_final.lhe files I see that there are line with starting spaces in them e.g.

>grep '<\(\/\|\)init>' thread*/cmsgrid_final.lhe
thread0/cmsgrid_final.lhe:  <init>
thread0/cmsgrid_final.lhe:  </init>
thread1/cmsgrid_final.lhe:<init>
thread1/cmsgrid_final.lhe:</init>
thread2/cmsgrid_final.lhe:<init>
thread2/cmsgrid_final.lhe:</init>
thread3/cmsgrid_final.lhe:<init>
thread3/cmsgrid_final.lhe:</init>

could this be causing mergeLHE.py to fail?

Hi @smuzaffar thanks for noticing this! Looking into the code I agree that the real reason for the collapse is due to the extra spaces you mention.
The extra spaces before <init> is tricky, but we haven't met this issue in other situations we are testing (we validate over all LHE cases, put in this repo

"routines: https://github.com/colizz/mergelhe_validate\n"
).
As a further investigation, the LHE before running systematics https://github.com/cms-sw/genproductions/blob/5b8972e680e5bc0d5ce72e590e3674258ce59389/bin/MadGraph5_aMCatNLO/runcmsgrid_NLO.sh#L137 DOES has these extra spaces before <init>, </init>. After this step, the new LHE has no spaces. Since thread0 has a problem in this systematics step, the resulting LHE file has these spaces.

Though these spaces won't trigger problems in any normal routine, I think for safety maybe we'll adjust the code.

@agrohsje
Copy link

agrohsje commented Aug 3, 2021

@smuzaffar: let me add. mcatnlo generates a file that is called cmsgrid_final.lhe. the systematics module crashes when printing the summary statistics (1/(+w - w) = 1/0. so instead of the proper file with systematic weights mergeLHE.pl uses 3 times the output from systematics, once from mcatnlo. this is what is causing the difference you are reporting.
@colizz still it might make sense to make it more flexible. in general, danyer perez is implementing proper exit codes so that in the above case we get a proper crash with no output that gets used by mistake.

@qliphy qliphy closed this as completed Aug 19, 2021
@mrodozov
Copy link
Contributor

mrodozov commented Aug 19, 2021

Dears, I was trying to figure out why 538.0 is failing in gcc10, then (today) I noticed the failure for 537.0 in gcc10 is the same as for 538.0.

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc10/CMSSW_12_1_X_2021-08-18-2300/pyRelValMatrixLogs/run/537.0_DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+HARVESTGEN/step1_DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+DYToLL012Jets_5FS_TuneCH3_13TeV_amcatnloFxFx_herwig7+HARVESTGEN.log#/

Also, I cannot find where/when 538.0 was added, and there is no record for it before July 16

https://tinyurl.com/vh594azb

I have two questions - do you expect 537.0 failure in gcc10 to be fixed by 34861 and
does the gridpacks used require update for slc7 since the one used now are for slc6_amd64_gcc630, which I see work fine for slc7_amd64_gcc900

@Dominic-Stafford
Copy link
Contributor

Workflows 537 and 538 were added here: #34463. The issue you've reported looks like it comes from ThePEG failing to read in the lhe files, so I wouldn't expect it to be fixed by 34861, though it may just be because we're using old gridpacks as you suggest- I'll look into this.

@mrodozov
Copy link
Contributor

mrodozov commented Aug 19, 2021

Workflows 537 and 538 were added here: #34463. The issue you've reported looks like it comes from ThePEG failing to read in the lhe files, so I wouldn't expect it to be fixed by 34861,

using older gridpacks begs the question how the program runs compiled with gcc900 but not with 10.
and yes, looks like ThePEG is failing on this line 115 to read something that points to non init pointer (or released, with the same outcome)

though it may just be because we're using old gridpacks as you suggest- I'll look into this.

thank you. For any external change (assuming you need to patch ThePEG) do not hesitate to tell me although I noticed you know how to patch and test it yourself

@Dominic-Stafford
Copy link
Contributor

I've done some tests, and I think this is simply a bug with the pointer not being created in ThePEG for gcc10, so the old gridpacks aren't the problem. I'll email the Herwig authors about this and ask if they can provide a patch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants