Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception in ExternalLHEProducer after merging #40939 #41230

Open
perrotta opened this issue Mar 30, 2023 · 12 comments
Open

Exception in ExternalLHEProducer after merging #40939 #41230

perrotta opened this issue Mar 30, 2023 · 12 comments

Comments

@perrotta
Copy link
Contributor

Since the merging of #40939 several workflows are crashing in the IBs with the following error message (e.g. from wf 512.0):

[INFO] MG5 LO LHE with event_norm = sum detected. Will recalculate weights in each event block.
Unit weight: +8.6690076E+02
Traceback (most recent call last):
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-29-2300/bin/el8_amd64_gcc11/mergeLHE.py", line 429, in <module>
    main()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-29-2300/bin/el8_amd64_gcc11/mergeLHE.py", line 425, in main
    lhe_merger.merge()
  File "/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw/CMSSW_13_1_X_2023-03-29-2300/bin/el8_amd64_gcc11/mergeLHE.py", line 264, in merge
    orig_wgt = float(line.split()[2])
IndexError: list index out of range
%MSG-e ExcessiveTime:  ExternalLHEProducer:externalLHEProducer@beginRun  30-Mar-2023 05:25:07 CEST Run: 1
ExcessiveTime: Module used 1978.84 seconds of time which exceeds the error threshold configured in the Timing Service of 600 seconds.
%MSG
----- Begin Fatal Exception 30-Mar-2023 05:25:07 CEST-----------------------
An exception of category 'ExternalLHEProducer' occurred while
   [0] Processing global begin Run run: 1
   [1] Calling method for module ExternalLHEProducer/'externalLHEProducer'
Exception Message:
Child failed with exit code 1.
----- End Fatal Exception -------------------------------------------------

Please notice that the very same error appears independently on the merge of cms-sw/cmsdist#8409, which I forgot for CMSSW_13_1_X_2023-03-29-1100 and merged only later on for CMSSW_13_1_X_2023-03-29-2300, but as you can see both IBs are crashing with the same error message.

Author @Dominic-Stafford has been informed.

A fix is needed, otherwise we will be eventually forced to revert PR #40939 and the accompanying cmsdist one.

@perrotta
Copy link
Contributor Author

assign generators

@cmsbuild
Copy link
Contributor

A new Issue was created by @perrotta Andrea Perrotta.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

New categories assigned: generators

@mkirsano,@menglu21,@alberto-sanchez,@SiewYan,@GurpreetSinghChahal,@Saptaparna you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dominic-Stafford
Copy link
Contributor

Hi, sorry, this issue appears to be because this line of regex: https://github.com/cms-sw/cmssw/blob/master/GeneratorInterface/LHEInterface/scripts/mergeLHE.py#L259 catches the new LHE <event_num> tag, so it expects the line after this to be a new event. It seems we didn't catch this because my local tests and the tests we did before merging the PR only ran on one core. I'll make a PR to fix this.

@Dominic-Stafford
Copy link
Contributor

Ah, it seems there is a further issue only for the Herwig workflows that the order in which the lhe numbers are read is different than I would expect. I'll need to explore it some more, but I think this could take a little while to resolve

@perrotta
Copy link
Contributor Author

Ah, it seems there is a further issue only for the Herwig workflows that the order in which the lhe numbers are read is different than I would expect. I'll need to explore it some more, but I think this could take a little while to resolve

Thank you Dominic.
Do you suggest to revert the already merged PR, so that you can provide a fixed one once completed? Or do you think you can provide a fix in a few days, and you prefer to let the bugged PR in the IBs (taking into account that the Herwig related workflows will keep failing in the IB tests in the meanwhile)?

@Dominic-Stafford
Copy link
Contributor

Yes, since I don't have an estimate for how long this will take to fix, it's probably best to revert the PR, if that's relatively easy to do from your side, then I'll make a new PR once I have a fix

@perrotta
Copy link
Contributor Author

I have prepared two PRs to revert #40939 (reverted by #41237) and cms-sw/cmsdist#8349 (reverted by cms-sw/cmsdist#8417). They can be merged once tested succesfully in the IB.

@Dominic-Stafford
Copy link
Contributor

Thank you, and sorry for not catching these earlier

@perrotta
Copy link
Contributor Author

PRs reverted for next CMSSW_13_1_X_2023-03-31-1100

@sunilUIET
Copy link
Contributor

Hi @perrotta

A similar issue is also observed for Summer22 production (with CMSSW_12_4_11_patch3) for the MCFM samples i.e.

https://cms-unified.web.cern.ch/cms-unified/showlog/?search=task_HIG-Run3Summer22EEwmLHEGS-00173

@makortel
Copy link
Contributor

A similar issue is also observed for Summer22 production (with CMSSW_12_4_11_patch3) for the MCFM samples i.e.

https://cms-unified.web.cern.ch/cms-unified/showlog/?search=task_HIG-Run3Summer22EEwmLHEGS-00173

@sunilUIET Please open a new issue as the cause is very likely different (ExternalLHEProducer failing with "child failed with exit code 1" is a very generic error message).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants