Unexpected behavior when exception thrown at begin Run #42501

Dr15Jones · 2023-08-07T19:15:54Z

A crash in the IB was caused because a module A threw an exception in begin Run and module B depended on the data product from A. Because of the exception, the framework did not run module A for the begin Run. As part of the shutdown from the exception, the framework ran the end Run transition. As part of the transition, it ran the end Run for module B, even though begin Run was never called for that module. The module expected end Run to only be called if begin Run had been called for the module and because that didn't happen, the module had a segmentation fault.

The intent of the framework was to not call an 'end' transition of a module never saw the associated 'begin' transition. This is clearly not happening.

There are three possible behaviors

make no change and say it is possible for 'end' to be called without 'begin' and all modules must be OK with that.
if an exception happens during beginRun/beingLuminosityBlock in a module (module A above) than all dependent modules (module B above) should still be run
the framework should guarantee that if and only if a module ran in 'begin' will it be run in 'end'.

Dr15Jones · 2023-08-07T19:16:00Z

assign core

cmsbuild · 2023-08-07T19:16:13Z

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2023-08-07T19:16:15Z

A new Issue was created by @Dr15Jones Chris Jones.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2023-08-07T20:07:28Z

On a cursory thought the option 3 would feel most natural to me.

I would not do option 2, as then the Run/Lumi behavior for dependent (module B in the description) modules would be different from Event behavior.

makortel · 2023-08-07T20:07:41Z

@wddgit Any thoughts?

wddgit · 2023-08-08T20:53:53Z

Chris's explanation sounds correct. It is consistent with the current Framework code.

I rewrote this part of the code when I implemented concurrent runs, but I think in this case I simply copied the behavior from what existed before. Here is what I have in my notes: "If globalBeginRun fails, then writeRunAsync should not run, streamEndRun should not run, but globalEndRun is attempted!" And I think this referred to the entire transition for all modules, not separately for each module.

There is not and never was any mechanism to track whether each particular module successfully completed globalBeginRun. If any module throws an exception in global begin run, there is a single bool for all modules that records that an exception occurred in that transition, not one per module.

If there was there was any intent to handle this differently, I don't think it was ever implemented. I don't think this is a situation where there is a bug and the code is not behaving as intended.

That said, we could change the behavior. I don't have a strong opinion one way or the other. We could handle this situation in the module or the Framework. If it happens often, it would be easier to handle in the Framework instead of reimplementing in multiple modules. On the other hand, this is the first time I remembering seeing this problem...

makortel · 2023-08-14T19:11:20Z

Summarizing a discussion between myself, @Dr15Jones, @wddgit, and @dan131riley

We agreed we would change the behavior to the option 3
All transitions should behave consistently
Maybe a bool in the Worker or something could do the job?

cmsbuild added core-pending pending-signatures labels Aug 7, 2023

makortel mentioned this issue Aug 14, 2023

Make the framework guarantee that an end-transition function of a module is run if and only if the corresponding begin-transition function has run cms-sw/framework-team#637

Closed

makortel mentioned this issue Mar 4, 2024

Improve exception handling in endStream #43831

Open

wddgit mentioned this issue Apr 4, 2024

Improve behavior after exception in begin/end stream lumi #44624

Merged

wddgit mentioned this issue Apr 24, 2024

Improve behavior after exception in begin/end global lumi #44840

Merged

wddgit mentioned this issue May 22, 2024

Improve behavior after exception in begin/end run transitions #45017

Merged

wddgit mentioned this issue Jul 11, 2024

Improve Framework behavior after exceptions in begin/end transitions (Job, Stream, ProcessBlock) #45434

Merged

makortel mentioned this issue Jul 19, 2024

Improve exception behavior in endStream, and wrt end transitions cms-sw/framework-team#844

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior when exception thrown at begin Run #42501

Unexpected behavior when exception thrown at begin Run #42501

Dr15Jones commented Aug 7, 2023

Dr15Jones commented Aug 7, 2023

cmsbuild commented Aug 7, 2023

cmsbuild commented Aug 7, 2023

makortel commented Aug 7, 2023

makortel commented Aug 7, 2023

wddgit commented Aug 8, 2023

makortel commented Aug 14, 2023

Unexpected behavior when exception thrown at begin Run #42501

Unexpected behavior when exception thrown at begin Run #42501

Comments

Dr15Jones commented Aug 7, 2023

Dr15Jones commented Aug 7, 2023

cmsbuild commented Aug 7, 2023

cmsbuild commented Aug 7, 2023

makortel commented Aug 7, 2023

makortel commented Aug 7, 2023

wddgit commented Aug 8, 2023

makortel commented Aug 14, 2023