Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior when exception thrown at begin Run #42501

Open
Dr15Jones opened this issue Aug 7, 2023 · 7 comments
Open

Unexpected behavior when exception thrown at begin Run #42501

Dr15Jones opened this issue Aug 7, 2023 · 7 comments

Comments

@Dr15Jones
Copy link
Contributor

A crash in the IB was caused because a module A threw an exception in begin Run and module B depended on the data product from A. Because of the exception, the framework did not run module A for the begin Run. As part of the shutdown from the exception, the framework ran the end Run transition. As part of the transition, it ran the end Run for module B, even though begin Run was never called for that module. The module expected end Run to only be called if begin Run had been called for the module and because that didn't happen, the module had a segmentation fault.

The intent of the framework was to not call an 'end' transition of a module never saw the associated 'begin' transition. This is clearly not happening.

There are three possible behaviors

  1. make no change and say it is possible for 'end' to be called without 'begin' and all modules must be OK with that.
  2. if an exception happens during beginRun/beingLuminosityBlock in a module (module A above) than all dependent modules (module B above) should still be run
  3. the framework should guarantee that if and only if a module ran in 'begin' will it be run in 'end'.
@Dr15Jones
Copy link
Contributor Author

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 7, 2023

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 7, 2023

A new Issue was created by @Dr15Jones Chris Jones.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Aug 7, 2023

On a cursory thought the option 3 would feel most natural to me.

I would not do option 2, as then the Run/Lumi behavior for dependent (module B in the description) modules would be different from Event behavior.

@makortel
Copy link
Contributor

makortel commented Aug 7, 2023

@wddgit Any thoughts?

@wddgit
Copy link
Contributor

wddgit commented Aug 8, 2023

Chris's explanation sounds correct. It is consistent with the current Framework code.

I rewrote this part of the code when I implemented concurrent runs, but I think in this case I simply copied the behavior from what existed before. Here is what I have in my notes: "If globalBeginRun fails, then writeRunAsync should not run, streamEndRun should not run, but globalEndRun is attempted!" And I think this referred to the entire transition for all modules, not separately for each module.

There is not and never was any mechanism to track whether each particular module successfully completed globalBeginRun. If any module throws an exception in global begin run, there is a single bool for all modules that records that an exception occurred in that transition, not one per module.

If there was there was any intent to handle this differently, I don't think it was ever implemented. I don't think this is a situation where there is a bug and the code is not behaving as intended.

That said, we could change the behavior. I don't have a strong opinion one way or the other. We could handle this situation in the module or the Framework. If it happens often, it would be easier to handle in the Framework instead of reimplementing in multiple modules. On the other hand, this is the first time I remembering seeing this problem...

@makortel
Copy link
Contributor

Summarizing a discussion between myself, @Dr15Jones, @wddgit, and @dan131riley

  • We agreed we would change the behavior to the option 3
  • All transitions should behave consistently
  • Maybe a bool in the Worker or something could do the job?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants