Depending on jobs that may fail, but that's expected. #713

bloodearnest · 2024-03-07T11:45:35Z

The folks from Bristol have common pattern where they run a lot of parameterised models in parallel. They expect some of these models will fail, as there will not be enough events, or they will fail to converge. In their case this is actually fine, its part of what they are researching.

They have a final action that consumes all the outputs of models, but if any of them have failed, then this action won't run. So after they've run the parameterised model actions, they manually check actions succeeded, and modify/commit the project.yaml to just depend on the ones that succeeded, so they are able to submit and run that final action.

This manual step is laborious, especially as there can be a lot of actions, and visually inspecting the status can be time consuming.

Potential solutions

They have requested a structured way to interrogate a workspace for the latest status of actions, so they can automate the modification of the dependencies in project.yaml (e.g. a downloadable json file or similar of actions and success/failure). Whilst this does make the manual step much easier, it doesn't remove the problem of having modify the final actions dependencies.

As a workaround, Will suggested they may be able to wrap their model code in error handling that will mean that it always succeeds, and use the contents of the output file as an indication of whether it materially succeed or not. This will mean they don't need to modify their project.yaml, but it does obfuscate whether a particular model succeeded or not, both in production and locally.

Another option would be to provide a way to mark an action as being allowed to fail, i.e. its failure will not block dependencies from running, but there will be no outputs from it. Downstream actions can use the presence or absence of the expected output files as a signal to know if the model was successful or not. This is a more general solution, that may have other use cases too.

evansd · 2024-03-07T12:13:35Z

This is fascinating, from the point of view of people finding unexpected ways to use a system!

My feeling (at the moment) is quite strongly that Will's suggestion is the correct one here: these jobs did not fail – they were successfully completed experiments as to whether a given model would converge. So I don't think we should be twisting the semantics of success/failure here.

In fact, in a funny way this is actually exploiting the theoretical one-bit-per-job side-channel data exfiltration route that we've discussed. You shouldn't really be able learn anything meaningful about the results of your analysis without logging into L4. (I'm not suggesting there's anything actually wrong with what they're doing here – just noting the amusing parallel.)

I'm sure, though, there are things we can do here to improve the ergonomics and surface these statuses more easily, both locally and possibly through Airlock.

LFISHER7 · 2024-03-07T12:37:47Z

I agree with Dave that Will's suggestion is the right way to do things. These jobs failed because the script failed to handle errors that you might expect when fitting these models. Though I accept that handling these errors is a little tedious and deviates from current ways of working.

What we need is better experimentation tracking. This would require some rethinking of how outputs are captured in OpenSAFELY. DVC experiment pipelines seem like the kind of thing we'd want.

evansd · 2024-03-07T12:40:26Z

Ah, super useful to see prior art on this. Thanks Louis!

iaindillingham · 2024-03-14T11:31:04Z

Thanks for writing this up, @bloodearnest, and for your comments @evansd and @LFISHER7. We discussed the issue in this week's planning meeting and decided to move it from "triage" to "later". It's unrelated to our current initiative, and, having spoken to @CLStables, we feel there's some discovery work to do around experiments. I'll let Bristol know in the #opensafely-users channel.¹

https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1709812104313469 ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Depending on jobs that may fail, but that's expected. #713

Depending on jobs that may fail, but that's expected. #713

bloodearnest commented Mar 7, 2024

evansd commented Mar 7, 2024

LFISHER7 commented Mar 7, 2024

evansd commented Mar 7, 2024

iaindillingham commented Mar 14, 2024

Depending on jobs that may fail, but that's expected. #713

Depending on jobs that may fail, but that's expected. #713

Comments

bloodearnest commented Mar 7, 2024

Potential solutions

evansd commented Mar 7, 2024

LFISHER7 commented Mar 7, 2024

evansd commented Mar 7, 2024

iaindillingham commented Mar 14, 2024

Footnotes