Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Depending on jobs that may fail, but that's expected. #713

Open
bloodearnest opened this issue Mar 7, 2024 · 4 comments
Open

Depending on jobs that may fail, but that's expected. #713

bloodearnest opened this issue Mar 7, 2024 · 4 comments

Comments

@bloodearnest
Copy link
Member

The folks from Bristol have common pattern where they run a lot of parameterised models in parallel. They expect some of these models will fail, as there will not be enough events, or they will fail to converge. In their case this is actually fine, its part of what they are researching.

They have a final action that consumes all the outputs of models, but if any of them have failed, then this action won't run. So after they've run the parameterised model actions, they manually check actions succeeded, and modify/commit the project.yaml to just depend on the ones that succeeded, so they are able to submit and run that final action.

This manual step is laborious, especially as there can be a lot of actions, and visually inspecting the status can be time consuming.

Potential solutions

They have requested a structured way to interrogate a workspace for the latest status of actions, so they can automate the modification of the dependencies in project.yaml (e.g. a downloadable json file or similar of actions and success/failure). Whilst this does make the manual step much easier, it doesn't remove the problem of having modify the final actions dependencies.

As a workaround, Will suggested they may be able to wrap their model code in error handling that will mean that it always succeeds, and use the contents of the output file as an indication of whether it materially succeed or not. This will mean they don't need to modify their project.yaml, but it does obfuscate whether a particular model succeeded or not, both in production and locally.

Another option would be to provide a way to mark an action as being allowed to fail, i.e. its failure will not block dependencies from running, but there will be no outputs from it. Downstream actions can use the presence or absence of the expected output files as a signal to know if the model was successful or not. This is a more general solution, that may have other use cases too.

@evansd
Copy link
Contributor

evansd commented Mar 7, 2024

This is fascinating, from the point of view of people finding unexpected ways to use a system!

My feeling (at the moment) is quite strongly that Will's suggestion is the correct one here: these jobs did not fail – they were successfully completed experiments as to whether a given model would converge. So I don't think we should be twisting the semantics of success/failure here.

In fact, in a funny way this is actually exploiting the theoretical one-bit-per-job side-channel data exfiltration route that we've discussed. You shouldn't really be able learn anything meaningful about the results of your analysis without logging into L4. (I'm not suggesting there's anything actually wrong with what they're doing here – just noting the amusing parallel.)

I'm sure, though, there are things we can do here to improve the ergonomics and surface these statuses more easily, both locally and possibly through Airlock.

@LFISHER7
Copy link
Contributor

LFISHER7 commented Mar 7, 2024

I agree with Dave that Will's suggestion is the right way to do things. These jobs failed because the script failed to handle errors that you might expect when fitting these models. Though I accept that handling these errors is a little tedious and deviates from current ways of working.

What we need is better experimentation tracking. This would require some rethinking of how outputs are captured in OpenSAFELY. DVC experiment pipelines seem like the kind of thing we'd want.

@evansd
Copy link
Contributor

evansd commented Mar 7, 2024

Ah, super useful to see prior art on this. Thanks Louis!

@iaindillingham
Copy link
Member

Thanks for writing this up, @bloodearnest, and for your comments @evansd and @LFISHER7. We discussed the issue in this week's planning meeting and decided to move it from "triage" to "later". It's unrelated to our current initiative, and, having spoken to @CLStables, we feel there's some discovery work to do around experiments. I'll let Bristol know in the #opensafely-users channel.1

Footnotes

  1. https://bennettoxford.slack.com/archives/C01D7H9LYKB/p1709812104313469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants