Replies: 2 comments 1 reply
-
Hi @awav, Related are the Hydra docs on the experimental re-run feature.
One idea would be to look at the log file produced by the multirun job. |
Beta Was this translation helpful? Give feedback.
-
Oops, it looks like the
The Let me demonstrate using the rerun example app. First, I'll use
Hydra doesn't have support for wildcards. echo multirun/2022-12-24/10-54-26/*/.hydra/config.pickle
multirun/2022-12-24/10-54-26/0/.hydra/config.pickle multirun/2022-12-24/10-54-26/1/.hydra/config.pickle You could use this shell feature to re-run all the pickle files from a previous multirun sweep: $ for pickle in multirun/2022-12-24/10-54-26/*/.hydra/config.pickle; do
python my_app.py --experimental-rerun $pickle;
done You could even combine such shell looping with some way to filter out which previous runs were successful: $ for job_dir in multirun/2022-12-24/10-54-26/*; do
if ! run_was_successful $job_dir; then
python my_app.py --experimental-rerun $job_dir/.hydra/config.pickle;
fi
done Here
Actually I think there may be a more elegant solution than inspecting log files. The >>> import pickle
>>> job_return = pickle.load(open("multirun/2022-12-24/10-54-26/1/.hydra/job_return.pickle", 'rb'))
>>> job_return.status
<JobStatus.COMPLETED: 1> If the job raised an exception, you should get EDIT: note that you will not be able to inspect
Launching a multirun sweep and then using on_job_start to cancel those jobs that have completed successfully in the past? That's a good idea (and I think that idea might work even without using the By the way: if you use
Anyway, sorry for the long reply. The re-run feature is experimental and best practices are not yet established. I'd be interested to hear about what ends up working for you / what techniques you decide to adopt. |
Beta Was this translation helpful? Give feedback.
-
Hello all,
I'm interested in the following use-case: rerunning unsuccessful or unfinished multi-run jobs. In some situations rerunning the same multi-run configuration is useful and necessary feature, e.g. in case when the multi-run failed because of issues in the code for some configuration settings, or a user decided to interrupt execution, but decided to continue running the remaining tasks.
I would appreciate it if someone could help to figure out how to rerun only unfinished and unsuccessful jobs with hydra.
Beta Was this translation helpful? Give feedback.
All reactions