How to rerun unsuccessful or unfinished jobs from multi run? #2518

awav · 2022-12-23T12:13:23Z

awav
Dec 23, 2022

Hello all,

I'm interested in the following use-case: rerunning unsuccessful or unfinished multi-run jobs. In some situations rerunning the same multi-run configuration is useful and necessary feature, e.g. in case when the multi-run failed because of issues in the code for some configuration settings, or a user decided to interrupt execution, but decided to continue running the remaining tasks.

I would appreciate it if someone could help to figure out how to rerun only unfinished and unsuccessful jobs with hydra.

Jasha10 · 2022-12-24T10:57:00Z

Jasha10
Dec 24, 2022
Collaborator

Hi @awav,

Related are the Hydra docs on the experimental re-run feature.

I would appreciate it if someone could help to figure out how to rerun only unfinished and unsuccessful jobs with hydra.

One idea would be to look at the log file produced by the multirun job.
Perhaps you could log something like "Job finished successfully" after the job is over, and parse those log files to see which jobs did not finish successfully.

1 reply

awav Dec 24, 2022
Author

Hi @Jasha10, thanks for your reply!

Related are the Hydra docs on the experimental re-run feature.

Does re-run feature work with multi-run or can I specify experiments which to run using wildcards, e.g. python my_app.py --experimental-rerun $OUTPUT_DIR/*/*/config.pickle. Also, some people might care about a complete reproducibility, including the order of execution with a launcher, and for me the order of execution is not necessary.
Btw, I'm gettting an error when I'm trying to use re-run feature:

  File "/Users/.../miniforge3/envs/py38/lib/python3.8/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 139, in _resolve_target
    raise InstantiationException(msg) from e
hydra.errors.InstantiationException: Error locating target 'hydra.experimental.pickle_job_info_callback.PickleJobInfoCallback', set env var HYDRA_FULL_ERROR=1 to see chained exception.
full_key: hydra.callbacks.save_job_info

and

❯ pip freeze | grep hydra
hydra-core==1.3.1
hydra-ray-launcher==1.2.0

One idea would be to look at the log file produced by the multirun job.
Perhaps you could log something like "Job finished successfully" after the job is over, and parse those log files to see which jobs did not finish successfully.

Do you mean to use callback on_job_start to decide which job to run and not to run?

Jasha10 · 2022-12-24T17:36:31Z

Jasha10
Dec 24, 2022
Collaborator

Btw, I'm getting an error when I'm trying to use re-run feature:

Oops, it looks like the _target_ used in the docs is incorrect.
it should be hydra.experimental.callbacks.PickleJobInfoCallback, not hydra.experimental.pickle_job_info_callback.PickleJobInfoCallback. I've opened issue #2521 to track this.

Does re-run feature work with multi-run

The PickleJobInfoCallback callback does work with multirun; it will successfully save a config.pickle file even in multirun mode (one pickle for each job in the sweep). The --experimental-rerun flag cannot be combined with the --multirun flag; you can only call --experimental-rerun on one pickle file at a time.

Let me demonstrate using the rerun example app. First, I'll use --multirun to sweep over two different config values (foo=bar and foo=baz). Then, I'll use the --experimental-rerun flag to re-run the foo=baz job.

$ cd examples/experimental/rerun
$ python my_app.py --multirun foo=bar,baz  # two jobs
[2022-12-24 10:54:26,627][HYDRA] Launching 2 jobs locally
[2022-12-24 10:54:26,627][HYDRA]        #0 : foo=bar
[2022-12-24 10:54:26,714][hydra.experimental.callbacks.PickleJobInfoCallback][INFO] - Saving job configs in /Users/jasha10/dev/hydra/examples/experimental/rerun/multirun/2022-12-24/10-54-26/0/.hydra/config.pickle
[2022-12-24 10:54:26,714][__main__][INFO] - Output_dir=/Users/jasha10/dev/hydra/examples/experimental/rerun/multirun/2022-12-24/10-54-26/0
[2022-12-24 10:54:26,715][__main__][INFO] - cfg.foo=bar
[2022-12-24 10:54:26,716][hydra.experimental.callbacks.PickleJobInfoCallback][INFO] - Saving job_return in /Users/jasha10/dev/hydra/examples/experimental/rerun/multirun/2022-12-24/10-54-26/0/.hydra/job_return.pickle
[2022-12-24 10:54:26,716][HYDRA]        #1 : foo=baz
[2022-12-24 10:54:26,806][hydra.experimental.callbacks.PickleJobInfoCallback][INFO] - Saving job configs in /Users/jasha10/dev/hydra/examples/experimental/rerun/multirun/2022-12-24/10-54-26/1/.hydra/config.pickle
[2022-12-24 10:54:26,806][__main__][INFO] - Output_dir=/Users/jasha10/dev/hydra/examples/experimental/rerun/multirun/2022-12-24/10-54-26/1
[2022-12-24 10:54:26,806][__main__][INFO] - cfg.foo=baz
[2022-12-24 10:54:26,807][hydra.experimental.callbacks.PickleJobInfoCallback][INFO] - Saving job_return in /Users/jasha10/dev/hydra/examples/experimental/rerun/multirun/2022-12-24/10-54-26/1/.hydra/job_return.pickle

$ python my_app.py --experimental-rerun multirun/2022-12-24/10-54-26/1/.hydra/config.pickle
/Users/jasha10/dev/hydra/hydra/main.py:24: UserWarning: Experimental rerun CLI option, other command line args are ignored.
  warnings.warn(msg, UserWarning)
[2022-12-24 10:58:34,159][__main__][INFO] - Output_dir=/Users/jasha10/dev/hydra/examples/experimental/rerun/multirun/2022-12-24/10-54-26/1
[2022-12-24 10:58:34,160][__main__][INFO] - cfg.foo=baz

can I specify experiments which to run using wildcards

Hydra doesn't have support for wildcards.
The bash shell does support wildcards, however.

echo multirun/2022-12-24/10-54-26/*/.hydra/config.pickle
multirun/2022-12-24/10-54-26/0/.hydra/config.pickle multirun/2022-12-24/10-54-26/1/.hydra/config.pickle

You could use this shell feature to re-run all the pickle files from a previous multirun sweep:

$ for pickle in multirun/2022-12-24/10-54-26/*/.hydra/config.pickle; do
    python my_app.py --experimental-rerun $pickle;
done

You could even combine such shell looping with some way to filter out which previous runs were successful:

$ for job_dir in multirun/2022-12-24/10-54-26/*; do
    if ! run_was_successful $job_dir; then
        python my_app.py --experimental-rerun $job_dir/.hydra/config.pickle;
    fi
done

Here run_was_successful is some executable file that that reads the contents of the $job_dir directory to see whether the job completed successfully.

One idea would be to look at the log file produced by the multirun job.

Actually I think there may be a more elegant solution than inspecting log files.

The PickleJobInfoCallback callback actually saves two pickle files: the file .hydra/config.pickle is saved by on_job_start and the file .hydra/job_return.pickle is saved by on_job_end (see the callback implementation here). Reading the job_return.pickle class will give you an instance of the JobReturn dataclass, which you can inspect to see if the job completed successfully.

>>> import pickle
>>> job_return = pickle.load(open("multirun/2022-12-24/10-54-26/1/.hydra/job_return.pickle", 'rb'))
>>> job_return.status
<JobStatus.COMPLETED: 1>

If the job raised an exception, you should get <JobStatus.FAILED: 2>.

EDIT: note that you will not be able to inspect job_return.pickle to see if the re-run job succeeded; you can only inspect job_return.pickle to see if the original job succeeded. No new pickle file will be saved for the re-run job.

Do you mean to use callback on_job_start to decide which job to run and not to run?

Launching a multirun sweep and then using on_job_start to cancel those jobs that have completed successfully in the past? That's a good idea (and I think that idea might work even without using the --experimental-rerun flag).

By the way: if you use --experimental-rerun, the re-run job will have the same value for hydra.runtime.output_dir.

Because of the way python logging works, the log files (e.g. multirun/2022-12-24/10-54-26/1/my_app.log) will be appended to (not overwritten).
~~the previous files config.pickle and job_return.pickle will be overwritten by the files from the re-run job.~~ Edit: The docs say that "Callbacks are not called" when using the --experimental-rerun flag, which means the *.pickle files are not updated when the re-run happens. This means that, after using the --experimental-rerun flag, inspecting inspecting the job_return.pickle file will not tell you whether the re-run job succeeded. It will only tell you whether the original job succeeded. (So, to determine if the re-run job succeeded, maybe inspecting log files is a better solution after all)
It seems that Hydra behaves as if hydra.job.chdir==False when re-running the job, even if chdir was set to True for the original job. I'm not sure if this is a bug or if it was an intentional design decision.

Anyway, sorry for the long reply. The re-run feature is experimental and best practices are not yet established. I'd be interested to hear about what ends up working for you / what techniques you decide to adopt.
Merry Christmas 🎅🏼🎄

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to rerun unsuccessful or unfinished jobs from multi run? #2518

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to rerun unsuccessful or unfinished jobs from multi run? #2518

awav Dec 23, 2022

Replies: 2 comments · 1 reply

Jasha10 Dec 24, 2022 Collaborator

awav Dec 24, 2022 Author

Jasha10 Dec 24, 2022 Collaborator

awav
Dec 23, 2022

Replies: 2 comments 1 reply

Jasha10
Dec 24, 2022
Collaborator

awav Dec 24, 2022
Author

Jasha10
Dec 24, 2022
Collaborator