Catch interrupts in phyperopt, return current data #68

albheim · 2021-10-12T11:49:29Z

In @hypteropt it will catch interrupts and return the data generated so far. I wanted the same functionality for long running distributed jobs, and implemented pretty much the same thing for @phyperopt (and @thyperopt also I just realized, have not tried this though and might need to adapt to function well with that) since I didn't see any immediate reasons it would create problems.

Was there any specific reason this didn't exist before, something problematic that I didn't see?

If you want this, should I create some tests? Not sure of a good way to generate external interrupts, but should be the same as throwing a InterruptedException after some time in the optimization function right?

baggepinnen · 2021-10-12T13:24:22Z

Awesome! I don't think there was any particular reason. I very rarely use the Distributed mode myself so probably just never got around to implementing it.

albheim · 2021-10-12T20:13:06Z

Realised it did not work perfectly as is, there were some problems when you actually tested it a bit more. And have not tested for threading either. So I will come back to this at some point.

albheim · 2022-01-28T16:11:02Z

A bit unclear how to test this well, added some basic runs for interruptions (should handle gracefully) and some with exceptions (should error). Problem is the distributed one errors in the interruptions here, but I can use it on my cluster fine.

albheim · 2022-01-29T09:35:35Z

src/Hyperopt.jl

+            try
+                # We don't care about the results of the `pmap` since we update the hyperoptimizer
+                # inside the loop.
+                $(esc(pmap))(1:ho.iterations) do i
+                    isdone = take!(done_channel)
+                    put!(done_channel, isdone)
+                    isdone && return
+


I think I realised why my tests behaved different from what I expected.

When generating my errors/interrupts I do them in the code block run on the workers, so the exception will happen on the worker.
When I press ctrl-C to interrupt jobs on my cluster I do that on the main process so it is caught by the try statement.

It seems that an exception on a worker can generate an exception back to the main process, but maybe not the same? Currently I have

Worker 2 terminated. Interrupt handling: Error During Test at /home/runner/work/Hyperopt.jl/Hyperopt.jl/test/runtests.jl:109 Got exception outside of a @test On worker 2: InterruptException: ...

which seems like it does generate the expected interrupt exception on the worker, but the worker can't handle it and stops which generates an error on the main process? I think this might be how you'd want it to work, it just makes it harder to test.

albheim · 2022-01-30T18:01:27Z

So as it is now I just added so that an exception (either interrupt or error) will interrupt running jobs and skip remaining ones.

In case of interrupt it will return a hyperopt object containing the finished runs, and for exception it will rethrow.

I tried making some tests for interrupt, but found it tricky to do it in any meaningful way so didn't add anything for now.

baggepinnen · 2022-01-31T11:36:07Z

Thanks for the updates :)
I guess you have used this a bit and know that the catch path works as expected? If so, I'll just go ahead and merge it

albheim · 2022-01-31T13:40:31Z

Yeah, have only really used @phyperopt and that has worked nicely for me. Tested @thyperopt quickly and it seems it terminates nicely in the same manner. Generating an error seems to rethrow it correctly also.

One thing I thought about is if you want to save the ho state on an error also, but I think that would prevent you from properly rethrowing the error since you would need to properly return so the value can be assigned. So maybe only save on interrupt is reasonable.

Also thought if length(ho) should be changed to reflect the number of actual steps performed instead of intended steps. No big deal for me, just tried to do it and expected the other result so it felt intuitive to me.

albheim · 2022-01-31T13:45:46Z

Testing for error could be easier and I could try to add something basic for both phyper and thyper, so at least one of the branches in the try is tested.

And if you have any suggestion on how to do the interrupt testing in some good way I'm all for that, but I couldn't figure it out and from what I saw you didn't have any existing test for that in the hyperopt code?

baggepinnen · 2022-01-31T13:56:03Z

I don't really have a good idea about the testing either :/ In any case, this PR should leave us in a better state than before when we didn't handle interrupts at all, so let's merge and of you come up with a strategy at a later time, we can add it then :)

Catch interrupts in phyperopt, return current data

d720fef

albheim marked this pull request as draft October 12, 2021 20:11

albheim changed the title ~~Catch interrupts in phyperopt, return current data~~ WIP: Catch interrupts in phyperopt, return current data Oct 12, 2021

albheim added 2 commits January 28, 2022 13:26

add a done_channel to keep track of when stopped

d5d7440

try to fix test

8ded31a

albheim commented Jan 29, 2022

View reviewed changes

remove interrupt tests (not working)

0152ee8

albheim changed the title ~~WIP: Catch interrupts in phyperopt, return current data~~ Catch interrupts in phyperopt, return current data Jan 30, 2022

albheim marked this pull request as ready for review January 30, 2022 18:01

baggepinnen merged commit e207cec into baggepinnen:master Jan 31, 2022

albheim deleted the albheim/add_phyper_interrupt branch February 1, 2022 10:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch interrupts in phyperopt, return current data #68

Catch interrupts in phyperopt, return current data #68

albheim commented Oct 12, 2021 •

edited

Loading

baggepinnen commented Oct 12, 2021

albheim commented Oct 12, 2021

albheim commented Jan 28, 2022

albheim Jan 29, 2022

albheim commented Jan 30, 2022

baggepinnen commented Jan 31, 2022

albheim commented Jan 31, 2022

albheim commented Jan 31, 2022

baggepinnen commented Jan 31, 2022

Catch interrupts in phyperopt, return current data #68

Catch interrupts in phyperopt, return current data #68

Conversation

albheim commented Oct 12, 2021 • edited Loading

baggepinnen commented Oct 12, 2021

albheim commented Oct 12, 2021

albheim commented Jan 28, 2022

albheim Jan 29, 2022

Choose a reason for hiding this comment

albheim commented Jan 30, 2022

baggepinnen commented Jan 31, 2022

albheim commented Jan 31, 2022

albheim commented Jan 31, 2022

baggepinnen commented Jan 31, 2022

albheim commented Oct 12, 2021 •

edited

Loading