-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Catch interrupts in phyperopt, return current data #68
Catch interrupts in phyperopt, return current data #68
Conversation
Awesome! I don't think there was any particular reason. I very rarely use the Distributed mode myself so probably just never got around to implementing it. |
Realised it did not work perfectly as is, there were some problems when you actually tested it a bit more. And have not tested for threading either. So I will come back to this at some point. |
A bit unclear how to test this well, added some basic runs for interruptions (should handle gracefully) and some with exceptions (should error). Problem is the distributed one errors in the interruptions here, but I can use it on my cluster fine. |
try | ||
# We don't care about the results of the `pmap` since we update the hyperoptimizer | ||
# inside the loop. | ||
$(esc(pmap))(1:ho.iterations) do i | ||
isdone = take!(done_channel) | ||
put!(done_channel, isdone) | ||
isdone && return | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I realised why my tests behaved different from what I expected.
When generating my errors/interrupts I do them in the code block run on the workers, so the exception will happen on the worker.
When I press ctrl-C to interrupt jobs on my cluster I do that on the main process so it is caught by the try statement.
It seems that an exception on a worker can generate an exception back to the main process, but maybe not the same? Currently I have
Worker 2 terminated.
Interrupt handling: Error During Test at /home/runner/work/Hyperopt.jl/Hyperopt.jl/test/runtests.jl:109
Got exception outside of a @test
On worker 2:
InterruptException:
...
which seems like it does generate the expected interrupt exception on the worker, but the worker can't handle it and stops which generates an error on the main process? I think this might be how you'd want it to work, it just makes it harder to test.
So as it is now I just added so that an exception (either interrupt or error) will interrupt running jobs and skip remaining ones. In case of interrupt it will return a hyperopt object containing the finished runs, and for exception it will rethrow. I tried making some tests for interrupt, but found it tricky to do it in any meaningful way so didn't add anything for now. |
Thanks for the updates :) |
Yeah, have only really used One thing I thought about is if you want to save the ho state on an error also, but I think that would prevent you from properly rethrowing the error since you would need to properly return so the value can be assigned. So maybe only save on interrupt is reasonable. Also thought if |
Testing for error could be easier and I could try to add something basic for both phyper and thyper, so at least one of the branches in the try is tested. And if you have any suggestion on how to do the interrupt testing in some good way I'm all for that, but I couldn't figure it out and from what I saw you didn't have any existing test for that in the hyperopt code? |
I don't really have a good idea about the testing either :/ In any case, this PR should leave us in a better state than before when we didn't handle interrupts at all, so let's merge and of you come up with a strategy at a later time, we can add it then :) |
In
@hypteropt
it will catch interrupts and return the data generated so far. I wanted the same functionality for long running distributed jobs, and implemented pretty much the same thing for@phyperopt
(and@thyperopt
also I just realized, have not tried this though and might need to adapt to function well with that) since I didn't see any immediate reasons it would create problems.Was there any specific reason this didn't exist before, something problematic that I didn't see?
If you want this, should I create some tests? Not sure of a good way to generate external interrupts, but should be the same as throwing a
InterruptedException
after some time in the optimization function right?