Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch interrupts in phyperopt, return current data #68

Merged
merged 4 commits into from
Jan 31, 2022

Conversation

albheim
Copy link
Contributor

@albheim albheim commented Oct 12, 2021

In @hypteropt it will catch interrupts and return the data generated so far. I wanted the same functionality for long running distributed jobs, and implemented pretty much the same thing for @phyperopt (and @thyperopt also I just realized, have not tried this though and might need to adapt to function well with that) since I didn't see any immediate reasons it would create problems.

Was there any specific reason this didn't exist before, something problematic that I didn't see?

If you want this, should I create some tests? Not sure of a good way to generate external interrupts, but should be the same as throwing a InterruptedException after some time in the optimization function right?

@baggepinnen
Copy link
Owner

Awesome! I don't think there was any particular reason. I very rarely use the Distributed mode myself so probably just never got around to implementing it.

@albheim albheim marked this pull request as draft October 12, 2021 20:11
@albheim albheim changed the title Catch interrupts in phyperopt, return current data WIP: Catch interrupts in phyperopt, return current data Oct 12, 2021
@albheim
Copy link
Contributor Author

albheim commented Oct 12, 2021

Realised it did not work perfectly as is, there were some problems when you actually tested it a bit more. And have not tested for threading either. So I will come back to this at some point.

@albheim
Copy link
Contributor Author

albheim commented Jan 28, 2022

A bit unclear how to test this well, added some basic runs for interruptions (should handle gracefully) and some with exceptions (should error). Problem is the distributed one errors in the interruptions here, but I can use it on my cluster fine.

Comment on lines +221 to +228
try
# We don't care about the results of the `pmap` since we update the hyperoptimizer
# inside the loop.
$(esc(pmap))(1:ho.iterations) do i
isdone = take!(done_channel)
put!(done_channel, isdone)
isdone && return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I realised why my tests behaved different from what I expected.

When generating my errors/interrupts I do them in the code block run on the workers, so the exception will happen on the worker.
When I press ctrl-C to interrupt jobs on my cluster I do that on the main process so it is caught by the try statement.

It seems that an exception on a worker can generate an exception back to the main process, but maybe not the same? Currently I have

Worker 2 terminated.
Interrupt handling: Error During Test at /home/runner/work/Hyperopt.jl/Hyperopt.jl/test/runtests.jl:109
  Got exception outside of a @test
  On worker 2:
  InterruptException:
...

which seems like it does generate the expected interrupt exception on the worker, but the worker can't handle it and stops which generates an error on the main process? I think this might be how you'd want it to work, it just makes it harder to test.

@albheim albheim changed the title WIP: Catch interrupts in phyperopt, return current data Catch interrupts in phyperopt, return current data Jan 30, 2022
@albheim
Copy link
Contributor Author

albheim commented Jan 30, 2022

So as it is now I just added so that an exception (either interrupt or error) will interrupt running jobs and skip remaining ones.

In case of interrupt it will return a hyperopt object containing the finished runs, and for exception it will rethrow.

I tried making some tests for interrupt, but found it tricky to do it in any meaningful way so didn't add anything for now.

@albheim albheim marked this pull request as ready for review January 30, 2022 18:01
@baggepinnen
Copy link
Owner

Thanks for the updates :)
I guess you have used this a bit and know that the catch path works as expected? If so, I'll just go ahead and merge it

@albheim
Copy link
Contributor Author

albheim commented Jan 31, 2022

Yeah, have only really used @phyperopt and that has worked nicely for me. Tested @thyperopt quickly and it seems it terminates nicely in the same manner. Generating an error seems to rethrow it correctly also.

One thing I thought about is if you want to save the ho state on an error also, but I think that would prevent you from properly rethrowing the error since you would need to properly return so the value can be assigned. So maybe only save on interrupt is reasonable.

Also thought if length(ho) should be changed to reflect the number of actual steps performed instead of intended steps. No big deal for me, just tried to do it and expected the other result so it felt intuitive to me.

@albheim
Copy link
Contributor Author

albheim commented Jan 31, 2022

Testing for error could be easier and I could try to add something basic for both phyper and thyper, so at least one of the branches in the try is tested.

And if you have any suggestion on how to do the interrupt testing in some good way I'm all for that, but I couldn't figure it out and from what I saw you didn't have any existing test for that in the hyperopt code?

@baggepinnen
Copy link
Owner

I don't really have a good idea about the testing either :/ In any case, this PR should leave us in a better state than before when we didn't handle interrupts at all, so let's merge and of you come up with a strategy at a later time, we can add it then :)

@baggepinnen baggepinnen merged commit e207cec into baggepinnen:master Jan 31, 2022
@albheim albheim deleted the albheim/add_phyper_interrupt branch February 1, 2022 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants