Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial supervision tests #43

Closed
goodboy opened this issue Nov 22, 2018 · 2 comments · Fixed by #91
Closed

Initial supervision tests #43

goodboy opened this issue Nov 22, 2018 · 2 comments · Fixed by #91
Milestone

Comments

@goodboy
Copy link
Owner

goodboy commented Nov 22, 2018

#42 brings in proper trio.MultiError support and more deterministic cancellation semantics.
More tests are needed to ensure this system is rock solid before moving onto adding different supervision strategies as in #22.

Some I can think of off-hand that aren't in the test suite are:

  • n run_in_actor(), x start_actor()
    • local error causes all to cancel
      • n don't error but complete quickly
      • n don't error but complete slowly
      • n all error
    • internal error in start_actor() and run_in_actor()
      • n all error
      • n some error
  • propagation of MultiError up subactor nursery trees

More to come...

@goodboy goodboy added this to the 0.1.0.a0 milestone Oct 17, 2019
@goodboy
Copy link
Owner Author

goodboy commented Oct 22, 2019

Hmm here's another weird case I thought of:

A start_actor() (daemon actor) errors on a call but we're stuck already blocking on some long running run_in_actor() - Portal.result() call. What should happen? The task result waiter task won't know about the daemon actor's failure because that portal hasn't been "checkpointed" but should the nursery/supervisor cancel the actor group regardless?

Wait, just thinking, nm. If you make a call to a daemon actor the only way this could happen is if you were to stop iterating a remote streaming function (any other type of call should error on the Portal.run() checkpoint), so you'd have to stop iterating some Portal.run() call to a start_actor() actor (aka daemon actor) mid-iteration. But, if you did that from an async for you'd (hopefully, assuming you're aware of the precariousness involved - #57) trigger the cross-process task cancellation machinery (remember aclose() needs to be called on an async for stream break). This whole situation could, however, happen more naturally using the context api (which now in hindsight probably also needs a context manager system wrapping it much like an async_generator.aclosing() to ensure proper cross process stream termination).

Probably needs tests for all of this right now lol!

See #87 for further discussion on how this might be solved. Either way we need tests to verify the current behavior.

goodboy added a commit that referenced this issue Oct 25, 2019
In an effort towards #43. This completes the first major "bullets" worth of tests
described in that issue.
goodboy added a commit that referenced this issue Oct 26, 2019
In an effort towards #43. This completes the first major bullet's worth of tests
described in that issue.
@goodboy
Copy link
Owner Author

goodboy commented Oct 26, 2019

Aha! Catching more undefined behavior with this endevour :D

Working on the nested MultiError test case I figured out that we don't have rules in place for what happens if a nursery is waited when one actor has errored but not all have yet spawned (since spawning is async).

I've created #88 to address this.

goodboy added a commit that referenced this issue Oct 26, 2019
This exemplifies the undefined behaviour in #88 and begins to test for
the last bullet in #43.
@goodboy goodboy changed the title More thorough supervision tests Initial supervision tests Oct 30, 2019
goodboy added a commit that referenced this issue Oct 30, 2019
Add a test to verify that `trio.MultiError`s are properly propagated up
a simple actor nursery tree. We don't have any exception marshalling
between processes (yet) so we can't validate much more then a simple
2-depth tree. This satisfies the final bullet in #43.

Note I've limited the number of subactors per layer to around 5 since
any more then this seems to break the `multiprocessing` forkserver;
zombie subprocesses seem to be blocking teardown somehow...

Also add a single depth fast fail test just to verify that it's the
nested spawning that triggers this forkserver bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant