Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a few Supervisor bugs #3972

Merged
merged 17 commits into from
Jan 27, 2024
Merged

Fix a few Supervisor bugs #3972

merged 17 commits into from
Jan 27, 2024

Conversation

durban
Copy link
Contributor

@durban durban commented Jan 26, 2024

This PR fixes a few bugs in Supervisor (and creates an unknown number of new ones ;-).

Fixed bugs (at least the ones I remember spotting; I tried to write tests for them, couldn't for all of them):

  • Closed Supervisor strangeness #3394
  • With checkRestart, cancel allowed and didn't wait for one last restart.
  • supervise/close race (although, due to Closed Supervisor strangeness #3394, close wasn't effective anyway).
  • With checkRestart, if cancel cancelled the working fiber before it did anything, the finalizer (which removes the fiber from the map) did not run. (This was effectively a memory leak.)

@durban durban linked an issue Jan 26, 2024 that may be closed by this pull request
Copy link
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch this is hairy. Nice work. :) Some minor nits.

class SupervisorSpec extends BaseSpec {
class SupervisorSpec extends BaseSpec with DetectPlatform {

sequential
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With lowering the iteration counts, I think we will be able to avoid this.


sequential

override def executionTimeout = 60.seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooof. Can we lower iteration counts instead?

Comment on lines 253 to 255
case ex: IllegalStateException
if ex.getMessage == "supervisor already shutdown" =>
None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a bit easier to diagnose an eventual failure here if we convert this into a side-effecting test assertion.

Suggested change
case ex: IllegalStateException
if ex.getMessage == "supervisor already shutdown" =>
None
case ex: IllegalStateException =>
ex.getMessage mustEqual "supervisor already shutdown"
None

IO.sleep(100.millis) *> adaptedFiber.cancel *> adaptedFiber.join *> (
(counter.get, IO.sleep(100.millis) *> counter.get).flatMapN {
case (v1, v2) =>
IO((v1: Int) mustEqual (v2: Int))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the type ascriptions needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I don't even remember why I did those 🙂

}
}

tsk.parReplicateA_(if (isJVM) 10000 else 5).as(ok)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to replicate it at all on non-JVM platforms?

Also, I'm assuming this is the test that takes most of the time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point; I think I copied the if (isJVM) ... else 5 part from somewhere.

For me, it was usually supervise / finalize race with checkRestart that timed out.


private[this] val allFibers: F[List[Fiber[F, Throwable, _]]] =
stateRef.get.map(_.values.toList)
stateRef.getAndSet(null).map(_.values.toList)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So null is the signal that we've decided to kill ourselves? Basically, the invariant here is that joinAll/cancelAll can only be called at the end of the world, when the Supervisor is shutting down. Can we toss that in a comment so we don't forget it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a comment in abstract class State about joinAll/cancelAll, but I'll note it here too.

@djspiewak
Copy link
Member

Also it occurs to me that, with this, we might not need the double-check on closure state in #3923

@durban
Copy link
Contributor Author

durban commented Jan 27, 2024

FTR there was a Dispatcher failure (https://github.com/typelevel/cats-effect/actions/runs/7678308491/job/20928046255#step:24:4779), but seemed unrelated to this PR, so I just restarted the CI jobs.

Copy link
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@djspiewak djspiewak merged commit 25acc4a into typelevel:series/3.5.x Jan 27, 2024
28 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Closed Supervisor strangeness
2 participants