Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): Adjust controller -> scheduler state recreation upon scheduler disconnect. #5944

Merged
merged 40 commits into from
Oct 4, 2024

Conversation

sakoush
Copy link
Member

@sakoush sakoush commented Sep 30, 2024

What this PR does / why we need it:

This PR changes the way we are resetting scheduler / controller state upon connection drops. Previously the state reconstruction was tightly couples with stream handling and this had a few issues:

  • bluring of concerns as establishing streams didnt overlap really with state reconstruction

The solution is to separate out the logic and trigger state reconstruction on a separate stream. These operations are idempotent and therefore it is fine to send them anyway.

The main logic is now in handleStateOnReconnect

We are also introducing a new grpc stream between scheduler<->operator that would trigger this logic ControlPlaneSubscription. Essentially when the scheduler gets a subscribe request on this stream it replies to controller to initiate the state reconstruction logic, this process is done on each connection reset.

Which issue(s) this PR fixes:

Fixes #

INFRA-1198 (internal).

Special notes for your reviewer:

  • I have tidied up some of the interfaces and added test coverage along the way for SubscribeModelEvents.
  • I also change the logic for LoadModel and UnloadModel to not to block on scheduling. This is safe as we rely on the state transition events to report status and from the cli perspective it is not required to wait for scheduling.
  • Wait for scheduler to be ready before establishing model/pipeline/experiment status streams so that the controller doesn't get statues that could be wrong while the scheduler is restarting.

@sakoush sakoush requested a review from lc525 as a code owner September 30, 2024 15:15
@sakoush sakoush added the v2 label Sep 30, 2024
@lc525 lc525 changed the title fix(controller): Adjust controller -> scheduler state recreation upton scheduler disconnect. fix(controller): Adjust controller -> scheduler state recreation upon scheduler disconnect. Sep 30, 2024
Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this improves things considerably from our current state, and hope all the testing doesn't show any issues.

I do have a question regarding the ServerBasedSynchronizer for a particular edge case, and a couple of other nits, but otherwise all looks good!

@sakoush sakoush merged commit 6c3410d into SeldonIO:v2 Oct 4, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants