Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful shutdown support #689

Merged
merged 1 commit into from
Nov 18, 2019
Merged

Conversation

terencefan
Copy link
Member

@terencefan terencefan commented Oct 17, 2019

The target of this PR is to ensure all incoming and outgoing packages will be processed before a scheduled shutdown.

UPDATE 11/13

  1. Add test cases for HubDispatcher and MultiEndpointContainer.

UPDATE 11/12

  1. Lift the ShutdownAsync method up to the HubDispatcher level.

UPDATE 11/06

  1. Lift the ShutdownAsync method up to the MultiEndpointServiceConnectionContainer level.
  2. Add OfflineAsync method to IServiceConnectionContainer.

UPDATE 11/05

  1. Move CloseAsync method to ServiceConnectionContainer level.

UPDATE 11/01

  1. Replace CloseConnectionMessage with PingMessage.
  2. Wait on TCS instead of connection outgoing tasks.
  3. Prevent rebalanced (ondemand) connection from starting in the shutdown process.

UPDATE 10/31

Add test cases for corner cases.

UPDATE 10/29

Add test cases for common scenarios.

UPDATE 10/28

Reimplement CloseAsync process by using TaskCompletionSource which makes the code more readable.

UPDATE 10/24

After discussed thoroughly with @KKhurin and @vicancy, we decided NOT do anything with the messages during the shutdown process in this scenario. We just remove server connections from the routing table to prevent new client connections to be assigned to this server and wait until either client or server to close client connections by themselves.

Since the server-side may not have the ability to notify the client to reconnect, maybe we will introduce a mechanism between server and ASRS to let the server has this ability in the future.

And there will also a TIMEOUT (by default it's 30s).
We simply drop every connection after reaching the time limit.

The CloseAsync (for each server connection) contains the following 6 steps:

  1. Try acquiring closeConnectionLock to ensure at most 1 closeAsync in progress.
  2. Send a CloseConnectionMessage with a FIN to ASRS
    • Once ASRS receives a FIN, it will remove the server connection immediately from its route table (to prevent new client connections from assigning to this server connection)
  3. Wait until ASRS has sent back a CloseConnectionMessage with a FINACK.
  4. Wait until all client connections have stopped.
    • Our ASRS will handle the client connections and find a proper time to close them, or in some cases we just let our customer close them by themselves. So we intentionally do nothing on the SDK side during this step.
  5. Complete incoming pipe
  6. Wait until the connection (StartAsync) has stopped safely.

image

@terencefan terencefan force-pushed the graceful-shutdown branch 3 times, most recently from 5a4a72d to 6efe9af Compare October 21, 2019 02:56
@terencefan terencefan changed the title [WIP] Graceful shutdown support Graceful shutdown support Oct 21, 2019
@davidfowl
Copy link
Member

davidfowl commented Oct 21, 2019

Spoke to @terencefan offline:

  • There should be no changes to the route API. Instead there should be a TimeSpan that can be set on ServiceOptions that let the user control how long to wait for shutdown.
  • By default we should Inject the IApplicationLifetime and call ShutdownAsync on all connections ourselves.
  • There should be no locks, CloseAsync on ServiceConnectionBase should Write the Fin and wait on task returned from ProcessIncomingAsync with the timeout configured in ServiceOptions.

Copy link
Contributor

@KKhurin KKhurin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we drain the outgoing pipe as the first step of the whole shutdown process or at least after we dispose the hub and close client connections but before we send the fin?
The reason is the messages queued in the pipe (and the implicit queue inside the semaphore lock) could require acks and it feels wrong to even have a possibility to send any messages after sending fin.

@terencefan
Copy link
Member Author

terencefan commented Oct 22, 2019

@KKhurin Actually we shouldn't block the outgoing pipe even after sending the FIN.

The FIN here is to notify the ASRS Runtime that the app server wants to close its connection, not already closed. There are still messages in both incoming and outgoing pipes and need to be processed at the moment.

Once our app server receives a FINACK, we could make sure no more incoming messages will be sent from ASRS Runtime, and after we close the incoming pipe gracefully, we could drain the outgoing pipe and close the connection.

And once our ASRS Runtime receives a FIN, it will block the upcoming messages from the client/browser, not the server. All pending messages (even the messages triggered by the unprocessed messages in the incoming pipe) in the outgoing pipe will be sent in my proposal.

A typical disconnecting process would contain 4 steps, we reduce it into 2 (FIN + FINACK) just because we are not expecting a graceful shutdown on the client-side.

@terencefan terencefan force-pushed the graceful-shutdown branch 4 times, most recently from baa06f0 to 6d55f2f Compare November 13, 2019 05:32
@terencefan terencefan force-pushed the graceful-shutdown branch 3 times, most recently from 09b3d9f to 490d5e0 Compare November 13, 2019 07:53
@terencefan terencefan force-pushed the graceful-shutdown branch 3 times, most recently from d920c87 to b9cabd6 Compare November 15, 2019 05:23

protected async Task WriteFinAsync(IServiceConnection c)
{
var ping = new PingMessage()
Copy link
Member

@vicancy vicancy Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to local static readonly field, serialized?


if (task != c.ConnectionOfflineTask)
{
// log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn't this get fixed in the PR?


if (actual != expected)
{
// TODO log timeout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well.

@@ -17,7 +16,7 @@ internal interface IServiceConnectionManager<THub> where THub : Hub

Task StopAsync();

Task ShutdownAsync(TimeSpan timeout);
Task OfflineAsync();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OfflineAsync isn't a verb. MarkOfflineAsync? MakeOfflineAsync?

JialinXin pushed a commit to JialinXin/azure-signalr that referenced this pull request Dec 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants