Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: server: add mechanism for graceful shutdown. #3307

Closed
wants to merge 3 commits into from

Conversation

ggreenway
Copy link
Contributor

This is useful for cases when the hot-restart-version is incompatible
and the old server needs to stop accepting new connections, drain
established connections, and close all shared resources that are
needed to start a new Envoy.

Signed-off-by: Greg Greenway ggreenway@apple.com

Risk Level: Medium

Testing: TODO

Docs Changes: TODO

Release Notes: TODO

[Optional Fixes #Issue]

This is useful for cases when the hot-restart-version is incompatible
and the old server needs to stop accepting new connections, drain
established connections, and close all shared resources that are
needed to start a new Envoy.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
@ggreenway
Copy link
Contributor Author

I'd like to get some feedback on this before I spend any time polishing it.

Should there be a separate admin call for each of the operations that are performed here, so they can be composed differently?

There have been various requests for something along these lines. Does this help with those? If not, could it help via tweaks or small changes to this?

@mattklein123
Copy link
Member

@ggreenway for my understanding, why would someone use this vs /healthcheck/fail? It seems like in general there needs to be a LB in front that can stop sending new traffic to this Envoy?

@ggreenway
Copy link
Contributor Author

@mattklein123 The goal isn't just to drain the traffic. The goal is to be able to start the new envoy while the old one is draining, so all the shared-resources need to be closed.

@mattklein123
Copy link
Member

I see. Since /healthcheck/fail already begins draining, what if we make closing listeners an additional operation instead of a replacement? E.g., a user might do:

  1. /healthcheck/fail (begin draining and fail Envoy out of load balancers if applicable)
  2. Wait some period of time
  3. /graceful_shutdown to actually shut the listeners (I would maybe rename /close_listeners or something but we can debate that later)

?

@htuch
Copy link
Member

htuch commented May 7, 2018

@ggreenway I think this could be used to help with #2802 (and the general case of having gRPC connections sourced at Envoy terminate with a graceful GOAWAY).

@ggreenway
Copy link
Contributor Author

@mattklein123 using /healthcheck/fail has slightly different semantics because it immediately makes all connections want to drain, whereas this calls startDrainSequence() which slowly drains everything. I was trying to mirror the hot-restart behavior as much as possible. But I don't feel strongly about which is better. I think either would work for my use-case.

There's some convenience in making the entire thing fire-and-forget, as that will simplify init-scripts. But I could just as easily launch bash with api_call_1; sleep $drain_time; api_call_2, so I don't feel too strongly about that one either.

I'm tempted to make each of those operations a separate admin command. Some users may want to close the admin port before starting the new envoy, whereas some may not be starting a new envoy but want to graceful-shutdown, so they don't want to close admin.

@htuch htuch removed their assignment May 7, 2018
@mattklein123
Copy link
Member

There's some convenience in making the entire thing fire-and-forget, as that will simplify init-scripts. But I could just as easily launch bash with api_call_1; sleep $drain_time; api_call_2, so I don't feel too strongly about that one either.

FWIW, this is basically what we do. We have wrapper run scripts that do various things.

I'm tempted to make each of those operations a separate admin command. Some users may want to close the admin port before starting the new envoy, whereas some may not be starting a new envoy but want to graceful-shutdown, so they don't want to close admin.

I think my main concern is confusion over which admin commands to use. It seems reasonable to provide the requested shutdown functionality, but it's not clear to me if they should be different commands, params to /healthcheck/fail, etc. Thoughts?

@ggreenway
Copy link
Contributor Author

I don't think this should be a param to /healthcheck/fail because one of the operations is closing listeners, and that's not easily reversible, whereas /healthcheck/ok should completely undo /healthcheck/fail.

But we could have an options to /healthcheck/fail to start the draining process using the gentle-draining vs the all-at-once.

Also, I just realized one more quirk: because this shuts down the admin port, killing on a timer is needed to stop it at the end, because we can't reach /killkillkill anymore.

I think I might make shutting down the admin port optional with default to false, because if you do that accidentally you've shot yourself in the foot pretty bad.

@ggreenway
Copy link
Contributor Author

I think we should also disallow adding new listeners via LDS after we've closed listeners, so the management server doesn't need to be aware of the shutdown status. Thoughts?

@mattklein123
Copy link
Member

@ggreenway I think the thing I'm confused about is why you would want the admin server to be shutdown. I.e., why wouldn't you just hot restart in that case? I get why someone might want to do a listener close operation depending on how their deployment works re: health checking or additional load balancers, but the full restart while the other one keeps running is kind of confusing to me.

@mattklein123
Copy link
Member

I think we should also disallow adding new listeners via LDS after we've closed listeners, so the management server doesn't need to be aware of the shutdown status. Thoughts?

Yeah I would probably shutdown LDS at that point. I guess the question is if you want to accept changes to existing listeners while not adding new ones, though that sounds complicated.

@ggreenway
Copy link
Contributor Author

I think the thing I'm confused about is why you would want the admin server to be shutdown. I.e., why wouldn't you just hot restart in that case?

This features helps deal with the case when the hot-restart-version has changed. I want to stop old-envoy and start new-envoy. To start new-envoy, old-envoy needs to release all shared resources.

@mattklein123
Copy link
Member

This features helps deal with the case when the hot-restart-version has changed. I want to stop old-envoy and start new-envoy. To start new-envoy, old-envoy needs to release all shared resources.

I see. The way we deal with this at Lyft is just to operationalize around it. Basically we do a rolling restart in which we:

  1. /healtheck/fail
  2. sleep N
  3. term/restart

IMO letting the old process keep going without monitoring it is potentially not great from an ops perspective and I wouldn't personally recommend it. I'm also not sure if it's worth the engineering effort to build the full close (including admin) into Envoy itself. With that said, I don't feel that strongly about it.

(Another thing to point out is that the hot restart version changes so infrequently that IMO it's not a big deal if you drop a few extra connections that you wouldn't otherwise drop.)

@ggreenway
Copy link
Contributor Author

Some of the things I deal with involve very long-lived connections, so I want the "sleep N" to be a really long time. But I don't want to wait that long to start the new envoy. I agree that un-monitored envoy isn't a great idea, but given that this is very rare (due to infrequently-changing hot-restart-version) this is the best tradeoff I've found for my operational needs.

My other idea on the admin port is to use a file-based UDS, and just move the old one to a different path before starting new-envoy.

@mattklein123
Copy link
Member

this is the best tradeoff I've found for my operational needs.

Yup fair enough. I think if we implement we just need to be super clear in the docs what to use when from an ops perspective, and hopefully have as little overlap as possible between the commands.

@wrowe
Copy link
Contributor

wrowe commented May 9, 2018

Having some problems understanding your logic, here;

  • server_.drainListeners();
  • server_.listenerManager().closeListeners();

If during drainListeners() we are pulling backlog queued connections off of the listener, and that listener continues to fill with new connections, how will we reach closeListeners()?

This doesn't seem to solve the underlying issue in #2920 of leaving listeners open and accepting connections in the three cases which claim to "stop listening" (libevent, under the hood, makes this claim about evconnlistener_disable(), but it certainly continues listening, and simply stops accepting connections; these connections up to the backlog value had been established and accepting some data.)

@ggreenway
Copy link
Contributor Author

@wrowe drainListeners() does not synchronously drain listeners, it starts the process documented at https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/draining.html.

The goal of this PR wasn't to fix #2920; I'm figuring out if that can also be fixed though. The description at the top of the PR describes the problem I'm trying to address.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of paranoid comments; it's possible the design prevents the concerns I was fabricating :)

const std::chrono::seconds shutdown_time(server_.options().parentShutdownTime());
ENVOY_LOG(info, "Starting graceful shutdown. Server will exit in {} seconds.",
shutdown_time.count());
graceful_shutdown_timer_ = server_.dispatcher().createTimer([this]() { server_.shutdown(); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if AdminImpl is destructed before the timeout is reached, the call to server_.shutdown() will crash. Can we cancel the timer in the destructor if non-null, and null the timer here?

Or is this not necessary due to the way the server_.dispatcher() works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer is held in a unique_ptr. If it is destructed, the timer is cancelled/unscheduled.

@@ -622,6 +622,20 @@ void ListenerManagerImpl::stopListeners() {
}
}

void ListenerManagerImpl::closeListeners() {
for (auto& listener : active_listeners_) {
listener->close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this ultimately mutate active_listeners_? When does a previously active listener get removed from the collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listeners are only removed when LDS removes them. This call does not mutate the vector (or list?), only the elements inside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close() maybe should be renamed to closeSocket() to make it clearer.

Signed-off-by: Greg Greenway <ggreenway@apple.com>
@mattklein123
Copy link
Member

@ggreenway FYI in #3340 I moved LDS API into the listener manager, which might make it easier for you to shut down the LDS provider as part of your change.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me, code-wise; I learned more about the awesome layers of SW in Envoy's codebase.

However I don't have a good feel for the operational tradeoffs as discussed. Happy to review further if desired; I assume the next step is to get CI green?

@ggreenway
Copy link
Contributor Author

Yeah, I got busy with other things. I'll try to get back to this sometime soon.

Still need to write tests for this.

@ggreenway
Copy link
Contributor Author

I've gotten busy on other things. Closing this until I have time to re-visit.

@lizan
Copy link
Member

lizan commented Jan 31, 2019

@ggreenway What is your status for this PR? It is getting higher priority in Istio side, do you mind if I take over?

@ggreenway
Copy link
Contributor Author

@lizan Yes that's fine if you want to take it over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants