-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal/grpc: properly handle Envoy ACK/NACK protocol #1176
Comments
Updates projectcontour#499 Updates projectcontour#273 Updates projectcontour#1176 The XDS spec says that Envoy will always initiate a stream with a discovery request, and expects the management server to respond with only one discovery response. After that, Envoy will initiate another discovery request containing an ACK or a NACK from the previous response. Currently Contour ignores the ACK/NACK, this is projectcontour#1176, however after inspection of the current code it is evident that we're also not waiting for Envoy to send the next discovery request. This PR removes the inner `for {}` loop that would continue to reuse the initial discovery request until the client disconnected. The previous code was written in a time when we'd just implemented filtering and it was possible for the filter to return no results, hence the inner loop was--incorrectly--trying to loop until there was a result to return. Huge thanks to @lrouquette who pointed this out. Signed-off-by: Dave Cheney <dave@cheney.net>
Updates projectcontour#499 Updates projectcontour#273 Updates projectcontour#1176 The XDS spec says that Envoy will always initiate a stream with a discovery request, and expects the management server to respond with only one discovery response. After that, Envoy will initiate another discovery request containing an ACK or a NACK from the previous response. Currently Contour ignores the ACK/NACK, this is projectcontour#1176, however after inspection of the current code it is evident that we're also not waiting for Envoy to send the next discovery request. This PR removes the inner `for {}` loop that would continue to reuse the initial discovery request until the client disconnected. The previous code was written in a time when we'd just implemented filtering and it was possible for the filter to return no results, hence the inner loop was--incorrectly--trying to loop until there was a result to return. Huge thanks to @lrouquette who pointed this out. Signed-off-by: Dave Cheney <dave@cheney.net>
Updates projectcontour#499 Updates projectcontour#273 Updates projectcontour#1176 The XDS spec says that Envoy will always initiate a stream with a discovery request, and expects the management server to respond with only one discovery response. After that, Envoy will initiate another discovery request containing an ACK or a NACK from the previous response. Currently Contour ignores the ACK/NACK, this is projectcontour#1176, however after inspection of the current code it is evident that we're also not waiting for Envoy to send the next discovery request. This PR removes the inner `for {}` loop that would continue to reuse the initial discovery request until the client disconnected. The previous code was written in a time when we'd just implemented filtering and it was possible for the filter to return no results, hence the inner loop was--incorrectly--trying to loop until there was a result to return. Huge thanks to @lrouquette who pointed this out. Signed-off-by: Dave Cheney <dave@cheney.net>
NACKs happen (at least) when config validation fails. Envoy erroneously throws exceptions in its config constructors (example) which will leak memory (Destructors aren't run if exceptions are thrown in the ctor body. See Item 10 in More Effective C++ for more detail) for anything allocated in the constructor body. In the worst case Contour continues to send the same resources and gets NACKs back while Envoy leaks memory until it's oom-killed. If memory leaks and how much will be dependent on which object in the deeply nested config classes throws an exception and where. Contour needs to mitigate this for xDS with resource hints by not sending the resource anymore if its NACKed. I don't know what mitigation it could do for LDS/CDS other than keep the previous set of resources around so they can be sent if the most recent version is NACKed. |
Updates projectcontour#1176 The xDS spec says that if Envoy NACKs an update we shouldn't try to send it again. Given that the update came from k8s and was translated by Contour either theres a bug in our translation or a mistake in the data that a user entered into K8s. For 0.15 log the NACK at ERROR level in the hope that this will give us some visibility into what kind of invalid configuraion we're transmitting to Envoy and the frequency of those mistakes. Ideally we'd be able to trace a NACK all the way back to the originating record and cordon it. However, given that records like RDS and LDS are gestalts cordoning the invalid part of the configuration is rather complicated. Perhaps the better solution would be to strengthen our validation to prevent these errors reaching Envoy. Signed-off-by: Dave Cheney <dave@cheney.net>
Updates projectcontour#1176 The xDS spec says that if Envoy NACKs an update we shouldn't try to send it again. Given that the update came from k8s and was translated by Contour either theres a bug in our translation or a mistake in the data that a user entered into K8s. For 0.15 log the NACK at ERROR level in the hope that this will give us some visibility into what kind of invalid configuraion we're transmitting to Envoy and the frequency of those mistakes. Ideally we'd be able to trace a NACK all the way back to the originating record and cordon it. However, given that records like RDS and LDS are gestalts cordoning the invalid part of the configuration is rather complicated. Perhaps the better solution would be to strengthen our validation to prevent these errors reaching Envoy. Signed-off-by: Dave Cheney <dave@cheney.net>
Updates #1176 The xDS spec says that if Envoy NACKs an update we shouldn't try to send it again. Given that the update came from k8s and was translated by Contour either theres a bug in our translation or a mistake in the data that a user entered into K8s. For 0.15 log the NACK at ERROR level in the hope that this will give us some visibility into what kind of invalid configuraion we're transmitting to Envoy and the frequency of those mistakes. Ideally we'd be able to trace a NACK all the way back to the originating record and cordon it. However, given that records like RDS and LDS are gestalts cordoning the invalid part of the configuration is rather complicated. Perhaps the better solution would be to strengthen our validation to prevent these errors reaching Envoy. Signed-off-by: Dave Cheney <dave@cheney.net>
For 0.15 I've landed #1350 and I plan to leave this issue here for the moment. Sending invalid updates to Envoy is a problem -- we shouldn't do that, but I don't know what to do with the signal once we did send an invalid update. Logging is about the least worst thing we can do. Trying to cordon the offending update is complicated, it depends on which table it came through. Crashing contour is a non starter, so logging is about the best I can think to do at the moment. Hopefully logging at error level will give us a signal from the field which updates are causing the problem and we'll know if its something we can systematically filter out on the Contour side or if we need to build a framework to trace NACK'd responses from Envoy back to their original k8s object. |
Okay, I'm having a look into this, and I'm going to start by checking that it's still a problem. To do this, I'm going to do something like this:
|
Okay, I ran the above process with both the "contour" and "envoy" xDS engines, and got the following results:
In both cases, I did not need to restart Envoy to have it pick up new changes. So, this means that this process is a lot more forgiving than I thought. Which is great! But I think there's still more to investigate here. |
This one's waiting on @wrowe to have some bandwidth. Moving to parking lot 1. |
#4276 says if we will support it, we must solve this one first, but after read the all issues about this one, I had a question as follows:
so when the config is turn valid, the operation of restart contour is needed no more After read the all above links and go through the contour's source code, I don't know what's the real problem about |
We have stalled a bit on working on the ACK/NACK problem because as you say, we've found that we pretty much validate all user input anyways so we don't send invalid config to Envoy so it is very difficult to generate a NACK at all at the moment An ACK is maybe more interesting to figure out, it could allow us to set some status on HTTPProxy/Route etc. that tells users when their config is actually in effect. However figuring out when all the nodes have ACKed configuration and making sure we can actually make sure updates do not cross each other and are processed sequentially is not necessarily super straightforward. |
@sunjayBhatia sorry, I don't quite understand what you mean. Please explain it in more detail and frankly, thx |
when we send a configuration update to a fleet of Envoys, they may return an ACK or NACK to signify if the update was accepted Because we heavily validate user input, the configuration we generate in practice is never NACKed at the moment, so we have held off making much change here, it would be adding complication for not much benefit. Figuring out if a change was ACKed would be useful so we can signify to the user when exactly their configuration has taken effect (useful for something like knative etc.). However, in the future, if we support WASM modules, Lua scripting, or something else that we cannot fully validate up-front and that Envoy could return a NACK for, we will need a way to signify to users that their resources are invalid. We've again held off on adding these sorts of features and "solving the NACK problem" this because of the below: Figuring out if a specific change was ACK or NACKed is not actually super straightforward because of how we generate snapshots of xDS config to send to Envoy, changes in k8s resources basically get stacked on top of each other and sent asynchronously, e.g. if we have say a change to HTTPProxy A that generates a new snapshot-1 that is NACKed, and then a change to HTTPProxy B that generates snapshot-2 that is also NACKed, we don't currently have a good way to decipher if both changes to HTTPProxy A/B were valid or not since snapshot-2 includes the changes to both resources. Getting this to work is a larger piece to modify how we process k8s resource updates and send them along to Envoy, likely introducing some sort of blocking or a complex accounting system which is a change we have held off on making so far. |
I don't quite understand what you mean, accord to ACK and NACK semantics summary
The Maybe we can cache the latest valid snapshot, when the new one is misconfigured then restore it |
This doesn't have anything in particular to do with the xDS protocol, but rather how Contour generates the snapshots. We take into account the current state of all the k8s resources in the cluster. In the example above:
How we generate snapshots today, the changes from HTTPProxy A and B will both be applied to the DAG to generate snapshot-2 so NACKs will be for the aggregate of the changes, not an individual one
This would involve caching the last good state of the DAG so we can apply the k8s resource changes on top of it, rather than the xDS config snapshot In general to isolate what k8s resource change causes a NACK will involve some significant addition of accounting logic, caching of the DAG etc., and most likely blocking on rebuilding the DAG and sending resource updates until we get ACKs/NACKs, which we do not do today A separate issue is what we should do on startup, if there are invalid resources in the cluster when Contour restarts, we have no prior valid configuration to fall back to so may end up incorrectly dropping traffic when we shouldn't |
I/We'd like to implement/design it, but I think we need to make a community meeting to discuss the whole matter at first. |
We're planning on starting up community meetings again soon, with an ad-hoc meeting coming next week or the week after. We will announce it on slack and the community mailing list: https://groups.google.com/g/project-contour |
Glad to hear that |
The Contour project currently lacks enough contributors to adequately respond to all Issues. This bot triages Issues according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all Issues. This bot triages Issues according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
Envoy's stream protocol states that after each DiscoveryResponse message the subsequent DiscoveryRequest will contain a ACK or NACK for the previous response. Contour currently doesn't pay attention to this ACK/NACK protocol, but for better compliance with the XDS spec we should.
The text was updated successfully, but these errors were encountered: