-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unrelated endpoint changes trigger ClusterLoadAssignment change and push #3782
Comments
setup.txt |
Right now contour doesn’t implement incremental xDS so we do send the state of the world on changes to resources |
Related: #1176 |
Thanks for the detailed report @Zsolt-LazarZsolt. I agree that Contour should ideally not be sending endpoint updates when the updates haven't changed. I'm not sure if we will be able to do much about it until we've fixed #1176, but one of us will see if we can replicate and understand the cause of the problem. |
Hello Everyone. Thank you for the quick assessment. I have trouble seeing the connection between the two tickets, let me elaborate on the issue. Additionally @tsaarni created the following, so far untested change. I plan to check it in my own environment. |
I don't think that #1176 has anything to do with this. The scenario you showed @Zsolt-LazarZsolt should not have cause any xDS changes, so no updates should have happened. We have logic to check if a service that changed causes an update. So in your example the update of the When the endpoint for that service changes, then the endpointstranslator should not find the matching cluster and should not cause the eps to rebuild. So all that logic is good (but found some places where we can add some tests). I played around with this and I think what's happening is that when an endpoint is changed, the So ideally we'd need to look at if the change was caused by an endpoint that's relevant before calling recalculate. Additionally, I think we'd benefit from a backoff timer which was added to the DAG rebuild a while back. This would delay the number of updates that happen to Envoy. We'd need to play with the timing, but if you have a lot of endpoint changes across the cluster, we're going to be bombarding Envoy with xDS updates. |
haha that timing @Zsolt-LazarZsolt! 😀 |
I'd need to test as well, but at a first glance, the patch from @tsaarni looks like it's on the same path that I came down in my response. I think if we combined that with a backoff timer, that would really help out! |
ah i misread, i thought the HTTPProxy etc. also referenced the irrelevant service, but it does not 👀 |
Maybe it is also good to add, that in case a new unrelated service is added. (clusterIP service without any endpoints) The manifest looks as follows:
Although when the same service is removed. I can see the following in the contour debug log:
Sot this also triggers the same kind of behaviour, on delete. |
@skriss If you have time to work with it, it would be perfect! I could have worked with it after the other PRs that I have open, but would be grateful if you can work with it for 1.18.0. Just the other day we were chatting with @Zsolt-LazarZsolt about this issue. It seems even a simple fix without backoff timer could help lowering some CPU consumption on a busy cluster quite nicely. |
In the EndpointsTranslator, do not recompute ClusterLoadAssignments or notify cache observers if the Endpoints being added/updated/deleted is not used by a ServiceCluster in the DAG. Closes projectcontour#3782.
In the EndpointsTranslator, do not recompute ClusterLoadAssignments or notify cache observers if the Endpoints being added/updated/deleted is not used by a ServiceCluster in the DAG. Closes projectcontour#3782.
In the EndpointsTranslator, do not recompute ClusterLoadAssignments or notify cache observers if the Endpoints being added/updated/deleted is not used by a ServiceCluster in the DAG. Closes projectcontour#3782. Signed-off-by: Steve Kriss <krisss@vmware.com>
In the EndpointsTranslator, do not recompute ClusterLoadAssignments or notify cache observers if the Endpoints being added/updated/deleted is not used by a ServiceCluster in the DAG. Closes #3782. Signed-off-by: Steve Kriss <krisss@vmware.com>
internal/xdscache: ignore irrelevant Endpoints (projectcontour#3852) In the EndpointsTranslator, do not recompute ClusterLoadAssignments or notify cache observers if the Endpoints being added/updated/deleted is not used by a ServiceCluster in the DAG. Closes projectcontour#3782. Signed-off-by: Steve Kriss <krisss@vmware.com>
What steps did you take and what happened:
I used the 1.15 contour example.yaml to reproduce the issue, the only changed thing is adding --debug to the contour serve process, and changing "--log-level debug" for envoy.
I installed 2 services to the cluster both tied with a separate deployment with one pod.
and created the simplest httpproxy I could think of sending traffic of example.com to the relevant-service.
So at this point the setup is simple and contour in theory only cares about the relevant-service, traffic is fine.
Next I follow the logs of one contour pod with kubectl logs -f and at the same time I do
kubectl edit deployment other-deployment, where I change the replica count to 2.
This creates a new pod, and a new endpoint on a service that in theory contour does not have anything to do with.
To my surprise I see the following in the contour logs:
time="2021-06-08T08:17:11Z" level=debug msg="handling v3 xDS resource request" connection=12 context=xds node_id=envoy-z8wxd node_version=v1.18.3 resource_names="[default/relevant-service/http]" response_nonce=18 type_url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version_info=18
time="2021-06-08T08:17:11Z" level=debug msg="handling v3 xDS resource request" connection=9 context=xds node_id=envoy-48brv node_version=v1.18.3 resource_names="[default/relevant-service/http]" response_nonce=18 type_url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version_info=18
time="2021-06-08T08:17:11Z" level=debug msg="handling v3 xDS resource request" connection=10 context=xds node_id=envoy-r495t node_version=v1.18.3 resource_names="[default/relevant-service/http]" response_nonce=18 type_url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version_info=18
Contour triggers and sends the endpoint list to all connected envoy pods.
This on a busy cluster does happen a lot of times, as this is triggered by any endpoint in any namespace especially when most pods do have failing readiness probes.
This multiplies by the dinamyc clusters in contour so for every backend service, and also multiplies by the connected envoys.
This consumes network traffic, and resource usage in both contour and envoy.
What did you expect to happen:
I would expect no endpoint push triggered by unrelated endpoint changes.
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version
): 1.21/etc/os-release
):The text was updated successfully, but these errors were encountered: