Unrelated endpoint changes trigger ClusterLoadAssignment change and push #3782

Zsolt-LazarZsolt · 2021-06-08T08:44:37Z

What steps did you take and what happened:
I used the 1.15 contour example.yaml to reproduce the issue, the only changed thing is adding --debug to the contour serve process, and changing "--log-level debug" for envoy.

I installed 2 services to the cluster both tied with a separate deployment with one pod.

relevant-service
other-service
and created the simplest httpproxy I could think of sending traffic of example.com to the relevant-service.

So at this point the setup is simple and contour in theory only cares about the relevant-service, traffic is fine.
Next I follow the logs of one contour pod with kubectl logs -f and at the same time I do
kubectl edit deployment other-deployment, where I change the replica count to 2.

This creates a new pod, and a new endpoint on a service that in theory contour does not have anything to do with.

To my surprise I see the following in the contour logs:
time="2021-06-08T08:17:11Z" level=debug msg="handling v3 xDS resource request" connection=12 context=xds node_id=envoy-z8wxd node_version=v1.18.3 resource_names="[default/relevant-service/http]" response_nonce=18 type_url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version_info=18
time="2021-06-08T08:17:11Z" level=debug msg="handling v3 xDS resource request" connection=9 context=xds node_id=envoy-48brv node_version=v1.18.3 resource_names="[default/relevant-service/http]" response_nonce=18 type_url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version_info=18
time="2021-06-08T08:17:11Z" level=debug msg="handling v3 xDS resource request" connection=10 context=xds node_id=envoy-r495t node_version=v1.18.3 resource_names="[default/relevant-service/http]" response_nonce=18 type_url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version_info=18

Contour triggers and sends the endpoint list to all connected envoy pods.

This on a busy cluster does happen a lot of times, as this is triggered by any endpoint in any namespace especially when most pods do have failing readiness probes.
This multiplies by the dinamyc clusters in contour so for every backend service, and also multiplies by the connected envoys.
This consumes network traffic, and resource usage in both contour and envoy.

What did you expect to happen:
I would expect no endpoint push triggered by unrelated endpoint changes.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Contour version: 1.15.1
Kubernetes version: (use kubectl version): 1.21
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

Zsolt-LazarZsolt · 2021-06-08T08:59:19Z

setup.txt
Forgot to add my file for the setup
Also attaching the contour manifests with my changes. (Switching on debug logging, and removing envoy probes.)

contour.txt

sunjayBhatia · 2021-06-08T13:21:35Z

Right now contour doesn’t implement incremental xDS so we do send the state of the world on changes to resources

sunjayBhatia · 2021-06-08T14:01:46Z

Related: #1176

youngnick · 2021-06-09T04:02:54Z

Thanks for the detailed report @Zsolt-LazarZsolt. I agree that Contour should ideally not be sending endpoint updates when the updates haven't changed. I'm not sure if we will be able to do much about it until we've fixed #1176, but one of us will see if we can replicate and understand the cause of the problem.

Zsolt-LazarZsolt · 2021-06-09T14:09:22Z

Hello Everyone. Thank you for the quick assessment.

I have trouble seeing the connection between the two tickets, let me elaborate on the issue.
My issue is not with the type of the update, my issue is that an update toward the envoy pods happens at all.
So basically any endpoint change in the cluster will trigger the communication. (even if the endpoint is not part of any configured upstream)

Additionally @tsaarni created the following, so far untested change. I plan to check it in my own environment.
What do you think, can it have unforeseen consequences?
Nordix@0be2438

stevesloka · 2021-06-09T14:11:51Z

I don't think that #1176 has anything to do with this. The scenario you showed @Zsolt-LazarZsolt should not have cause any xDS changes, so no updates should have happened.

We have logic to check if a service that changed causes an update. So in your example the update of the other-service shouldn't get added to the clusters that Contour cares about.

When the endpoint for that service changes, then the endpointstranslator should not find the matching cluster and should not cause the eps to rebuild.

So all that logic is good (but found some places where we can add some tests).

I played around with this and I think what's happening is that when an endpoint is changed, the Recalculate is always called, which builds out a new endpoints cache causing that change to get passed over to Envoy.

So ideally we'd need to look at if the change was caused by an endpoint that's relevant before calling recalculate. Additionally, I think we'd benefit from a backoff timer which was added to the DAG rebuild a while back. This would delay the number of updates that happen to Envoy. We'd need to play with the timing, but if you have a lot of endpoint changes across the cluster, we're going to be bombarding Envoy with xDS updates.

stevesloka · 2021-06-09T14:12:52Z

haha that timing @Zsolt-LazarZsolt! 😀

stevesloka · 2021-06-09T14:20:10Z

I'd need to test as well, but at a first glance, the patch from @tsaarni looks like it's on the same path that I came down in my response.

I think if we combined that with a backoff timer, that would really help out!

sunjayBhatia · 2021-06-09T15:14:24Z

ah i misread, i thought the HTTPProxy etc. also referenced the irrelevant service, but it does not 👀

Zsolt-LazarZsolt · 2021-06-09T16:24:23Z

Maybe it is also good to add, that in case a new unrelated service is added. (clusterIP service without any endpoints)
Nothing seem to be in the debug log (which is good)

The manifest looks as follows:

kind: Service
apiVersion: v1
metadata:
name: dummy-endpoint
spec:
type: ClusterIP
ports:

port: 9191
targetPort: 9191

Although when the same service is removed. I can see the following in the contour debug log:

time="2021-06-09T16:16:39Z" level=info msg="performing delayed update" context=contourEventHandler last_update=28.282869247s outstanding=1
time="2021-06-09T16:16:39Z" level=debug msg="added ServiceCluster "default/relevant-service/http" from DAG" context=endpointstranslator
time="2021-06-09T16:16:39Z" level=debug msg="cluster load assignments did not change" context=endpointstranslator
time="2021-06-09T16:16:39Z" level=debug msg="received a status update" context=StatusUpdateHandler name=relevant-proxy namespace=default
time="2021-06-09T16:16:39Z" level=debug msg="update was a no-op" context=StatusUpdateHandler name=relevant-proxy namespace=default
time="2021-06-09T16:16:39Z" level=debug msg="handling v3 xDS resource request" connection=2 context=xds node_id=envoy-gcf8g node_version=v1.18.3 resource_names="[]" response_nonce=4 type_url=type.googleapis.com/envoy.config.cluster.v3.Cluster version_info=4
time="2021-06-09T16:16:39Z" level=debug msg="handling v3 xDS resource request" connection=1 context=xds node_id=envoy-gcf8g node_version=v1.18.3 resource_names="[]" response_nonce=4 type_url=type.googleapis.com/envoy.config.listener.v3.Listener version_info=4
time="2021-06-09T16:16:39Z" level=debug msg="handling v3 xDS resource request" connection=3 context=xds node_id=envoy-gcf8g node_version=v1.18.3 resource_names="[ingress_http]" response_nonce=4 type_url=type.googleapis.com/envoy.config.route.v3.RouteConfiguration version_info=4

Sot this also triggers the same kind of behaviour, on delete.

skriss · 2021-06-29T20:22:58Z

I'm gonna add this to the 1.18 milestone, I think we should get this fixed. I also tested out a similar patch to @tsaarni's and it looked good in my initial testing. @tsaarni did you want to get your patch up as a PR or would you rather someone else take this on?

tsaarni · 2021-06-30T08:23:08Z

@skriss If you have time to work with it, it would be perfect! I could have worked with it after the other PRs that I have open, but would be grateful if you can work with it for 1.18.0.

Just the other day we were chatting with @Zsolt-LazarZsolt about this issue. It seems even a simple fix without backoff timer could help lowering some CPU consumption on a busy cluster quite nicely.

In the EndpointsTranslator, do not recompute ClusterLoadAssignments or notify cache observers if the Endpoints being added/updated/deleted is not used by a ServiceCluster in the DAG. Closes projectcontour#3782.

In the EndpointsTranslator, do not recompute ClusterLoadAssignments or notify cache observers if the Endpoints being added/updated/deleted is not used by a ServiceCluster in the DAG. Closes projectcontour#3782. Signed-off-by: Steve Kriss <krisss@vmware.com>

In the EndpointsTranslator, do not recompute ClusterLoadAssignments or notify cache observers if the Endpoints being added/updated/deleted is not used by a ServiceCluster in the DAG. Closes #3782. Signed-off-by: Steve Kriss <krisss@vmware.com>

internal/xdscache: ignore irrelevant Endpoints (projectcontour#3852) In the EndpointsTranslator, do not recompute ClusterLoadAssignments or notify cache observers if the Endpoints being added/updated/deleted is not used by a ServiceCluster in the DAG. Closes projectcontour#3782. Signed-off-by: Steve Kriss <krisss@vmware.com>

Zsolt-LazarZsolt added kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Jun 8, 2021

skriss added this to the 1.18.0 milestone Jun 29, 2021

skriss self-assigned this Jun 30, 2021

skriss removed the lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. label Jun 30, 2021

skriss mentioned this issue Jun 30, 2021

internal/xdscache: ignore irrelevant Endpoints #3852

Merged

skriss modified the milestones: 1.18.0, 1.17.0 Jun 30, 2021

skriss closed this as completed in #3852 Jul 1, 2021

Zsolt-LazarZsolt mentioned this issue Mar 3, 2022

Unrelated secret changes trigger contour envoy xds communication and status update on httpproxies #4386

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrelated endpoint changes trigger ClusterLoadAssignment change and push #3782

Unrelated endpoint changes trigger ClusterLoadAssignment change and push #3782

Zsolt-LazarZsolt commented Jun 8, 2021

Zsolt-LazarZsolt commented Jun 8, 2021

sunjayBhatia commented Jun 8, 2021

sunjayBhatia commented Jun 8, 2021

youngnick commented Jun 9, 2021

Zsolt-LazarZsolt commented Jun 9, 2021

stevesloka commented Jun 9, 2021

stevesloka commented Jun 9, 2021

stevesloka commented Jun 9, 2021

sunjayBhatia commented Jun 9, 2021

Zsolt-LazarZsolt commented Jun 9, 2021 •

edited

Loading

skriss commented Jun 29, 2021

tsaarni commented Jun 30, 2021

Unrelated endpoint changes trigger ClusterLoadAssignment change and push #3782

Unrelated endpoint changes trigger ClusterLoadAssignment change and push #3782

Comments

Zsolt-LazarZsolt commented Jun 8, 2021

Zsolt-LazarZsolt commented Jun 8, 2021

sunjayBhatia commented Jun 8, 2021

sunjayBhatia commented Jun 8, 2021

youngnick commented Jun 9, 2021

Zsolt-LazarZsolt commented Jun 9, 2021

stevesloka commented Jun 9, 2021

stevesloka commented Jun 9, 2021

stevesloka commented Jun 9, 2021

sunjayBhatia commented Jun 9, 2021

Zsolt-LazarZsolt commented Jun 9, 2021 • edited Loading

skriss commented Jun 29, 2021

tsaarni commented Jun 30, 2021

Zsolt-LazarZsolt commented Jun 9, 2021 •

edited

Loading