Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrelated endpoint changes trigger ClusterLoadAssignment change and push #3782

Closed
Zsolt-LazarZsolt opened this issue Jun 8, 2021 · 12 comments · Fixed by #3852
Closed

Unrelated endpoint changes trigger ClusterLoadAssignment change and push #3782

Zsolt-LazarZsolt opened this issue Jun 8, 2021 · 12 comments · Fixed by #3852
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@Zsolt-LazarZsolt
Copy link

What steps did you take and what happened:
I used the 1.15 contour example.yaml to reproduce the issue, the only changed thing is adding --debug to the contour serve process, and changing "--log-level debug" for envoy.

I installed 2 services to the cluster both tied with a separate deployment with one pod.

  • relevant-service
  • other-service
    and created the simplest httpproxy I could think of sending traffic of example.com to the relevant-service.

So at this point the setup is simple and contour in theory only cares about the relevant-service, traffic is fine.
Next I follow the logs of one contour pod with kubectl logs -f and at the same time I do
kubectl edit deployment other-deployment, where I change the replica count to 2.

This creates a new pod, and a new endpoint on a service that in theory contour does not have anything to do with.

To my surprise I see the following in the contour logs:
time="2021-06-08T08:17:11Z" level=debug msg="handling v3 xDS resource request" connection=12 context=xds node_id=envoy-z8wxd node_version=v1.18.3 resource_names="[default/relevant-service/http]" response_nonce=18 type_url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version_info=18
time="2021-06-08T08:17:11Z" level=debug msg="handling v3 xDS resource request" connection=9 context=xds node_id=envoy-48brv node_version=v1.18.3 resource_names="[default/relevant-service/http]" response_nonce=18 type_url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version_info=18
time="2021-06-08T08:17:11Z" level=debug msg="handling v3 xDS resource request" connection=10 context=xds node_id=envoy-r495t node_version=v1.18.3 resource_names="[default/relevant-service/http]" response_nonce=18 type_url=type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment version_info=18

Contour triggers and sends the endpoint list to all connected envoy pods.

This on a busy cluster does happen a lot of times, as this is triggered by any endpoint in any namespace especially when most pods do have failing readiness probes.
This multiplies by the dinamyc clusters in contour so for every backend service, and also multiplies by the connected envoys.
This consumes network traffic, and resource usage in both contour and envoy.

What did you expect to happen:
I would expect no endpoint push triggered by unrelated endpoint changes.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Contour version: 1.15.1
  • Kubernetes version: (use kubectl version): 1.21
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
@Zsolt-LazarZsolt Zsolt-LazarZsolt added kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Jun 8, 2021
@Zsolt-LazarZsolt
Copy link
Author

setup.txt
Forgot to add my file for the setup
Also attaching the contour manifests with my changes. (Switching on debug logging, and removing envoy probes.)

contour.txt

@sunjayBhatia
Copy link
Member

Right now contour doesn’t implement incremental xDS so we do send the state of the world on changes to resources

@sunjayBhatia
Copy link
Member

Related: #1176

@youngnick
Copy link
Member

Thanks for the detailed report @Zsolt-LazarZsolt. I agree that Contour should ideally not be sending endpoint updates when the updates haven't changed. I'm not sure if we will be able to do much about it until we've fixed #1176, but one of us will see if we can replicate and understand the cause of the problem.

@Zsolt-LazarZsolt
Copy link
Author

Hello Everyone. Thank you for the quick assessment.

I have trouble seeing the connection between the two tickets, let me elaborate on the issue.
My issue is not with the type of the update, my issue is that an update toward the envoy pods happens at all.
So basically any endpoint change in the cluster will trigger the communication. (even if the endpoint is not part of any configured upstream)

Additionally @tsaarni created the following, so far untested change. I plan to check it in my own environment.
What do you think, can it have unforeseen consequences?
Nordix@0be2438

@stevesloka
Copy link
Member

I don't think that #1176 has anything to do with this. The scenario you showed @Zsolt-LazarZsolt should not have cause any xDS changes, so no updates should have happened.

We have logic to check if a service that changed causes an update. So in your example the update of the other-service shouldn't get added to the clusters that Contour cares about.

When the endpoint for that service changes, then the endpointstranslator should not find the matching cluster and should not cause the eps to rebuild.

So all that logic is good (but found some places where we can add some tests).

I played around with this and I think what's happening is that when an endpoint is changed, the Recalculate is always called, which builds out a new endpoints cache causing that change to get passed over to Envoy.

So ideally we'd need to look at if the change was caused by an endpoint that's relevant before calling recalculate. Additionally, I think we'd benefit from a backoff timer which was added to the DAG rebuild a while back. This would delay the number of updates that happen to Envoy. We'd need to play with the timing, but if you have a lot of endpoint changes across the cluster, we're going to be bombarding Envoy with xDS updates.

@stevesloka
Copy link
Member

haha that timing @Zsolt-LazarZsolt! 😀

@stevesloka
Copy link
Member

I'd need to test as well, but at a first glance, the patch from @tsaarni looks like it's on the same path that I came down in my response.

I think if we combined that with a backoff timer, that would really help out!

@sunjayBhatia
Copy link
Member

ah i misread, i thought the HTTPProxy etc. also referenced the irrelevant service, but it does not 👀

@Zsolt-LazarZsolt
Copy link
Author

Zsolt-LazarZsolt commented Jun 9, 2021

Maybe it is also good to add, that in case a new unrelated service is added. (clusterIP service without any endpoints)
Nothing seem to be in the debug log (which is good)

The manifest looks as follows:

kind: Service
apiVersion: v1
metadata:
name: dummy-endpoint
spec:
type: ClusterIP
ports:

  • port: 9191
    targetPort: 9191

Although when the same service is removed. I can see the following in the contour debug log:

time="2021-06-09T16:16:39Z" level=info msg="performing delayed update" context=contourEventHandler last_update=28.282869247s outstanding=1
time="2021-06-09T16:16:39Z" level=debug msg="added ServiceCluster "default/relevant-service/http" from DAG" context=endpointstranslator
time="2021-06-09T16:16:39Z" level=debug msg="cluster load assignments did not change" context=endpointstranslator
time="2021-06-09T16:16:39Z" level=debug msg="received a status update" context=StatusUpdateHandler name=relevant-proxy namespace=default
time="2021-06-09T16:16:39Z" level=debug msg="update was a no-op" context=StatusUpdateHandler name=relevant-proxy namespace=default
time="2021-06-09T16:16:39Z" level=debug msg="handling v3 xDS resource request" connection=2 context=xds node_id=envoy-gcf8g node_version=v1.18.3 resource_names="[]" response_nonce=4 type_url=type.googleapis.com/envoy.config.cluster.v3.Cluster version_info=4
time="2021-06-09T16:16:39Z" level=debug msg="handling v3 xDS resource request" connection=1 context=xds node_id=envoy-gcf8g node_version=v1.18.3 resource_names="[]" response_nonce=4 type_url=type.googleapis.com/envoy.config.listener.v3.Listener version_info=4
time="2021-06-09T16:16:39Z" level=debug msg="handling v3 xDS resource request" connection=3 context=xds node_id=envoy-gcf8g node_version=v1.18.3 resource_names="[ingress_http]" response_nonce=4 type_url=type.googleapis.com/envoy.config.route.v3.RouteConfiguration version_info=4

Sot this also triggers the same kind of behaviour, on delete.

@skriss
Copy link
Member

skriss commented Jun 29, 2021

I'm gonna add this to the 1.18 milestone, I think we should get this fixed. I also tested out a similar patch to @tsaarni's and it looked good in my initial testing. @tsaarni did you want to get your patch up as a PR or would you rather someone else take this on?

@skriss skriss added this to the 1.18.0 milestone Jun 29, 2021
@tsaarni
Copy link
Member

tsaarni commented Jun 30, 2021

@skriss If you have time to work with it, it would be perfect! I could have worked with it after the other PRs that I have open, but would be grateful if you can work with it for 1.18.0.

Just the other day we were chatting with @Zsolt-LazarZsolt about this issue. It seems even a simple fix without backoff timer could help lowering some CPU consumption on a busy cluster quite nicely.

@skriss skriss self-assigned this Jun 30, 2021
@skriss skriss removed the lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. label Jun 30, 2021
skriss added a commit to skriss/contour that referenced this issue Jun 30, 2021
In the EndpointsTranslator, do not recompute ClusterLoadAssignments
or notify cache observers if the Endpoints being added/updated/deleted
is not used by a ServiceCluster in the DAG.

Closes projectcontour#3782.
skriss added a commit to skriss/contour that referenced this issue Jun 30, 2021
In the EndpointsTranslator, do not recompute ClusterLoadAssignments
or notify cache observers if the Endpoints being added/updated/deleted
is not used by a ServiceCluster in the DAG.

Closes projectcontour#3782.
skriss added a commit to skriss/contour that referenced this issue Jun 30, 2021
In the EndpointsTranslator, do not recompute ClusterLoadAssignments
or notify cache observers if the Endpoints being added/updated/deleted
is not used by a ServiceCluster in the DAG.

Closes projectcontour#3782.

Signed-off-by: Steve Kriss <krisss@vmware.com>
@skriss skriss modified the milestones: 1.18.0, 1.17.0 Jun 30, 2021
skriss added a commit that referenced this issue Jul 1, 2021
In the EndpointsTranslator, do not recompute ClusterLoadAssignments
or notify cache observers if the Endpoints being added/updated/deleted
is not used by a ServiceCluster in the DAG.

Closes #3782.

Signed-off-by: Steve Kriss <krisss@vmware.com>
tsaarni added a commit to Nordix/contour that referenced this issue Sep 20, 2021
internal/xdscache: ignore irrelevant Endpoints (projectcontour#3852)

In the EndpointsTranslator, do not recompute ClusterLoadAssignments
or notify cache observers if the Endpoints being added/updated/deleted
is not used by a ServiceCluster in the DAG.

Closes projectcontour#3782.

Signed-off-by: Steve Kriss <krisss@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants