Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrelated secret changes trigger contour envoy xds communication and status update on httpproxies #4386

Closed
Zsolt-LazarZsolt opened this issue Mar 3, 2022 · 5 comments · Fixed by #4792
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@Zsolt-LazarZsolt
Copy link

When I run contour with --debug and --kubernetes-debug=10 and have any valid httpproxy in the system (regardless of using any secrets or not), change in unrelated secrets will trigger contour and envoy communication, and also a status update on all httpproxy instances, even if that secret has nothing to do with ingress.

Interestingly does not trigger on generic secrets.

Ways to reproduce it:
have two certificate and key files ready: secret1/secret.key secret1/secret.crt secret2/secret.key secret2/secret.crt

Update the secret by alternating the two lines:

kubectl create secret tls unrelated-secret --key=secret1/secret.key --cert=secret1/secret.crt --save-config --dry-run -o yaml | kubectl apply -f -

kubectl create secret tls unrelated-secret --key=secret2/secret.key --cert=secret2/secret.crt --save-config --dry-run -o yaml | kubectl apply -f -

While checking the log of contour. You will see:

time="2022-03-03T17:00:56Z" level=debug msg="handling v3 xDS resource request" connection=4 messages toward connected envoys and httpProxy status check is also triggered.

This can generate unwanted pressure on the xds communication, and also on kube API
Secret creation is an exception, but secret change and secret deletion does trigger the behavior.

Similar to this issue

#3782

Related issue:
Also I noticed that --kubernetes-debug=10 does not seem to have an affect on the informer on the secrets. At least I don't see secret related API movement in the logs.

Also related issue:
Using the same setup and creating a secret the following way:

kubectl create secret generic my-secret --from-literal=ca.crt=supersecret --dry-run -o yaml | kubectl apply -f -

then update it with the following command:

kubectl create secret generic my-secret --from-literal=ca.crt=supersecret1 --dry-run -o yaml | kubectl apply -f -

I see the following error on contour, on a secret it should not care about.

time="2022-03-03T17:08:05Z" level=error msg="invalid CA certificate bundle: failed to locate certificate" context=KubernetesCache kind=Secret name=my-secret namespace=default version=v1

The problem only seen on unrelated secret Updates.

  • Contour version: 1.20
  • Kubernetes version: (use kubectl version): 1.21.1
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
@Zsolt-LazarZsolt Zsolt-LazarZsolt added kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Mar 3, 2022
@skriss
Copy link
Member

skriss commented Mar 4, 2022

Yeah, I can see generally why this is happening - these unrelated secrets are being stored in the DAG cache, even though they're not referenced, and updates or deletes trigger the remove function (in the case of an update, Contour does a delete followed by an insert) which will then trigger a DAG rebuild/etc. as long as the secret was in the cache.

I think you'd probably see the same thing with updates/deletes on unrelated Services too.

Right now, we have to cache all Services and Secrets because otherwise they'd be unavailable to the DAG if/when they are eventually referenced by an HTTPProxy or Ingress.

This logic is definitely tricky to get exactly right while still being performant, especially given the way the DAG cache currently works. I think now that we're using controller-runtime more broadly, we might have some options for leveraging its Cache to do some refactoring here and make these problems go away, but it would be a pretty big change and we'd have to ensure no functional or performance regressions.

There might be a more tactical fix here around checking for usage of the Secret/Service before validating it, and also when processing a delete, that could reduce the amount of churn. We'd have to make sure that we are properly validating things when they do start being referenced by an HTTPProxy or Ingress, though.

Do you have a sense of the impact of this issue for you? Urgent to fix, or minor annoyance?

@skriss
Copy link
Member

skriss commented Mar 4, 2022

Looked at this some more and I have a few thoughts on fixes we can make ~now:

  • I'm not sure that we need to validate secrets as they go into the cache, since it looks like they're always validated as they're read from the cache by the processors anyway. If we drop the pre-insertion validation, that avoids validation errors on unused secrets. Just need to make sure the validation on reading from the cache matches what currently happens up-front.
  • I believe we can add a check to the secretTriggersRebuild function in the remove logic, so that we only return true if the secret was referenced somewhere. This should help avoid the churn on unused secret update/delete.
  • We can do a similar thing to the previous bullet for services too.

I think we can probably also get more efficient with our tracking/checking of whether secrets and services are used -- right now it's fairly inefficient for larger sets of data. One idea would be to do some reference tracking as HTTPProxies/Ingresses/Routes go into the cache, so we keep a running list of which secrets/services are referenced, and then use that as a basis for determining whether a given secret/service event needs to trigger a rebuild or not. This is more of a nice-to-have performance improvement, though, not directly related to the issues you're encountering.

@Zsolt-LazarZsolt
Copy link
Author

I see your point on the DAG referencing.
So in theory all secrets and services are stored in the DAG in case something references them in the future.
I guess that also have memory usage problems on huge clusters, maybe also problematic from a security perspective?
Maybe the best way for now to combat that is using the root-namespace.

I guess the list keeping of important resources could be a great way of not triggering unwanted rebuild/status update and envoy communication.
Maybe that can produce some race conditions, between incoming updates and the list being written.

So far the issue itself was not triggering faults, for now I can only foresee CPU and Network usage as a result of this.
One thing I can imagine is a resent endpoint list for a upstream service can break routing if strategy: Cookie is used.

Any input on the kubernetes-debug=10 vs secret informer verbosity question? That could be a low hanging fruit at first

@sunjayBhatia
Copy link
Member

informers come from controller-runtime manager caches, so we can maybe look at if the controller-runtime manager needs the logger to be hooked up properly, currently we just use the "default": https://github.com/kubernetes-sigs/controller-runtime/blob/9ee63fc65a9720568f74df9901b16883621636b4/pkg/manager/manager.go#L136-L138

@sunjayBhatia
Copy link
Member

sunjayBhatia commented Mar 7, 2022

See #4391 for an attempt to fix the logging part

@skriss skriss added this to Contour Sep 22, 2022
@sunjayBhatia sunjayBhatia added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Sep 26, 2022
@skriss skriss added this to the 1.24.0 milestone Oct 12, 2022
@skriss skriss moved this to Todo in Contour Oct 12, 2022
@skriss skriss self-assigned this Oct 12, 2022
@skriss skriss moved this from Todo to In Progress in Contour Oct 12, 2022
skriss added a commit to skriss/contour that referenced this issue Oct 13, 2022
On Secret updates and deletes, only
trigger DAG rebuilds if the Secret
is referenced by an Ingress, HTTPProxy,
Gateway or global config.

Closes projectcontour#4386.

Signed-off-by: Steve Kriss <krisss@vmware.com>
skriss added a commit to skriss/contour that referenced this issue Oct 21, 2022
On Secret updates and deletes, only
trigger DAG rebuilds if the Secret
is referenced by an Ingress, HTTPProxy,
Gateway or global config.

Closes projectcontour#4386.

Signed-off-by: Steve Kriss <krisss@vmware.com>
skriss added a commit that referenced this issue Oct 24, 2022
On Secret updates and deletes, only
trigger DAG rebuilds if the Secret
is referenced by an Ingress, HTTPProxy,
Gateway or global config.

Closes #4386.

Signed-off-by: Steve Kriss <krisss@vmware.com>
Repository owner moved this from In Progress to Done in Contour Oct 24, 2022
moeyui1 pushed a commit to moeyui1/contour that referenced this issue Oct 26, 2022
…ctcontour#4792)

On Secret updates and deletes, only
trigger DAG rebuilds if the Secret
is referenced by an Ingress, HTTPProxy,
Gateway or global config.

Closes projectcontour#4386.

Signed-off-by: Steve Kriss <krisss@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants