Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VirtualGateways should not break with a missing TLS certificate #6957

Closed
day0ops opened this issue Aug 16, 2022 · 11 comments
Closed

VirtualGateways should not break with a missing TLS certificate #6957

day0ops opened this issue Aug 16, 2022 · 11 comments
Assignees
Labels
Area: Stability Issues related to stability of the product, engineering, tech debt Geo: APAC Needs Investigation Prioritized Indicating issue prioritized to be worked on in RFE stream Priority: High Required in next 3 months to make progress, bugs that affect multiple users, or very bad UX release/1.17 release/1.18 scope/1.18 Type: Enhancement New feature or request zendesk

Comments

@day0ops
Copy link
Contributor

day0ops commented Aug 16, 2022

Gloo Edge Version

1.11.x (latest stable)

Kubernetes Version

1.23.x

Describe the bug

At times when certificate issuance is slow (could be because the DNS challenge isnt solved) in cert-manager (i.e. the TLS secret is not available yet), Gloo Edge will go into bad state (when mounting a TLS). This is specially true if the proxy restarts as this causes HTTPS listener to disappear and will cause an outage.

The listener should keep work functioning for all connections where a proper cert is existing.

Steps to reproduce the bug

Observed when deleting a referenced certificate.

Expected Behavior

Expect the proxy to behave good for all other connections where there is a certificate.

Additional Context

No response

┆Issue is synchronized with this Asana task by Unito

@day0ops day0ops added the Type: Bug Something isn't working label Aug 16, 2022
@chrisgaun chrisgaun added Geo: APAC Type: Enhancement New feature or request and removed Type: Bug Something isn't working labels Aug 31, 2022
@dinukxx
Copy link

dinukxx commented Sep 9, 2022

This issue actually spawned two production incident last week. so this one is pretty high in priority for Xero. it would be great if this can be looked at sooner.

@AkshayAdsul AkshayAdsul added the Priority: High Required in next 3 months to make progress, bugs that affect multiple users, or very bad UX label Jun 7, 2023
@day0ops
Copy link
Contributor Author

day0ops commented Jun 8, 2023

@SantoDE SantoDE changed the title Introduce a fall back TLS when cert-manager is slow to sign certificate with an external CA VirtualGateways should not break with a missing TLS certificate Jun 8, 2023
@sam-heilbron
Copy link
Contributor

There are a few dimensions to this issue, and I want to make sure we break them down, so we can better understand the challenges and afterwards help come up with a desired solution. To me there are a few aspects at play:

  • Delayed certificate issuance
  • Deleting secrets that are valid and referenced in a VirtualService to server-side TLS
  • Proxy restarts causing differing behavior in the new pod from the previous one
  • The listener should keep work functioning for all connections where a proper cert is existing.

Delayed Certificate Issuance

Can you help me understand how this plays a role here? I would imagine that cert-manager (as an example) doesn't delete the secret, but instead modifies it when there is an update. I'm happy to learn more about the internals about what takes place, but can you share what the expected order of events is and where in there Secrets are being deleted?

Deleting Secrets That are Valid

I agree that we should not support this. There are a few cases that I think are worth considering:

  1. Prevent the deletion through a validation API (Gloo-EE Validating Webhook for Secrets Doesn't Work #8001)
  2. What should the behavior or contract of Gloo Edge, in the case where validation is disabled. Do you have thoughts on this?

Proxy restarts causing differing behavior in the new pod from the previous one

I observed a similar behavior when configuring a single VS. I believe this is because if you invalidate the VS (and thus the filter chain), the listener is no longer valid and is not updated. As a result, the "old", previously working tls config is still stored in memory in the proxy. When a new pod starts up, it requests the config from gloo and doesn't pick up any listeners, because the xds cache has no valid listeners.

However, I noticed that when I had multiple virtual services, invalidating one resulted in the listener being updated by removing the now invalid filter chain. However, since there was >0 valid filter chains, the listener kept functioning. This to me feels like the proper behavior, but the open questions is how we want to handle the case where a listener has 0 valid virtual hosts (because each has an invalid tls config)

The listener should keep work functioning for all connections where a proper cert is existing

To my understanding, this is the current behavior. It feels like the open questions are:

  • What should happen when a secret containing a secret is deleted?
  • What is causing that Secret to be deleted?
  • Can we prevent it from occurring?

@nil-scan
Copy link

nil-scan commented Nov 1, 2023

I feel that there is some context missing here as you mention secret being deleted which I don't see in this issue (maybe part of the zendesk ticket that is linked but I do not have access to it), so our situation might be a bit different but I think this is similar to an issue we're having. Let me know if you prefer that I create a separate issue.

Delayed Certificate Issuance

AFAIK this is only an issue when creating a new VirtualService and associated cert, not on deletion or updates.
What happens is that Gloo processes the VirtualServices almost immediately, while cert-manager needs a certain amount of time to fetch a certificate (it can take > 1h in case of rate limiting, or even be impossible to fetch at all if there are issues answering the challenge).
Cert-manager only creates the kubernetes secret when the certificate has been properly issued, therefore there is almost always some time during which Gloo sees new VirtualServices as invalid.

The way nginx-ingress solves this problem is by serving a default self-signed certificate if the secret cannot be found.

Deleting Secrets That are Valid

I haven't had this issue, it seems reasonable to have a validation webhook on secret deletion, but as you mentioned with a flag to disable it for backwards-compatibility.

One issue I have with the validation webhook way is that it introduces a notion of dependency (the VirtualService has to be deleted first, then the secret). This can make simple workflows using a kubectl apply/delete break or be flaky as there's no way to guarantee the order objects will be created/deleted in.

Note that this is also true at creation time with some of the existing validation hooks. Here's an example of applying a single yaml file that contains a VirtualService and a cert-manager Certificate

$ kubectl apply -f vs-test.yaml

Error from server: error when creating "test.mydomain.com.yaml": admission webhook "gloo.gloo-system.svc" denied the request: resource incompatible with current Gloo snapshot: [Validating v1.VirtualService failed: validating *v1.VirtualService name:"test.mydomain.com"  namespace:"gloo-test": failed to validate Proxy with Gloo validation server: Listener Error: SSLConfigError. Reason: SSL secret not found: list did not find secret test.mydomain.com]

To work around this one would have to either:

  • Retry later (once the cert has been fetched)
  • Use another tool such as terraform to manage the dependency
  • Set validation.alwaysAcceptResources to true, but that means that we will also accept truly malformed VirtualServices.

Serving a default cert when the secret cannot be found would be a simple way to solve this issue. The validation webhook could maybe just throw a warning instead of failing?

Proxy restarts causing differing behavior in the new pod from the previous one

The listener should keep work functioning for all connections where a proper cert is existing

Merging these 2 questions, as I'm not sure in which one this falls in to but what happened to us was that the gloo Gateway was in an error state (because the certificate couldn't be fetched and thus the tls secret couldn't be found).
Everything still worked fine until the gloo pod restarted (not gateway-proxy), after which the new pod decided to drop the whole listener (stripping invalid listener from proxy)

$ curl localhost:19000/listeners

prometheus_listener::0.0.0.0:8081
listener-::-8080::[::]:8080

(This was tested on gloo OSS 1.13.30)

This can be easily reproduced by creating a VirtualService referencing a secret that doesn't exist (with validation.alwaysAcceptResources set to true). Then restart the gloo pod.

I do have other https VirtualServices that have a good config, so it's not that we ended up with 0 valid routes.
This is IMO the critical part of the issue, where a single badly configured VS can take everything down with it.

Here's the logs on gateway-proxy when this happens (I removed the deprecation warnings)

[2023-11-01 00:31:15.180][7][info][upstream] [external/envoy/source/server/lds_api.cc:82] lds: add/update listener 'listener-::-8080'
[2023-11-01 00:31:15.222][7][info][upstream] [external/envoy/source/server/lds_api.cc:82] lds: add/update listener 'listener-::-8443'
[2023-11-01 00:31:15.228][7][info][config] [external/envoy/source/server/listener_manager_impl.cc:841] all dependencies initialized. starting workers
[2023-11-01 00:40:27.700][7][warning][config] [external/envoy/source/common/config/grpc_stream.h:160] StreamAggregatedResources gRPC config stream closed: 13, 
[2023-11-01 00:40:28.058][7][info][upstream] [external/envoy/source/server/lds_api.cc:63] lds: remove listener 'listener-::-8443'
[2023-11-01 00:46:15.228][7][info][main] [external/envoy/source/server/drain_manager_impl.cc:171] shutting down parent after drain

Logs for the gloo pod (note that the timestamps won't match as I had to restart the pod again to capture the logs)

gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.468Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:92","msg":"begin sync 5563496949759613014 (56 virtual services, 6 gateways, 30 route tables)","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.470Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:132","msg":"desired proxy name:\"gateway-proxy\" namespace:\"gloo-system\"","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.476Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"validation/server.go:189","msg":"received proxy validation request","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.493Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer.proxy-validator.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8080","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.496Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer.proxy-validator.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8443","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"warn","ts":"2023-11-01T01:14:13.548Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"reconciler/proxy_reconciler.go:164","msg":"stripping invalid listener from proxy","version":"1.13.30","proxy":"name:\"gateway-proxy\" namespace:\"gloo-system\"","listener":"listener-::-8443"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.549Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:104","msg":"end sync 5563496949759613014","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.570Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:80","msg":"begin sync 11661502467246133202 (1 proxies, 471 upstreams, 579 endpoints, 197 secrets, 355 artifacts, 0 auth configs, 0 rate limit configs, 0 graphql apis)","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.583Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8080","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.597Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:165","msg":"Setting xDS Snapshot","version":"1.13.30","key":"gloo-system~gateway-proxy","clusters":469,"listeners":1,"routes":1,"endpoints":451}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.597Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:175","msg":"end sync 11661502467246133202","version":"1.13.30"}

@nil-scan
Copy link

nil-scan commented Nov 8, 2023

I just found out that cert-manager is able to issue a temporary certificate by setting the cert-manager.io/issue-temporary-certificate annotation to "true" on the cert.

This works nicely as a workaround, and it actually makes a lot more sense for cert-manager to do that rather than gloo.

I think there's still work to do on the gloo side:

  • Stripping a whole listener for a bad vs configuration shouldn't happen. At most it should remove the vs itself, ideally keep the last known good config whenever possible.
  • We need to set alwaysAccept to true to create VirtualServices with cert-manager, I would rather keep that to false to catch other misconfiguration issues
  • There’s still the underlying possibility that the secret won’t be created for whatever reason, and lead to a full outage

@nfuden nfuden self-assigned this Mar 5, 2024
@htpvu htpvu added the Prioritized Indicating issue prioritized to be worked on in RFE stream label Mar 6, 2024
@kcbabo kcbabo added the Area: Stability Issues related to stability of the product, engineering, tech debt label Apr 2, 2024
@nfuden nfuden removed their assignment Apr 15, 2024
@AkshayAdsul
Copy link

Another related issue
https://solo-io.zendesk.com/agent/tickets/3753

@nrjpoddar
Copy link

@sam-heilbron to look at this issue for scope/estimation.

@soloio-bot
Copy link

Zendesk ticket #3753 has been linked to this issue.

@jbohanon
Copy link
Contributor

Reproduction

I have been able to reproduce this with the following steps (assuming running cluster):

  1. Create the cert-manager installation
kubectl create namespace cert-manager
kubectl apply --validate=false -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml
  1. Create the gloo installation
helm repo add gloo https://storage.googleapis.com/solo-public-helm
helm repo update
helm install -n gloo-system gloo gloo/gloo --set gateway.persistProxySpec=true --set gateway.validation.alwaysAcceptResources=false --set gateway.validation.allowWarnings=false
  1. Create the cert-manager pre-requisites
kubectl apply -f - << EOF
apiVersion: v1
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-issuer
spec:
  selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-selfsigned-ca
  namespace: cert-manager
spec:
  isCA: true
  commonName: my-selfsigned-ca
  secretName: root-secret
  privateKey:
    algorithm: ECDSA
    size: 256
  issuerRef:
    name: selfsigned-issuer
    kind: ClusterIssuer
    group: cert-manager.io
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: my-ca-issuer
spec:
  ca:
    secretName: root-secret
EOF
  1. Create a Certificate and a virtual service which consumes its resultant secret
kubectl apply -f - << EOF
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: test-1.solo.io
  namespace: gloo-system
spec:
  secretName: test-1.solo.io
  dnsNames:
  - test-1.solo.io
  issuerRef:
    name: my-ca-issuer
    kind: ClusterIssuer
---
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: direct-response-ssl-1
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - test-1.solo.io
      - test-1.solo.io:8443
    routes:
    - matchers:
       - prefix: /
      directResponseAction:
        status: 200
        body: you did it, number 1!
  sslConfig:
    secretRef:
      name: test-1.solo.io
      namespace: gloo-system
    oneWayTls: true
    sniDomains: [ "test-1.solo.io" ]
EOF
  1. At this point, note that we have a validation error since cert-manager was not given time to create the secret before gloo attempted to reconcile the VS. Running the exact same command again will succeed IFF the cert (and its secret) were created first, which they should have been due to the ordering of the resources in the command.
certificate.cert-manager.io/test-1.solo.io created
Error from server: error when creating "STDIN": admission webhook "gloo.gloo-system.svc" denied the request: resource incompatible with current Gloo snapshot: [Validating *v1.VirtualService failed: 1 error occurred:
        * Validating *v1.VirtualService failed: validating *v1.VirtualService name:"direct-response-ssl-1" namespace:"gloo-system": 1 error occurred:
        * failed to validate Proxy [namespace: gloo-system, name: gateway-proxy] with gloo validation: Listener Error: SSLConfigError. Reason: SSL secret not found: list did not find secret gloo-system.test-1.solo.io
  1. Port-forward the gateway-proxy in a separate terminal
kubectl port-forward -n gloo-system deploy/gateway-proxy 8443
  1. Append your hosts file with the following lines for the hosts we will be using
127.0.0.1 test-1.solo.io
127.0.0.1 test-2.solo.io
  1. Execute a curl request showing that the proxy is responding as expected
curl -vk https://test-1.solo.io:8443
  1. Delete the Certificate resource and secret
    NOTE: this requires disabling validation. Currently, validation settings should prevent this particular failure mode, and when used in conjunction with the cert-manager configuration to enable temporary certs during issuance, the problem of a missing secret should be nullified.
    ...unless validation has to be disabled for cert-manager to work properly...
kubectl patch settings -n gloo-system default --type merge -p '{"spec":{"gateway":{"validation":{"alwaysAccept":true}}}}'
sleep 1
kubectl delete Certificate -n gloo-system test-1.solo.io && kubectl delete secret -n gloo-system test-1.solo.io
  1. Watch for the proxy to eventually stop serving
watch curl -vk https://test-1.solo.io:8443

An odd state

image
We can see here that the VirtualService, where the actual configuration issue* is occurring, shows Accepted with a subresource status for the Proxy showing Accepted, the Gateway which consumes the VirtualService shows Rejected with a subresource status for the Proxy showing Accepted, and the Proxy shows Rejected. At the very least this is inconsistent status reporting that misleads where the error is actually occurring.

@jbohanon
Copy link
Contributor

jbohanon commented Sep 3, 2024

This work will be released in the following versions:

Gloo Edge:
v1.18.0-beta16
v1.17.5
v1.16.20
v1.15.31

Gloo Edge Enterprise:
v1.18.0-beta1
v1.17.2
v1.16.15
v1.15.22

This work will not be available in 1.14 versions of Gloo Edge because the validating webhook behavior was significantly different, and I was unable to reproduce the issue on 1.14

@jbohanon
Copy link
Contributor

This work is merged into all relevant versions as detailed in the comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Stability Issues related to stability of the product, engineering, tech debt Geo: APAC Needs Investigation Prioritized Indicating issue prioritized to be worked on in RFE stream Priority: High Required in next 3 months to make progress, bugs that affect multiple users, or very bad UX release/1.17 release/1.18 scope/1.18 Type: Enhancement New feature or request zendesk
Projects
None yet
Development

No branches or pull requests