VirtualGateways should not break with a missing TLS certificate #6957

day0ops · 2022-08-16T09:25:32Z

Gloo Edge Version

1.11.x (latest stable)

Kubernetes Version

1.23.x

Describe the bug

At times when certificate issuance is slow (could be because the DNS challenge isnt solved) in cert-manager (i.e. the TLS secret is not available yet), Gloo Edge will go into bad state (when mounting a TLS). This is specially true if the proxy restarts as this causes HTTPS listener to disappear and will cause an outage.

The listener should keep work functioning for all connections where a proper cert is existing.

Steps to reproduce the bug

Observed when deleting a referenced certificate.

Expected Behavior

Expect the proxy to behave good for all other connections where there is a certificate.

Additional Context

No response

┆Issue is synchronized with this Asana task by Unito

dinukxx · 2022-09-09T00:58:40Z

This issue actually spawned two production incident last week. so this one is pretty high in priority for Xero. it would be great if this can be looked at sooner.

day0ops · 2023-06-08T07:55:19Z

Related issue https://solo-io.zendesk.com/agent/tickets/2123

sam-heilbron · 2023-09-13T21:24:35Z

There are a few dimensions to this issue, and I want to make sure we break them down, so we can better understand the challenges and afterwards help come up with a desired solution. To me there are a few aspects at play:

Delayed certificate issuance
Deleting secrets that are valid and referenced in a VirtualService to server-side TLS
Proxy restarts causing differing behavior in the new pod from the previous one
The listener should keep work functioning for all connections where a proper cert is existing.

Delayed Certificate Issuance

Can you help me understand how this plays a role here? I would imagine that cert-manager (as an example) doesn't delete the secret, but instead modifies it when there is an update. I'm happy to learn more about the internals about what takes place, but can you share what the expected order of events is and where in there Secrets are being deleted?

Deleting Secrets That are Valid

I agree that we should not support this. There are a few cases that I think are worth considering:

Prevent the deletion through a validation API (Gloo-EE Validating Webhook for Secrets Doesn't Work #8001)
What should the behavior or contract of Gloo Edge, in the case where validation is disabled. Do you have thoughts on this?

Proxy restarts causing differing behavior in the new pod from the previous one

I observed a similar behavior when configuring a single VS. I believe this is because if you invalidate the VS (and thus the filter chain), the listener is no longer valid and is not updated. As a result, the "old", previously working tls config is still stored in memory in the proxy. When a new pod starts up, it requests the config from gloo and doesn't pick up any listeners, because the xds cache has no valid listeners.

However, I noticed that when I had multiple virtual services, invalidating one resulted in the listener being updated by removing the now invalid filter chain. However, since there was >0 valid filter chains, the listener kept functioning. This to me feels like the proper behavior, but the open questions is how we want to handle the case where a listener has 0 valid virtual hosts (because each has an invalid tls config)

The listener should keep work functioning for all connections where a proper cert is existing

To my understanding, this is the current behavior. It feels like the open questions are:

What should happen when a secret containing a secret is deleted?
What is causing that Secret to be deleted?
Can we prevent it from occurring?

nil-scan · 2023-11-01T01:43:03Z

I feel that there is some context missing here as you mention secret being deleted which I don't see in this issue (maybe part of the zendesk ticket that is linked but I do not have access to it), so our situation might be a bit different but I think this is similar to an issue we're having. Let me know if you prefer that I create a separate issue.

Delayed Certificate Issuance

AFAIK this is only an issue when creating a new VirtualService and associated cert, not on deletion or updates.
What happens is that Gloo processes the VirtualServices almost immediately, while cert-manager needs a certain amount of time to fetch a certificate (it can take > 1h in case of rate limiting, or even be impossible to fetch at all if there are issues answering the challenge).
Cert-manager only creates the kubernetes secret when the certificate has been properly issued, therefore there is almost always some time during which Gloo sees new VirtualServices as invalid.

The way nginx-ingress solves this problem is by serving a default self-signed certificate if the secret cannot be found.

Deleting Secrets That are Valid

I haven't had this issue, it seems reasonable to have a validation webhook on secret deletion, but as you mentioned with a flag to disable it for backwards-compatibility.

One issue I have with the validation webhook way is that it introduces a notion of dependency (the VirtualService has to be deleted first, then the secret). This can make simple workflows using a kubectl apply/delete break or be flaky as there's no way to guarantee the order objects will be created/deleted in.

Note that this is also true at creation time with some of the existing validation hooks. Here's an example of applying a single yaml file that contains a VirtualService and a cert-manager Certificate

$ kubectl apply -f vs-test.yaml

Error from server: error when creating "test.mydomain.com.yaml": admission webhook "gloo.gloo-system.svc" denied the request: resource incompatible with current Gloo snapshot: [Validating v1.VirtualService failed: validating *v1.VirtualService name:"test.mydomain.com"  namespace:"gloo-test": failed to validate Proxy with Gloo validation server: Listener Error: SSLConfigError. Reason: SSL secret not found: list did not find secret test.mydomain.com]

To work around this one would have to either:

Retry later (once the cert has been fetched)
Use another tool such as terraform to manage the dependency
Set validation.alwaysAcceptResources to true, but that means that we will also accept truly malformed VirtualServices.

Serving a default cert when the secret cannot be found would be a simple way to solve this issue. The validation webhook could maybe just throw a warning instead of failing?

Proxy restarts causing differing behavior in the new pod from the previous one

The listener should keep work functioning for all connections where a proper cert is existing

Merging these 2 questions, as I'm not sure in which one this falls in to but what happened to us was that the gloo Gateway was in an error state (because the certificate couldn't be fetched and thus the tls secret couldn't be found).
Everything still worked fine until the gloo pod restarted (not gateway-proxy), after which the new pod decided to drop the whole listener (stripping invalid listener from proxy)

$ curl localhost:19000/listeners

prometheus_listener::0.0.0.0:8081
listener-::-8080::[::]:8080

(This was tested on gloo OSS 1.13.30)

This can be easily reproduced by creating a VirtualService referencing a secret that doesn't exist (with validation.alwaysAcceptResources set to true). Then restart the gloo pod.

I do have other https VirtualServices that have a good config, so it's not that we ended up with 0 valid routes.
This is IMO the critical part of the issue, where a single badly configured VS can take everything down with it.

Here's the logs on gateway-proxy when this happens (I removed the deprecation warnings)

[2023-11-01 00:31:15.180][7][info][upstream] [external/envoy/source/server/lds_api.cc:82] lds: add/update listener 'listener-::-8080'
[2023-11-01 00:31:15.222][7][info][upstream] [external/envoy/source/server/lds_api.cc:82] lds: add/update listener 'listener-::-8443'
[2023-11-01 00:31:15.228][7][info][config] [external/envoy/source/server/listener_manager_impl.cc:841] all dependencies initialized. starting workers
[2023-11-01 00:40:27.700][7][warning][config] [external/envoy/source/common/config/grpc_stream.h:160] StreamAggregatedResources gRPC config stream closed: 13, 
[2023-11-01 00:40:28.058][7][info][upstream] [external/envoy/source/server/lds_api.cc:63] lds: remove listener 'listener-::-8443'
[2023-11-01 00:46:15.228][7][info][main] [external/envoy/source/server/drain_manager_impl.cc:171] shutting down parent after drain

Logs for the gloo pod (note that the timestamps won't match as I had to restart the pod again to capture the logs)

gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.468Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:92","msg":"begin sync 5563496949759613014 (56 virtual services, 6 gateways, 30 route tables)","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.470Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:132","msg":"desired proxy name:\"gateway-proxy\" namespace:\"gloo-system\"","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.476Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"validation/server.go:189","msg":"received proxy validation request","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.493Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer.proxy-validator.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8080","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.496Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer.proxy-validator.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8443","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"warn","ts":"2023-11-01T01:14:13.548Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"reconciler/proxy_reconciler.go:164","msg":"stripping invalid listener from proxy","version":"1.13.30","proxy":"name:\"gateway-proxy\" namespace:\"gloo-system\"","listener":"listener-::-8443"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.549Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.TranslatorSyncer","caller":"syncer/translator_syncer.go:104","msg":"end sync 5563496949759613014","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.570Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:80","msg":"begin sync 11661502467246133202 (1 proxies, 471 upstreams, 579 endpoints, 197 secrets, 355 artifacts, 0 auth configs, 0 rate limit configs, 0 graphql apis)","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.583Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer.translator","caller":"translator/translator.go:223","msg":"computing envoy resources for listener: listener-::-8080","version":"1.13.30"}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.597Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:165","msg":"Setting xDS Snapshot","version":"1.13.30","key":"gloo-system~gateway-proxy","clusters":469,"listeners":1,"routes":1,"endpoints":451}
gloo-system/gloo-c45787569-mgpr6[gloo]: {"level":"info","ts":"2023-11-01T01:14:13.597Z","logger":"gloo.v1.event_loop.setup.gloosnapshot.event_loop.envoyTranslatorSyncer","caller":"syncer/envoy_translator_syncer.go:175","msg":"end sync 11661502467246133202","version":"1.13.30"}

nil-scan · 2023-11-08T18:24:48Z

I just found out that cert-manager is able to issue a temporary certificate by setting the cert-manager.io/issue-temporary-certificate annotation to "true" on the cert.

This works nicely as a workaround, and it actually makes a lot more sense for cert-manager to do that rather than gloo.

I think there's still work to do on the gloo side:

Stripping a whole listener for a bad vs configuration shouldn't happen. At most it should remove the vs itself, ideally keep the last known good config whenever possible.
We need to set alwaysAccept to true to create VirtualServices with cert-manager, I would rather keep that to false to catch other misconfiguration issues
There’s still the underlying possibility that the secret won’t be created for whatever reason, and lead to a full outage

AkshayAdsul · 2024-05-14T01:49:02Z

Another related issue
https://solo-io.zendesk.com/agent/tickets/3753

nrjpoddar · 2024-07-02T18:56:19Z

@sam-heilbron to look at this issue for scope/estimation.

soloio-bot · 2024-07-11T08:06:59Z

Zendesk ticket #3753 has been linked to this issue.

jbohanon · 2024-07-24T20:07:11Z

Reproduction

I have been able to reproduce this with the following steps (assuming running cluster):

Create the cert-manager installation

kubectl create namespace cert-manager
kubectl apply --validate=false -f https://github.com/cert-manager/cert-manager/releases/download/v1.11.0/cert-manager.yaml

Create the gloo installation

helm repo add gloo https://storage.googleapis.com/solo-public-helm
helm repo update
helm install -n gloo-system gloo gloo/gloo --set gateway.persistProxySpec=true --set gateway.validation.alwaysAcceptResources=false --set gateway.validation.allowWarnings=false

Create the cert-manager pre-requisites

kubectl apply -f - << EOF
apiVersion: v1
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-issuer
spec:
  selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-selfsigned-ca
  namespace: cert-manager
spec:
  isCA: true
  commonName: my-selfsigned-ca
  secretName: root-secret
  privateKey:
    algorithm: ECDSA
    size: 256
  issuerRef:
    name: selfsigned-issuer
    kind: ClusterIssuer
    group: cert-manager.io
---
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: my-ca-issuer
spec:
  ca:
    secretName: root-secret
EOF

Create a Certificate and a virtual service which consumes its resultant secret

kubectl apply -f - << EOF
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: test-1.solo.io
  namespace: gloo-system
spec:
  secretName: test-1.solo.io
  dnsNames:
  - test-1.solo.io
  issuerRef:
    name: my-ca-issuer
    kind: ClusterIssuer
---
apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: direct-response-ssl-1
  namespace: gloo-system
spec:
  virtualHost:
    domains:
      - test-1.solo.io
      - test-1.solo.io:8443
    routes:
    - matchers:
       - prefix: /
      directResponseAction:
        status: 200
        body: you did it, number 1!
  sslConfig:
    secretRef:
      name: test-1.solo.io
      namespace: gloo-system
    oneWayTls: true
    sniDomains: [ "test-1.solo.io" ]
EOF

At this point, note that we have a validation error since cert-manager was not given time to create the secret before gloo attempted to reconcile the VS. Running the exact same command again will succeed IFF the cert (and its secret) were created first, which they should have been due to the ordering of the resources in the command.

certificate.cert-manager.io/test-1.solo.io created
Error from server: error when creating "STDIN": admission webhook "gloo.gloo-system.svc" denied the request: resource incompatible with current Gloo snapshot: [Validating *v1.VirtualService failed: 1 error occurred:
        * Validating *v1.VirtualService failed: validating *v1.VirtualService name:"direct-response-ssl-1" namespace:"gloo-system": 1 error occurred:
        * failed to validate Proxy [namespace: gloo-system, name: gateway-proxy] with gloo validation: Listener Error: SSLConfigError. Reason: SSL secret not found: list did not find secret gloo-system.test-1.solo.io

Port-forward the gateway-proxy in a separate terminal

kubectl port-forward -n gloo-system deploy/gateway-proxy 8443

Append your hosts file with the following lines for the hosts we will be using

127.0.0.1 test-1.solo.io
127.0.0.1 test-2.solo.io

Execute a curl request showing that the proxy is responding as expected

curl -vk https://test-1.solo.io:8443

Delete the Certificate resource and secret
NOTE: this requires disabling validation. Currently, validation settings should prevent this particular failure mode, and when used in conjunction with the cert-manager configuration to enable temporary certs during issuance, the problem of a missing secret should be nullified.
...unless validation has to be disabled for cert-manager to work properly...

kubectl patch settings -n gloo-system default --type merge -p '{"spec":{"gateway":{"validation":{"alwaysAccept":true}}}}'
sleep 1
kubectl delete Certificate -n gloo-system test-1.solo.io && kubectl delete secret -n gloo-system test-1.solo.io

Watch for the proxy to eventually stop serving

watch curl -vk https://test-1.solo.io:8443

An odd state

We can see here that the VirtualService, where the actual configuration issue* is occurring, shows Accepted with a subresource status for the Proxy showing Accepted, the Gateway which consumes the VirtualService shows Rejected with a subresource status for the Proxy showing Accepted, and the Proxy shows Rejected. At the very least this is inconsistent status reporting that misleads where the error is actually occurring.

jbohanon · 2024-09-03T19:30:09Z

This work will be released in the following versions:

Gloo Edge:
v1.18.0-beta16
v1.17.5
v1.16.20
v1.15.31

Gloo Edge Enterprise:
v1.18.0-beta1
v1.17.2
v1.16.15
v1.15.22

This work will not be available in 1.14 versions of Gloo Edge because the validating webhook behavior was significantly different, and I was unable to reproduce the issue on 1.14

jbohanon · 2024-09-10T14:19:54Z

This work is merged into all relevant versions as detailed in the comment above.

day0ops added the Type: Bug Something isn't working label Aug 16, 2022

chrisgaun added Geo: APAC Type: Enhancement New feature or request and removed Type: Bug Something isn't working labels Aug 31, 2022

AkshayAdsul added the Priority: High Required in next 3 months to make progress, bugs that affect multiple users, or very bad UX label Jun 7, 2023

SantoDE changed the title ~~Introduce a fall back TLS when cert-manager is slow to sign certificate with an external CA~~ VirtualGateways should not break with a missing TLS certificate Jun 8, 2023

nfuden self-assigned this Mar 5, 2024

nfuden added the Needs Investigation label Mar 5, 2024

htpvu added the Prioritized Indicating issue prioritized to be worked on in RFE stream label Mar 6, 2024

kcbabo added the Area: Stability Issues related to stability of the product, engineering, tech debt label Apr 2, 2024

nfuden removed their assignment Apr 15, 2024

DuncanDoyle added the release/1.18 label May 13, 2024

DuncanDoyle added the scope/1.18 label Jul 10, 2024

soloio-bot added the zendesk label Jul 11, 2024

jbohanon self-assigned this Jul 24, 2024

solo-changelog-bot bot mentioned this issue Aug 8, 2024

Warn on missing TLS secret #9875

Merged

4 tasks

nfuden added the release/1.17 label Aug 23, 2024

This was referenced Aug 23, 2024

[1.17] Warn on missing TLS secret #9938

Merged

[1.16] Warn on missing TLS secret #9974

Merged

solo-changelog-bot bot mentioned this issue Sep 4, 2024

[1.15] Warn on missing TLS secret #9986

Merged

4 tasks

solo-changelog-bot bot mentioned this issue Sep 6, 2024

[1.14] Warn on missing TLS secret #9994

Closed

4 tasks

jbohanon closed this as completed Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VirtualGateways should not break with a missing TLS certificate #6957

VirtualGateways should not break with a missing TLS certificate #6957

day0ops commented Aug 16, 2022 •

edited by sync-by-unito bot

Loading

dinukxx commented Sep 9, 2022

day0ops commented Jun 8, 2023

sam-heilbron commented Sep 13, 2023

nil-scan commented Nov 1, 2023

nil-scan commented Nov 8, 2023 •

edited

Loading

AkshayAdsul commented May 14, 2024

nrjpoddar commented Jul 2, 2024

soloio-bot commented Jul 11, 2024

jbohanon commented Jul 24, 2024

jbohanon commented Sep 3, 2024 •

edited

Loading

jbohanon commented Sep 10, 2024

VirtualGateways should not break with a missing TLS certificate #6957

VirtualGateways should not break with a missing TLS certificate #6957

Comments

day0ops commented Aug 16, 2022 • edited by sync-by-unito bot Loading

Gloo Edge Version

Kubernetes Version

Describe the bug

Steps to reproduce the bug

Expected Behavior

Additional Context

dinukxx commented Sep 9, 2022

day0ops commented Jun 8, 2023

sam-heilbron commented Sep 13, 2023

Delayed Certificate Issuance

Deleting Secrets That are Valid

Proxy restarts causing differing behavior in the new pod from the previous one

The listener should keep work functioning for all connections where a proper cert is existing

nil-scan commented Nov 1, 2023

Delayed Certificate Issuance

Deleting Secrets That are Valid

Proxy restarts causing differing behavior in the new pod from the previous one

The listener should keep work functioning for all connections where a proper cert is existing

nil-scan commented Nov 8, 2023 • edited Loading

AkshayAdsul commented May 14, 2024

nrjpoddar commented Jul 2, 2024

soloio-bot commented Jul 11, 2024

jbohanon commented Jul 24, 2024

Reproduction

An odd state

jbohanon commented Sep 3, 2024 • edited Loading

jbohanon commented Sep 10, 2024

day0ops commented Aug 16, 2022 •

edited by sync-by-unito bot

Loading

nil-scan commented Nov 8, 2023 •

edited

Loading

jbohanon commented Sep 3, 2024 •

edited

Loading