[cert-manager] invalid peer certificate: UnknownIssuer #509

plaisted · 2023-09-12T20:38:42Z

I have a cluster with mtls set up as shown below. I originally had the certificate durations much shorter but updated to the length below.

  mTLS:
    certificatesDuration:
      clientCertificates: 720h0m0s
      frontendCertificate: 24h0m0s
      intermediateCAsCertificates: 1440h0m0s
      internodeCertificate: 24h0m0s
      rootCACertificate: 2160h0m0s
    frontend:
      enabled: true
    internode:
      enabled: true
    provider: cert-manager
    refreshInterval: 1h0m0s

Later I attempted to connect via a client using a certificate from the secret created via TemporalClusterClient and got invalid peer certificate: UnknownIssuer. I then:

deleted and recreated the TemporalClusterClient and used the new secret and still had same issue.
deleted the front end deployment for temporal and waited for it to be recreated and still had same issue
deleted the frontend intermediate CA secret, the frontend certificate secret and the frontend deployment and then the connection worked.

I'm not sure if this was due to me updating the certificate durations on TemporalCluster or some other issue but somehow the certificates were not properly renewed. I'll keep experimenting with this but opened the issue in case others have similar problems.

The text was updated successfully, but these errors were encountered:

plaisted · 2023-10-13T15:28:48Z

Similar issue occurred again around clientCertificate, they are set for 30 day rotation, on the start of the 30th day clients using the secret began crashing with BadCertificate when attempting to connect to temporal.

I deleted the secret and deleted / recreated the TemporalClusterClient and connection worked fine.

I have multiple TemporalClusterClients in different namespaces. This one is not in the same namespace as the cluster itself. The TemporalClusterClient in the same namespace appears to have been updated fine although that could be a coincidence (or cert-manager doing it since the secret is referenced by the Certificate).

@alexandrevilain Any suggestions on how to troubleshoot this? I'm not familiar with how the operator handles these client certs.

plaisted · 2023-10-13T16:55:08Z

Some additional details, for one of the certs that was not rotated I added an annotation to the TemporalClusterClient and verified that the operator did run reconciliation (with no errors). I checked the secret and it was not updated to the latest TLS cert.

I then deleted the secret and changed the tag again, and the operator reconciled again (with no errors) and this time the secrete was recreated with the new correct TLS cert.

Digging in (warning, I haven't touched go in a long time) I see that the reconciler is using SecretCopier to copy the secret. SecretCopier is calling controllerutil.CreateOrUpdate on the secret. Looking at the docs for controllerutil.CreateOrUpdate I see:

The object's desired state must be reconciled with the existing state inside the passed in callback MutateFn.

In this case an empty function is passed to the callback so based on the docs it's probably never updating the secret if it already exists.

On a second note I'm not sure when the reconciler runs for the TemporalClusterClient, if it only runs on changes (and not periodically as well) the reconciliation wouldn't be triggered on a secret rotation so that may be an issue as well.

nchen622 · 2023-10-17T22:12:14Z

I am also facing the exact same issue that the secret in the client namespace does not get updated when a cert is being renewed in cluster namespace. Can you share the expected time that this bug will be fixed? We tried this operator out and plan to use it, but it won't work unless this bug is fixed.

If it won't be fixed soon, can you please share some details on the following so that we can try to fix it ourselves while waiting for the official fix?

how does TemporalClusterClient reconciler gets triggered? It seems only get triggered when first deployed
it seems the create works when the TemporalClusterClient gets deployed, but the update does not work. The secret in the client cluster becomes stale. Which code handles this? It seems that the same code is being called for both create and update. If so, does it mean the problem is at triggering side?
Is there any workaround?

alexandrevilain · 2023-10-18T11:57:39Z

Hi !

Thanks for reporting this issue!
Your insights are valuable to find what's going wrong with TemporalClusterClient.

I'm digging a bit more in the issue to find what's causing the issue.

plaisted · 2023-10-19T03:13:03Z

Is there any workaround?

I've got a cron job set up to copy the secret between namespaces for now. Not elegant but works as long as you have a decent overlap between when the cert is rotated and when it expires.

nchen622 · 2023-10-19T16:53:24Z

Is there any workaround?

I've got a cron job set up to copy the secret between namespaces for now. Not elegant but works as long as you have a decent overlap between when the cert is rotated and when it expires.

Thanks, @plaisted ! We are also looking for potential work around. And will share what our solution is later once it's working for us.

alexandrevilain · 2023-10-26T18:12:42Z

Hi @nchen622 @plaisted !

I submitted a PR that fixes the secret update when the certificate is renewed. This should be part of the 0.16.0 release :)

nchen622 · 2023-10-28T19:07:51Z

While testing the mtls feature, we actually found another issue. When mtls is enabled with cert-manager as a provider, Everything still works fine after the worker certificate and frontend-intermediate-ca-certificate were renewed. But the system worker will stop working right after the root-ca-certificate is renewed. I double checked the log and the renewal time, the failure started right after the root-ca-certificate is renewed. Right after the failure, even restarting the worker pod does not fix it, only thing so far can recover is to delete the certificates and have them reissued. Sometimes, after a long time, restarting the worker pod fixes it. I believe the admintool pod also sees the similar issue.

@plaisted are you seeing the same?

@alexandrevilain please let me know if you need more information or logs. It seems pretty easy to reproduce it. It happens all the time consistently.

The configuration for certificate renewal is as below

mTLS:
internode:
enabled: true
frontend:
enabled: true
certificatesDuration:
rootCACertificate: 4h
intermediateCAsCertificates: 2h
clientCertificates: 1h
frontendCertificate: 1h
internodeCertificate: 1h
refreshInterval: 5m

The logs shows that worker fails

{"level":"warn","ts":"2023-10-28T01:54:45.515Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-history-scanner-taskqueue-0","WorkerID":"1@temporaldevmtls-worker-b94f46888-q6wpn@","WorkerType":"WorkflowWorker","Error":"last connection error: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA certificate\")"","logging-call-at":"internal_worker_base.go:308"}

plaisted · 2023-10-28T21:44:23Z

@nchen622 Yea, I've noticed the same (or similar issues). I haven't gotten a chance to dig in yet. For the root CA rotation to work I believe you'd need something like:

Root CA rotates, new intermediate issued, new client/server certs issued.
Client / Server components picks up new root CA but keeps old CA and old certs. Server accepts requests with certs from either CA, client accepts server certs from either CA.
Once all pods have both CA's, clients / servers update to new certs.
Once all pods have new certs, old CA removed from client / servers

That would obviously take a decent amount of coordination. I do see https://pkg.go.dev/crypto/x509#CertPool.AppendCertsFromPEM accepts multiple CA's in the PEM so that piece should work but would need a way to combine old and new into the CA files, I don't know if cert-manager provides a way to do this.

nchen622 · 2023-10-30T16:19:49Z

@plaisted Thanks for the notes. I thought the operator and cert-manager will handle this automatically. Maybe not fully? Strange thing is that all the internode communication seemed still fine after the root-ca-certificate was renewed. I wonder what are the differences except that they have their own intermediate-ca-certificates

alexandrevilain · 2023-10-31T12:42:59Z

Reopening as this issue has more than one subject.

plaisted · 2023-11-01T14:25:14Z

@nchen622 I thought so as well, but seems like cert-manager may not handle it (see here suggesting a similar approach to what I mentioned, but not implemented in cert-manager). I haven't read up on it much so maybe there is a way, but seems like in current setup if you truly rotate the root CA / secret you'll have downtime. It may be best to disable root CA rotation or at least have a warning about downtime etc.

I think I'll be disabling the root CA rotation and manually trigger rotation when appropriate in a maintenance window unless someone has an automated solution.

nchen622 · 2023-11-08T18:33:30Z

@plaisted Thanks for the suggestion. We also decided to make the root ca rotation cycle long and put a manual process around it.

nchen622 · 2023-11-13T16:55:15Z

@alexandrevilain Do you have an estimate on when the 0.16.0 release will be available? We want to utilize the feature that the client secret is auto synced

alexandrevilain · 2023-11-18T12:30:49Z

Hi @nchen622 !
This is in progress: #559 :)

vinimona · 2023-12-20T21:07:27Z

@alexandrevilain I tested with the 0.16.1 operator version and observed that the secret created by temporalclusterclient is not synced in the worker namespace after the certificate is renewed. Both the secrets i.e. in temporal cluster namespace and in worker namespace differ in content, and we observe bad/invalid certificate errors. Could you please check?

Below is the manifest that I used -

apiVersion: temporal.io/v1beta1
kind: TemporalClusterClient
metadata:
  name: my-worker
  namespace: my-worker-ns
spec:
  clusterRef:
    name: temporal-cluster-dev
    namespace: temporal-cluster-ns

Below is the mtls configuration in temporal cluster manifest. We have put small duration to test the secret sync on certificate expiration and renewal.

mTLS:
  certificatesDuration:
    clientCertificates: 1h30m
    frontendCertificate: 1h30m
    intermediateCAsCertificates: 1h30m
    internodeCertificate: 1h30m
    rootCACertificate: 17520h
  frontend:
    enabled: true
  internode:
    enabled: true
  refreshInterval: 30m
  renewBefore: 5m

Our worker logs once the certs get expired (and secret is not updated on renewal) -

Caused by: javax.net.ssl.SSLException: error:10000412:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_CERTIFICATE
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.shutdownWithError(ReferenceCountedOpenSslEngine.java:1083)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.sslReadErrorResult(ReferenceCountedOpenSslEngine.java:1377)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1317)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1404)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1447)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:222)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1343)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1236)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1285)
	at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
	at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
	... 15 common frames omitted

Also, I don't see any errors in the temporal-operator logs.

vinimona · 2024-01-03T04:14:02Z

@alexandrevilain I tested with the 0.16.1 operator version and observed that the secret created by temporalclusterclient is not synced in the worker namespace after the certificate is renewed. Both the secrets i.e. in temporal cluster namespace and in worker namespace differ in content, and we observe bad/invalid certificate errors. Could you please check?

Below is the manifest that I used -

apiVersion: temporal.io/v1beta1
kind: TemporalClusterClient
metadata:
  name: my-worker
  namespace: my-worker-ns
spec:
  clusterRef:
    name: temporal-cluster-dev
    namespace: temporal-cluster-ns

Below is the mtls configuration in temporal cluster manifest. We have put small duration to test the secret sync on certificate expiration and renewal.

mTLS:
  certificatesDuration:
    clientCertificates: 1h30m
    frontendCertificate: 1h30m
    intermediateCAsCertificates: 1h30m
    internodeCertificate: 1h30m
    rootCACertificate: 17520h
  frontend:
    enabled: true
  internode:
    enabled: true
  refreshInterval: 30m
  renewBefore: 5m

Our worker logs once the certs get expired (and secret is not updated on renewal) -

Caused by: javax.net.ssl.SSLException: error:10000412:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_CERTIFICATE
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.shutdownWithError(ReferenceCountedOpenSslEngine.java:1083)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.sslReadErrorResult(ReferenceCountedOpenSslEngine.java:1377)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1317)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1404)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1447)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:222)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1343)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1236)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1285)
	at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
	at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
	... 15 common frames omitted

Also, I don't see any errors in the temporal-operator logs.

@alexandrevilain Could you please take a look at the issue? Is there anything that I am missing? Thanks.

alexandrevilain · 2024-01-03T12:25:20Z

Hi @vinimona !

Thanks for reporting this.
Could you please delete the operator pod to ensure a new reconciliation and check if the secret got updated ?
I need to find out if it's a watch issue or a sync issue.

vinimona · 2024-01-04T06:02:54Z

@alexandrevilain Thanks for responding. I deleted operator pod, then it reconciled the cluster client and secret in worker namespace was synced. And when the certs were renewed after the given duration, it went to the same state. I again deleted operator pod and it synced the secrets. So it appears it is a watch issue.

Could you please prioritize this?

alexandrevilain · 2024-01-04T20:43:16Z

Could you please prioritize this?

Please note that I'm maintaining this in my free time.

I deleted operator pod, then it reconciled the cluster client and secret in worker namespace was synced.

Thanks for the information, If found the issue. The operator was watching for Certificates but was trying to reconcile the TemporalClusterClient owning this Certificate. But this owner is the TemporalCluster.
This is now fixed in #601

ElanHasson · 2024-04-22T19:14:19Z

@alexandrevilain I still see the issue where the certs are rotated and everything basically crashloops or fails requests due to expired certs. I have to delete the cert secrets and also the deployments to let the operator reconcile it properly.

Is this one of the issues here? It looks like here was for only CAs-- it's for all certs it seems.

I can make a new issue if this wasn't it also

alexandrevilain · 2024-04-22T19:32:00Z

Hi @ElanHasson !

Let's do another issue it will be easier to follow it :)

Thanks !

alexandrevilain added the bug Something isn't working label Sep 21, 2023

alexandrevilain mentioned this issue Oct 26, 2023

fix(SecretCopier): update secret when orginal secret is updated #540

Merged

alexandrevilain closed this as completed in #540 Oct 31, 2023

alexandrevilain reopened this Oct 31, 2023

alexandrevilain mentioned this issue Jan 4, 2024

fix(TemporalClusterClient): requeue reconcile when certificate is updated #601

Merged

alexandrevilain closed this as completed in #601 Jan 4, 2024

ElanHasson mentioned this issue Apr 22, 2024

Rotated certificates cause all services to crashloop/failure to refresh #708

Open

alexandrevilain mentioned this issue Jun 13, 2024

Umbrella issue: mTLS with cert-manager #746

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cert-manager] invalid peer certificate: UnknownIssuer #509

[cert-manager] invalid peer certificate: UnknownIssuer #509

plaisted commented Sep 12, 2023

plaisted commented Oct 13, 2023

plaisted commented Oct 13, 2023

nchen622 commented Oct 17, 2023

alexandrevilain commented Oct 18, 2023

plaisted commented Oct 19, 2023

nchen622 commented Oct 19, 2023

alexandrevilain commented Oct 26, 2023

nchen622 commented Oct 28, 2023

plaisted commented Oct 28, 2023

nchen622 commented Oct 30, 2023

alexandrevilain commented Oct 31, 2023

plaisted commented Nov 1, 2023

nchen622 commented Nov 8, 2023

nchen622 commented Nov 13, 2023

alexandrevilain commented Nov 18, 2023

vinimona commented Dec 20, 2023 •

edited

Loading

vinimona commented Jan 3, 2024

alexandrevilain commented Jan 3, 2024

vinimona commented Jan 4, 2024 •

edited

Loading

alexandrevilain commented Jan 4, 2024

ElanHasson commented Apr 22, 2024

alexandrevilain commented Apr 22, 2024

[cert-manager] invalid peer certificate: UnknownIssuer #509

[cert-manager] invalid peer certificate: UnknownIssuer #509

Comments

plaisted commented Sep 12, 2023

plaisted commented Oct 13, 2023

plaisted commented Oct 13, 2023

nchen622 commented Oct 17, 2023

alexandrevilain commented Oct 18, 2023

plaisted commented Oct 19, 2023

nchen622 commented Oct 19, 2023

alexandrevilain commented Oct 26, 2023

nchen622 commented Oct 28, 2023

plaisted commented Oct 28, 2023

nchen622 commented Oct 30, 2023

alexandrevilain commented Oct 31, 2023

plaisted commented Nov 1, 2023

nchen622 commented Nov 8, 2023

nchen622 commented Nov 13, 2023

alexandrevilain commented Nov 18, 2023

vinimona commented Dec 20, 2023 • edited Loading

vinimona commented Jan 3, 2024

alexandrevilain commented Jan 3, 2024

vinimona commented Jan 4, 2024 • edited Loading

alexandrevilain commented Jan 4, 2024

ElanHasson commented Apr 22, 2024

alexandrevilain commented Apr 22, 2024

vinimona commented Dec 20, 2023 •

edited

Loading

vinimona commented Jan 4, 2024 •

edited

Loading