Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cert-manager] invalid peer certificate: UnknownIssuer #509

Closed
plaisted opened this issue Sep 12, 2023 · 22 comments · Fixed by #540 or #601
Closed

[cert-manager] invalid peer certificate: UnknownIssuer #509

plaisted opened this issue Sep 12, 2023 · 22 comments · Fixed by #540 or #601
Labels
bug Something isn't working

Comments

@plaisted
Copy link

I have a cluster with mtls set up as shown below. I originally had the certificate durations much shorter but updated to the length below.

  mTLS:
    certificatesDuration:
      clientCertificates: 720h0m0s
      frontendCertificate: 24h0m0s
      intermediateCAsCertificates: 1440h0m0s
      internodeCertificate: 24h0m0s
      rootCACertificate: 2160h0m0s
    frontend:
      enabled: true
    internode:
      enabled: true
    provider: cert-manager
    refreshInterval: 1h0m0s

Later I attempted to connect via a client using a certificate from the secret created via TemporalClusterClient and got invalid peer certificate: UnknownIssuer. I then:

  • deleted and recreated the TemporalClusterClient and used the new secret and still had same issue.
  • deleted the front end deployment for temporal and waited for it to be recreated and still had same issue
  • deleted the frontend intermediate CA secret, the frontend certificate secret and the frontend deployment and then the connection worked.

I'm not sure if this was due to me updating the certificate durations on TemporalCluster or some other issue but somehow the certificates were not properly renewed. I'll keep experimenting with this but opened the issue in case others have similar problems.

@alexandrevilain alexandrevilain added the bug Something isn't working label Sep 21, 2023
@plaisted
Copy link
Author

Similar issue occurred again around clientCertificate, they are set for 30 day rotation, on the start of the 30th day clients using the secret began crashing with BadCertificate when attempting to connect to temporal.

I deleted the secret and deleted / recreated the TemporalClusterClient and connection worked fine.

I have multiple TemporalClusterClients in different namespaces. This one is not in the same namespace as the cluster itself. The TemporalClusterClient in the same namespace appears to have been updated fine although that could be a coincidence (or cert-manager doing it since the secret is referenced by the Certificate).

@alexandrevilain Any suggestions on how to troubleshoot this? I'm not familiar with how the operator handles these client certs.

@plaisted
Copy link
Author

Some additional details, for one of the certs that was not rotated I added an annotation to the TemporalClusterClient and verified that the operator did run reconciliation (with no errors). I checked the secret and it was not updated to the latest TLS cert.

I then deleted the secret and changed the tag again, and the operator reconciled again (with no errors) and this time the secrete was recreated with the new correct TLS cert.

Digging in (warning, I haven't touched go in a long time) I see that the reconciler is using SecretCopier to copy the secret. SecretCopier is calling controllerutil.CreateOrUpdate on the secret. Looking at the docs for controllerutil.CreateOrUpdate I see:

The object's desired state must be reconciled with the existing state inside the passed in callback MutateFn.

In this case an empty function is passed to the callback so based on the docs it's probably never updating the secret if it already exists.

On a second note I'm not sure when the reconciler runs for the TemporalClusterClient, if it only runs on changes (and not periodically as well) the reconciliation wouldn't be triggered on a secret rotation so that may be an issue as well.

@nchen622
Copy link

I am also facing the exact same issue that the secret in the client namespace does not get updated when a cert is being renewed in cluster namespace. Can you share the expected time that this bug will be fixed? We tried this operator out and plan to use it, but it won't work unless this bug is fixed.

If it won't be fixed soon, can you please share some details on the following so that we can try to fix it ourselves while waiting for the official fix?

  1. how does TemporalClusterClient reconciler gets triggered? It seems only get triggered when first deployed
  2. it seems the create works when the TemporalClusterClient gets deployed, but the update does not work. The secret in the client cluster becomes stale. Which code handles this? It seems that the same code is being called for both create and update. If so, does it mean the problem is at triggering side?
  3. Is there any workaround?

@alexandrevilain
Copy link
Owner

Hi !

Thanks for reporting this issue!
Your insights are valuable to find what's going wrong with TemporalClusterClient.

I'm digging a bit more in the issue to find what's causing the issue.

@plaisted
Copy link
Author

  1. Is there any workaround?

I've got a cron job set up to copy the secret between namespaces for now. Not elegant but works as long as you have a decent overlap between when the cert is rotated and when it expires.

@nchen622
Copy link

  1. Is there any workaround?

I've got a cron job set up to copy the secret between namespaces for now. Not elegant but works as long as you have a decent overlap between when the cert is rotated and when it expires.

Thanks, @plaisted ! We are also looking for potential work around. And will share what our solution is later once it's working for us.

@alexandrevilain
Copy link
Owner

Hi @nchen622 @plaisted !

I submitted a PR that fixes the secret update when the certificate is renewed. This should be part of the 0.16.0 release :)

@nchen622
Copy link

While testing the mtls feature, we actually found another issue. When mtls is enabled with cert-manager as a provider, Everything still works fine after the worker certificate and frontend-intermediate-ca-certificate were renewed. But the system worker will stop working right after the root-ca-certificate is renewed. I double checked the log and the renewal time, the failure started right after the root-ca-certificate is renewed. Right after the failure, even restarting the worker pod does not fix it, only thing so far can recover is to delete the certificates and have them reissued. Sometimes, after a long time, restarting the worker pod fixes it. I believe the admintool pod also sees the similar issue.

@plaisted are you seeing the same?

@alexandrevilain please let me know if you need more information or logs. It seems pretty easy to reproduce it. It happens all the time consistently.

The configuration for certificate renewal is as below

mTLS:
internode:
enabled: true
frontend:
enabled: true
certificatesDuration:
rootCACertificate: 4h
intermediateCAsCertificates: 2h
clientCertificates: 1h
frontendCertificate: 1h
internodeCertificate: 1h
refreshInterval: 5m

The logs shows that worker fails

{"level":"warn","ts":"2023-10-28T01:54:45.515Z","msg":"Failed to poll for task.","service":"worker","Namespace":"temporal-system","TaskQueue":"temporal-sys-history-scanner-taskqueue-0","WorkerID":"1@temporaldevmtls-worker-b94f46888-q6wpn@","WorkerType":"WorkflowWorker","Error":"last connection error: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA certificate\")"","logging-call-at":"internal_worker_base.go:308"}

@plaisted
Copy link
Author

@nchen622 Yea, I've noticed the same (or similar issues). I haven't gotten a chance to dig in yet. For the root CA rotation to work I believe you'd need something like:

  • Root CA rotates, new intermediate issued, new client/server certs issued.
  • Client / Server components picks up new root CA but keeps old CA and old certs. Server accepts requests with certs from either CA, client accepts server certs from either CA.
  • Once all pods have both CA's, clients / servers update to new certs.
  • Once all pods have new certs, old CA removed from client / servers

That would obviously take a decent amount of coordination. I do see https://pkg.go.dev/crypto/x509#CertPool.AppendCertsFromPEM accepts multiple CA's in the PEM so that piece should work but would need a way to combine old and new into the CA files, I don't know if cert-manager provides a way to do this.

@nchen622
Copy link

@plaisted Thanks for the notes. I thought the operator and cert-manager will handle this automatically. Maybe not fully? Strange thing is that all the internode communication seemed still fine after the root-ca-certificate was renewed. I wonder what are the differences except that they have their own intermediate-ca-certificates

@alexandrevilain
Copy link
Owner

Reopening as this issue has more than one subject.

@plaisted
Copy link
Author

plaisted commented Nov 1, 2023

@nchen622 I thought so as well, but seems like cert-manager may not handle it (see here suggesting a similar approach to what I mentioned, but not implemented in cert-manager). I haven't read up on it much so maybe there is a way, but seems like in current setup if you truly rotate the root CA / secret you'll have downtime. It may be best to disable root CA rotation or at least have a warning about downtime etc.

I think I'll be disabling the root CA rotation and manually trigger rotation when appropriate in a maintenance window unless someone has an automated solution.

@nchen622
Copy link

nchen622 commented Nov 8, 2023

@plaisted Thanks for the suggestion. We also decided to make the root ca rotation cycle long and put a manual process around it.

@nchen622
Copy link

@alexandrevilain Do you have an estimate on when the 0.16.0 release will be available? We want to utilize the feature that the client secret is auto synced

@alexandrevilain
Copy link
Owner

Hi @nchen622 !
This is in progress: #559 :)

@vinimona
Copy link

vinimona commented Dec 20, 2023

@alexandrevilain I tested with the 0.16.1 operator version and observed that the secret created by temporalclusterclient is not synced in the worker namespace after the certificate is renewed. Both the secrets i.e. in temporal cluster namespace and in worker namespace differ in content, and we observe bad/invalid certificate errors. Could you please check?

Below is the manifest that I used -

apiVersion: temporal.io/v1beta1
kind: TemporalClusterClient
metadata:
  name: my-worker
  namespace: my-worker-ns
spec:
  clusterRef:
    name: temporal-cluster-dev
    namespace: temporal-cluster-ns

Below is the mtls configuration in temporal cluster manifest. We have put small duration to test the secret sync on certificate expiration and renewal.

mTLS:
  certificatesDuration:
    clientCertificates: 1h30m
    frontendCertificate: 1h30m
    intermediateCAsCertificates: 1h30m
    internodeCertificate: 1h30m
    rootCACertificate: 17520h
  frontend:
    enabled: true
  internode:
    enabled: true
  refreshInterval: 30m
  renewBefore: 5m

Our worker logs once the certs get expired (and secret is not updated on renewal) -

Caused by: javax.net.ssl.SSLException: error:10000412:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_CERTIFICATE
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.shutdownWithError(ReferenceCountedOpenSslEngine.java:1083)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.sslReadErrorResult(ReferenceCountedOpenSslEngine.java:1377)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1317)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1404)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1447)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:222)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1343)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1236)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1285)
	at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
	at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
	... 15 common frames omitted

Also, I don't see any errors in the temporal-operator logs.

@vinimona
Copy link

vinimona commented Jan 3, 2024

@alexandrevilain I tested with the 0.16.1 operator version and observed that the secret created by temporalclusterclient is not synced in the worker namespace after the certificate is renewed. Both the secrets i.e. in temporal cluster namespace and in worker namespace differ in content, and we observe bad/invalid certificate errors. Could you please check?

Below is the manifest that I used -

apiVersion: temporal.io/v1beta1
kind: TemporalClusterClient
metadata:
  name: my-worker
  namespace: my-worker-ns
spec:
  clusterRef:
    name: temporal-cluster-dev
    namespace: temporal-cluster-ns

Below is the mtls configuration in temporal cluster manifest. We have put small duration to test the secret sync on certificate expiration and renewal.

mTLS:
  certificatesDuration:
    clientCertificates: 1h30m
    frontendCertificate: 1h30m
    intermediateCAsCertificates: 1h30m
    internodeCertificate: 1h30m
    rootCACertificate: 17520h
  frontend:
    enabled: true
  internode:
    enabled: true
  refreshInterval: 30m
  renewBefore: 5m

Our worker logs once the certs get expired (and secret is not updated on renewal) -

Caused by: javax.net.ssl.SSLException: error:10000412:SSL routines:OPENSSL_internal:SSLV3_ALERT_BAD_CERTIFICATE
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.shutdownWithError(ReferenceCountedOpenSslEngine.java:1083)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.sslReadErrorResult(ReferenceCountedOpenSslEngine.java:1377)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1317)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1404)
	at io.grpc.netty.shaded.io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1447)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:222)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1343)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1236)
	at io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1285)
	at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
	at io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
	... 15 common frames omitted

Also, I don't see any errors in the temporal-operator logs.

@alexandrevilain Could you please take a look at the issue? Is there anything that I am missing? Thanks.

@alexandrevilain
Copy link
Owner

Hi @vinimona !

Thanks for reporting this.
Could you please delete the operator pod to ensure a new reconciliation and check if the secret got updated ?
I need to find out if it's a watch issue or a sync issue.

@vinimona
Copy link

vinimona commented Jan 4, 2024

@alexandrevilain Thanks for responding. I deleted operator pod, then it reconciled the cluster client and secret in worker namespace was synced. And when the certs were renewed after the given duration, it went to the same state. I again deleted operator pod and it synced the secrets. So it appears it is a watch issue.

Could you please prioritize this?

@alexandrevilain
Copy link
Owner

Could you please prioritize this?

Please note that I'm maintaining this in my free time.

I deleted operator pod, then it reconciled the cluster client and secret in worker namespace was synced.

Thanks for the information, If found the issue. The operator was watching for Certificates but was trying to reconcile the TemporalClusterClient owning this Certificate. But this owner is the TemporalCluster.
This is now fixed in #601

@ElanHasson
Copy link
Contributor

@alexandrevilain I still see the issue where the certs are rotated and everything basically crashloops or fails requests due to expired certs. I have to delete the cert secrets and also the deployments to let the operator reconcile it properly.

Is this one of the issues here? It looks like here was for only CAs-- it's for all certs it seems.

I can make a new issue if this wasn't it also

@alexandrevilain
Copy link
Owner

Hi @ElanHasson !

Let's do another issue it will be easier to follow it :)

Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants