-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLS connection error on the transport protocol during rolling upgrades #2823
Comments
May be related to #2659 where we introduced |
I managed to reproduce locally by triggering some rolling upgrades on a 3 nodes cluster. Pod
Pasting this certificate on https://www.sslshopper.com/certificate-decoder.html reports:
IP Address Inspecting the content of secret
It reports IP So it looks like ES did not reload its transport cert file and still has the "old" one loaded in memory? |
At this point in time, if I manually delete the Pod certificate entry from the secret, which triggers a new one to be recreated automatically, the situation is unlocked. The ES process loads the new certificate file and we move on with the rolling upgrade. |
I'm wondering if the following may happen when Elasticsearch starts (after a rolling upgrade):
I'm looking into ES code to figure it out. |
I'm assuming it also works if you kill the pod and it restarts? Which might be an easier repro step |
I discussed with ES devs offline, who confirmed there's a time window (~5sec) between loading certs for the first time and watching the filesystem for changes, where any update on the file is ignored. Leading to the correct cert not being served if updated at this exact timing. I created an issue in the ES repo. Until this gets resolved, I think we can improve the init container startup script accordingly. |
We discussed this with @nkvoll and @pebrc and came up with the following:
Upgrading existing clusters (this gets more complicated): We must pay extra attention to existing clusters out there when we decide to re-enable full TLS verification. It will enable a rolling upgrade of the cluster. At the moment the Pod restarts with full TLS verification enabled, it may not be able to contact other Pods in the cluster anymore if they have been impacted by the bug described in this issue. In which case the rolling upgrade will never complete. So we probably need to make sure we only switch to full TLS verification a cluster that could not be impacted by the bug. Either because it's running a fixed Elasticsearch version, either because it has been created/upgraded by a fixed ECK version (with the init container fix).
|
As discussed out of band, long term we may also want to stop advertising IP addresses ( |
I created a bunch of issues to eventually handle this situation correctly. |
I observed several E2E tests failing during a rolling upgrade (example: https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-stack-versions/38/).
I think it randomly impacts all rolling upgrades where nodes IPs are replaced.
Symptoms:
ES 7.1.1 logs:
ES 6.8.5 logs:
The text was updated successfully, but these errors were encountered: