-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TestUpdateESSecureSettings is failing #2380
Comments
Not sure if this is relevant but I noticed
in the logs which I haven't seen before. |
I reproduced it with the operator running locally and no webhook set up. Edit: my dev env is running on AKS, where Pods take 2 minutes to start up. I'm double checking the timings again on GKE, then I'll probably open a PR to fix the test timeout once I'm 100% confident this is causing the issue. |
On my 1.15 GKE cluster the rolling upgrade takes about 200 secs for the three nodes so well below the 5 min threshold. |
@pebrc you're right, my slow rolling upgrade was just coming from using AKS. Switched back to GKE and I also get below the 5 min threshold. Sorry for the noise! |
I'm investigating this build. Comparing the status of the unready pod in the test logs:
vs. status of the Pod from the support archive:
22 seconds after the test ended (5min timeout), the container was ready and the test would probably have succeeded. Looking at the operator logs:
Then we get That could be related to the error @pebrc pointed out:
Same investigation on build 179:
But then we see I can also see a bunch of Same thing on build 146 (stack-versions test). Since we added a 30sec wait to the Pod deletion, I think we can expect to see
Based on the logs the keystore seems to be the first "real" rolling upgrade test executed. I guess this failure would also happen for other rolling upgrade tests. |
I am wondering whether we have some variability in the pre-stop hook as well. If the endpoint removal from the service is slow we might have longer pre-stop hook run times, worst case 50 secs if IIUC |
Just to confirm that something happens at random when the container is killed/stopped. Each time the test fails, the container kill/stop takes more than two minutes. # https://devops-ci.elastic.co/view/cloud-on-k8s/job/cloud-on-k8s-versions-gke/179/1
2020-01-09T01:43:46Z Killing container with id docker://elasticsearch:Need to kill Pod test-es-keystore-q6kh-es-masterdata-0
2020-01-09T01:45:50Z Successfully assigned e2e-j2ibb-mercury/test-es-keystore-q6kh-es-masterdata-0 to gke-eck-gke12-179-e2e-default-pool-9085c232-svxp
# https://devops-ci.elastic.co/view/cloud-on-k8s/job/cloud-on-k8s-versions-gke/179/2
2020-01-09T01:25:15Z Stopping container elasticsearch
2020-01-09T01:27:59Z Successfully assigned e2e-gw6m8-mercury/test-es-keystore-xkg8-es-masterdata-0 to gke-eck-gke14-179-e2e-default-pool-ff5f4e13-j29p
# https://devops-ci.elastic.co/view/cloud-on-k8s/job/cloud-on-k8s-stack/146/3
2020-01-09T01:27:07Z Killing container with id docker://elasticsearch:Need to kill Pod test-es-keystore-vnrl-es-masterdata-0
2020-01-09T01:29:12Z Successfully assigned e2e-sife0-mercury/test-es-keystore-vnrl-es-masterdata-0 to gke-eck-73-146-e2e-default-pool-6f60a53a-69hr When the test succeeds, it takes a few seconds: # https://devops-ci.elastic.co/view/cloud-on-k8s/job/cloud-on-k8s-versions-gke/119/1
2019-12-11T01:11:24Z Killing container with id docker://elasticsearch:Need to kill Pod test-es-keystore-2drl-es-masterdata-1
2019-12-11T01:11:33Z Successfully assigned e2e-77ioh-mercury/test-es-keystore-2drl-es-masterdata-1 to gke-eck-68-119-e2e-default-pool-7e724caa-8012
2019-12-11T01:12:17Z Killing container with id docker://elasticsearch:Need to kill Pod test-es-keystore-2drl-es-masterdata-0
2019-12-11T01:12:18Z Successfully assigned e2e-77ioh-mercury/test-es-keystore-2drl-es-masterdata-0 to gke-eck-68-119-e2e-default-pool-2ee4afc7-dkq1 |
Just did a quick test while looking at the state of the Docker container on the host: Container is stopped and deleted at 12:18:34:
But the Pod is still in the Terminating state until 12:20:40 :
Edit: Here are the Kubelet logs (most recent messages first) :
|
Reopening. This is flaky again :(
|
I'm looking at https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-stack-versions/38/. In the tests, we remove one of the 2 referenced secure settings secret, then expect a rolling upgrade to happen.
Elasticsearch logs report a problem with TLS certificates, preventing the node to join the cluster:
Logs of other ES nodes seem to indicate something wrong with the TLS SANs:
|
Other E2E test fail for the same reason, I think all rolling upgrades are impacted. I'm opening a dedicated issue: #2823. |
#2831 should fix this. |
TestUpdateESSecureSettings
has failed several times on the last release candidate for 1.0.0 (rc5):The text was updated successfully, but these errors were encountered: