-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k8s versions e2e test are failing since Oct 15th #2134
Comments
TL;DR I did not find the root cause but I think I spotted something strange. I picked one build that produced a dump and was not related to an endpoint not yet removed (#1927): https://devops-ci.elastic.co/blue/organizations/jenkins/cloud-on-k8s-versions-gke/detail/cloud-on-k8s-versions-gke/129/pipeline. The failure of this build was produced in the
The e2e test creates/updates/deletes some secure settings which restarts a 3-node ES cluster. It seems that the last restart took too much time (timeout is set to 5min). I correlated the build logs with the e2e test steps, the k8s pods events and the ES logs (to try to) highlight what's going on for each restart. (Commands and result.)curl -s "https://devops-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/cloud-on-k8s-versions-gke/runs/129/nodes/33/log/?start=0" \
| grep TestUpdateESSecureSettings \
| sed "s/.*{/{/" \
| jq -c '{"t":.Time,"a":.Test}' > e2e-test-steps.full.json
cat e2e-m9cld/e2e-m9cld-mercury/events.json \
| jq '.items[]
| select(.involvedObject.name | contains("test-es-keystore-drf6-es-masterdata"))
| {"t":.firstTimestamp,"m":.message,"n":.involvedObject.name,"z":.involvedObject.fieldPath}' -c | sort > pods-events.json
grep -hEr '(usable_space|"started")' e2e-m9cld \
| jq '{"t":.timestamp,"m":.message,"n":.["node.name"]}' -c > es-logs.json
cat e2e-test-steps.json pods-events.json es-logs.json \
| sort \
| sed '/^.*keystore#/a # -- Restart' > all.json Result:
We observe the abnormal duration of the 3th restart (>5min) compared to the first two (2min24 and 2min27).
Why? Is it the time to schedule a pod? Is it related to the volume attachment? |
I went through the last 14 builds. 10 failed because we broke the job config so nothing to see there :(. Summary of "interesting" failures:
Build https://devops-ci.elastic.co/job/cloud-on-k8s-versions-gke/137/FAIL: TestMutationHTTPToHTTPS/All_expected_Pods_should_eventually_be_ready (300.00s) Build https://devops-ci.elastic.co/job/cloud-on-k8s-versions-gke/136/FAIL: TestMutationHTTPToHTTPS/All_expected_Pods_should_eventually_be_ready (300.00s) Build https://devops-ci.elastic.co/job/cloud-on-k8s-versions-gke/135/ to 130Job config was broken. Build https://devops-ci.elastic.co/job/cloud-on-k8s-versions-gke/129/FAIL: TestUpdateESSecureSettings/Elasticsearch_secure_settings_should_eventually_be_set_in_all_nodes_keystore#03 (300.00s) Build https://devops-ci.elastic.co/job/cloud-on-k8s-versions-gke/128/FAIL: TestMutationHTTPToHTTPS/All_expected_Pods_should_eventually_be_ready (300.00s) Build https://devops-ci.elastic.co/job/cloud-on-k8s-versions-gke/127/ to 124Already resolved issue that was related to a misconfiguration of the job for the E2E license test. |
I close this because the job has been successful multiple times since the creation of this issue. |
https://devops-ci.elastic.co/view/cloud-on-k8s/job/cloud-on-k8s-versions-gke/
The text was updated successfully, but these errors were encountered: