[test-operator] Fix handling of failed job #1112

lpiwowar · 2024-02-06T18:35:57Z

[test-operator] Fix handling of failed job

There are two issues in the way of how the role does handle failed job

Currently, the role waits for the Job that starts the execution of
the test pod to reach Completed state but this state is only reached
when the execution of the tests was successful. If there is an error
the job finishes with Failed state.
Logs are not collected when the the test pod times out

This patch updates the role so that it correctly waits for the job
completion (either with success or failure) and ensures that we collect
the logs (output of oc logs) even when the test pod times out.

As a pull request owner and reviewers, I checked that:

Appropriate testing is done and actually running
Appropriate documentation exists and/or is up-to-date:
- README in the role
- Content of the docs/source is reflecting the changes

softwarefactory-project-zuul · 2024-02-07T10:26:48Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/4b8bce35c3c54e6d8b5533f485af5dac

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 25m 51s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 09m 06s
✔️ noop SUCCESS in 0s
❌ cifmw-molecule-test_operator FAILURE in 3m 47s

There are two issues in the way of how the role does handle failed job - Currently, the role waits for the Job that starts the execution of the test pod to reach Completed state but this state is only reached when the execution of the tests was successful. If there is an error the job finishes with Failed state. - Logs are not collected when the the test pod times out This patch updates the role so that it correctly waits for the job completion (either with success or failure) and ensures that we collect the logs (output of oc logs) even when the test pod times out.

kopecmartin

lgtm

roles/test_operator/tasks/main.yml

queria · 2024-02-07T12:44:13Z

roles/test_operator/tasks/main.yml

+
+- name: Start test-operator-logs-pod
+  when:
+    - not cifmw_test_operator_dry_run | bool


would this mean there will be no logs collected in case it hanged/timed-out?

just wild guessing but if its maybe due to access to persistent storage or such, could there be some read-only parallel access, or maybe force terminate the execution on timeout to get access to logs?

I wanted to skip these because there are no longs to be collected in case of time out [1]. The logs are moved to the directory where the PV is mounted at the end of the test execution.

However, it is possible that there might be an issue with parallel access to the logs.

[1] https://github.com/openstack-k8s-operators/tcib/blob/106264ab0a0b18be278d69b0664dad6a4f9be29a/container-images/tcib/base/os/tempest/run_tempest.sh#L265

that is not correct

generate logs is post processing the tempest logs.

the tempest logs are stored in a binary format by stester when tempest is run in parrall with the execution fo the these

this https://github.com/openstack-k8s-operators/tcib/blob/106264ab0a0b18be278d69b0664dad6a4f9be29a/container-images/tcib/base/os/tempest/run_tempest.sh#L226C5-L226C66

stestr last --subunit > ${TEMPEST_PATH}testrepository.subunit

is just copying the last log to a new location.

however what we should do doing is making sure that the default .stestr folder
is mind mounted to a log volume so that the logs are directly written to a PVC and we can post process them even when a time out happens.

alternatively we shoudl be passing the time out to the tempest run invocation

timeout ${TEMPEST_TIMEOUT} tempest run ${TEMPEST_ARGS}

hhttps://github.com/openstack-k8s-operators/tcib/blob/106264ab0a0b18be278d69b0664dad6a4f9be29a/container-images/tcib/base/os/tempest/run_tempest.sh#L215

then the job will finish in a roghyly fixed time and it will generate the logs properly then we done need to wait for a fix period of time here in this script.

making the run_tempest.sh script enforce the timeout when invoking tempest is obviously the simpler option

marios

as discussed yesterday this won't affect any currently running job until we wire it up so lets merge to unblock work here

SeanMooney · 2024-02-08T11:00:19Z

i used depend on to include it in openstack-k8s-operators/nova-operator#665
and it seamed to function fine. that last run did not time out but the logs are present
https://logserver.rdoproject.org/65/665/8a814593c6acbbf86b6ad3022bb565bb23d73247/github-check/nova-operator-tempest-multinode/ac65ef4/controller/ci-framework-data/tests/test_operator/stestr_results.html
so this is at least not breaking that existing behaivor

openshift-ci · 2024-02-08T11:05:01Z

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from dasm and rachael-george February 6, 2024 18:36

lpiwowar requested a review from kopecmartin February 6, 2024 18:36

lpiwowar mentioned this pull request Feb 6, 2024

[zuul] Update the tempest job to use test_operator openstack-k8s-operators/nova-operator#665

Merged

lpiwowar force-pushed the test-operator/failed-job branch from dad92e2 to fc0db1a Compare February 7, 2024 08:59

lpiwowar changed the title ~~[test-operator] Fix waiting for failed job~~ [test-operator] Fix handling of failed job Feb 7, 2024

lpiwowar mentioned this pull request Feb 7, 2024

[test-operator] Collect logs when the pod times out #1111

Closed

4 tasks

lpiwowar force-pushed the test-operator/failed-job branch from fc0db1a to 7a8d4ae Compare February 7, 2024 11:05

kopecmartin approved these changes Feb 7, 2024

View reviewed changes

openshift-ci bot assigned kopecmartin Feb 7, 2024

openshift-ci bot added the lgtm label Feb 7, 2024

marios reviewed Feb 7, 2024

View reviewed changes

roles/test_operator/tasks/main.yml Show resolved Hide resolved

roles/test_operator/tasks/main.yml Show resolved Hide resolved

queria reviewed Feb 7, 2024

View reviewed changes

lpiwowar requested review from arxcruz and queria and removed request for dasm and rachael-george February 7, 2024 12:59

marios approved these changes Feb 8, 2024

View reviewed changes

openshift-ci bot assigned marios Feb 8, 2024

kopecmartin added the approved label Feb 8, 2024

openshift-merge-bot bot merged commit 1d97e47 into openstack-k8s-operators:main Feb 8, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test-operator] Fix handling of failed job #1112

[test-operator] Fix handling of failed job #1112

lpiwowar commented Feb 6, 2024 •

edited

Loading

softwarefactory-project-zuul bot commented Feb 7, 2024

kopecmartin left a comment

queria Feb 7, 2024

lpiwowar Feb 7, 2024

SeanMooney Feb 8, 2024

SeanMooney Feb 8, 2024

marios left a comment

SeanMooney commented Feb 8, 2024

openshift-ci bot commented Feb 8, 2024

[test-operator] Fix handling of failed job #1112

[test-operator] Fix handling of failed job #1112

Conversation

lpiwowar commented Feb 6, 2024 • edited Loading

softwarefactory-project-zuul bot commented Feb 7, 2024

kopecmartin left a comment

Choose a reason for hiding this comment

queria Feb 7, 2024

Choose a reason for hiding this comment

lpiwowar Feb 7, 2024

Choose a reason for hiding this comment

SeanMooney Feb 8, 2024

Choose a reason for hiding this comment

SeanMooney Feb 8, 2024

Choose a reason for hiding this comment

marios left a comment

Choose a reason for hiding this comment

SeanMooney commented Feb 8, 2024

openshift-ci bot commented Feb 8, 2024

lpiwowar commented Feb 6, 2024 •

edited

Loading