Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test-operator] Fix handling of failed job #1112

Conversation

lpiwowar
Copy link
Contributor

@lpiwowar lpiwowar commented Feb 6, 2024

[test-operator] Fix handling of failed job

There are two issues in the way of how the role does handle failed job

  • Currently, the role waits for the Job that starts the execution of
    the test pod to reach Completed state but this state is only reached
    when the execution of the tests was successful. If there is an error
    the job finishes with Failed state.

  • Logs are not collected when the the test pod times out

This patch updates the role so that it correctly waits for the job
completion (either with success or failure) and ensures that we collect
the logs (output of oc logs) even when the test pod times out.

As a pull request owner and reviewers, I checked that:

  • Appropriate testing is done and actually running
  • Appropriate documentation exists and/or is up-to-date:
    • README in the role
    • Content of the docs/source is reflecting the changes

@lpiwowar lpiwowar changed the title [test-operator] Fix waiting for failed job [test-operator] Fix handling of failed job Feb 7, 2024
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/4b8bce35c3c54e6d8b5533f485af5dac

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 25m 51s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 09m 06s
✔️ noop SUCCESS in 0s
cifmw-molecule-test_operator FAILURE in 3m 47s

There are two issues in the way of how the role does handle failed job

- Currently, the role waits for the Job that starts the execution of
  the test pod to reach Completed state but this state is only reached
  when the execution of the tests was successful. If there is an error
  the job finishes with Failed state.

- Logs are not collected when the the test pod times out

This patch updates the role so that it correctly waits for the job
completion (either with success or failure) and ensures that we collect
the logs (output of oc logs) even when the test pod times out.
Copy link
Contributor

@kopecmartin kopecmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm


- name: Start test-operator-logs-pod
when:
- not cifmw_test_operator_dry_run | bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this mean there will be no logs collected in case it hanged/timed-out?

just wild guessing but if its maybe due to access to persistent storage or such, could there be some read-only parallel access, or maybe force terminate the execution on timeout to get access to logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to skip these because there are no longs to be collected in case of time out [1]. The logs are moved to the directory where the PV is mounted at the end of the test execution.

However, it is possible that there might be an issue with parallel access to the logs.

[1] https://github.com/openstack-k8s-operators/tcib/blob/106264ab0a0b18be278d69b0664dad6a4f9be29a/container-images/tcib/base/os/tempest/run_tempest.sh#L265

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is not correct

generate logs is post processing the tempest logs.

the tempest logs are stored in a binary format by stester when tempest is run in parrall with the execution fo the these

this https://github.com/openstack-k8s-operators/tcib/blob/106264ab0a0b18be278d69b0664dad6a4f9be29a/container-images/tcib/base/os/tempest/run_tempest.sh#L226C5-L226C66

stestr last --subunit > ${TEMPEST_PATH}testrepository.subunit

is just copying the last log to a new location.

however what we should do doing is making sure that the default .stestr folder
is mind mounted to a log volume so that the logs are directly written to a PVC and we can post process them even when a time out happens.

alternatively we shoudl be passing the time out to the tempest run invocation

timeout ${TEMPEST_TIMEOUT} tempest run ${TEMPEST_ARGS}

hhttps://github.com/openstack-k8s-operators/tcib/blob/106264ab0a0b18be278d69b0664dad6a4f9be29a/container-images/tcib/base/os/tempest/run_tempest.sh#L215

then the job will finish in a roghyly fixed time and it will generate the logs properly then we done need to wait for a fix period of time here in this script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making the run_tempest.sh script enforce the timeout when invoking tempest is obviously the simpler option

@lpiwowar lpiwowar requested review from arxcruz and queria and removed request for dasm and rachael-george February 7, 2024 12:59
Copy link
Contributor

@marios marios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed yesterday this won't affect any currently running job until we wire it up so lets merge to unblock work here

@SeanMooney
Copy link
Contributor

i used depend on to include it in openstack-k8s-operators/nova-operator#665
and it seamed to function fine. that last run did not time out but the logs are present
https://logserver.rdoproject.org/65/665/8a814593c6acbbf86b6ad3022bb565bb23d73247/github-check/nova-operator-tempest-multinode/ac65ef4/controller/ci-framework-data/tests/test_operator/stestr_results.html
so this is at least not breaking that existing behaivor

Copy link
Contributor

openshift-ci bot commented Feb 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 1d97e47 into openstack-k8s-operators:main Feb 8, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants