-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[test-operator] Fix handling of failed job #1112
[test-operator] Fix handling of failed job #1112
Conversation
dad92e2
to
fc0db1a
Compare
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/4b8bce35c3c54e6d8b5533f485af5dac ✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 25m 51s |
There are two issues in the way of how the role does handle failed job - Currently, the role waits for the Job that starts the execution of the test pod to reach Completed state but this state is only reached when the execution of the tests was successful. If there is an error the job finishes with Failed state. - Logs are not collected when the the test pod times out This patch updates the role so that it correctly waits for the job completion (either with success or failure) and ensures that we collect the logs (output of oc logs) even when the test pod times out.
fc0db1a
to
7a8d4ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
||
- name: Start test-operator-logs-pod | ||
when: | ||
- not cifmw_test_operator_dry_run | bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this mean there will be no logs collected in case it hanged/timed-out?
just wild guessing but if its maybe due to access to persistent storage or such, could there be some read-only parallel access, or maybe force terminate the execution on timeout to get access to logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to skip these because there are no longs to be collected in case of time out [1]. The logs are moved to the directory where the PV is mounted at the end of the test execution.
However, it is possible that there might be an issue with parallel access to the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is not correct
generate logs is post processing the tempest logs.
the tempest logs are stored in a binary format by stester when tempest is run in parrall with the execution fo the these
stestr last --subunit > ${TEMPEST_PATH}testrepository.subunit
is just copying the last log to a new location.
however what we should do doing is making sure that the default .stestr folder
is mind mounted to a log volume so that the logs are directly written to a PVC and we can post process them even when a time out happens.
alternatively we shoudl be passing the time out to the tempest run invocation
timeout ${TEMPEST_TIMEOUT} tempest run ${TEMPEST_ARGS}
hhttps://github.com/openstack-k8s-operators/tcib/blob/106264ab0a0b18be278d69b0664dad6a4f9be29a/container-images/tcib/base/os/tempest/run_tempest.sh#L215
then the job will finish in a roghyly fixed time and it will generate the logs properly then we done need to wait for a fix period of time here in this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making the run_tempest.sh script enforce the timeout when invoking tempest is obviously the simpler option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed yesterday this won't affect any currently running job until we wire it up so lets merge to unblock work here
i used depend on to include it in openstack-k8s-operators/nova-operator#665 |
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1d97e47
into
openstack-k8s-operators:main
[test-operator] Fix handling of failed job
There are two issues in the way of how the role does handle failed job
Currently, the role waits for the Job that starts the execution of
the test pod to reach Completed state but this state is only reached
when the execution of the tests was successful. If there is an error
the job finishes with Failed state.
Logs are not collected when the the test pod times out
This patch updates the role so that it correctly waits for the job
completion (either with success or failure) and ensures that we collect
the logs (output of oc logs) even when the test pod times out.
As a pull request owner and reviewers, I checked that: