Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[zuul] Update the tempest job to use test_operator #665

Conversation

elfiesmelfie
Copy link
Contributor

@elfiesmelfie elfiesmelfie commented Feb 2, 2024

Using the tempest role from ci_framework ran the tempest container via podman on the controller.
test_operator uses the same image and runs tempest in a pod on the OCP cluster.

Depends-On: openstack-k8s-operators/openstack-operator#659

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/3761837e16fb4d7fa8587259956ddd11

✔️ nova-operator-content-provider SUCCESS in 55m 56s
✔️ nova-operator-kuttl SUCCESS in 36m 48s
nova-operator-tempest-multinode RETRY_LIMIT in 4s

@gibizer
Copy link
Contributor

gibizer commented Feb 2, 2024

recheck

@SeanMooney
Copy link
Contributor

something broke in zuul but the job actually passed.

https://logserver.rdoproject.org/65/665/796bce2a61df96939f273e23905f29684659f55f/github-check/nova-operator-tempest-multinode/39fd880/controller/ci-framework-data/tests/test_operator/stestr_results.html

however the allowed and excluded list is not propagating so it only ran the default

@elfiesmelfie
Copy link
Contributor Author

however the allowed and excluded list is not propagating so it only ran the default

That's my bad, I forgot to update the names of the vars used to pass in the include_list and exclude_list, the update is pushed

@elfiesmelfie
Copy link
Contributor Author

recheck

The Depends on link was not correct

Using the tempest role from ci_framework ran the tempest container via
podman on the controller.
test_operator uses the same image and runs tempest in a pod on the OCP
cluster.

Depends-On: openstack-k8s-operators/ci-framework#1065
cifmw_tempest_tests_allowed -> cifmw_test_operator_tempest_include_list
cifmw_tempest_tests_skipped -> cifmw_test_operator_tempest_exclude_list
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/947d0966917a461d91da7308ba39f656

✔️ nova-operator-content-provider SUCCESS in 2h 32m 43s
✔️ nova-operator-kuttl SUCCESS in 36m 38s
nova-operator-tempest-multinode FAILURE in 2h 14m 20s

@SeanMooney
Copy link
Contributor

rdo-check

@SeanMooney
Copy link
Contributor

check-github

@SeanMooney
Copy link
Contributor

recheck there are a lot of rabbitmq and mysql connection errors in the failed tempest run whcih are unrelated to how we ran tempest so that could explain many of the test failures

.zuul.yaml Show resolved Hide resolved
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/f27a2eb76f4447368ea23a662d4b34d8

✔️ nova-operator-content-provider SUCCESS in 2h 34m 41s
✔️ nova-operator-kuttl SUCCESS in 43m 41s
nova-operator-tempest-multinode FAILURE in 2h 16m 37s

@gibizer
Copy link
Contributor

gibizer commented Feb 5, 2024

Tempest logs are here now: https://review.rdoproject.org/zuul/build/b6577dec951b4bb3a7f746d47b5fae85/log/controller/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/namespaces/openstack/pods/tempest-tests-6ksc4/logs/tempest-tests-tests-runner.log

The execution was interrupted by the job timeout. But I see a lot of failures in the executed tests that makes the execution slower e.g.:

{7} tempest.api.compute.security_groups.test_security_group_rules_negative.SecurityGroupRulesNegativeTestJSON.test_create_security_group_rule_with_invalid_port_range [67.534572s] ... FAILED
{6} tempest.api.compute.servers.test_delete_server.DeleteServersTestJSON.test_delete_server_while_in_building_state [98.352015s] ... FAILED

I think the main reason is to much parallelism. In the current podman based executor we run tempest in 4 processes
https://logserver.rdoproject.org/66/666/d9aa648cb7613bb124fd9d878fe97ab00d77664d/github-check/nova-operator-tempest-multinode/12ca46c/controller/ci-framework-data/tests/tempest/podman_tempest.log
but it seems the tempest operator uses 8.

@SeanMooney
Copy link
Contributor

ya we likely need to drop it down to 3-4

ill check with emma in the morning and see if she wants us to take over this patch.

one downside to the test-operator is it only copies the full tempest logs to the logs folder if the tempest execution does not time out.

where it runes properly it provides the html report and the raw tempest logs in a seperate tests log dir but where waiting for the job to complete fails it does not collect those logs properly.

@lpiwowar
Copy link

lpiwowar commented Feb 6, 2024

one downside to the test-operator is it only copies the full tempest logs to the logs folder if the tempest execution does not time out.

I'm going to take a look at this ^^.

Also, about the concurrency. It would probably be good to drop it down to 4. I'm going to propose a patch. We've encountered similar issues related to it.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/075656d74dcc4638b0ce17d392ef0008

✔️ nova-operator-content-provider SUCCESS in 3h 27m 06s
✔️ nova-operator-kuttl SUCCESS in 36m 38s
nova-operator-tempest-multinode FAILURE in 3h 09m 10s

@SeanMooney
Copy link
Contributor

rdo-check

@SeanMooney
Copy link
Contributor

check-github

@SeanMooney
Copy link
Contributor

check-rdo is what i wanted

@SeanMooney
Copy link
Contributor

looking at the failed tempest tests the 500 errors form keystone correspond to

mysql exceptions cause by the galarea cluster not being writeable

look at the mysql pod events we can see it was restarted at least 4 times.

Warning Unhealthy 152m kubelet Startup probe failed: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (111)
Normal Pulled 151m (x4 over 152m) kubelet Container image "quay.io/podified-antelope-centos9/openstack-mariadb@sha256:095e75c0a028bf5ba83af90882ce1836e00fc198038f776ee1104f6b1232da93" already present on machine
Normal Started 151m (x4 over 152m) kubelet Started container galera
Normal Created 151m (x4 over 152m) kubelet Created container galera
Warning BackOff 151m (x8 over 152m) kubelet Back-off restarting failed container
Warning Unhealthy 57s (x4 over 102m) kubelet Liveness probe failed: command timed out
Warning Unhealthy 57s (x4 over 3m38s) kubelet Readiness probe failed: command timed out

so i dont think the deploy db was stable and functional when tempest ran

@SeanMooney
Copy link
Contributor

i can see in the pod logs that nova and keystone were unable to connect to the db for a protracted period of time. i belive we have a tracker for a similar issue in other ci jobs so i think the test failures are directly a result of the db being inaccessable.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/4d723d57417b42c583616b847a0e40c2

✔️ nova-operator-content-provider SUCCESS in 2h 14m 39s
nova-operator-kuttl RETRY_LIMIT in 50m 53s
nova-operator-tempest-multinode FAILURE in 1h 56m 25s

@SeanMooney
Copy link
Contributor

much much better, only one test failed this time test_mtu_sized_frames
and I'm not sure we configure tempest to run that properly as we have to take into account the lower mtu in ci.

this change reduces the concurrency to 4
and extends the tempest timeout to 7200
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/008a6023b5b04df880b8ec49928d0404

✔️ nova-operator-content-provider SUCCESS in 2h 15m 21s
nova-operator-kuttl RETRY_LIMIT in 50m 53s
✔️ nova-operator-tempest-multinode SUCCESS in 1h 55m 01s

@gibizer
Copy link
Contributor

gibizer commented Feb 8, 2024

These are the lost test cases with this PR:

473,480d472
< tempest.api.compute.volumes.test_volumes_negative.VolumesNegativeTest.test_create_volume_with_invalid_size
< tempest.api.compute.volumes.test_volumes_negative.VolumesNegativeTest.test_create_volume_without_passing_size
< tempest.api.compute.volumes.test_volumes_negative.VolumesNegativeTest.test_create_volume_with_size_zero
< tempest.api.compute.volumes.test_volumes_negative.VolumesNegativeTest.test_delete_invalid_volume_id
< tempest.api.compute.volumes.test_volumes_negative.VolumesNegativeTest.test_delete_volume_without_passing_volume_id
< tempest.api.compute.volumes.test_volumes_negative.VolumesNegativeTest.test_get_volume_without_passing_volume_id
< tempest.api.compute.volumes.test_volumes_negative.VolumesNegativeTest.test_volume_delete_nonexistent_volume_id
< tempest.api.compute.volumes.test_volumes_negative.VolumesNegativeTest.test_volume_get_nonexistent_volume_id
491d482
< tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_mtu_sized_frames

I'm OK to not running them here

Copy link
Contributor

openshift-ci bot commented Feb 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elfiesmelfie, gibizer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Feb 8, 2024
@SeanMooney
Copy link
Contributor

check-rdo

@openshift-merge-bot openshift-merge-bot bot merged commit 0610489 into openstack-k8s-operators:main Feb 9, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants