Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/e2e: Libvirt Env tests are unstable #1831

Open
stevenhorsman opened this issue May 3, 2024 · 8 comments
Open

tests/e2e: Libvirt Env tests are unstable #1831

stevenhorsman opened this issue May 3, 2024 · 8 comments

Comments

@stevenhorsman
Copy link
Member

stevenhorsman commented May 3, 2024

We see occasional (anecdotally <20% of the time) failures on the libvirt nightly CI, which seems to always (so far) pass on re-run and now we've seen in on a PR test, so it's becoming more of an obstacle, so we should investigate it when we get the chance

=== RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly
=== RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly/EnvVariablePeerPodWithImageOnly_test
    assessment_runner.go:262: timed out waiting for the condition
--- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly (600.10s)
    --- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly/EnvVariablePeerPodWithImageOnly_test (600.10s)
=== RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment
=== RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment/EnvVariablePeerPodWithBoth_test
    assessment_runner.go:262: timed out waiting for the condition
--- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment (600.04s)
    --- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment/EnvVariablePeerPodWithBoth_test (600.04s)
RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly
=== RUN   TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly/EnvVariablePeerPodWithDeploymentOnly_test
    assessment_runner.go:262: timed out waiting for the condition
--- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly (600.06s)
    --- FAIL: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly/EnvVariablePeerPodWithDeploymentOnly_test (600.06s)
=== RUN   TestLibvirtCreatePeerPodAndCheckWorkDirLogs
=== RUN   TestLibvirtCreatePeerPodAndCheckWorkDirLogs/WorkDirPeerPod_test
    assessment_runner.go:262: timed out waiting for the condition
--- FAIL: TestLibvirtCreatePeerPodAndCheckWorkDirLogs (600.16s)
    --- FAIL: TestLibvirtCreatePeerPodAndCheckWorkDirLogs/WorkDirPeerPod_test (600.16s)
@stevenhorsman
Copy link
Member Author

This is getting worse and we are hitting it multiple times on each PR now. I've tried running this test locally and in about 8 re-runs it worked every time, so I'm not sure of the cause of the failure. In the short term I think we need to skip it in the CI to stop it blocking PRs.

stevenhorsman added a commit to stevenhorsman/cloud-api-adaptor that referenced this issue May 9, 2024
The TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly
test is failing semi-regularly on the CI, but seems to run okay
locally, so skip it until we have a chance to debug.
See confidential-containers#1831

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
stevenhorsman added a commit to stevenhorsman/cloud-api-adaptor that referenced this issue May 9, 2024
The TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly
and TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment
tests are failing semi-regularly on the CI, but seems to run okay
locally, so skip it until we have a chance to debug.
See confidential-containers#1831

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
wainersm pushed a commit that referenced this issue May 14, 2024
The TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly
and TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment
tests are failing semi-regularly on the CI, but seems to run okay
locally, so skip it until we have a chance to debug.
See #1831

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
beraldoleal pushed a commit to beraldoleal/cloud-api-adaptor that referenced this issue May 27, 2024
The TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly
and TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment
tests are failing semi-regularly on the CI, but seems to run okay
locally, so skip it until we have a chance to debug.
See confidential-containers#1831

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
@stevenhorsman
Copy link
Member Author

It is possible that this is related to the image-pull changes as Chengyu is touch the config merge code in kata-containers/kata-containers#9695, so after this, we should try re-testing this.

@stevenhorsman stevenhorsman changed the title tests/e2e: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly test unstable tests/e2e: Libvirt Env tests are unstable Jun 12, 2024
@stevenhorsman
Copy link
Member Author

Hmm, this is suspicious, now the e2e tests related to env are skipped I've seen:

=== RUN   TestLibvirtCreatePeerPodAndCheckWorkDirLogs
=== RUN   TestLibvirtCreatePeerPodAndCheckWorkDirLogs/WorkDirPeerPod_test
    assessment_runner.go:262: timed out waiting for the condition
--- FAIL: TestLibvirtCreatePeerPodAndCheckWorkDirLogs (600.16s)
    --- FAIL: TestLibvirtCreatePeerPodAndCheckWorkDirLogs/WorkDirPeerPod_test (600.16s)

start failing, so maybe it's related to something before now being cleaned up, or the workdir has the same issue?

@stevenhorsman
Copy link
Member Author

start failing, so maybe it's related to something before now being cleaned up, or the workdir has the same issue?

This has failed the last three nightlies, so I will raise a PR to skip this for now

stevenhorsman added a commit to stevenhorsman/cloud-api-adaptor that referenced this issue Jul 10, 2024
The TestLibvirtCreatePeerPodAndCheckWorkDirLogs test
has failed on a few PRs and the last three nightly test runs,
so skip it until we have a chance to debug.
See confidential-containers#1831

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
@wainersm
Copy link
Member

@stevenhorsman yesterday I ran TestLibvirtCreatePeerPodAndCheckWorkDirLogs a couple of times locally with the hope of reproducing the error but it always passed!

Then I started working on a golang equivalent of kubectl describe so we could print more info on CI, but ran out of time...

@stevenhorsman
Copy link
Member Author

@stevenhorsman yesterday I ran TestLibvirtCreatePeerPodAndCheckWorkDirLogs a couple of times locally with the hope of reproducing the error but it always passed!

Yeah - I have this experience with the other tests too. My hope is that a new version of the kata-agent and image-rs might have addressed some of these, so I will re-test after they've been bumped

stevenhorsman added a commit that referenced this issue Jul 11, 2024
The TestLibvirtCreatePeerPodAndCheckWorkDirLogs test
has failed on a few PRs and the last three nightly test runs,
so skip it until we have a chance to debug.
See #1831

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
@stevenhorsman
Copy link
Member Author

In #2183 I've tried re-enabling all the tests and it seems that only

  • TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageAndDeployment
  • TestLibvirtCreatePeerPodWithLargeImage
    seem to still be failing, but we'll need to keep a close eye on the stability of the rest if/when it gets merged.

@stevenhorsman
Copy link
Member Author

stevenhorsman commented Dec 13, 2024

Tracking of unstable tests:

  • 12/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly failed on the crio, packer runs
  • 13/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly, TestLibvirtCreatePeerPodAndCheckWorkDirLogs and TestLibvirtCreatePeerPodWithJob failed on the crio, packer test runs across various repeats
  • 17/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly failed on the containerd, packer run
  • 18/12/24 nightly: TestLibvirtCreatePeerPodWithJob failed on one of the crio, packer runs
    • On the crio re-run TestLibvirtCreatePeerPodAndCheckWorkDirLogs failed
  • 19/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly failed on the containerd,packer test, TestLibvirtCreatePeerPodWithJob failed on the crio, packer test run.
  • 20/12/24 nightly: TestLibvirtCreatePeerPodWithJob failed on the crio, packer test run.
  • 21/12/24 nightly: TestLibvirtCreatePeerPodAndCheckWorkDirLogs failed on the containerd, packer run, TestLibvirtCreatePeerPodWithJob failed on the crio, packer runs
  • 22/12/24 nightly: containerd tests all passed, TestLibvirtCreatePeerPodAndCheckWorkDirLogs and TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly failed on the crio, packer run
  • 23/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly failed on the containerd, x86, mkosi test, no packer test failures
  • 24/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly failed on a crio, packer test
  • 25/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly failed on a crio, packer test
  • 26/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly failed on a crio, packer test
  • 27/12/24 nightly: TestLibvirtCreatePeerPodAndCheckWorkDirLogs failed on a containerd, packer test, TestLibvirtKbsKeyRelease failed on a crio, packer test
  • 28/12/24 nightly: TestLibvirtCreatePeerPodAndCheckWorkDirLogs failed on a containerd, packer test
  • 29/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly failed on a containerd, packet test, TestLibvirtCreatePeerPodWithJob failed on a crio, packer test
  • 30/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly failed on a containerd, packet test, TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly failed on a crio, packer test
  • 31/12/24 nightly: TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly failed on a containerd, mkosi test, TestLibvirtCreatePeerPodAndCheckWorkDirLogs failed on a crio, packer test
  • 1/1/25 nightly: TestLibvirtCreatePeerPodAndCheckWorkDirLogs failed in a containerd, packer test, and a crio, packer test

stevenhorsman added a commit to stevenhorsman/cloud-api-adaptor that referenced this issue Jan 3, 2025
Based in the test analysis done for 18 days of nighty tests in
confidential-containers#1831 (comment)
the only containerd test failures we saw were:
- `TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly` - three times
- `TestLibvirtCreatePeerPodAndCheckWorkDirLogs` - four times
- `TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly` - twice

Although the chances of failure for each of these tests is < 25%, we want to reduce
the re-runs required, so if we skip these we should have more stable CI tests.
It should also be noted that most of the failures were seen on the packer built images.
This is probably just chance, but might indicate that the peer pod boot speed is related
and we should re-evaluate again once we can remove the packer podvm images.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
mkulke pushed a commit that referenced this issue Jan 7, 2025
Based in the test analysis done for 18 days of nighty tests in
#1831 (comment)
the only containerd test failures we saw were:
- `TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithImageOnly` - three times
- `TestLibvirtCreatePeerPodAndCheckWorkDirLogs` - four times
- `TestLibvirtCreatePeerPodAndCheckEnvVariableLogsWithDeploymentOnly` - twice

Although the chances of failure for each of these tests is < 25%, we want to reduce
the re-runs required, so if we skip these we should have more stable CI tests.
It should also be noted that most of the failures were seen on the packer built images.
This is probably just chance, but might indicate that the peer pod boot speed is related
and we should re-evaluate again once we can remove the packer podvm images.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants