Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new pasta flake: timeout #17598

Closed
edsantiago opened this issue Feb 21, 2023 · 16 comments
Closed

new pasta flake: timeout #17598

edsantiago opened this issue Feb 21, 2023 · 16 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. pasta pasta(1) bugs or features rootless

Comments

@edsantiago
Copy link
Member

Seen twice today, on different PRs:

$ podman run --net=pasta -p [10.128.0.63]:5561:5561/tcp quay.io/libpod/testimage:20221018 sh -c for port in $(seq 5561 5561); do                              socat -u TCP4-LISTEN:${port},bind=[10.128.0.63] STDOUT &                          done; wait
timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman’
timeout: sending signal KILL to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman’
[ rc=137 (** EXPECTED 0 **) ]

In one case, only one subtest failed. In the other, subsequent tests also failed, and the entire test run timed out.

@edsantiago edsantiago added flakes Flakes from Continuous Integration rootless pasta pasta(1) bugs or features labels Feb 21, 2023
@edsantiago
Copy link
Member Author

And yet another one, the repeated-timeout scenario.

@edsantiago
Copy link
Member Author

@sbrivio-rh PTAL

@sbrivio-rh
Copy link
Collaborator

@edsantiago is there a way to check what version of the passt package is installed here?

I released a new version (packages available in testing for Fedora 37, "stable" for Fedora 38, currently unstable for Debian and Ubuntu) fixing occasional TCP stalls on transfers with small receive buffers, which might be the case here.

@edsantiago
Copy link
Member Author

Yes! At the top of each test log is a dump of important versions. Here I see passt-0^20230216.g4663ccc-1.fc37-x86_64 in the many-timeout failures, and 0^20221116.gace074c-1.fc37-x86_64 in the one-timeout log.

@edsantiago
Copy link
Member Author

Well, this is unpleasant. The many-failures one, with passt-0^20230216.g4663ccc-1.fc37-x86_64, is a hard failure: it's not a flake. Tests fail repeatedly despite multiple reruns.

@sbrivio-rh
Copy link
Collaborator

Would it be better to skip all the TCP forwarding tests (via tap device only) for the moment being, while I'm investigating this?

I can't reproduce this at the moment, and we have very similar tests in passt's CI which are consistently passing, so it might take a bit -- it's nothing obvious (to me).

@sbrivio-rh
Copy link
Collaborator

Issue finally reproduced, on Fedora 37 only, but reliably -- I guess it depends on some specific timing, and probably socat sending a single byte right away after the SYN, ACK segment.

The container sends a FIN, ACK segment, but pasta fails to send an ACK segment back and the connection doesn't close. This appears to be caused by https://passt.top/passt/commit/?id=cc6d8286d1043d04eb8518e39cebcb9e086dca17

I'll debug this further and try to release fixed packages later on Wednesday.

Let me know if I should meanwhile send a pull request to skip those tests for the moment being. Thanks for the report, and sorry for the mess.

@edsantiago
Copy link
Member Author

Oh, wow, you've had a long evening (or long few hours in your time zone). Thank you for looking into this.

I don't speak for the team, but my my personal preference right now would be for #17305 to merge. That's a complex PR that has taken many weeks of arduous effort, and keeps running into setbacks. If you can precisely identify the subset of tests that should be skipped, and could post a list or diffs or patch on that PR, I think that'd be the best use of everyone's time. That PR merging will then unfreeze other work.

Thanks again for looking into this so promptly.

@sbrivio-rh
Copy link
Collaborator

An updated Fedora package fixing this is pending testing phase here:

https://bodhi.fedoraproject.org/updates/FEDORA-2023-dc03f3fc08

and a dnf command for the specific upgrade should be available at that page in a few minutes. I need to be offline for a couple of hours. If this doesn't work for you, I'll send a patch on that PR skipping the tests.

If you're curious: https://passt.top/passt/commit/?id=4ddbcb9c0c555838b123c018a9ebc9b7e14a87e5

@sbrivio-rh
Copy link
Collaborator

For some reason, the updated package, passt-0^20230222.g4ddbcb9-1.fc37, it not yet available from the testing repository -- the push to testing is still pending. This usually takes minutes, but today it's taking hours. There might be some workflow rule I'm not aware of.

Let me know if I should just go ahead with a patch temporarily disabling those tests at this point.

@edsantiago
Copy link
Member Author

@sbrivio-rh bodhi has been very slow the last few times I've checked, in the last two weeks. O(days).

ITM #17305 has merged (YAY!) but with all pasta tests disabled. ALL pasta tests, not just a subset.

If you wish to submit a PR to reenable tests, skipping only the ones that break in passt-0^20230216.g4663ccc-1.fc37, feel free to do so. This step is optional.

Once 0^20230222.g4ddbcb9-1.fc37 makes it into updates, please coordinate with @cevich to build new VMs and reenable the full suite of pasta tests.

Thank you!

@sbrivio-rh
Copy link
Collaborator

Oh, wow, you've had a long evening (or long few hours in your time zone). Thank you for looking into this.

Right, yes, thanks for noticing -- it was a perfect storm, also with libvirt and KubeVirt integration issues coming up at the same time.

ITM #17305 has merged (YAY!) but with all pasta tests disabled. ALL pasta tests, not just a subset.

Absolutely reasonable, thanks @cevich for the patch.

If you wish to submit a PR to reenable tests, skipping only the ones that break in passt-0^20230216.g4663ccc-1.fc37, feel free to do so. This step is optional.

I'd skip this.

Once 0^20230222.g4ddbcb9-1.fc37 makes it into updates, please coordinate with @cevich to build new VMs and reenable the full suite of pasta tests.


@cevich, passt-0^20230222.g4ddbcb9-1.fc37 is in updates-testing now and can be installed with:

sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2023-dc03f3fc08

For Debian Sid, version 0.0~git20230216.4663ccc-1 is available. Debian packages are not built with -flto, so the present issue doesn't apply to them. Would it be possible to add this package to Debian Sid images as well? I already ran tests for pasta with Podman Debian package 4.4.0+ds1-1 on my local setup with no failures. Thanks.

@cevich
Copy link
Member

cevich commented Feb 23, 2023

@cevich, passt-0^20230222.g4ddbcb9-1.fc37 is in updates-testing now.

The F37 CI VM images are configured to use updates-testing by default, so all I need to do is start a new build. Should be fairly painless...

Would it be possible to add this package to Debian Sid images as well?

Yep, np.

cevich added a commit to cevich/automation_images that referenced this issue Feb 23, 2023
Ref: containers/podman#17598

Signed-off-by: Chris Evich <cevich@redhat.com>
@cevich
Copy link
Member

cevich commented Feb 23, 2023

@sbrivio-rh with the c20230223t153813z-f37f36d12 images built, just swap that value in for IMAGE_SUFFIX in .cirrus.yml (in this PR) and it'll use the new images.

@sbrivio-rh
Copy link
Collaborator

This is now fixed in #17650.

@rhatdan rhatdan closed this as completed Feb 28, 2023
@edsantiago
Copy link
Member Author

Seen again yesterday, f37 remote on a PR in main with, presumably, the same (fixed) VM images with a fixed version of pasta.

I'm reluctant to reopen based solely on one instance in one month... but will leave this here for now.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 29, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. pasta pasta(1) bugs or features rootless
Projects
None yet
Development

No branches or pull requests

4 participants