Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman-remote: there's a hang somewhere #7241

Closed
edsantiago opened this issue Aug 5, 2020 · 10 comments
Closed

podman-remote: there's a hang somewhere #7241

edsantiago opened this issue Aug 5, 2020 · 10 comments
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote stale-issue

Comments

@edsantiago
Copy link
Member

Another semi-useless report with no reproducer nor actual details.

There's a hang somewhere in podman-remote. It's causing flakes in CI all over the place. It does not seem to be deterministic, it's happening in lots of different situations.

Examples:

Sorry for sparse details, I'm trying very hard to be done for the week and am failing dismally. Please add examples here as you see them, I think this is going to be an ugly one.

@edsantiago edsantiago added flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. remote Problem is in podman-remote labels Aug 5, 2020
@edsantiago
Copy link
Member Author

Another wait https://cirrus-ci.com/task/6571684087988224

@edsantiago
Copy link
Member Author

This script will reproduce it, although not quickly, you have to do it in a loop and wait possibly hours:

$ cat >7241.sh <<EOF
set -e
./bin/podman-remote run -d --name foo alpine sh -c 'touch /foo;while test -e /foo; do sleep 1;done'
cid=foo
./bin/podman-remote logs $cid
./bin/podman-remote exec $cid /etc             || true
./bin/podman-remote exec $cid no-such-command  || true
./bin/podman-remote exec $cid rm /foo
timeout -v 15 ./bin/podman-remote wait $cid
./bin/podman-remote rm $cid
EOF

$ while bash -x 7241.sh;do echo;echo;done
...copious output... eventually dies
+ timeout -v 15 ./bin/podman-remote wait foo
timeout: sending signal TERM to command ‘./bin/podman-remote’

$ podman ps -a
CONTAINER ID  IMAGE                            COMMAND               CREATED         STATUS                     PORTS   NAMES
c87ff4b0836f  docker.io/library/alpine:latest  sh -c touch /foo;...  15 minutes ago  Exited (0) 10 minutes ago          foo

I find it interesting that CREATED was 15 minutes ago, but it exited 10. Implying that it took 5 minutes between the exec rm /foo and the time the container exited.

I'm heading out for the day, but am leaving the container as it is in case anyone has suggestions for me to inspect or check logs or anything.

@zhangguanzhang
Copy link
Collaborator

same with you in "podman-remote wait $cid", in https://storage.googleapis.com/cirrus-ci-6707778565701632-fcae48/artifacts/containers/podman/5815345191583744/html/system_test.log.html.

# # podman-remote --url unix:/tmp/podman.6CurjS exec 3b0aee1ff91b0c0440cccc3963843b935fe767c05c48cca7216eac05935f54e2 rm -f /Q5gFVgzznJnlfcj6mZzL
# # podman-remote --url unix:/tmp/podman.6CurjS wait 3b0aee1ff91b0c0440cccc3963843b935fe767c05c48cca7216eac05935f54e2
# timeout: sending signal TERM to command ‘/var/tmp/go/src/github.com/containers/podman/bin/podman-remote’
# [ rc=124 (** EXPECTED 0 **) ]

exitCode is 124, so it seem be killed by something, but I test this on my local machine, it will be ok

edsantiago added a commit to edsantiago/libpod that referenced this issue Aug 10, 2020
- new sanity checks for podman-remote:
  - first, confirm that when PODMAN is "-remote",
    we actually talk to a server (validated by
    presence of "Server:" string in "podman version").
  - second, add test for containers#7212, in which we run
    "podman --remote" (podman with --remote flag,
    not podman-remote command) and make sure --remote
    is allowed both as the first option and also
    with other flag options preceding.

- new test for "podman image tree" (piggybacking on
  top of a "podman build" test, because that gives
  us lots of layers).

- skip "podman exec - basic test" when remote. It is consistently
  causing CI failures, breaking all of CI, due to containers#7241.

Signed-off-by: Ed Santiago <santiago@redhat.com>
Luap99 pushed a commit to Luap99/libpod that referenced this issue Aug 30, 2020
- new sanity checks for podman-remote:
  - first, confirm that when PODMAN is "-remote",
    we actually talk to a server (validated by
    presence of "Server:" string in "podman version").
  - second, add test for containers#7212, in which we run
    "podman --remote" (podman with --remote flag,
    not podman-remote command) and make sure --remote
    is allowed both as the first option and also
    with other flag options preceding.

- new test for "podman image tree" (piggybacking on
  top of a "podman build" test, because that gives
  us lots of layers).

- skip "podman exec - basic test" when remote. It is consistently
  causing CI failures, breaking all of CI, due to containers#7241.

Signed-off-by: Ed Santiago <santiago@redhat.com>
Luap99 pushed a commit to Luap99/libpod that referenced this issue Aug 30, 2020
- new sanity checks for podman-remote:
  - first, confirm that when PODMAN is "-remote",
    we actually talk to a server (validated by
    presence of "Server:" string in "podman version").
  - second, add test for containers#7212, in which we run
    "podman --remote" (podman with --remote flag,
    not podman-remote command) and make sure --remote
    is allowed both as the first option and also
    with other flag options preceding.

- new test for "podman image tree" (piggybacking on
  top of a "podman build" test, because that gives
  us lots of layers).

- skip "podman exec - basic test" when remote. It is consistently
  causing CI failures, breaking all of CI, due to containers#7241.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@github-actions
Copy link

github-actions bot commented Sep 8, 2020

A friendly reminder that this issue had no activity for 30 days.

@mheon
Copy link
Member

mheon commented Sep 8, 2020

@edsantiago Still seeing this one?

@edsantiago
Copy link
Member Author

Sorry, yes. Issue still present in master @ 54a61e3. It took 35 minutes of looping on my f32 laptop, but:

...
+ timeout -v 15 ./bin/podman-remote wait foo
timeout: sending signal TERM to command ‘./bin/podman-remote’
$ ./bin/podman ps -a                                                                                                                                  -
CONTAINER ID  IMAGE                            COMMAND               CREATED        STATUS                         PORTS   NAMES
316df2501ed6  docker.io/library/alpine:latest  sh -c touch /foo;...  6 minutes ago  Exited (0) About a minute ago          foo

@rhatdan rhatdan added the kind/test-flake Categorizes issue or PR as related to test flakes. label Oct 7, 2020
@github-actions
Copy link

github-actions bot commented Nov 7, 2020

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Nov 7, 2020

@edsantiago the original report said that there were hangs all over the place, but I guess this is very rare now?

@edsantiago
Copy link
Member Author

I skipped the failing test, which is one reason why we're not seeing the failure in CI any more.

I tried running the 7241.sh reproducer just now, saw no failures in one hour. I will try removing the skip and we'll see what happens in CI.

edsantiago added a commit to edsantiago/libpod that referenced this issue Nov 9, 2020
It was 'skip'ped due to frequent flakes (containers#7241). I just tried
running the 7241 reproducer on my laptop for one hour, and
saw no failures, so let's reenable this in CI and see if it
comes back.

I really hate problems that "go away" on their own without
being explicitly acknowledged and fixed.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@rhatdan
Copy link
Member

rhatdan commented Dec 24, 2020

I believe this is fixed now, @edsantiago reopen if I am mistaken,

@rhatdan rhatdan closed this as completed Dec 24, 2020
edsantiago added a commit to edsantiago/libpod that referenced this issue Jan 4, 2021
Test was disabled August 2020 due to containers#7241, a hang. That issue
has been closed, so let's see if it's really fixed.

Signed-off-by: Ed Santiago <santiago@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 22, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. kind/test-flake Categorizes issue or PR as related to test flakes. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote stale-issue
Projects
None yet
Development

No branches or pull requests

4 participants