Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rawhide is broken #624

Merged
merged 5 commits into from
Dec 17, 2020
Merged

Rawhide is broken #624

merged 5 commits into from
Dec 17, 2020

Conversation

marusak
Copy link
Member

@marusak marusak commented Dec 15, 2020

No description provided.

@marusak
Copy link
Member Author

marusak commented Dec 15, 2020

I saw almost the same failure in downstream: https://src.fedoraproject.org/rpms/cockpit-podman/pull-request/14

@marusak
Copy link
Member Author

marusak commented Dec 16, 2020

I booted F-32/rawhide prepared VM and:

$  systemctl --user start podman.socket

$ systemctl --user status podman
● podman.service - Podman API Service
     Loaded: loaded (/usr/lib/systemd/user/podman.service; static; vendor preset: disabled)
     Active: inactive (dead)
TriggeredBy: ● podman.socket
       Docs: man:podman-system-service(1)

[admin@m1 ~]$ time  curl --unix-socket /run/user/1000/podman/podman.sock http://d/v1.24/events
curl: (18) transfer closed with outstanding read data remaining

real	1m35.054s
user	0m0.013s
sys	0m0.017s

$ systemctl --user status podman
● podman.service - Podman API Service
     Loaded: loaded (/usr/lib/systemd/user/podman.service; static; vendor preset: disabled)
     Active: failed (Result: timeout) since Wed 2020-12-16 11:59:48 EST; 3s ago
TriggeredBy: ● podman.socket
       Docs: man:podman-system-service(1)
    Process: 17474 ExecStart=/usr/bin/podman $LOGGING system service (code=exited, status=1/FAILURE)
   Main PID: 17474 (code=exited, status=1/FAILURE)
        CPU: 131ms

Dec 16 11:58:13 m1.cockpit.lan systemd[17434]: podman.service: Got notification message from PID 17519, but reception only permitted for main PID 17474
Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="APIHandler(d146316c-7340-4d8a-9d52-8185cbe7792f) -- GET /v1.24/events BEGIN"
Dec 16 11:59:43 m1.cockpit.lan systemd[17434]: podman.service: start operation timed out. Terminating.
Dec 16 11:59:43 m1.cockpit.lan podman[17474]: time="2020-12-16T11:59:43-05:00" level=info msg="Received shutdown signal terminated, terminating!"
Dec 16 11:59:43 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:43-05:00" level=info msg="Received shutdown signal terminated, terminating!"
Dec 16 11:59:43 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:43-05:00" level=info msg="Invoking shutdown handler server"
Dec 16 11:59:48 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:48-05:00" level=info msg="Invoking shutdown handler libpod"
Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: podman.service: Main process exited, code=exited, status=1/FAILURE
Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: podman.service: Failed with result 'timeout'.
Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: Failed to start Podman API Service.

A bit more logs:

$ journalctl -r
-- Logs begin at Fri 2020-12-04 21:12:04 EST, end at Wed 2020-12-16 11:59:48 EST. --
Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: Failed to start Podman API Service.
Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: podman.service: Failed with result 'timeout'.
Dec 16 11:59:48 m1.cockpit.lan systemd[17434]: podman.service: Main process exited, code=exited, status=1/FAILURE
Dec 16 11:59:48 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:48-05:00" level=info msg="Invoking shutdown handler libpod"
Dec 16 11:59:43 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:43-05:00" level=info msg="Invoking shutdown handler server"
Dec 16 11:59:43 m1.cockpit.lan podman[17519]: time="2020-12-16T11:59:43-05:00" level=info msg="Received shutdown signal terminated, terminating!"
Dec 16 11:59:43 m1.cockpit.lan podman[17474]: time="2020-12-16T11:59:43-05:00" level=info msg="Received shutdown signal terminated, terminating!"
Dec 16 11:59:43 m1.cockpit.lan systemd[17434]: podman.service: start operation timed out. Terminating.
Dec 16 11:59:41 m1.cockpit.lan systemd[17434]: Finished Mark boot as successful.
Dec 16 11:59:41 m1.cockpit.lan systemd[17434]: grub-boot-success.service: Succeeded.
Dec 16 11:59:41 m1.cockpit.lan systemd[17434]: Starting Mark boot as successful...
Dec 16 11:59:29 m1.cockpit.lan kernel: audit: type=1131 audit(1608137968.994:331): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=libvirtd comm="syste>
Dec 16 11:59:28 m1.cockpit.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=libvirtd comm="systemd" exe="/usr/lib/systemd/sys>
Dec 16 11:59:28 m1.cockpit.lan systemd[1]: libvirtd.service: Succeeded.
Dec 16 11:59:21 m1.cockpit.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="systemd" exe="/usr/lib/systemd/s>
Dec 16 11:59:21 m1.cockpit.lan kernel: audit: type=1131 audit(1608137961.238:330): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="sys>
Dec 16 11:59:21 m1.cockpit.lan kernel: audit: type=1130 audit(1608137961.238:329): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="sys>
Dec 16 11:59:21 m1.cockpit.lan audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="systemd" exe="/usr/lib/systemd/>
Dec 16 11:59:21 m1.cockpit.lan systemd[1]: Finished Check PMIE instances are running.
Dec 16 11:59:21 m1.cockpit.lan systemd[1]: pmie_check.service: Succeeded.
Dec 16 11:59:21 m1.cockpit.lan kernel: audit: type=1131 audit(1608137961.230:328): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmlogger_check comm=>
Dec 16 11:59:21 m1.cockpit.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmlogger_check comm="systemd" exe="/usr/lib/syste>
Dec 16 11:59:21 m1.cockpit.lan audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmlogger_check comm="systemd" exe="/usr/lib/syst>
Dec 16 11:59:21 m1.cockpit.lan kernel: audit: type=1130 audit(1608137961.230:327): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmlogger_check comm=>
Dec 16 11:59:21 m1.cockpit.lan systemd[1]: Finished Check pmlogger instances are running.
Dec 16 11:59:21 m1.cockpit.lan systemd[1]: pmlogger_check.service: Succeeded.
Dec 16 11:59:20 m1.cockpit.lan systemd[1]: Starting Check pmlogger instances are running...
Dec 16 11:59:20 m1.cockpit.lan systemd[1]: Starting Check PMIE instances are running...
Dec 16 11:58:39 m1.cockpit.lan chronyd[607]: Selected source 147.251.48.140
Dec 16 11:58:13 m1.cockpit.lan kernel: audit: type=1131 audit(1608137893.708:326): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="sys>
Dec 16 11:58:13 m1.cockpit.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="systemd" exe="/usr/lib/systemd/s>
Dec 16 11:58:13 m1.cockpit.lan audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="systemd" exe="/usr/lib/systemd/>
Dec 16 11:58:13 m1.cockpit.lan kernel: audit: type=1130 audit(1608137893.708:325): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pmie_check comm="sys>
Dec 16 11:58:13 m1.cockpit.lan systemd[1]: Finished Check PMIE instances are running.
Dec 16 11:58:13 m1.cockpit.lan systemd[1]: pmie_check.service: Succeeded.
Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="APIHandler(d146316c-7340-4d8a-9d52-8185cbe7792f) -- GET /v1.24/events BEGIN"
Dec 16 11:58:13 m1.cockpit.lan systemd[17434]: podman.service: Got notification message from PID 17519, but reception only permitted for main PID 17474
Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="API server listening on \"/run/user/1000/podman/podman.sock\""
Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="using API endpoint: ''"
Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="using systemd socket activation to determine API endpoint"
Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="Setting parallel job count to 4"
Dec 16 11:58:13 m1.cockpit.lan podman[17519]: 2020-12-16 11:58:13.636386071 -0500 EST m=+0.049761667 system refresh
Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=warning msg="Error initializing configured OCI runtime kata: no valid executable found for OCI runtime k>
Dec 16 11:58:13 m1.cockpit.lan podman[17519]: time="2020-12-16T11:58:13-05:00" level=info msg="/usr/bin/podman filtering at log level info"
Dec 16 11:58:13 m1.cockpit.lan podman[17474]: time="2020-12-16T11:58:13-05:00" level=warning msg="Error initializing configured OCI runtime kata: no valid executable found for OCI runtime k>
Dec 16 11:58:13 m1.cockpit.lan podman[17474]: time="2020-12-16T11:58:13-05:00" level=info msg="/usr/bin/podman filtering at log level info"
Dec 16 11:58:13 m1.cockpit.lan systemd[1]: Starting Check PMIE instances are running...
Dec 16 11:58:13 m1.cockpit.lan systemd[17434]: Starting Podman API Service...
Dec 16 11:57:58 m1.cockpit.lan audit: BPF prog-id=43 op=UNLOAD
Dec 16 11:57:58 m1.cockpit.lan kernel: audit: type=1334 audit(1608137878.525:324): prog-id=43 op=UNLOAD
Dec 16 11:57:58 m1.cockpit.lan kernel: audit: type=1334 audit(1608137878.525:323): prog-id=44 op=UNLOAD
Dec 16 11:57:58 m1.cockpit.lan audit: BPF prog-id=44 op=UNLOAD
Dec 16 11:57:58 m1.cockpit.lan audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed comm="systemd" exe="/usr/lib/sy>
Dec 16 11:57:58 m1.cockpit.lan kernel: audit: type=1131 audit(1608137878.504:322): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-hostnamed co>
Dec 16 11:57:58 m1.cockpit.lan kernel: kauditd_printk_skb: 21 callbacks suppressed
Dec 16 11:57:58 m1.cockpit.lan systemd[1]: systemd-hostnamed.service: Succeeded.
Dec 16 11:57:55 m1.cockpit.lan systemd[17434]: Listening on Podman API Socket.

Interestingly, when I boot VM, first time events time out after ~1.5 minute, when I try it again it never times out, any other subsequent call times out in ~1.5minute.

@marusak
Copy link
Member Author

marusak commented Dec 16, 2020

Sent containers/podman#8751

martinpitt
martinpitt previously approved these changes Dec 16, 2020
Copy link
Member

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! That's in fact the very fix in containers/podman#8753 .

@marusak
Copy link
Member Author

marusak commented Dec 16, 2020

bah, does not seem to work :/ waiting for the result to see if the same problem or something different

@marusak
Copy link
Member Author

marusak commented Dec 16, 2020

so still some failed user services but also now container.Image does not contain tag. Before it would return fedora:32 now just fedora. I need to check if this is regression or the API was "improved".

@marusak
Copy link
Member Author

marusak commented Dec 16, 2020

Oh and that sed is overwritten by the rawhide scenario being applied (with dnf update podman after the sed).
This needs a bit more work, but I guess tomorrow :)

@martinpitt
Copy link
Member

@marusak: Quick idea: copy the unit to /etc or /run, they should both supersede units in /usr.

martinpitt
martinpitt previously approved these changes Dec 17, 2020
Copy link
Member

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to unblock image refreshes. I'm hot entirely happy about this, but this can be cleaned up in a follow-up. Thanks!

test/vm.install Outdated
Comment on lines 54 to 55
sed -i "s/Type=notify/Type=exec/" /etc/systemd/system/podman.service
sed -i "s/Type=notify/Type=exec/" /etc/systemd/user/podman.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important at all, just in general for next time: Have sed output to the new file instead of copy/in-place edit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh duh, I was not thinking straight when I wrote this. If I have to repush I'll do that this way.

test/check-application Show resolved Hide resolved
@martinpitt martinpitt dismissed their stale review December 17, 2020 09:24

Meh, it's still kaputt 😿

@marusak
Copy link
Member Author

marusak commented Dec 17, 2020

aargh, it is still failing. Can it be related to the fact, that our nond tests are touching etc? (When I prepare rawhide VM locally it never times out).

@martinpitt
Copy link
Member

On my way out, sorry. My go-to method for this is to have this helper in my bashrc:

export RUNC="--machine 127.0.0.2:2201 --browser 127.0.0.2:9091"

Then prepare a f32/rawhide VM, bots/vm-run fedora-32, and run test/check-application -stv $RUNC.

@marusak
Copy link
Member Author

marusak commented Dec 17, 2020

I can reproduce this just fine if I run two tests in a row. So there is some issue with cleanup.

@martinpitt
Copy link
Member

our nond tests are touching etc

Not specifically, no. Also, they'd just restore what was on the original image, and the hack already happens in image-customize. In the worst case, move the hack into the RAWHIDE= bits in Makefile?

@marusak
Copy link
Member Author

marusak commented Dec 17, 2020

This is "fun". So it seems that after our cleanup when we try to again use the socket it just hangs with reply. I am trying to figure out why it is doing that.

Now podman shows `image` for given container as string, which was used
when the container was created. So when we expect `busybox:latest` we
need to create it from that.
@marusak marusak force-pushed the rawhide branch 2 times, most recently from cb94e8a to 05f3baa Compare December 17, 2020 13:09
@marusak
Copy link
Member Author

marusak commented Dec 17, 2020

aaalmost there. One more flake and one more issue. The last one will need naughty, working on it now.

We removed images, but `wait_visible` for `alpine:latest` could still
sometimes see also the user ones, thus failing with ambiguous selectors.
@martinpitt
Copy link
Member

this failure looks almost the same? May need an extension of the pattern?

@marusak
Copy link
Member Author

marusak commented Dec 17, 2020

this failure looks almost the same? May need an extension of the pattern?

It is the same, it just ran with the current bots, so it does not see the new naughty yet. fedora-32/rawhide@bots#1477 succeeded. It is the same scenario, just with bots, which include that new pattern.

@marusak
Copy link
Member Author

marusak commented Dec 17, 2020

This was fun. 3 reports to podman later and this is ready for review. (+1 report from Marius to podman last week, not great podman, not great)

Copy link
Member

@martinpitt martinpitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a mess.. Thanks for wading through this!

@@ -165,6 +165,10 @@ $(VM_IMAGE): $(VM_DEP) bots
rm -f $(VM_IMAGE) $(VM_IMAGE).qcow2
bots/image-customize -v $(VM_PACKAGE) -s $(CURDIR)/test/vm.install $(TEST_OS)
$(RAWHIDE)
# HACK: systemd kills the services after 90s
# See https://github.com/containers/podman/issues/8751
bots/image-customize -r "sed -i 's/Type=notify/Type=exec/' /usr/lib/systemd/system/podman.service" -r "sed -i 's/Type=notify/Type=exec/' /usr/lib/systemd/user/podman.service" $(TEST_OS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked above to put this into the RAWHIDE= definition, but indeed you said in the upstream report that it already landed in F32/33, and Debian also has 2.2.1 now.

@marusak marusak merged commit f3ab2bc into cockpit-project:master Dec 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants