ci: metrics: metrics ci fails for networking tests #960

jcvenegas · 2018-03-29T19:50:48Z

The logs show the following

12:39:53 ===== starting test [iperf3 tests] =====
12:39:53 command: docker: yes
12:39:53 docker pull'ing: gabyct/network
12:39:53 Using default tag: latest
12:39:55 latest: Pulling from gabyct/network
12:39:55 0c62fc2b46a9: Pulling fs layer
12:40:04 0c62fc2b46a9: Verifying Checksum
12:40:04 0c62fc2b46a9: Download complete
12:40:18 0c62fc2b46a9: Pull complete
12:40:18 Digest: sha256:c7abad113ea5f3829c3fcdb7b4886d02e28e3dbc392da3b442bb36ce0dedfc76
12:40:18 Status: Downloaded newer image for gabyct/network:latest
12:40:18 docker pull'd: gabyct/network
12:40:18 Iteration 1
12:40:24 ERROR: iperf server init fails
12:40:24 ERROR: result argument not supplied
12:40:24 
12:40:24 ===== starting test [storage IO random read bs 16k] =====

\cc @grahamwhaley @sboeuf

The text was updated successfully, but these errors were encountered:

grahamwhaley · 2018-04-03T12:43:25Z

Let's start adding a little context (it would be great if we knew what the first PR that showed this was for instance). For reference then, we see this over on: clearcontainers/runtime#1091

It does not happen with a very simple hand run at my desk, so I'm thinking:

I'll update all components locally and see if it happens
If I can't make that happen I'll run it directly on one of the CI machines
once we've identified the issue, we'll see if we can make the test more verbose in its failure to help track down any future similar issues.

grahamwhaley · 2018-04-03T12:51:47Z

OK, I updated to the latest runtime/proxy/shim - and now it fails locally for me. Looks like we may have broken something. If I had to guess, then it is most likely around the 9p/tmpfs workaround: https://github.com/clearcontainers/tests/blob/master/metrics/network/network-metrics-iperf3.sh#L63-L68

grahamwhaley · 2018-04-03T16:31:21Z

Update: Looks like 3.0.22 works, but 3.0.23 fails. It looks like it fails with some sort of tmpfile access on the /dev/shm mount, but a quick look at the mount in the working and failing case shows them to superficially look the same.
Current suspect could be around kata-containers/runtime#123
/cc @sboeuf

grahamwhaley · 2018-04-03T18:21:15Z

OK - update - in 3.0.23 it looks like we have two related mounts in the container (I only spotted the first before...):

root@ef1c57ea478e:/# mount | fgrep shm
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
hyperShared on /dev/shm type 9p (rw,nodev,relatime,sync,dirsync,access=client,trans=virtio)

That second hyperShared is the one not present in 3.0.22. It is a 9p, and is showing the classic 9p 'unlink' symptoms with iperf3 that we normally see.

@sboeuf @amshinde - thoughts?

amshinde · 2018-04-03T21:03:17Z

@grahamwhaley What is the workaround you use for /dev/shm for iperf?
I think with the change introduced with kata-containers/runtime@08909b2,
/dev/shm is being passed as a 9p mount.
Maybe skipping your workaround for shm may fix the issue.

grahamwhaley · 2018-04-03T21:34:16Z

@amshinde /dev/shm was the second workaround we've had :-) The basic problem is iperf3 tries to use an unlinked tmpfile, which does not work on 9p. Thus, we cannot use the 9p /tmp default dest for tmpfiles.

our first workaround was to do an in-container tmpfs mount over /tmp - which we cannot now do as we removed mount privs
our second workaround was to set TMPDIR to point at /dev/shm, which was an existing shm mount.

What we have at present either feels wrong to me or we need to explain what we do and why. What we now have is an in-VM tmpfs that is then (afaict) overlayed with a 9p mount. either...

maybe we are mapping in a host side shm so it can be shared across containers ?
maybe we need to special case tmpfs/ramfs mounts and do them in the container via the agent in the VM, rather than map them through as 9p mounts?

amshinde · 2018-04-05T15:35:21Z

@grahamwhaley Just did a quick check. With @sboeuf's change, all our bind mounts are passed through 9p including those in /dev and /dev/shm happens to be a bind mount. Thats why you are seeing the 2 mounts for /dev/shm. We need special handling for /dev/shm to avoid this. I'll look into the change for this.

grahamwhaley · 2018-04-05T15:39:02Z

great, thanks @amshinde !

grahamwhaley · 2018-04-09T12:46:38Z

Related: kata-containers/runtime#191

jodh-intel assigned grahamwhaley Apr 3, 2018

grahamwhaley mentioned this issue Apr 3, 2018

metrics: iperf3: improve error case output #963

Open

grahamwhaley mentioned this issue Apr 5, 2018

ci: Add proxy env variables to crio service. #965

Closed

grahamwhaley mentioned this issue Apr 6, 2018

metrics: Add FIO flags to produce stable results. #957

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: metrics: metrics ci fails for networking tests #960

ci: metrics: metrics ci fails for networking tests #960

jcvenegas commented Mar 29, 2018

grahamwhaley commented Apr 3, 2018

grahamwhaley commented Apr 3, 2018

grahamwhaley commented Apr 3, 2018

grahamwhaley commented Apr 3, 2018

amshinde commented Apr 3, 2018

grahamwhaley commented Apr 3, 2018

amshinde commented Apr 5, 2018

grahamwhaley commented Apr 5, 2018

grahamwhaley commented Apr 9, 2018

ci: metrics: metrics ci fails for networking tests #960

ci: metrics: metrics ci fails for networking tests #960

Comments

jcvenegas commented Mar 29, 2018

grahamwhaley commented Apr 3, 2018

grahamwhaley commented Apr 3, 2018

grahamwhaley commented Apr 3, 2018

grahamwhaley commented Apr 3, 2018

amshinde commented Apr 3, 2018

grahamwhaley commented Apr 3, 2018

amshinde commented Apr 5, 2018

grahamwhaley commented Apr 5, 2018

grahamwhaley commented Apr 9, 2018