Skip to content
This repository has been archived by the owner on May 6, 2020. It is now read-only.

ci: metrics: metrics ci fails for networking tests #960

Open
jcvenegas opened this issue Mar 29, 2018 · 9 comments
Open

ci: metrics: metrics ci fails for networking tests #960

jcvenegas opened this issue Mar 29, 2018 · 9 comments
Assignees

Comments

@jcvenegas
Copy link
Contributor

The logs show the following

12:39:53 ===== starting test [iperf3 tests] =====
12:39:53 command: docker: yes
12:39:53 docker pull'ing: gabyct/network
12:39:53 Using default tag: latest
12:39:55 latest: Pulling from gabyct/network
12:39:55 0c62fc2b46a9: Pulling fs layer
12:40:04 0c62fc2b46a9: Verifying Checksum
12:40:04 0c62fc2b46a9: Download complete
12:40:18 0c62fc2b46a9: Pull complete
12:40:18 Digest: sha256:c7abad113ea5f3829c3fcdb7b4886d02e28e3dbc392da3b442bb36ce0dedfc76
12:40:18 Status: Downloaded newer image for gabyct/network:latest
12:40:18 docker pull'd: gabyct/network
12:40:18 Iteration 1
12:40:24 ERROR: iperf server init fails
12:40:24 ERROR: result argument not supplied
12:40:24 
12:40:24 ===== starting test [storage IO random read bs 16k] =====

\cc @grahamwhaley @sboeuf

@grahamwhaley
Copy link
Contributor

Let's start adding a little context (it would be great if we knew what the first PR that showed this was for instance). For reference then, we see this over on: clearcontainers/runtime#1091

It does not happen with a very simple hand run at my desk, so I'm thinking:

  • I'll update all components locally and see if it happens
  • If I can't make that happen I'll run it directly on one of the CI machines
  • once we've identified the issue, we'll see if we can make the test more verbose in its failure to help track down any future similar issues.

@grahamwhaley
Copy link
Contributor

OK, I updated to the latest runtime/proxy/shim - and now it fails locally for me. Looks like we may have broken something. If I had to guess, then it is most likely around the 9p/tmpfs workaround: https://github.com/clearcontainers/tests/blob/master/metrics/network/network-metrics-iperf3.sh#L63-L68

@grahamwhaley
Copy link
Contributor

Update: Looks like 3.0.22 works, but 3.0.23 fails. It looks like it fails with some sort of tmpfile access on the /dev/shm mount, but a quick look at the mount in the working and failing case shows them to superficially look the same.
Current suspect could be around kata-containers/runtime#123
/cc @sboeuf

@grahamwhaley
Copy link
Contributor

OK - update - in 3.0.23 it looks like we have two related mounts in the container (I only spotted the first before...):

root@ef1c57ea478e:/# mount | fgrep shm
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
hyperShared on /dev/shm type 9p (rw,nodev,relatime,sync,dirsync,access=client,trans=virtio)

That second hyperShared is the one not present in 3.0.22. It is a 9p, and is showing the classic 9p 'unlink' symptoms with iperf3 that we normally see.

@sboeuf @amshinde - thoughts?

@amshinde
Copy link
Contributor

amshinde commented Apr 3, 2018

@grahamwhaley What is the workaround you use for /dev/shm for iperf?
I think with the change introduced with kata-containers/runtime@08909b2,
/dev/shm is being passed as a 9p mount.
Maybe skipping your workaround for shm may fix the issue.

@grahamwhaley
Copy link
Contributor

@amshinde /dev/shm was the second workaround we've had :-) The basic problem is iperf3 tries to use an unlinked tmpfile, which does not work on 9p. Thus, we cannot use the 9p /tmp default dest for tmpfiles.

  • our first workaround was to do an in-container tmpfs mount over /tmp - which we cannot now do as we removed mount privs
  • our second workaround was to set TMPDIR to point at /dev/shm, which was an existing shm mount.

What we have at present either feels wrong to me or we need to explain what we do and why. What we now have is an in-VM tmpfs that is then (afaict) overlayed with a 9p mount. either...

  • maybe we are mapping in a host side shm so it can be shared across containers ?
  • maybe we need to special case tmpfs/ramfs mounts and do them in the container via the agent in the VM, rather than map them through as 9p mounts?

@amshinde
Copy link
Contributor

amshinde commented Apr 5, 2018

@grahamwhaley Just did a quick check. With @sboeuf's change, all our bind mounts are passed through 9p including those in /dev and /dev/shm happens to be a bind mount. Thats why you are seeing the 2 mounts for /dev/shm. We need special handling for /dev/shm to avoid this. I'll look into the change for this.

@grahamwhaley
Copy link
Contributor

great, thanks @amshinde !

@grahamwhaley
Copy link
Contributor

Related: kata-containers/runtime#191

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants