Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tiny improvements #392

Merged
merged 1 commit into from
Mar 12, 2015
Merged

Tiny improvements #392

merged 1 commit into from
Mar 12, 2015

Conversation

lukego
Copy link
Member

@lukego lukego commented Mar 10, 2015

This branch contains tiny improvements. Initially an update of .gitignore so that git status can be clean after a build.

These are build-related files.
@lukego
Copy link
Member Author

lukego commented Mar 10, 2015

@eugeneia SnabbBot failure?

@eugeneia
Copy link
Member

@lukego Seems to have been a one-off. Hmmm. We should probably investigate how the taskset -c 1 was able to always trigger this bug and understand what its really about, seeing that the issue didn't just disappear.

@eugeneia
Copy link
Member

@lukego A quick further investigation led to something: It seems to depend on the CPU core in use. So when I use taskset to force a CPU on a different NUMA node than the NIC it will fail, but like wise I can force the task to run on a affine node and it will work. So removing the taskset all together led to random failures. I now explicitly pin the snabb_bot process to CPU 6 which is on the same node as the NIC used by snabb_bot. Let's see if my theory holds and we won't see any more ping failures.

@lukego
Copy link
Member Author

lukego commented Mar 10, 2015

Cool! Let us get to the bottom of this.

I am guessing the problem will ultimately be failure to allocate a HugeTLB page. Snabb Switch and QEMU both need these and they are allocated from per-NUMA-node pools.

Do you have any of this information?

  1. QEMU or Snabb Switch is the process causing test to fail?
  2. How many huge pages are available on each node? (Should be in a /proc or /sys file.)
  3. Does the Snabb core.memory selftest pass on both nodes (selecting with taskset)? That will allocate huge pages.
  4. If the Snabb process is failing, does strace show a system call causing this? (Likely a shm call from memory.c)

Or I could be barking up the wrong tree entirely and maybe it is about the relationship between the node of Snabb Switch and the node of the NIC and/or QEMU.

@eugeneia
Copy link
Member

@lukego Snabb does not crash, is there a scenario in which Snabb endures a "failure to allocate a HugeTLB page" without at least throwing an error?

Regarding 3.: core.memory passes.

Regarding 4.:

shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 843677701
shmat(843677701, 0, 0)                  = 0x2aaaaac00000
shmat(843677701, 0x50042fe00000, 0)     = 0x50042fe00000
shmdt(0x2aaaaac00000)                   = 0
shmctl(843677701, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 843710470
shmat(843710470, 0, 0)                  = 0x2aaaaac00000
shmat(843710470, 0x50042fc00000, 0)     = 0x50042fc00000
shmdt(0x2aaaaac00000)                   = 0
shmctl(843710470, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 843743239
shmat(843743239, 0, 0)                  = 0x2aaaaac00000
shmat(843743239, 0x50042fa00000, 0)     = 0x50042fa00000
shmdt(0x2aaaaac00000)                   = 0
shmctl(843743239, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 843776008
shmat(843776008, 0, 0)                  = 0x2aaaaac00000
shmat(843776008, 0x50042f800000, 0)     = 0x50042f800000
shmdt(0x2aaaaac00000)                   = 0
shmctl(843776008, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 843808777
shmat(843808777, 0, 0)                  = 0x2aaaaac00000
shmat(843808777, 0x50042f600000, 0)     = 0x50042f600000
shmdt(0x2aaaaac00000)                   = 0
shmctl(843808777, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 843841546
shmat(843841546, 0, 0)                  = 0x2aaaaac00000
shmat(843841546, 0x50042f400000, 0)     = 0x50042f400000
shmdt(0x2aaaaac00000)                   = 0
shmctl(843841546, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 843874315
shmat(843874315, 0, 0)                  = 0x2aaaaac00000
shmat(843874315, 0x50042f200000, 0)     = 0x50042f200000
shmdt(0x2aaaaac00000)                   = 0
shmctl(843874315, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 843907084
shmat(843907084, 0, 0)                  = 0x2aaaaac00000
shmat(843907084, 0x50042f000000, 0)     = 0x50042f000000
shmdt(0x2aaaaac00000)                   = 0
shmctl(843907084, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 843939853
shmat(843939853, 0, 0)                  = 0x2aaaaac00000
shmat(843939853, 0x50042ee00000, 0)     = 0x50042ee00000
shmdt(0x2aaaaac00000)                   = 0
shmctl(843939853, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 843972622
shmat(843972622, 0, 0)                  = 0x2aaaaac00000
shmat(843972622, 0x50042ec00000, 0)     = 0x50042ec00000
shmdt(0x2aaaaac00000)                   = 0
shmctl(843972622, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 844005391
shmat(844005391, 0, 0)                  = 0x2aaaaac00000
shmat(844005391, 0x50042ea00000, 0)     = 0x50042ea00000
shmdt(0x2aaaaac00000)                   = 0
shmctl(844005391, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 844038160
shmat(844038160, 0, 0)                  = 0x2aaaaac00000
shmat(844038160, 0x50042e800000, 0)     = 0x50042e800000
shmdt(0x2aaaaac00000)                   = 0
shmctl(844038160, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 844070929
shmat(844070929, 0, 0)                  = 0x2aaaaac00000
shmat(844070929, 0x50042e600000, 0)     = 0x50042e600000
shmdt(0x2aaaaac00000)                   = 0
shmctl(844070929, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 844103698
shmat(844103698, 0, 0)                  = 0x2aaaaac00000
shmat(844103698, 0x50042e400000, 0)     = 0x50042e400000
shmdt(0x2aaaaac00000)                   = 0
shmctl(844103698, IPC_RMID, 0)          = 0
shmget(IPC_PRIVATE, 2097152, IPC_CREAT|SHM_HUGETLB|0600) = 844136467
shmat(844136467, 0, 0)                  = 0x2aaaaac00000
shmat(844136467, 0x50042e200000, 0)     = 0x50042e200000
shmdt(0x2aaaaac00000)                   = 0
shmctl(844136467, IPC_RMID, 0)          = 0

I do suspect it has something to do with the NIC simply because the symptom is packets not arriving. There is no crash or anything, just absence of I/O.

@lukego
Copy link
Member Author

lukego commented Mar 11, 2015

Cool problem.

Could be NIC related but the evidence seems weak to me. "PING failed" is also what you see if the VM fails to boot, right? Also, we have tested various NUMA combinations between NIC/Snabb/QEMU and never seen any bugs of this kind before, the worst case has been ~33% performance impact. So it is possible but I would like to audit logs and eliminate other possible causes too.

Thinking of the future: nice if SnabbBot would attach more log files to the gist. I would quite like to see the output of the snabb process, the two QEMU processes on the host, and the ping/iperf/etc processes inside the VMs.

I still suspect the issue is related to hugetlb allocation. Here are more thoughts:

  1. Which processes does the taskset apply to? Is it forcing everything (Snabb Switch + both QEMU) to run on a single core? (If so that is probably not what we want... Could use numactl -N instead to pick a node instead of a core, but then probably need to remove isolcpus kernel parameter because I don't think numactl plays well with that.)
  2. How many HugeTLB pages are available on each node? (How many are reserved by grub kernel parameter? How many are actually allocated now?)
  3. How many HugeTLB pages are required for the test? In the strace above it looks like 2MB page size is used and that Snabb Switch is allocating around 16 pages. The QEMU VMs will use HugeTLB pages for guest memory and if that is (say) 1GB then the VMs would need 512 pages each. So estimate of our HugeTLB requirement is around 1100 pages based on those assumptions.

@eugeneia
Copy link
Member

"PING failed" is also what you see if the VM fails to boot, right?

No since for a VM to be considered "up" the caller needs to successfully telnet into the VM and have it ping itself.

Thinking of the future: nice if SnabbBot would attach more log files to the gist. I would quite like to see the output of the snabb process, the two QEMU processes on the host, and the ping/iperf/etc processes inside the VMs.

OK. Have increased verbosity in my ci-updates branch.

Which processes does the taskset apply to?

Good question. The man page only talks about the target process, not the children spawned by it. The taskset seems superfluous since bench_env prepends everything with a numactl call anyways. Maybe they are stepping on each others toes? To be honest, I've never questioned the variables controlling numactl (NODE_BIND[0-9]+).

Regarding isolcpus, as far as I can tell we don't do that on davos (or chur) because it broke numactl which is used quite heavily in bench_env:

$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.13.0-43-generic root=/dev/mapper/davos--vg-root ro

@eugeneia
Copy link
Member

@lukego numactl -N 0|1 triggers the bug as well. E.g. when binding the testscript to node 0 PING fails (NIC is on node 1).

@lukego
Copy link
Member Author

lukego commented Mar 11, 2015

Check this out:

$ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
1024
$ cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
1024

Looks like each NUMA node has 1024 hugepages available. QEMU 1 wants 512, QEMU 2 wants 512, Snabb Switch wants ~16... there won't be enough if all three processs are running on the same node.

Can you try this?

echo 8192 | sudo tee /proc/sys/vm/nr_hugepages

Then hopefully we have 4096 huge pages (=8GB) available on each node.

Could also edit the grub config to add this kernel parameter:

hugepages=8192

so that it happens automatically on boot. (Can be that the kernel will fail to allocate hugepages long after boot due to fragmentation and not being able to find enough contiguous regions.)

@eugeneia
Copy link
Member

Looks like each NUMA node has 1024 hugepages available. QEMU 1 wants 512, QEMU 2 wants 512, Snabb Switch wants ~16... there won't be enough if all three processs are running on the same node.

But isn't our failure case happening in the opposite scenario, when all three processes are not on the same node?

I tried anyways and increasing the number of hugepages does not fix the issue

@lukego
Copy link
Member Author

lukego commented Mar 11, 2015

Thanks for checking on the hugepages. I'm surprised that was not the issue.

On interlaken I can run snabbnfv on a different node to the NIC and guests can connect to that switch and ping each other. So it is not a black-and-white issue of the traffic process having to be on the same node as the NIC.

How about the intel_app selftest, does that work on both nodes for the same NIC? (Can we reproduce this problem in a simpler way?)

I'll see if I wake up with new ideas..

@eugeneia
Copy link
Member

I guess the most curious thing is that you actually have to tasket/numactl the parent command (e.g. SnabbBot's test task) instead of the make invocation to trigger the bug.

lukego added a commit that referenced this pull request Mar 12, 2015
@lukego lukego merged commit 964eb27 into snabbco:master Mar 12, 2015
@lukego lukego deleted the tidy branch June 10, 2015 08:03
dpino added a commit to dpino/snabb that referenced this pull request Aug 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants