-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execsnoop can miss some events in parallel environments #1250
Comments
I've built a much simpler version of the program, but now you have to do some maths: #include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
void execbomb(int count, const char *filename) {
char s_count[12];
if (count <= 0)
return;
snprintf(s_count, sizeof(s_count) - 1, "%d", count -1);
execl(filename, filename, s_count, "-1", NULL);
}
int main(int argc, char *argv[]) {
int children;
if(argc != 3 ) {
printf("Usage: %s <count> <forks>\n", argv[0]);
return -1;
}
children = atoi(argv[2]);
while(children-- > 0) {
if(!fork()) {
children = -1;
break;
}
}
if(children < 0)
execbomb(atoi(argv[1]), argv[0]);
wait(NULL);
return 0;
} Use it like this: |
The execsnoop basically kprobes on sys_execve. I did a quick experiment on this. I added an array of two elements and one element is incremented at the entry point and another is incremented at the exit point. I found that the entry point is exactly 4001 (one is the program itself), but the exit point is around 3500. 500 is lost. So the issue is not the bpf or perf output missing samples, but the kretprobe sometimes not triggered. Not sure what exactly happening here. |
This appears to be a known issue. There are no simple solution as the controlling kernel variable/field is not configurable.
|
Actually, maxactive is tunable. basically, in bcc, we can generate event like "r50:...". This will set maxactive to 50 I tried this example, twice of online cpu number is not enough, still some samples We could package something like kretprobe_maxactive=# in a parameter like |
Apparently the maxactive tunable feature is new in 4.12, so this would need to be documented as well. |
This was also discussed in #1072 and we added support for maxactive in gobpf in iovisor/gobpf#39. |
I suggest we close this issue, and open another one for adding the MAXACTIVE feature to the docs, and another one for adding MAXACTIVE support to the bcc API. |
#2224 fixes this. |
@anisse Is this solved? |
It still happens with the python version of execsnoop (even when modifying to raise maxactive to its maximum(4096) in the kretprobe) , but I can't reproduce with the C version in libbpf-tools. I think we can close this and advise people to use the libbpf-tools C version instead. |
Is it possible, that I ran into this issue here? |
It turns out even execsnoop can miss some short-lived processes, when those run in a highly-parallel environment. I've ran into this while building a tree version of execsnoop (for flamegraphs). I have a build system that has quite a few short-lived processes, and sometimes I'd miss a few process, completely breaking the tree chain (which I've rewritten to get the ppid in-kernel).
I've built a simple program to illustrate the issue. It will run a given number of execve, over a given number of processors (it does this by forking).
Say you want to run this in a single CPU, and generate 4000 execve():
./execbomb 0 4000 0
In this case, it seems
execsnoop.py -n execbomb
catches all event. Now, run 17 parallel processes at a given time:./execbomb 0 4000 17
And then I'm losing about ~274 out of 4000 events. This might need tuning on your hardware, and you can verify easily that the program works properly, like this:
(first execve() is the program itself).
Running it through strace does slow it enough so that execsnoop catches all events.
Important point: in these examples, I'm never seeing the perf message "Possibly lost X samples". You could use this program to trigger it though. Also, I always wait for execsnoop.py to finish its processing (through casual
top
watching). I have disabled the get_ppid() userland search, as well as tried using a bigger perf buffer (page_cnt=512).If you analyse the log, you might also see events where the arg submission logic fails, with either an empty command line, or multiple command lines concatenated.
The text was updated successfully, but these errors were encountered: