Skip to content

eBPF profiling

Animesh Trivedi edited this page Sep 19, 2024 · 20 revisions

code and bookmarks

Ubuntu: sudo apt-get install bpfcc-tools (to get all the *-bpfcc tools, so from https://github.com/iovisor/bcc/tree/master/examples/tracing become XXX-bpfcc in home.


Compile from source for a particular kernel verion bpftools

fatal error: readline/readline.h

sudo apt-get install libreadline-dev 
#Then 
  DESCEND runqslower
Couldn't find kernel BTF; set VMLINUX_BTF to specify its location.
make[1]: *** [Makefile:77: /home/animesh.trivedi/src/linux/tools/bpf/runqslower/.output//vmlinux.h] Error 1
make: *** [Makefile:122: runqslower] Error 2

Remove the pre-installed packages: https://github.com/iovisor/bcc/issues/3993#issuecomment-1228217609

apt purge bpfcc-tools libbpfcc python3-bpfcc
wget https://github.com/iovisor/bcc/releases/download/v0.25.0/bcc-src-with-submodule.tar.gz
tar xf bcc-src-with-submodule.tar.gz
cd bcc/
apt install -y python-is-python3
apt install -y bison build-essential cmake flex git libedit-dev   libllvm11 llvm-11-dev libclang-11-dev zlib1g-dev libelf-dev libfl-dev python3-distutils
apt install -y checkinstall
# This you can follow the instruction below 
mkdir build
cd build/
cmake -DCMAKE_INSTALL_PREFIX=/usr -DPYTHON_CMD=python3 ..
make
checkinstall

https://github.com/iovisor/bcc/blob/master/INSTALL.md#ubuntu---source

On Ubuntu 24 (make sure to use the llvm 18)

sudo apt install -y zip bison build-essential cmake flex git libedit-dev \
  libllvm16 llvm-18-dev libclang-18-dev python3 zlib1g-dev libelf-dev libfl-dev python3-setuptools \
  liblzma-dev libdebuginfod-dev arping netperf iperf libpolly-18-dev python-is-python3 

Then clone and install

git clone https://github.com/iovisor/bcc.git
mkdir bcc/build; cd bcc/build
cmake ..
make
sudo make install

Tracing framework (choices, formats)

https://blogs.oracle.com/linux/post/taming-tracepoints-in-the-linux-kernel

# show available events 
sudo cat  /sys/kernel/debug/tracing/available_events
atr@cordova:~$ sudo ls -l  /sys/kernel/debug/tracing/events/ | wc -l 
151
atr@cordova:~$ sudo cat  /sys/kernel/debug/tracing/available_events | wc -l 
2618
# There is a bit of difference in how many events have format directory 
# showing the format. OK, it seems like there is a recursive directory structure where events are grouped together 
sudo cat  /sys/kernel/debug/tracing/events/xhci-hcd/xhci_setup_device/format

Some hints on how to compile the C/eBPF program directly

https://github.com/anakryiko/bpf-ringbuf-examples/tree/main

Get function "anything" histogram

I am taking the size as an example: see bitehist.py file in the bcc github. https://github.com/iovisor/bcc/blob/master/examples/tracing/bitehist.py

Get function execution time distribution

atr@f20u24:~/src/ebpf-probes-traces$ sudo /usr/share/bcc/tools//funclatency -d 10 memset_probe2
Tracing 1 functions for "memset_probe2"... Hit Ctrl-C to end.

     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 849777   |****                                    |
       512 -> 1023       : 7156535  |****************************************|
      1024 -> 2047       : 12235    |                                        |
      2048 -> 4095       : 69       |                                        |
      4096 -> 8191       : 1394     |                                        |
      8192 -> 16383      : 801      |                                        |
     16384 -> 32767      : 74       |                                        |
     32768 -> 65535      : 20       |                                        |
     65536 -> 131071     : 12       |                                        |
    131072 -> 262143     : 1        |                                        |
    262144 -> 524287     : 2        |                                        |
    524288 -> 1048575    : 3        |                                        |

avg = 572 nsecs, total: 4590811412 nsecs, count: 8021354

Detaching...

kernel symbols which are non-traceable or probe-able

list the kernel functions that can be probed:

less /proc/kallsyms

There are function names with .constprop or __pfx names. What do the symbols means: https://people.redhat.com/~jolawren/klp-compiler-notes/livepatch/compiler-considerations.html

What to do about them? https://github.com/iovisor/bcc/issues/4261

If not changing static inline void to void would resolve this.

On my own OOT nullblk, this did work.

[July 2024] Examples sessions refresher

Tracking io_uring performance on tmpfs

dump CPU profiles with fio

sudo profile-bpfcc -p `pidof -d, fio` -F 99 10 &> fast_stack

How to get a CPU off time histograms and stacks

Histogram:

atr@u24clean:~/tmp$ sudo cpudist-bpfcc -O -p 6271 10 1 2>/dev/null
Tracing off-CPU time... Hit Ctrl-C to end.

     usecs               : count     distribution
         0 -> 1          : 100      |                                        |
         2 -> 3          : 112      |                                        |
         4 -> 7          : 20752    |                                        |
         8 -> 15         : 1342784  |****************************************|
        16 -> 31         : 12664    |                                        |
        32 -> 63         : 454      |                                        |
        64 -> 127        : 143      |                                        |
       128 -> 255        : 83       |                                        |
       256 -> 511        : 3        |                                        |
       512 -> 1023       : 1        |                                        |
atr@u24clean:~/tmp$ sudo cpudist-bpfcc -O -p 6290 10 1 2>/dev/null
Tracing off-CPU time... Hit Ctrl-C to end.

     usecs               : count     distribution
         0 -> 1          : 1298518  |****************                        |
         2 -> 3          : 3098098  |****************************************|
         4 -> 7          : 34802    |                                        |
         8 -> 15         : 7021     |                                        |
        16 -> 31         : 564      |                                        |
        32 -> 63         : 36       |                                        |
        64 -> 127        : 8        |                                        |
       128 -> 255        : 6        |                                        |
       256 -> 511        : 11       |                                        |
       512 -> 1023       : 1        |                                        |

CPU stack histograms

here is an example of fio process. -d, uses ',' as delimiter of pidof output.

sudo profile-bpfcc -p `pidof -d, fio` -F 99 10 &> fast_stack

Workqueue dump

/usr/src/linux-6.9.0-atr-2024-07-05/tools/workqueue$ ./wq_dump.py 

A collection of system tools to benchmark Linux with eBPF

It seems like when perf is compiled from source it does not include eBPF tracepoint events.https://www.brendangregg.com/eBPF/linux_ebpf_support.png

Showing all supported tracepoint events

on node2, 5.17.59. Also sudo gives a different list than the normal user.

zebin@node2:~$ sudo perf list sched:*

List of pre-defined events (to be used in -e):

  sched:sched_kthread_stop                           [Tracepoint event]
  sched:sched_kthread_stop_ret                       [Tracepoint event]
  sched:sched_kthread_work_execute_end               [Tracepoint event]
  sched:sched_kthread_work_execute_start             [Tracepoint event]
  sched:sched_kthread_work_queue_work                [Tracepoint event]
  sched:sched_migrate_task                           [Tracepoint event]
  sched:sched_move_numa                              [Tracepoint event]
  sched:sched_pi_setprio                             [Tracepoint event]
  sched:sched_process_exec                           [Tracepoint event]
  sched:sched_process_exit                           [Tracepoint event]
  sched:sched_process_fork                           [Tracepoint event]
  sched:sched_process_free                           [Tracepoint event]
  sched:sched_process_hang                           [Tracepoint event]
  sched:sched_process_wait                           [Tracepoint event]
  sched:sched_stat_blocked                           [Tracepoint event]
  sched:sched_stat_iowait                            [Tracepoint event]
  sched:sched_stat_runtime                           [Tracepoint event]
  sched:sched_stat_sleep                             [Tracepoint event]
  sched:sched_stat_wait                              [Tracepoint event]
  sched:sched_stick_numa                             [Tracepoint event]
  sched:sched_swap_numa                              [Tracepoint event]
  sched:sched_switch                                 [Tracepoint event]
  sched:sched_wait_task                              [Tracepoint event]
  sched:sched_wake_idle_without_ipi                  [Tracepoint event]
  sched:sched_wakeup                                 [Tracepoint event]
  sched:sched_wakeup_new                             [Tracepoint event]
  sched:sched_waking                                 [Tracepoint event]
zebin@node2:~$ sudo perf list syscalls:*

List of pre-defined events (to be used in -e):

  syscalls:sys_enter_accept                          [Tracepoint event]
  syscalls:sys_enter_accept4                         [Tracepoint event]
  syscalls:sys_enter_access                          [Tracepoint event]
  syscalls:sys_enter_acct                            [Tracepoint event]
  syscalls:sys_enter_add_key                         [Tracepoint event]
  syscalls:sys_enter_adjtimex                        [Tracepoint event]
  syscalls:sys_enter_alarm                           [Tracepoint event]
  syscalls:sys_enter_arch_prctl                      [Tracepoint event]
  syscalls:sys_enter_bind                            [Tracepoint event]
  syscalls:sys_enter_bpf                             [Tracepoint event]
  syscalls:sys_enter_brk                             [Tracepoint event]
  syscalls:sys_enter_capget                          [Tracepoint event]
...

Counting number of system calls per second

https://www.brendangregg.com/blog/2014-07-03/perf-counting.html

zebin@node2:~$ sudo perf stat -e 'syscalls:sys_enter_*' -a sleep 5 | awk '{sum+=$1}; END {print sum}'

 Performance counter stats for 'system wide':

                 3      syscalls:sys_enter_socket                                   
                 0      syscalls:sys_enter_socketpair                                   
                 0      syscalls:sys_enter_bind                                     
                 0      syscalls:sys_enter_listen                                   
                 1      syscalls:sys_enter_accept4                                   
                 0      syscalls:sys_enter_accept                                   
                 3      syscalls:sys_enter_connect                                   
                 0      syscalls:sys_enter_getsockname                                   
                 0      syscalls:sys_enter_getpeername                                   

It generates the output but does not summarizes.

https://kubernetes.io/blog/2017/12/using-ebpf-in-kubernetes/

https://lwn.net/Articles/740157/

System instrumentation

While reading bpftrace:

Setup
  • make sure headers are installed. The 5.12 kernel I compiled is missing headers.
atr@node1:~$ sudo bpftrace --version 
bpftrace v0.9.4
atr@node1:~$ which bpftrace
/usr/bin/bpftrace
atr@node1:~$ 

Example small run:

sudo bpftrace -e 'tracepoint:syscalls:sys_enter_nanosleep { printf("%s is sleeping.\n", comm); }'

-e flag is for what to execute. Uses the same awk type execution profile.

How to look for probe
bpftrace -l '*sleep*'
How to look for the tracepoint signature?
atr@node1:~$ sudo bpftrace -lv tracepoint:syscalls:sys_enter_nanosleep  
tracepoint:syscalls:sys_enter_nanosleep
    int __syscall_nr;
    struct __kernel_timespec * rqtp;
    struct __kernel_timespec * rmtp;
atr@node1:~$ 

Question: comm where does this come from? Looks like it says it is one of the builtins. Yes it is, see this: https://github.com/iovisor/bpftrace/blob/master/man/adoc/bpftrace.adoc#builtins

So the tracepoints have a clear signature and are well maintained. kprobes are not. There you need to look into the function signature and use that.

Including headers
bpftrace --include ./header.h 
bpftrace --I ./folder/
Filtering example with kprobe

Filter out small file reads or "X" bytes

bpftrace -e 'kprobe:vfs_read /arg2 == 512/ { printf("%s small read: %d byte buffer\n", comm, arg2); }'

vfs_read signature for v5.12 kernel: https://elixir.bootlin.com/linux/v5.12.19/source/fs/read_write.c#L476

The second argument is the count, hence this is where we are filtering. The arg count starts from 0.

Now I want to filter on the process name, use the builtin comm name:

bpftrace -e 'kprobe:vfs_read /comm == "my_name"/ { printf("%s small read: %d byte buffer\n", comm, arg2); }'

With tracepoints how to reference arguments

Use args-> construct.

root@node1:/home/atr# bpftrace -lv tracepoint:syscalls:sys_enter_openat
tracepoint:syscalls:sys_enter_openat
    int __syscall_nr;
    int dfd;
    const char * filename;
    int flags;
    umode_t mode;
root@node1:/home/atr# 
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
Attaching 1 probe...
snmpd /proc/diskstats
snmpd /proc/stat
snmpd /proc/vmstat

Navigating structs as arguments

Include the header file

# cat path.bt
#include <linux/path.h>
#include <linux/dcache.h>

kprobe:vfs_open
{
	printf("open path: %s\n", str(((struct path *)arg0)->dentry->d_name.name));
}

# bpftrace path.bt
Attaching 1 probe...
open path: dev
open path: if_inet6
open path: retrans_time_ms
[...]

Links

Questions

  • What is the difference between bpftrace and bpftool? bpftool is missing on the node1, dont know why.
Clone this wiki locally