Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bpftrace fixes for s390x #1241

Merged
merged 5 commits into from
Apr 13, 2020
Merged

Conversation

sumanthkorikkar
Copy link
Contributor

The patch series provides register support for s390x. This also solves various issues that can occur in big-endian architectures.

  1. bpftrace: Add s390x register support
  2. bpftrace: Fix sarg support for s390x and other architectures
  3. bpftrace: Fix type conversion in printf() builtin
  4. bpftrace: Fix histogram output & stat output for big-endian systems
  5. bpftrace: Add architecture specific testcase support
  6. bpftrace: Introduce htonl, htons, ntohl, ntohs functions
  7. bpftrace: Fix ntop builtin function
  8. bpftrace: Fix variable assignment of 32 bit integers

@sumanthkorikkar
Copy link
Contributor Author

Please review and request to merge into master.

Thank you.

@sumanthkorikkar
Copy link
Contributor Author

Hello @fbs , @brendangregg @yonghong-song,

The last commit Fix variable assignment of 32 bit integers causes failure in tools-parsing-test.sh.

Other commits passes the tests.

Few questions:

  1. Shall I again create a pull requests without the last commit?.
  2. Or should I create a pull request per commit, so that it can be reviewed individually?

Any suggestions would be great. Thank you.

Best Regards
Sumanth

@danobi
Copy link
Member

danobi commented Apr 3, 2020

Hi @sumanthkorikkar , thanks for the PR. I'm skimming it and it seems reasonable (will look more closely next on monday) .

Shall I again create a pull requests without the last commit?.

It looks like it's hitting some kind of assertion. Do you think you can figure out what's going on?

If you wanna do partial merges of s390x support we can also go with a build flag that's off by default and fix the abort in another PR.

Or should I create a pull request per commit, so that it can be reviewed individually?

Usually we do it all in one PR, especially if the commits depend on each other. But see above if you wanna get most of the code in first so you don't have to keep fixing merge conflicts.

Other stuff:

  • can you look at the clang-format build job failures and apply the suggestions there?
  • might be a good time to decide what direction we wanna take static binaries. See inline comment

@sumanthkorikkar
Copy link
Contributor Author

Hi @danobi

Hi @sumanthkorikkar , thanks for the PR. I'm skimming it and it seems reasonable (will look more closely next on monday) .

Ok. thank you.

Shall I again create a pull requests without the last commit?.

It looks like it's hitting some kind of assertion. Do you think you can figure out what's going on?

Sure. I will give it a try.

If you wanna do partial merges of s390x support we can also go with a build flag that's off by default and fix the abort in another PR.

Ok. I will separate s390 support and I will fix the abort in another PR

Or should I create a pull request per commit, so that it can be reviewed individually?

Usually we do it all in one PR, especially if the commits depend on each other. But see above if you wanna get most of the code in first so you don't have to keep fixing merge conflicts.

Other stuff:

  • can you look at the clang-format build job failures and apply the suggestions there?

Ok. Sure I will conform it to clang-format and send. Thanks

  • might be a good time to decide what direction we wanna take static binaries. See inline comment

Ok.

@fbs
Copy link
Contributor

fbs commented Apr 4, 2020

Thanks for working on this! :)

Haven't look closely either yet, but just some remarks:


As for the helpers. Having both a 'short/i16' and 'long/i32' versions is a bit different from the way most helpers currently work, which usually look at the input argument and derive stuff from it. Instead of having 4 helpers, cant we have one that just flips the byte order? Looking at the change that seems possible to do.

We're not 1.0 yet, but I'd rather not add any helpers that we don't need. If a good use case pops up we can always add more helpers (calls).


Re the current compiler issue:

bpftrace is not as strict with LLVM typing as LLVM wants us to be. While it "mostly works" you can run into weird issues like the crashes you're seeing now. A simple way to test that is running bpftrace -dd if it shows the before opt code but crashes during optimization it is likely a typing issue.

In this case it is caused by:

    $dport = $sk->__sk_common.skc_dport;
    $dport = ntohs($dport);

Which compiles down to

  %"$dport" = alloca i16
...
  %100 = load i16, i16* %"struct sock_common.skc_dport"
...
  %111 = or i16 %108, %110
  %112 = zext i16 %111 to i64
  store i64 %112, i16* %"$dport"

Doing a i64 store in a i16* causes the issue in this case. As a fix:

]$ diff /vagrant/tools/tcpconnect.bt tcpconnect.bt
46c46
<     $dport = $sk->__sk_common.skc_dport;
---
>     $dport = (uint64) $sk->__sk_common.skc_dport;

Lately I've been fixing most of these (e.g #1189) with #1231 and #1194 still open but there probably still are some left.

In this case the issue is mostly with the semantic analyser however. Your newly added functions have call.type set to i64 which appearently doesn't get checked properly during variable assignment as it only appears to check the type:

1481:  else if (search->second.type != assignment.expr->type.type) {

I want to start working on improving on this (#878) once the current codegen issues are fixed but that will take a while.

For now it will probably work if you make the return type of your new helpers match the input type, instead of casting it to i64.


Can you add codegen tests for the new helpers (after we've decided what to do with them, see fisrt part of comment).


What are your thoughts about testing this and making sure it keeps working? Are there any CI platforms that do s390. At least being able to do semi-regular (daily, weekly) builds would be a huge help.

@fbs
Copy link
Contributor

fbs commented Apr 4, 2020

As for splitting up, I having the first 4 or 5 commits going in first would be easier.

@sumanthkorikkar
Copy link
Contributor Author

Hi @fbs,

Thanks for reviewing.

Thanks for working on this! :)

Haven't look closely either yet, but just some remarks:

As for the helpers. Having both a 'short/i16' and 'long/i32' versions is a bit different from the way most helpers currently work, which usually look at the input argument and derive stuff from it. Instead of having 4 helpers, cant we have one that just flips the byte order? Looking at the change that seems possible to do.

something like bswap() instead of hton*(). ?. Based on input type, Let me try that.
For user readability, I added these functions.

We're not 1.0 yet, but I'd rather not add any helpers that we don't need. If a good use case pops up we can always add more helpers (calls).

Re the current compiler issue:

bpftrace is not as strict with LLVM typing as LLVM wants us to be. While it "mostly works" you can run into weird issues like the crashes you're seeing now. A simple way to test that is running bpftrace -dd if it shows the before opt code but crashes during optimization it is likely a typing issue.

In this case it is caused by:

    $dport = $sk->__sk_common.skc_dport;
    $dport = ntohs($dport);

Which compiles down to

  %"$dport" = alloca i16
...
  %100 = load i16, i16* %"struct sock_common.skc_dport"
...
  %111 = or i16 %108, %110
  %112 = zext i16 %111 to i64
  store i64 %112, i16* %"$dport"

Doing a i64 store in a i16* causes the issue in this case. As a fix:

]$ diff /vagrant/tools/tcpconnect.bt tcpconnect.bt
46c46
<     $dport = $sk->__sk_common.skc_dport;
---
>     $dport = (uint64) $sk->__sk_common.skc_dport;

Ok. Thanks for mentioning this.

Lately I've been fixing most of these (e.g #1189) with #1231 and #1194 still open but there probably still are some left.

In this case the issue is mostly with the semantic analyser however. Your newly added functions have call.type set to i64 which appearently doesn't get checked properly during variable assignment as it only appears to check the type:

1481:  else if (search->second.type != assignment.expr->type.type) {

I want to start working on improving on this (#878) once the current codegen issues are fixed but that will take a while.

For now it will probably work if you make the return type of your new helpers match the input type, instead of casting it to i64.

Ok. I will check.

Can you add codegen tests for the new helpers (after we've decided what to do with them, see fisrt part of comment).

I will add this. But this is specific to little endian. [ llvm/*] datalayout is different for big endian. I will think about this and see how we improve codegen tests for both arch.

What are your thoughts about testing this and making sure it keeps working? Are there any CI platforms that do s390. At least being able to do semi-regular (daily, weekly) builds would be a huge help.

Ok. I need to check with my colleagues, if there is CI platforms for s390.

Thank you

@sumanthkorikkar
Copy link
Contributor Author

Hi @fbs,

What are your thoughts about testing this and making sure it keeps working? Are there any CI platforms that do s390. At least being able to do semi-regular (daily, weekly) builds would be a huge help.

I have added bpftrace runtime-tests to our internal CI environment as of now. So, we run daily builds and runtime-tests.

@sumanthkorikkar
Copy link
Contributor Author

Hi @fbs, @danobi

Made changes to initial 5 commits. Other commits related to hton* and later will be done in another PR. Request to merge this into master, if it is good. Thank you.

src/arch/ppc64.cpp Show resolved Hide resolved
@@ -3,6 +3,11 @@
#include <algorithm>
#include <array>

// For little endian 64 bit, sp + 32 + 8 regs save area + argX
#define ARG0_STACK_LE 96
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find where in the spec LE and BE are different for stack frame setup. Can you point me to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Powerpc64, little endian, See Power Architecture 64-Bit ELF V2 ABI Specification
https://members.openpowerfoundation.org/document/dl/576

  • See parameter save area
    • When the caller allocates the Parameter Save Area, it will always be automatically quadword aligned because it must always start at SP + 32. It shall be at least 8 doublewords in length. If a function needs to pass more than 8 doublewords of arguments, the Parameter Save Area shall be large enough to spill all register-based parameters and to contain the arguments that the caller stores in it.

Copy link
Member

@danobi danobi Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I'm a little confused. Seems like the ABI changed starting at power ISA 2.07B. So why is only LE using the new ABI?

Won't block this PR on these questions. Realizing the sarg issue is a big win already.

src/arch/s390.cpp Show resolved Hide resolved
tests/runtime/engine/parser.py Outdated Show resolved Hide resolved
@danobi
Copy link
Member

danobi commented Apr 8, 2020

I have added bpftrace runtime-tests to our internal CI environment as of now. So, we run daily builds and runtime-tests.

Sounds good. If you can find a way to set it up in the open source CI it would help us prevent future breakages at merge time. Not sure if it would be less work on the long run (depends on how much of our stuff implicitly depends on endianness, I guess).

@sumanthkorikkar
Copy link
Contributor Author

sumanthkorikkar commented Apr 9, 2020

I have added bpftrace runtime-tests to our internal CI environment as of now. So, we run daily builds and runtime-tests.

Sounds good. If you can find a way to set it up in the open source CI it would help us prevent future breakages at merge time. Not sure if it would be less work on the long run (depends on how much of our stuff implicitly depends on endianness, I guess).

Hi @danobi , @fbs

yes. that would be great for community. One of my colleague, referred me to this, So it should be possible.
https://blog.travis-ci.com/2019-11-12-multi-cpu-architecture-ibm-power-ibm-z

I can try this. But I would need some input from you people about the requirements - (os requirements, services, scripts) to run this.?

  • Some docker scripts, travis yaml file which are currently used to run bpftrace
  • The following seems to use LXD container instead of docker and verify if we can run bpftrace as root.

Ok. I will have a look at build scripts which are available here. Thanks

@danobi
Copy link
Member

danobi commented Apr 9, 2020

But I would need some input from you people about the requirements - (os requirements, services, scripts) to run this.?

I don't know if we have any hard requirements other than it functioning as expected. I don't think any of us are too picky on how tests get run, just that they run somehow. We have github actions and travis ci at our disposal.

The following seems to use LXD container instead of docker and verify if we can run bpftrace as root.

This seems like a pain. We also have github actions set up. I looked around and I found this: https://github.com/marketplace/actions/run-on-architecture . Looks much easier to get set up than lxd. Seems like it's powered by https://github.com/multiarch/qemu-user-static which looks quite powerful.

@fbs
Copy link
Contributor

fbs commented Apr 9, 2020

Looks mostly good to me. I'm not that sure about "bpftrace: Fix type conversion in printf() builtin". Using cast_type probably works ok in most cases but because structs are casts and casts are casts (obviously) but not all types are 8 bytes. E.g. cpid is 4bytes.

I think that the better way to fix this is to get rid of that resize the semantic analyser does and instead make sure everything is properly stored. Which is what I want to do with #1231.

I think the best we can do is to accept for now and revert it later (e.g. in #1231).

@sumanthkorikkar
Copy link
Contributor Author

But I would need some input from you people about the requirements - (os requirements, services, scripts) to run this.?

I don't know if we have any hard requirements other than it functioning as expected. I don't think any of us are too picky on how tests get run, just that they run somehow. We have github actions and travis ci at our disposal.

The following seems to use LXD container instead of docker and verify if we can run bpftrace as root.

This seems like a pain. We also have github actions set up. I looked around and I found this: https://github.com/marketplace/actions/run-on-architecture . Looks much easier to get set up than lxd. Seems like it's powered by https://github.com/multiarch/qemu-user-static which looks quite powerful.

Hi @danobi,

My findings:

  1. I gave qemu-user-static a try. I tried running bpftrace command, s390x/fedora with priviliged mode docker run.
  • Run docker run -t -v /usr/src:/usr/src:ro -v /lib/modules/:/lib/modules:ro -v /sys/kernel/debug/:/sys/kernel/debug:rw --net=host --pid=host --privileged -t test-demo
  • Dockerfile contains setup to run sys_open_read() probe
  • Starting the script:
    Error creating printf map: Function not implemented
    Creation of the required BPF maps has failed.
    Make sure you have all the required permissions and are not confined (e.g. like
    snapcraft does). dmesg will likely have useful output for further troubleshooting
    ##[error]Process completed with exit code 1.
  • Compilation is quite slow and may timeout when running runtime tests.

-Related issue : multiarch/qemu-user-static#17. Tried this, but same error message.

  1. However the same works without qemu-user-static for x86

  2. Using lxd containers may not be the option, as this does not provide access to /sys/kernel/debug:rw, which bpftrace needs

  3. Should discuss about other options with the team. But this may need some time.

Add s390x specific registers. This is needed for bpftrace builtins like argX,
regs(), retval etc. This commit provides various functions to perform proper
offset calculation from the pt_regs context. The builtin functions of
bpftrace uses these offset functions to generate the proper bytecode for s390x

Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Introduce arg_stack_offset() function and define it for each
architecture. When the number of arguments passed to a function is
greater than the number of available argument registers, then the
additonal parameters are stored on the stack. This function calculates
the argument stack offset for each architecture. This should fix the
sarg builtin support for s390x/powerpc64/aarch64. It should also be
noted that only 64 bit architecture is supported.

Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
LLVM might change the size used to store the print arguements in the
stack according to BPF alignment requirements.  This resulted in
bpftrace printf() builtin interpreting the values not correctly. To
avoid these, in commit df51b11 ("promote scalar printf args to 64-bits"),
all the printf integer args were promoted to 64 bits. But however this
causes issues when performing type conversion in big endian systems.

In Big endian systems:
1. Current state: bpftrace -e  'BEGIN { printf("%d", (uint16)0x1234);
exit() }'
Returns 0

Expected result:
0x1234

2. Current state: In Big endian machines, this is interpreted as:
Address: Value
0x0 : 12
0x1 : 34
---------
0x2 : 00
0x3 : 00
0x4 : 00
0x5 : 00
0x6 : 00
0x7 : 00

format() will intepret it from 0x6-0x7 and print the result as 0. Hence
the result is always zero.

Expected state: In Big endian machines:
Address: Value
0x0 : 00
0x1 : 00
0x2 : 00
0x3 : 00
0x4 : 00
0x5 : 00
0x6 : 00
0x7 : 12
0x8 : 34

format() will print 0x1234.

Now let's see the bytecode:

0: (bf) r6 = r1
1: (b7) r1 = 0
2: (7b) *(u64 *)(r10 -8) = r1
last_idx 2 first_idx 0
regs=2 stack=0 before 1: (b7) r1 = 0
3: (b7) r2 = 4660
4: (6b) *(u16 *)(r10 -8) = r2   <<<<<<< The upper most 0x0-0x1 will be
set to 0x1234, Which is OK for Little endian, but for big endian NOT OK.
5: (7b) *(u64 *)(r10 -16) = r1
6: (18) r7 = 0xab6d0c00
8: (85) call bpf_get_smp_processor_id#8
9: (bf) r4 = r10
10: (07) r4 += -16   <<<< arg_data.offset for calls + arg_data args.
11: (bf) r1 = r6
12: (bf) r2 = r7
13: (bf) r3 = r0
14: (b7) r5 = 16
15: (85) call bpf_perf_event_output#25
last_idx 15 first_idx 0
regs=20 stack=0 before 14: (b7) r5 = 16
16: (b7) r0 = 0
17: (95) exit

4. bpftrace also maintains arg.type.cast_type information. Utilize this
information to correctly print the arguments for bigendian
architectures.

Test result:
./bpftrace  -e  'BEGIN { printf("%x", (uint8)-1); exit()  }'
Attaching 1 probe...
ff

Fixes: bpftrace#47
Fixes: df51b11 ("promote scalar printf args to 64-bits")

Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
1. bpftrace -e 'kretprobe:vfs_read { @=hist(512); exit(); }'
2. bpftrace histogram adds the 512 value into the wrong bucket
- For big endian systems, the output value is always added to the 0th
bucket instead of (log 512 base 2) bucket. The output could be wrong
even for little endian systems, when the bucket number goes beyond 255.
Avoid this by properly reading 8 bytes data instead of 1 byte char.
- Also avoid 32 bit overflow of sum.

All histogram, stats related testcases are now OK.

Test result:
./bpftrace -e 'kretprobe:vfs_read { @[1]=hist(512); @[1]=hist(512);
@[1]=hist(1024); exit(); }'
Attaching 1 probe...

@[1]:
[512, 1K)              2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@|

Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
- Add ARCH support for the test tool
- Fix architecture specific testcases

Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
@sumanthkorikkar
Copy link
Contributor Author

Looks mostly good to me. I'm not that sure about "bpftrace: Fix type conversion in printf() builtin". Using cast_type probably works ok in most cases but because structs are casts and casts are casts (obviously) but not all types are 8 bytes. E.g. cpid is 4bytes.

I think that the better way to fix this is to get rid of that resize the semantic analyser does and instead make sure everything is properly stored. Which is what I want to do with #1231.

I think the best we can do is to accept for now and revert it later (e.g. in #1231).

Ok. Thanks

Copy link
Member

@danobi danobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the CI conversation going in parallel. I think this is good to merge. I'll merge tomorrow if there aren't any new comments.

@fbs
Copy link
Contributor

fbs commented Apr 13, 2020

+1. Lets get the 0.10 release out first and merge it first thing after

@fbs fbs added this to the 0.11.0 milestone Apr 13, 2020
@danobi danobi merged commit 48d4051 into bpftrace:master Apr 13, 2020
@danobi danobi mentioned this pull request Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants