Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cranelift: Add Bswap instruction (#1092) #5147

Merged
merged 1 commit into from
Oct 31, 2022

Conversation

11evan
Copy link
Contributor

@11evan 11evan commented Oct 27, 2022

Adds Bswap to the Cranelift IR. Implements the Bswap instruction in the x64 and aarch64 codegen backends. Cranelift users can now:

builder.ins().bswap(value)

to get a native byteswap instruction.

  • x64: implements the 32- and 64-bit bswap instruction, following the pattern set by similar unary instrutions (Neg and Not) - it only operates on a dst register, but is parameterized with both a src and dst which are expected to be the same register.

As x64 bswap instruction is only for 32- or 64-bit registers, the 16-bit swap is implemented as a rotate left by 8.

  • aarch64: Bswap gets emitted as aarch64 rev16, rev32, or rev64 instruction as appropriate.

  • s390x: Bswap not implemented

  • For completeness, added bswap to the interpreter as well.

@11evan 11evan mentioned this pull request Oct 27, 2022
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. labels Oct 27, 2022
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- this is really nice to have!

I have a few style comments below but nothing crucial.

At the top level, a little more testing might be nice too. Can we have at least a runtest (cranelift/filetests/filetests/runtests/bswap.clif maybe) with some interesting value cases, and the appropriate targets (test run with x86-64, aarch64, and test interpret)?

// expect:
// 1101_1010_110 | 0_0000_0000_01 | 01_011 | 0_0010
// which is little endian:
// 0110_0010_0000_0101_1100_0000_1101_1010
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The little-endian bit literal here is kind of confusing (it's little-endian but MSB first); if we need an explanatory comment can we convert the above to a u32 (0xdac00562) or maybe just remove this comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments were just to help me out when writing the test cases; I've removed them now

// expect:
// 1101_1010_110 | 0_0000_0000_11 | 01_010 | 0_0001
// which is little endian:
// 0100_0001_0000_1101_1100_0000_1101_1010
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here as above


// BSWAP reg32 is (REX.W==0) 0F C8
// BSWAP reg64 is (REX.W==1) 0F C8
let mut rex = match size {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use RexFlags here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good idea, changed to use RexFlags in update, and gave it a new emit_one_op fn to go along with the existing emit_two_op and emit_three_op

@@ -86,6 +86,9 @@ pub trait Value: Clone + From<DataValue> {
fn leading_zeros(self) -> ValueResult<Self>;
fn trailing_zeros(self) -> ValueResult<Self>;
fn reverse_bits(self) -> ValueResult<Self>;

// Byteswap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the blank line and comment here, I think...

Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This is impressively thorough.

I notice there aren't any lowerings for 128-bit bswap. They'd be relatively easy to write (bswap each 64-bit half, and also swap the two registers), but it's also okay to not support that yet.

It would be best to also add tests in cranelift/filetests/filetests/isa/ and cranelift/filetests/filetests/runtests/ that cover this new instruction. But again, I think we could merge this without those.

"#,
&formats.unary,
)
.operands_in(vec![x])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should exclude 8-bit integers for this instruction. There's no sensible implementation of byte-swapping when there's only one byte.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, excluded 8-bit and also 128-bit since the latter is unimplemented for now

@uweigand
Copy link
Member

s390x: Bswap not implemented

We do have bswap instructions, and they're already implemented in the backend: for $I32 and $I64, bswap_reg can be used; for $I16, a 32-bit bswap followed by a 16-bit right shift is generally the most efficient solution.

See e.g. the use as part of the bitrev implementation:

(rule (bitrev_bytes $I16 x) (lshr_imm $I32 (bswap_reg $I32 x) 16))
(rule (bitrev_bytes $I32 x) (bswap_reg $I32 x))
(rule (bitrev_bytes $I64 x) (bswap_reg $I64 x))

@afonso360
Copy link
Contributor

This is awesome! Thanks! 🎉

I added this to our fuzzer (Feel free to add that to this PR if you want!) and It's been running for over an hour on aarch64 without any issues, however on x86 it reported an interesting failure.

I've tried to minimize this as much as I could but it sometimes throws weird results.

test interpret
test run
target x86_64

function %a() -> i8, i32, i64 {
block0:
    v5 = iconst.i64 0x9903_5204_d05f_abab
    v6 = bswap v5
    v7 = iconst.i8 0
    v8 = iconst.i32 0
    return v7, v8, v6
}

; run: %a() == [0, 0, -6076657925176032359]

You can run this testcase from the cranelift directory with cargo run -- test ./the-above.clif

Removing the v7 or v8 makes the test pass, which is interesting... but otherwise the test segfaults on my machine, most of the time but not always.

Here is the disassembly, which you can get with cargo run -- compile -D --target x86_64 ./the-above.clif:

Disassembly of 29 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   49 bb ab ab 5f d0 04 52 03 99
                                movabs  r11, 0x99035204d05fabab
   e:   4c 0f cb                bswap   rbx
  11:   31 c0                   xor     eax, eax
  13:   31 d2                   xor     edx, edx
  15:   4c 89 1f                mov     qword ptr [rdi], r11
  18:   48 89 ec                mov     rsp, rbp
  1b:   5d                      pop     rbp
  1c:   c3                      ret
Here's a more reliable example that *always* fails with a wrong result, but never segfaults.
test interpret
test run
target x86_64

function %a(f32, f64, i32, i32, f64) -> i8, i32, i64 {
block0(v0: f32, v1: f64, v2: i32, v3: i32, v4: f64):
    v5 = iconst.i64 0x9903_5204_d05f_abab
    v6 = bswap v5
    v7 = iconst.i8 0
    v8 = iconst.i32 0
    return v7, v8, v6
}

; run: %a(0.0, 0.0, 0, 0, 0.0) == [0, 0, -6076657925176032359]

Here's the disassembly of the above function

Disassembly of 32 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   49 89 d1                mov     r9, rdx
   7:   49 b8 ab ab 5f d0 04 52 03 99
                                movabs  r8, 0x99035204d05fabab
  11:   4c 0f c8                bswap   rax
  14:   31 c0                   xor     eax, eax
  16:   31 d2                   xor     edx, edx
  18:   4d 89 01                mov     qword ptr [r9], r8
  1b:   48 89 ec                mov     rsp, rbp
  1e:   5d                      pop     rbp
  1f:   c3                      ret

@11evan
Copy link
Contributor Author

11evan commented Oct 27, 2022

Thanks for the feedback everyone! I'll work on addressing everything, including adding runtests and investigating the fuzzer failure

@11evan
Copy link
Contributor Author

11evan commented Oct 28, 2022

The x86_64 issue the fuzzer found was an incorrect REX encoding - I misunderstood the manual and was using REX.R instead of REX.B when encoding bswap access to r8-r15. Fixed locally and added those cases to the runtests, they pass now.

I'll have an update out sometime in the next couple days once I address the other feedback and add s390x support

@11evan 11evan force-pushed the bswap branch 2 times, most recently from 66833f7 to 7f2a823 Compare October 28, 2022 23:32
@11evan
Copy link
Contributor Author

11evan commented Oct 28, 2022

PR updated:

  • fixed REX encoding bug found by fuzzer
  • added runtests
  • added s390x support
  • other cleanups from feedback (remove extraneous comments, exclude 8-bit integers, use x64 RexFlags type, add bswap to fuzzgen)

let iSwappable = &TypeVar::new(
"iSwappable",
"A multi byte scalar integer type",
TypeSetBuilder::new().ints(16..64).build(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for removing the 8-bit case! We've generally kept support for types which make sense even if there's no backend support, though, so I would put the 128-bit case back in. If nothing else, that lets us have a runtests/i128-bswap.clif that validates the interpreter (test interpret) even though we can't yet do test run on any target.

There are quite a few instructions that don't have 128-bit support on some backends yet, but it's good to have things in place to show what those instructions are expected to do for whenever somebody gets around to implementing them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good! Added back 128-bit and created an interpreter test. I didn't actually implement any 128-bit swaps in the backends, left that for the future.

Copy link
Contributor

@afonso360 afonso360 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the 128 version added!

I ran this through the fuzzer this morning and no further issues popped up! 🥳

Edit: To clarify, I don't think we need the i128 version now, we can add it in a later PR but we should legalize the instruction.

@uweigand
Copy link
Member

s390x parts LGTM, thanks for adding this. (Seeing the discussion above, if consensus is to add I128 support, you can likewise copy the 128-bit byteswap implementation via vector permute from bitrev.)

Adds Bswap to the Cranelift IR. Implements the Bswap instruction
in the x64 and aarch64 codegen backends. Cranelift users can now:
```
builder.ins().bswap(value)
```
to get a native byteswap instruction.

* x64: implements the 32- and 64-bit bswap instruction, following
the pattern set by similar unary instrutions (Neg and Not) - it
only operates on a dst register, but is parameterized with both
a src and dst which are expected to be the same register.

As x64 bswap instruction is only for 32- or 64-bit registers,
the 16-bit swap is implemented as a rotate left by 8.

Updated x64 RexFlags type to support emitting for single-operand
instructions like bswap

* aarch64: Bswap gets emitted as aarch64 rev16, rev32,
or rev64 instruction as appropriate.

* s390x: Bswap was already supported in backend, just had to add
a bit of plumbing

* For completeness, added bswap to the interpreter as well.

* added filetests and runtests for each ISA

* added bswap to fuzzgen, thanks to afonso360 for the code there

* 128-bit swaps are not yet implemented, that can be done later
@11evan
Copy link
Contributor Author

11evan commented Oct 31, 2022

PR Updated - legalized 128-bit bswap, added runtest for interpreter. Didn't actually implement 128-bit swap in any of the backends yet.

Copy link
Contributor

@jameysharp jameysharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! Thanks so much for this.

@jameysharp jameysharp enabled auto-merge (squash) October 31, 2022 17:28
@11evan
Copy link
Contributor Author

11evan commented Oct 31, 2022

Looks like the CI run failed at verify-publish with:

Run cd /opt/hostedtoolcache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
[...]
100   212  100   212    0     0     40      0  0:00:05  0:00:05 --:--:--    54

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Error: Process completed with exit code 2.

It looks like an issue with the tooling rather than an issue with my commit - what should I do here, does it just need a retry?

@jameysharp jameysharp merged commit 4ca9e82 into bytecodealliance:main Oct 31, 2022
@jameysharp
Copy link
Contributor

Apparently it did just need a retry. Merged now! 🎉

@11evan 11evan deleted the bswap branch November 4, 2022 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants