Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize memset and memclr for ARM #164

Merged
merged 2 commits into from
Jul 1, 2017
Merged

optimize memset and memclr for ARM #164

merged 2 commits into from
Jul 1, 2017

Conversation

japaric
Copy link
Member

@japaric japaric commented May 26, 2017

This commit optimizes those routines by rewriting them in assembly and
performing the memory copying in 32-bit chunks, rather than in 8-bit chunks
as it was done before this commit. This assembly implementation is
compatible with the ARMv6 and ARMv7 architectures.

This change results in a reduction of runtime of about 40-70% in all cases
that matter (the compiler will never use these intrinsics for sizes smaller
than 4 bytes). See data below:

Bytes HEAD this PR diff
0 6 14 +133.3333%
1 10 13 +30%
2 14 13 -7.1429%
3 18 13 -27.77%
4 24 21 -12.5%
16 70 36 -48.5714%
64 263 97 -63.1179%
256 1031 337 -67.3133%
1024 4103 1297 -68.389%

All times are in clock cycles. The measurements were done on a Cortex-M3
processor running at 8 MHz using the technique described here.


For relevance all pure Rust programs for Cortex-M microcontrollers use memclr to
zero the .bss during startup so this change results in a quicker boot time.

Some questions / comments:

  • the original code (it had a bug) comes from this repo and it's licensed
    under the ICS license. I have preserved the copyright and license text in the
    source code. IANAL, is that OK?
    no longer applies. The intrinsics are written in Rust now.

  • I don't know whether this ARM implementation works for ARMv4 or ARMv5.
    @FenrirWolf and @Uvekilledkenny may want to take look at it first.
    no longer applies. The intrinsics are written in Rust now.

  • No idea whether this implementation works on processors that have no thumb
    instruction set. The current implementation uses 16-bit thumb instructions.
    no longer applies. The intrinsics are written in Rust now.

  • The loop code can be rewritten in less instructions but using 32-bit thumb
    instructions. That 32-bit version would only work on ARMv7 though. I have yet
    to check whether that makes any difference in the runtime of the intrinsic.
    no longer applies. The intrinsics are written in Rust now.

  • I'll look into memcpy4 next. done

@japaric
Copy link
Member Author

japaric commented May 26, 2017

The loop code can be rewritten in less instructions but using 32-bit thumb
instructions.

I changed this:

1:  @ word-wise copy loop
    str  r2, [r0]
    adds r0, #4
    subs r1, #4
    bhs  1b

into this

1:  @ word-wise copy loop
    str r2, [r0], #4
    subs r1, #4
    bhs  1b

It made no difference whatsoever. I get the same number of cycles in the 1024 and 256 bytes cases.


#[test]
fn memclr4() {
let mut xs = [0u8; 8];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ABI requires these to be aligned to 4 bytes, but your u8 array may be unaligned. Some with all the others instances in the other tests.

@Amanieu
Copy link
Member

Amanieu commented May 26, 2017

We need proper support for target_features on ARM so that we can specialize these for the target architecture. For example, non-Thumb ARMv5 onwards should probably make use of the ldrd and strd instructions.

@japaric
Copy link
Member Author

japaric commented May 26, 2017

Pushed a memcpy implementation. Speedups below:

Bytes HEAD this PR diff
0 6 10 +66.6667%
1 12 16 +33.3333%
2 18 24 +33.3333%
3 24 32 +33.3333%
4 30 20 -33.3333%
16 102 50 -50.9804%
64 390 170 -56.4103%
256 1796 650 -63.8085%
1024 7172 2570 -64.1662%

Times are in clock cycles. Smaller is better.

@Amanieu
Copy link
Member

Amanieu commented May 26, 2017

I don't know whether this ARM implementation works for ARMv4 or ARMv5.

It should work fine for ARMv5 and ARMv4T. It won't work on plain ARMv4 because it doesn't support the bx instruction.

src/arm.rs Outdated
str r2, [r0]
adds r0, #4
subs r1, #4
bhs 1b

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw your RFC on IRC earlier, I think you can make this a little tighter for cores that support postfix operation. I wrote this ( https://gist.github.com/ppannuto/672328eb8184abdb9559 ) a little while back for a tight init loop, comparing:

str r2, [r0] (2 cycle)
adds r0, #4 (1 cycle)
subs r1, #4 (1 cycle)
bhs 1b (~1 cycle)

-> 5 cycles / loop

with stmia:

WriteZeroToBssStart:
  ldr r0, =_sbss
  ldr r1, =_ebss
  movs r2, #0
  b WriteZeroToBssEnterLoop
WriteZeroToBssLoop:
  stmia r0!, {r2}  <-- 2 cycles
WriteZeroToBssEnterLoop:
  cmp r0, r1  <-- 1 cycle
  bcc WriteZeroToBssLoop  <-- 1 cycle

-> 4 cycles / loop

Using cycle counts from 3.3.1 in http://infocenter.arm.com/help/topic/com.arm.doc.ddi0439b/DDI0439B_cortex_m4_r0p0_trm.pdf rather than measured, but I think this should save a cycle

@japaric
Copy link
Member Author

japaric commented May 27, 2017

I tested cross compiling a small program that uses this memset4 for ARMv4 and ARMv5 and it seems to link fine:

thumbv4t-none-eabi

0000801c <__aeabi_memclr4>:
    801c:       e0322002        eors    r2, r2, r2

00008020 <__aeabi_memset4>:
    8020:       e2511004        subs    r1, r1, #4
    8024:       3a000008        bcc     804c <__aeabi_memset4+0x2c>
    8028:       e1b02c02        lsls    r2, r2, #24
    802c:       e1b03422        lsrs    r3, r2, #8
    8030:       e1922003        orrs    r2, r2, r3
    8034:       e1b03822        lsrs    r3, r2, #16

armv5te-none-eabi

0000801c <__aeabi_memclr4>:
    801c:       e0322002        eors    r2, r2, r2

00008020 <__aeabi_memset4>:
    8020:       e2511004        subs    r1, r1, #4
    8024:       3a000008        bcc     804c <__aeabi_memset4+0x2c>
    8028:       e1b02c02        lsls    r2, r2, #24
    802c:       e1b03422        lsrs    r3, r2, #8
    8030:       e1922003        orrs    r2, r2, r3
    8034:       e1b03822        lsrs    r3, r2, #16

thumbv6m-none-eabi (for comparison)

00008010 <__aeabi_memclr4>:
    8010:       4052            eors    r2, r2

00008012 <__aeabi_memset4>:
    8012:       3904            subs    r1, #4
    8014:       d308            bcc.n   8028 <__aeabi_memset4+0x16>
    8016:       0612            lsls    r2, r2, #24
    8018:       0a13            lsrs    r3, r2, #8
    801a:       431a            orrs    r2, r3
    801c:       0c13            lsrs    r3, r2, #16

ARMv4 and ARMv5 end up using 32-bit (ARM?) instructions instead of 16-bit instructions but other than that the disassemblies look the same.

@ppannuto Nice, I'll test that but the implementation we pick here should work from ARMv4 all the way to ARMv7 as there's no cfg mechanism for subarchitectures / architecture versions.

@japaric
Copy link
Member Author

japaric commented May 28, 2017

@ppannuto I tested you approach but I get the same numbers: 5 cycles per loop. There's ... something weird about the measurement though. If I execute the code instruction by instruction and look at CYCCNT then I see:

This PR

0x08000514 ? str        r2, [r0, #0] ; 1 cycle
0x08000516 ? adds       r0, #4       ; 1 cycle
0x08000518 ? subs       r1, #4       ; 1 cycle
0x0800051a ? bcs.n      0x8000514    ; 1 cycle

Your approach

0x0800051e ? stmia      r0!, {r2} ; 2 cycles
0x08000520 ? cmp        r0, r3    ; 1 cycle
0x08000522 ? blt.n      0x800051e ; 1 cycle

However, if I use breakpoint and look at CYCCNT then I see:

This PR

0x08000514 ? str        r2, [r0, #0] ; <-- breakpoint
0x08000516 ? adds       r0, #4       ;
0x08000518 ? subs       r1, #4       ;
0x0800051a ? bcs.n      0x8000514    ;

Change in CYCCNT across breakpoints: 5 cycles

You approach:

0x0800051e ? stmia      r0!, {r2} ; <-- breakpoint
0x08000520 ? cmp        r0, r3
0x08000522 ? blt.n      0x800051e

Change in CYCCNT across breakpoints: 5 cycles

If I mesaure the whole thing:

This PR

 800048e:       be00            bkpt    0x0000
 8000490:       f6cd 62ad       movt    r2, #57005      ; 0xdead
 8000494:       f000 f837       bl      8000506 <__aeabi_memset4>

4096 bytes -> 5137 cycles. Or about 5 cycles per loop.

Your approach

 800048e:       be00            bkpt    0x0000
 8000490:       f6cd 62ad       movt    r2, #57005      ; 0xdead
 8000494:       f000 f837       bl      8000506 <__aeabi_memset4>
 8000498:       be00            bkpt    0x0000

4096 bytes -> 5139 cycles. Or about 5 cycles per loop.

I have no idea why there's a 4 vs 5 difference in the micro (per instruction) vs "mini" (per loop) measurement, but the macro (per memset4 call) measurement reports the same numbers for both approaches.

@japaric
Copy link
Member Author

japaric commented May 28, 2017

I have changed the mem{clr,set}{4,8} to look more like the memcpy implementations. I have also added an assembly implementation for mem{clr,set} to have the former tail call the latter to reduce code size.

Performance is pretty much unchanged.

@ppannuto
Copy link

ppannuto commented May 29, 2017 via email

src/arm.rs Outdated
// calling convention which can't be implemented using a normal Rust function
// NOTE This function and the ones below are implemented using assembly
// because they using a custom calling convention which can't be implemented
// using a normal Rust function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it unrelated to this PR, but I couldn't help but notice these functions could be implemented using a normal rust function like

pub extern "C" fn __aeabi_uidivmod(n: u32, d: u32) -> u64 {
    let mut rem = 0;
    let quot = unsafe {__udivmodsi4(n, d, Option::Some(&mut rem))};
    ((rem as u64) << 32) | quot as u64
}

using a u64 as a 'in regs' struct.

@parched
Copy link
Contributor

parched commented May 30, 2017

@japaric How is the performance compared to optimised versions written in rust which would be more portable? I.e., something like

#[no_mangle]
pub unsafe extern "C" fn memcpy4(dest: *mut u32,
                                src: *const u32,
                                n: usize)
                                -> *mut u32 {
    let mut i = 0;
    while i < n / 4 {
        *dest.offset(i as isize) = *src.offset(i as isize);
        i += 1;
    }
    dest
}

@bors
Copy link
Contributor

bors commented Jun 25, 2017

☔ The latest upstream changes (presumably #166) made this pull request unmergeable. Please resolve the merge conflicts.

this reduces the execution time of all these routines by 40-70%
@japaric
Copy link
Member Author

japaric commented Jun 30, 2017

@parched Surprisingly (or maybe not), memcpy is around 20% faster (on Cortex-M3) when written in Rust. Everything else is slightly slower by a constant factor. I can't no longer read the generated assembly and there's no tail chaining of routines when written in Rust so the code size of the routines has doubled or so (They are still small though: 72 bytes for memcpy4). But if memcpy is faster then I'm sold.

I have changed everything to Rust now.

r? @alexcrichton TL;DR optimized memcpy et al for 32-bit aligned buffers (ARM has intrinsics for these operations) and reduced execution time for meaningful inputs by 40-70%.

The debatable bit may be that I changed all the aeabi_mem* intrinsics to weak linkage and that they are always exported regardless of whether the "mem" feature is enabled. That way this crate can be linked to binaries that link to libc and to binaries that don't link to libc without having to enable / disable Cargo features for either case. I think this is OK because all our ARM targets use the ELF format which supports weak linkage.

Alternatively we could not use weak linkage and export the aeabi_mem* intrinsics only when the "mem" feature is enabled. In this scenario if you link this crate to a program that also links to libc you'll get a "multiple definition" error if the "mem" feature was enabled.

let mut src = src as *mut u32;

while n >= 4 {
ptr::write(dest, ptr::read(src));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, the 4 in the name is "aligned to 4 bytes", right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That's why we can optimize the routine by doing these aligned 4-byte reads / writes here.

@alexcrichton
Copy link
Member

I think this is OK because all our ARM targets use the ELF format which supports weak linkage.

I think we have iOS which may not respect this?

@japaric
Copy link
Member Author

japaric commented Jun 30, 2017

I think we have iOS which may not respect this?

Er, I forgot about iOS. (I was looking at rustc --print target-list and iOS doesn't show up there on Linux.). I don't know if it supports weak linkage.

@japaric
Copy link
Member Author

japaric commented Jun 30, 2017

error[E0560]: struct __test::test::TestDesc has no field named allow_fail

Did the real test crate change recently?

@japaric
Copy link
Member Author

japaric commented Jun 30, 2017

I have fixed the utest related error.

@alexcrichton would you prefer to go down the "mem" route, instead of using weak linkage, for now? If we want to eventually make binary releases of rust-std for the thumb targets that rust-std package will have to include a libcompiler_builtins crate with weak aeabi_mem* symbols or it won't be usable in bare metal contexts where you don't link to libc.

@alexcrichton
Copy link
Member

@japaric oh enabling weak symbols on ELF targets makes sense to me, I'm fine doing that by default (starting off on thumb and maybe moving outwards slowly). Just wanted to point out that this may not work for iOS I think? (unsure)

@japaric
Copy link
Member Author

japaric commented Jun 30, 2017

@alexcrichton I checked the contents of the libcompiler_builtins.rlib that ships with in rust-std component for the arm-unknown-linux-gnueabi and armv7-apple-ios targets. The Linux one contains files like aeabi_memcpy.o where the aeabi_mempcy* symbols are defined, whereas the iOS one didn't contain those files. So it's probably better to not expose the aeabi_mem* symbols on this crate when compiling for iOS.

After checking that I was wondering why there was no "multiple definition" error when linking a program for arm-unknown-linux-gnueabi given that both libc and libcompiler_builtins provide the same aeabi_mem* symbols and neither library exposes them as weak symbols whereas I get that error when linking a thumbv7m-none-eabi program. The difference seems to be that in the linux program libc is linked dynamically and it appears as -lc at the end of the linker invocation. Whereas in the thumb case libc is linked statically and appears as -Wl,--whole-archive -lc -Wl,--no-whole-archive in the linker invocation.

So to maintain the current behavior the aeabi_mem* symbols should not be exposed on iOS, be exposed as weak symbols on thumb, and be exposed as normal symbols elsewhere (Linux).

@alexcrichton
Copy link
Member

@bors: r+

@bors
Copy link
Contributor

bors commented Jul 1, 2017

📌 Commit b8a6620 has been approved by alexcrichton

@bors
Copy link
Contributor

bors commented Jul 1, 2017

⌛ Testing commit b8a6620 with merge b0300b1...

bors added a commit that referenced this pull request Jul 1, 2017
optimize memset and memclr for ARM

This commit optimizes those routines by rewriting them in assembly and
performing the memory copying in 32-bit chunks, rather than in 8-bit chunks
as it was done before this commit. This assembly implementation is
compatible with the ARMv6 and ARMv7 architectures.

This change results in a reduction of runtime of about 40-70% in all cases
that matter (the compiler will never use these intrinsics for sizes smaller
than 4 bytes). See data below:

| Bytes | HEAD | this PR | diff       |
| ----- | ---- | ------- | ---------- |
| 0     | 6    | 14      | +133.3333% |
| 1     | 10   | 13      | +30%       |
| 2     | 14   | 13      | -7.1429%   |
| 3     | 18   | 13      | -27.77%    |
| 4     | 24   | 21      | -12.5%     |
| 16    | 70   | 36      | -48.5714%  |
| 64    | 263  | 97      | -63.1179%  |
| 256   | 1031 | 337     | -67.3133%  |
| 1024  | 4103 | 1297    | -68.389%   |

All times are in clock cycles. The measurements were done on a Cortex-M3
processor running at 8 MHz using the technique described [here].

[here]: http://blog.japaric.io/rtfm-overhead

---

For relevance all pure Rust programs for Cortex-M microcontrollers use memclr to
zero the .bss during startup so this change results in a quicker boot time.

Some questions / comments:

- ~~the original code (it had a bug) comes from this [repo] and it's licensed
  under the ICS license. I have preserved the copyright and license text in the
  source code. IANAL, is that OK?~~ no longer applies. The intrinsics are written in Rust now.

- ~~I don't know whether this ARM implementation works for ARMv4 or ARMv5.
  @FenrirWolf and @Uvekilledkenny may want to take look at it first.~~ no longer applies. The intrinsics are written in Rust now.

- ~~No idea whether this implementation works on processors that have no thumb
  instruction set. The current implementation uses 16-bit thumb instructions.~~ no longer applies. The intrinsics are written in Rust now.

- ~~The loop code can be rewritten in less instructions but using 32-bit thumb
  instructions. That 32-bit version would only work on ARMv7 though. I have yet
  to check whether that makes any difference in the runtime of the intrinsic.~~ no longer applies. The intrinsics are written in Rust now.

- ~~I'll look into memcpy4 next.~~ done

[repo]: https://github.com/bobbl/libaeabi-cortexm0
@bors
Copy link
Contributor

bors commented Jul 1, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing b0300b1 to master...

@bors bors merged commit b8a6620 into master Jul 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants