optimize memset and memclr for ARM #164

japaric · 2017-05-26T17:08:26Z

This commit optimizes those routines by rewriting them in assembly and
performing the memory copying in 32-bit chunks, rather than in 8-bit chunks
as it was done before this commit. This assembly implementation is
compatible with the ARMv6 and ARMv7 architectures.

This change results in a reduction of runtime of about 40-70% in all cases
that matter (the compiler will never use these intrinsics for sizes smaller
than 4 bytes). See data below:

Bytes	HEAD	this PR	diff
0	6	14	+133.3333%
1	10	13	+30%
2	14	13	-7.1429%
3	18	13	-27.77%
4	24	21	-12.5%
16	70	36	-48.5714%
64	263	97	-63.1179%
256	1031	337	-67.3133%
1024	4103	1297	-68.389%

All times are in clock cycles. The measurements were done on a Cortex-M3
processor running at 8 MHz using the technique described here.

For relevance all pure Rust programs for Cortex-M microcontrollers use memclr to
zero the .bss during startup so this change results in a quicker boot time.

Some questions / comments:

the original code (it had a bug) comes from this repo and it's licensed
under the ICS license. I have preserved the copyright and license text in the
source code. IANAL, is that OK? no longer applies. The intrinsics are written in Rust now.
I don't know whether this ARM implementation works for ARMv4 or ARMv5.
@FenrirWolf and @Uvekilledkenny may want to take look at it first. no longer applies. The intrinsics are written in Rust now.
No idea whether this implementation works on processors that have no thumb
instruction set. The current implementation uses 16-bit thumb instructions. no longer applies. The intrinsics are written in Rust now.
The loop code can be rewritten in less instructions but using 32-bit thumb
instructions. That 32-bit version would only work on ARMv7 though. I have yet
to check whether that makes any difference in the runtime of the intrinsic. no longer applies. The intrinsics are written in Rust now.
~~I'll look into memcpy4 next.~~ done

japaric · 2017-05-26T17:36:00Z

The loop code can be rewritten in less instructions but using 32-bit thumb
instructions.

I changed this:

1:  @ word-wise copy loop
    str  r2, [r0]
    adds r0, #4
    subs r1, #4
    bhs  1b

into this

1:  @ word-wise copy loop
    str r2, [r0], #4
    subs r1, #4
    bhs  1b

It made no difference whatsoever. I get the same number of cycles in the 1024 and 256 bytes cases.

Amanieu · 2017-05-26T17:50:21Z

tests/aeabi_memclr.rs

+
+#[test]
+fn memclr4() {
+    let mut xs = [0u8; 8];


The ABI requires these to be aligned to 4 bytes, but your u8 array may be unaligned. Some with all the others instances in the other tests.

Amanieu · 2017-05-26T17:52:41Z

We need proper support for target_features on ARM so that we can specialize these for the target architecture. For example, non-Thumb ARMv5 onwards should probably make use of the ldrd and strd instructions.

japaric · 2017-05-26T19:41:32Z

Pushed a memcpy implementation. Speedups below:

Bytes	HEAD	this PR	diff
0	6	10	+66.6667%
1	12	16	+33.3333%
2	18	24	+33.3333%
3	24	32	+33.3333%
4	30	20	-33.3333%
16	102	50	-50.9804%
64	390	170	-56.4103%
256	1796	650	-63.8085%
1024	7172	2570	-64.1662%

Times are in clock cycles. Smaller is better.

Amanieu · 2017-05-26T19:49:39Z

I don't know whether this ARM implementation works for ARMv4 or ARMv5.

It should work fine for ARMv5 and ARMv4T. It won't work on plain ARMv4 because it doesn't support the bx instruction.

ppannuto · 2017-05-27T00:58:36Z

src/arm.rs

+    str  r2, [r0]
+    adds r0, #4
+    subs r1, #4
+    bhs  1b


Saw your RFC on IRC earlier, I think you can make this a little tighter for cores that support postfix operation. I wrote this ( https://gist.github.com/ppannuto/672328eb8184abdb9559 ) a little while back for a tight init loop, comparing:

str r2, [r0] (2 cycle) adds r0, #4 (1 cycle) subs r1, #4 (1 cycle) bhs 1b (~1 cycle)

-> 5 cycles / loop

with stmia:

WriteZeroToBssStart: ldr r0, =_sbss ldr r1, =_ebss movs r2, #0 b WriteZeroToBssEnterLoop WriteZeroToBssLoop: stmia r0!, {r2} <-- 2 cycles WriteZeroToBssEnterLoop: cmp r0, r1 <-- 1 cycle bcc WriteZeroToBssLoop <-- 1 cycle

-> 4 cycles / loop

Using cycle counts from 3.3.1 in http://infocenter.arm.com/help/topic/com.arm.doc.ddi0439b/DDI0439B_cortex_m4_r0p0_trm.pdf rather than measured, but I think this should save a cycle

japaric · 2017-05-27T23:16:06Z

I tested cross compiling a small program that uses this memset4 for ARMv4 and ARMv5 and it seems to link fine:

thumbv4t-none-eabi

0000801c <__aeabi_memclr4>:
    801c:       e0322002        eors    r2, r2, r2

00008020 <__aeabi_memset4>:
    8020:       e2511004        subs    r1, r1, #4
    8024:       3a000008        bcc     804c <__aeabi_memset4+0x2c>
    8028:       e1b02c02        lsls    r2, r2, #24
    802c:       e1b03422        lsrs    r3, r2, #8
    8030:       e1922003        orrs    r2, r2, r3
    8034:       e1b03822        lsrs    r3, r2, #16

armv5te-none-eabi

0000801c <__aeabi_memclr4>:
    801c:       e0322002        eors    r2, r2, r2

00008020 <__aeabi_memset4>:
    8020:       e2511004        subs    r1, r1, #4
    8024:       3a000008        bcc     804c <__aeabi_memset4+0x2c>
    8028:       e1b02c02        lsls    r2, r2, #24
    802c:       e1b03422        lsrs    r3, r2, #8
    8030:       e1922003        orrs    r2, r2, r3
    8034:       e1b03822        lsrs    r3, r2, #16

thumbv6m-none-eabi (for comparison)

00008010 <__aeabi_memclr4>:
    8010:       4052            eors    r2, r2

00008012 <__aeabi_memset4>:
    8012:       3904            subs    r1, #4
    8014:       d308            bcc.n   8028 <__aeabi_memset4+0x16>
    8016:       0612            lsls    r2, r2, #24
    8018:       0a13            lsrs    r3, r2, #8
    801a:       431a            orrs    r2, r3
    801c:       0c13            lsrs    r3, r2, #16

ARMv4 and ARMv5 end up using 32-bit (ARM?) instructions instead of 16-bit instructions but other than that the disassemblies look the same.

@ppannuto Nice, I'll test that but the implementation we pick here should work from ARMv4 all the way to ARMv7 as there's no cfg mechanism for subarchitectures / architecture versions.

japaric · 2017-05-28T02:57:17Z

@ppannuto I tested you approach but I get the same numbers: 5 cycles per loop. There's ... something weird about the measurement though. If I execute the code instruction by instruction and look at CYCCNT then I see:

This PR

0x08000514 ? str        r2, [r0, #0] ; 1 cycle
0x08000516 ? adds       r0, #4       ; 1 cycle
0x08000518 ? subs       r1, #4       ; 1 cycle
0x0800051a ? bcs.n      0x8000514    ; 1 cycle

Your approach

0x0800051e ? stmia      r0!, {r2} ; 2 cycles
0x08000520 ? cmp        r0, r3    ; 1 cycle
0x08000522 ? blt.n      0x800051e ; 1 cycle

However, if I use breakpoint and look at CYCCNT then I see:

This PR

0x08000514 ? str        r2, [r0, #0] ; <-- breakpoint
0x08000516 ? adds       r0, #4       ;
0x08000518 ? subs       r1, #4       ;
0x0800051a ? bcs.n      0x8000514    ;

Change in CYCCNT across breakpoints: 5 cycles

You approach:

0x0800051e ? stmia      r0!, {r2} ; <-- breakpoint
0x08000520 ? cmp        r0, r3
0x08000522 ? blt.n      0x800051e

Change in CYCCNT across breakpoints: 5 cycles

If I mesaure the whole thing:

This PR

 800048e:       be00            bkpt    0x0000
 8000490:       f6cd 62ad       movt    r2, #57005      ; 0xdead
 8000494:       f000 f837       bl      8000506 <__aeabi_memset4>

4096 bytes -> 5137 cycles. Or about 5 cycles per loop.

Your approach

 800048e:       be00            bkpt    0x0000
 8000490:       f6cd 62ad       movt    r2, #57005      ; 0xdead
 8000494:       f000 f837       bl      8000506 <__aeabi_memset4>
 8000498:       be00            bkpt    0x0000

4096 bytes -> 5139 cycles. Or about 5 cycles per loop.

I have no idea why there's a 4 vs 5 difference in the micro (per instruction) vs "mini" (per loop) measurement, but the macro (per memset4 call) measurement reports the same numbers for both approaches.

japaric · 2017-05-28T04:10:51Z

I have changed the mem{clr,set}{4,8} to look more like the memcpy implementations. I have also added an assembly implementation for mem{clr,set} to have the former tail call the latter to reduce code size.

Performance is pretty much unchanged.

ppannuto · 2017-05-29T21:38:20Z

Interesting -- I didn't think that Cortex-M's had a load-store queue, but that (or something like it) would explain the identical performance and how the `str` gets down to 1 cycle. Thanks for testing on hardware, sorry it didn't prove to actually be faster :/

…

On Sat, May 27, 2017 at 10:57 PM Jorge Aparicio ***@***.***> wrote: @ppannuto <https://github.com/ppannuto> I tested you approach but I get the same numbers: 5 cycles per loop. There's ... something weird about the measurement though. If I execute the code instruction by instruction and look at CYCCNT then I see: This PR 0x08000514 ? str r2, [r0, #0] ; 1 cycle 0x08000516 ? adds r0, #4 ; 1 cycle 0x08000518 ? subs r1, #4 ; 1 cycle 0x0800051a ? bcs.n 0x8000514 ; 1 cycle Your approach 0x0800051e ? stmia r0!, {r2} ; 2 cycles 0x08000520 ? cmp r0, r3 ; 1 cycle 0x08000522 ? blt.n 0x800051e ; 1 cycle However, if I use breakpoint and look at CYCCNT then I see: This PR 0x08000514 ? str r2, [r0, #0] ; <-- breakpoint 0x08000516 ? adds r0, #4 ; 0x08000518 ? subs r1, #4 ; 0x0800051a ? bcs.n 0x8000514 ; Change in CYCCNT across breakpoints: 5 cycles You approach: 0x0800051e ? stmia r0!, {r2} ; <-- breakpoint 0x08000520 ? cmp r0, r3 0x08000522 ? blt.n 0x800051e Change in CYCCNT across breakpoints: 5 cycles If I mesaure the whole thing: This PR 800048e: be00 bkpt 0x0000 8000490: f6cd 62ad movt r2, #57005 ; 0xdead 8000494: f000 f837 bl 8000506 <__aeabi_memset4> 4096 bytes -> 5137 cycles. Or about 5 cycles per loop. Your approach 800048e: be00 bkpt 0x0000 8000490: f6cd 62ad movt r2, #57005 ; 0xdead 8000494: f000 f837 bl 8000506 <__aeabi_memset4> 8000498: be00 bkpt 0x0000 4096 bytes -> 5139 cycles. Or about 5 cycles per loop. I have no idea why there's a 4 vs 5 difference in the micro (per instruction) vs "mini" (per loop) measurement, but the macro (per memset4 call) measurement reports the same numbers for both approaches. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#164 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAUt3lru1F0B4-MigmvP8q15UKS2FSR6ks5r-OKOgaJpZM4Nn1-q> .

parched · 2017-05-30T20:04:07Z

src/arm.rs

-// calling convention which can't be implemented using a normal Rust function
+// NOTE This function and the ones below are implemented using assembly
+// because they using a custom calling convention which can't be implemented
+// using a normal Rust function


I know it unrelated to this PR, but I couldn't help but notice these functions could be implemented using a normal rust function like

pub extern "C" fn __aeabi_uidivmod(n: u32, d: u32) -> u64 { let mut rem = 0; let quot = unsafe {__udivmodsi4(n, d, Option::Some(&mut rem))}; ((rem as u64) << 32) | quot as u64 }

using a u64 as a 'in regs' struct.

parched · 2017-05-30T20:20:58Z

@japaric How is the performance compared to optimised versions written in rust which would be more portable? I.e., something like

#[no_mangle]
pub unsafe extern "C" fn memcpy4(dest: *mut u32,
                                src: *const u32,
                                n: usize)
                                -> *mut u32 {
    let mut i = 0;
    while i < n / 4 {
        *dest.offset(i as isize) = *src.offset(i as isize);
        i += 1;
    }
    dest
}

bors · 2017-06-25T18:07:15Z

☔ The latest upstream changes (presumably #166) made this pull request unmergeable. Please resolve the merge conflicts.

this reduces the execution time of all these routines by 40-70%

japaric · 2017-06-30T04:00:54Z

@parched Surprisingly (or maybe not), memcpy is around 20% faster (on Cortex-M3) when written in Rust. Everything else is slightly slower by a constant factor. I can't no longer read the generated assembly and there's no tail chaining of routines when written in Rust so the code size of the routines has doubled or so (They are still small though: 72 bytes for memcpy4). But if memcpy is faster then I'm sold.

I have changed everything to Rust now.

r? @alexcrichton TL;DR optimized memcpy et al for 32-bit aligned buffers (ARM has intrinsics for these operations) and reduced execution time for meaningful inputs by 40-70%.

The debatable bit may be that I changed all the aeabi_mem* intrinsics to weak linkage and that they are always exported regardless of whether the "mem" feature is enabled. That way this crate can be linked to binaries that link to libc and to binaries that don't link to libc without having to enable / disable Cargo features for either case. I think this is OK because all our ARM targets use the ELF format which supports weak linkage.

Alternatively we could not use weak linkage and export the aeabi_mem* intrinsics only when the "mem" feature is enabled. In this scenario if you link this crate to a program that also links to libc you'll get a "multiple definition" error if the "mem" feature was enabled.

alexcrichton · 2017-06-30T05:32:40Z

src/arm.rs

+    let mut src = src as *mut u32;
+
+    while n >= 4 {
+        ptr::write(dest, ptr::read(src));


Just to confirm, the 4 in the name is "aligned to 4 bytes", right?

Yes. That's why we can optimize the routine by doing these aligned 4-byte reads / writes here.

alexcrichton · 2017-06-30T05:34:24Z

I think this is OK because all our ARM targets use the ELF format which supports weak linkage.

I think we have iOS which may not respect this?

japaric · 2017-06-30T16:44:12Z

I think we have iOS which may not respect this?

Er, I forgot about iOS. (I was looking at rustc --print target-list and iOS doesn't show up there on Linux.). I don't know if it supports weak linkage.

japaric · 2017-06-30T16:46:19Z

error[E0560]: struct __test::test::TestDesc has no field named allow_fail

Did the real test crate change recently?

japaric · 2017-06-30T22:21:43Z

I have fixed the utest related error.

@alexcrichton would you prefer to go down the "mem" route, instead of using weak linkage, for now? If we want to eventually make binary releases of rust-std for the thumb targets that rust-std package will have to include a libcompiler_builtins crate with weak aeabi_mem* symbols or it won't be usable in bare metal contexts where you don't link to libc.

alexcrichton · 2017-06-30T22:26:52Z

@japaric oh enabling weak symbols on ELF targets makes sense to me, I'm fine doing that by default (starting off on thumb and maybe moving outwards slowly). Just wanted to point out that this may not work for iOS I think? (unsure)

japaric · 2017-06-30T22:58:05Z

@alexcrichton I checked the contents of the libcompiler_builtins.rlib that ships with in rust-std component for the arm-unknown-linux-gnueabi and armv7-apple-ios targets. The Linux one contains files like aeabi_memcpy.o where the aeabi_mempcy* symbols are defined, whereas the iOS one didn't contain those files. So it's probably better to not expose the aeabi_mem* symbols on this crate when compiling for iOS.

After checking that I was wondering why there was no "multiple definition" error when linking a program for arm-unknown-linux-gnueabi given that both libc and libcompiler_builtins provide the same aeabi_mem* symbols and neither library exposes them as weak symbols whereas I get that error when linking a thumbv7m-none-eabi program. The difference seems to be that in the linux program libc is linked dynamically and it appears as -lc at the end of the linker invocation. Whereas in the thumb case libc is linked statically and appears as -Wl,--whole-archive -lc -Wl,--no-whole-archive in the linker invocation.

So to maintain the current behavior the aeabi_mem* symbols should not be exposed on iOS, be exposed as weak symbols on thumb, and be exposed as normal symbols elsewhere (Linux).

…lsewhere

alexcrichton · 2017-07-01T07:27:52Z

@bors: r+

bors · 2017-07-01T07:27:53Z

📌 Commit b8a6620 has been approved by alexcrichton

bors · 2017-07-01T07:27:59Z

⌛ Testing commit b8a6620 with merge b0300b1...

@FenrirWolf

optimize memset and memclr for ARM This commit optimizes those routines by rewriting them in assembly and performing the memory copying in 32-bit chunks, rather than in 8-bit chunks as it was done before this commit. This assembly implementation is compatible with the ARMv6 and ARMv7 architectures. This change results in a reduction of runtime of about 40-70% in all cases that matter (the compiler will never use these intrinsics for sizes smaller than 4 bytes). See data below: | Bytes | HEAD | this PR | diff | | ----- | ---- | ------- | ---------- | | 0 | 6 | 14 | +133.3333% | | 1 | 10 | 13 | +30% | | 2 | 14 | 13 | -7.1429% | | 3 | 18 | 13 | -27.77% | | 4 | 24 | 21 | -12.5% | | 16 | 70 | 36 | -48.5714% | | 64 | 263 | 97 | -63.1179% | | 256 | 1031 | 337 | -67.3133% | | 1024 | 4103 | 1297 | -68.389% | All times are in clock cycles. The measurements were done on a Cortex-M3 processor running at 8 MHz using the technique described [here]. [here]: http://blog.japaric.io/rtfm-overhead --- For relevance all pure Rust programs for Cortex-M microcontrollers use memclr to zero the .bss during startup so this change results in a quicker boot time. Some questions / comments: - ~~the original code (it had a bug) comes from this [repo] and it's licensed under the ICS license. I have preserved the copyright and license text in the source code. IANAL, is that OK?~~ no longer applies. The intrinsics are written in Rust now. - ~~I don't know whether this ARM implementation works for ARMv4 or ARMv5. @FenrirWolf and @Uvekilledkenny may want to take look at it first.~~ no longer applies. The intrinsics are written in Rust now. - ~~No idea whether this implementation works on processors that have no thumb instruction set. The current implementation uses 16-bit thumb instructions.~~ no longer applies. The intrinsics are written in Rust now. - ~~The loop code can be rewritten in less instructions but using 32-bit thumb instructions. That 32-bit version would only work on ARMv7 though. I have yet to check whether that makes any difference in the runtime of the intrinsic.~~ no longer applies. The intrinsics are written in Rust now. - ~~I'll look into memcpy4 next.~~ done [repo]: https://github.com/bobbl/libaeabi-cortexm0

bors · 2017-07-01T08:04:46Z

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing b0300b1 to master...

Amanieu reviewed May 26, 2017

View reviewed changes

ppannuto reviewed May 27, 2017

View reviewed changes

parched reviewed May 30, 2017

View reviewed changes

optimize 32-bit aligned mem{cpy,clr,set} intrinsics for ARM

75c6ccc

this reduces the execution time of all these routines by 40-70%

japaric force-pushed the memclr branch from f84b3ba to 75c6ccc Compare June 30, 2017 04:00

alexcrichton reviewed Jun 30, 2017

View reviewed changes

no aeabi_mem* symbols on iOS, weak symbols on thumb, normal symbols e…

b8a6620

…lsewhere

bors merged commit b8a6620 into master Jul 1, 2017

josephlr mentioned this pull request Jul 8, 2020

Use REP MOVSQ/STOSQ on x86_64 #365

Merged

pca006132 mentioned this pull request Sep 29, 2020

Weak linkage for aeabi memory functions #378

Closed

pca006132 mentioned this pull request Oct 13, 2020

Use weak linkage for aeabi memory functions #385

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize memset and memclr for ARM #164

optimize memset and memclr for ARM #164

japaric commented May 26, 2017 •

edited

Loading

japaric commented May 26, 2017

Amanieu May 26, 2017

Amanieu commented May 26, 2017

japaric commented May 26, 2017 •

edited

Loading

Amanieu commented May 26, 2017

ppannuto May 27, 2017

japaric commented May 27, 2017

japaric commented May 28, 2017

japaric commented May 28, 2017

ppannuto commented May 29, 2017 via email

parched May 30, 2017

parched commented May 30, 2017 •

edited

Loading

bors commented Jun 25, 2017

japaric commented Jun 30, 2017

alexcrichton Jun 30, 2017

japaric Jun 30, 2017

alexcrichton commented Jun 30, 2017

japaric commented Jun 30, 2017

japaric commented Jun 30, 2017

japaric commented Jun 30, 2017

alexcrichton commented Jun 30, 2017

japaric commented Jun 30, 2017

alexcrichton commented Jul 1, 2017

bors commented Jul 1, 2017

bors commented Jul 1, 2017

bors commented Jul 1, 2017

optimize memset and memclr for ARM #164

optimize memset and memclr for ARM #164

Conversation

japaric commented May 26, 2017 • edited Loading

japaric commented May 26, 2017

Amanieu May 26, 2017

Choose a reason for hiding this comment

Amanieu commented May 26, 2017

japaric commented May 26, 2017 • edited Loading

Amanieu commented May 26, 2017

ppannuto May 27, 2017

Choose a reason for hiding this comment

japaric commented May 27, 2017

japaric commented May 28, 2017

japaric commented May 28, 2017

ppannuto commented May 29, 2017 via email

parched May 30, 2017

Choose a reason for hiding this comment

parched commented May 30, 2017 • edited Loading

bors commented Jun 25, 2017

japaric commented Jun 30, 2017

alexcrichton Jun 30, 2017

Choose a reason for hiding this comment

japaric Jun 30, 2017

Choose a reason for hiding this comment

alexcrichton commented Jun 30, 2017

japaric commented Jun 30, 2017

japaric commented Jun 30, 2017

japaric commented Jun 30, 2017

alexcrichton commented Jun 30, 2017

japaric commented Jun 30, 2017

alexcrichton commented Jul 1, 2017

bors commented Jul 1, 2017

bors commented Jul 1, 2017

bors commented Jul 1, 2017

japaric commented May 26, 2017 •

edited

Loading

japaric commented May 26, 2017 •

edited

Loading

parched commented May 30, 2017 •

edited

Loading