-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize memset and memclr for ARM #164
Conversation
I changed this:
into this
It made no difference whatsoever. I get the same number of cycles in the 1024 and 256 bytes cases. |
tests/aeabi_memclr.rs
Outdated
|
||
#[test] | ||
fn memclr4() { | ||
let mut xs = [0u8; 8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ABI requires these to be aligned to 4 bytes, but your u8
array may be unaligned. Some with all the others instances in the other tests.
We need proper support for |
Pushed a memcpy implementation. Speedups below:
Times are in clock cycles. Smaller is better. |
It should work fine for ARMv5 and ARMv4T. It won't work on plain ARMv4 because it doesn't support the |
src/arm.rs
Outdated
str r2, [r0] | ||
adds r0, #4 | ||
subs r1, #4 | ||
bhs 1b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw your RFC on IRC earlier, I think you can make this a little tighter for cores that support postfix operation. I wrote this ( https://gist.github.com/ppannuto/672328eb8184abdb9559 ) a little while back for a tight init loop, comparing:
str r2, [r0] (2 cycle)
adds r0, #4 (1 cycle)
subs r1, #4 (1 cycle)
bhs 1b (~1 cycle)
-> 5 cycles / loop
with stmia:
WriteZeroToBssStart:
ldr r0, =_sbss
ldr r1, =_ebss
movs r2, #0
b WriteZeroToBssEnterLoop
WriteZeroToBssLoop:
stmia r0!, {r2} <-- 2 cycles
WriteZeroToBssEnterLoop:
cmp r0, r1 <-- 1 cycle
bcc WriteZeroToBssLoop <-- 1 cycle
-> 4 cycles / loop
Using cycle counts from 3.3.1 in http://infocenter.arm.com/help/topic/com.arm.doc.ddi0439b/DDI0439B_cortex_m4_r0p0_trm.pdf rather than measured, but I think this should save a cycle
I tested cross compiling a small program that uses this memset4 for ARMv4 and ARMv5 and it seems to link fine:
ARMv4 and ARMv5 end up using 32-bit (ARM?) instructions instead of 16-bit instructions but other than that the disassemblies look the same. @ppannuto Nice, I'll test that but the implementation we pick here should work from ARMv4 all the way to ARMv7 as there's no cfg mechanism for subarchitectures / architecture versions. |
@ppannuto I tested you approach but I get the same numbers: 5 cycles per loop. There's ... something weird about the measurement though. If I execute the code instruction by instruction and look at CYCCNT then I see: This PR
Your approach
However, if I use breakpoint and look at CYCCNT then I see: This PR
Change in CYCCNT across breakpoints: 5 cycles You approach:
Change in CYCCNT across breakpoints: 5 cycles If I mesaure the whole thing: This PR
4096 bytes -> 5137 cycles. Or about 5 cycles per loop. Your approach
4096 bytes -> 5139 cycles. Or about 5 cycles per loop. I have no idea why there's a 4 vs 5 difference in the micro (per instruction) vs "mini" (per loop) measurement, but the macro (per memset4 call) measurement reports the same numbers for both approaches. |
I have changed the mem{clr,set}{4,8} to look more like the memcpy implementations. I have also added an assembly implementation for mem{clr,set} to have the former tail call the latter to reduce code size. Performance is pretty much unchanged. |
Interesting -- I didn't think that Cortex-M's had a load-store queue, but
that (or something like it) would explain the identical performance and how
the `str` gets down to 1 cycle. Thanks for testing on hardware, sorry it
didn't prove to actually be faster :/
…On Sat, May 27, 2017 at 10:57 PM Jorge Aparicio ***@***.***> wrote:
@ppannuto <https://github.com/ppannuto> I tested you approach but I get
the same numbers: 5 cycles per loop. There's ... something weird about the
measurement though. If I execute the code instruction by instruction and
look at CYCCNT then I see:
This PR
0x08000514 ? str r2, [r0, #0] ; 1 cycle
0x08000516 ? adds r0, #4 ; 1 cycle
0x08000518 ? subs r1, #4 ; 1 cycle
0x0800051a ? bcs.n 0x8000514 ; 1 cycle
Your approach
0x0800051e ? stmia r0!, {r2} ; 2 cycles
0x08000520 ? cmp r0, r3 ; 1 cycle
0x08000522 ? blt.n 0x800051e ; 1 cycle
However, if I use breakpoint and look at CYCCNT then I see:
This PR
0x08000514 ? str r2, [r0, #0] ; <-- breakpoint
0x08000516 ? adds r0, #4 ;
0x08000518 ? subs r1, #4 ;
0x0800051a ? bcs.n 0x8000514 ;
Change in CYCCNT across breakpoints: 5 cycles
You approach:
0x0800051e ? stmia r0!, {r2} ; <-- breakpoint
0x08000520 ? cmp r0, r3
0x08000522 ? blt.n 0x800051e
Change in CYCCNT across breakpoints: 5 cycles
If I mesaure the whole thing:
This PR
800048e: be00 bkpt 0x0000
8000490: f6cd 62ad movt r2, #57005 ; 0xdead
8000494: f000 f837 bl 8000506 <__aeabi_memset4>
4096 bytes -> 5137 cycles. Or about 5 cycles per loop.
Your approach
800048e: be00 bkpt 0x0000
8000490: f6cd 62ad movt r2, #57005 ; 0xdead
8000494: f000 f837 bl 8000506 <__aeabi_memset4>
8000498: be00 bkpt 0x0000
4096 bytes -> 5139 cycles. Or about 5 cycles per loop.
I have no idea why there's a 4 vs 5 difference in the micro (per
instruction) vs "mini" (per loop) measurement, but the macro (per memset4
call) measurement reports the same numbers for both approaches.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#164 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAUt3lru1F0B4-MigmvP8q15UKS2FSR6ks5r-OKOgaJpZM4Nn1-q>
.
|
src/arm.rs
Outdated
// calling convention which can't be implemented using a normal Rust function | ||
// NOTE This function and the ones below are implemented using assembly | ||
// because they using a custom calling convention which can't be implemented | ||
// using a normal Rust function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it unrelated to this PR, but I couldn't help but notice these functions could be implemented using a normal rust function like
pub extern "C" fn __aeabi_uidivmod(n: u32, d: u32) -> u64 {
let mut rem = 0;
let quot = unsafe {__udivmodsi4(n, d, Option::Some(&mut rem))};
((rem as u64) << 32) | quot as u64
}
using a u64 as a 'in regs' struct.
@japaric How is the performance compared to optimised versions written in rust which would be more portable? I.e., something like
|
☔ The latest upstream changes (presumably #166) made this pull request unmergeable. Please resolve the merge conflicts. |
this reduces the execution time of all these routines by 40-70%
@parched Surprisingly (or maybe not), memcpy is around 20% faster (on Cortex-M3) when written in Rust. Everything else is slightly slower by a constant factor. I can't no longer read the generated assembly and there's no tail chaining of routines when written in Rust so the code size of the routines has doubled or so (They are still small though: 72 bytes for memcpy4). But if memcpy is faster then I'm sold. I have changed everything to Rust now. r? @alexcrichton TL;DR optimized memcpy et al for 32-bit aligned buffers (ARM has intrinsics for these operations) and reduced execution time for meaningful inputs by 40-70%. The debatable bit may be that I changed all the aeabi_mem* intrinsics to weak linkage and that they are always exported regardless of whether the "mem" feature is enabled. That way this crate can be linked to binaries that link to libc and to binaries that don't link to libc without having to enable / disable Cargo features for either case. I think this is OK because all our ARM targets use the ELF format which supports weak linkage. Alternatively we could not use weak linkage and export the aeabi_mem* intrinsics only when the "mem" feature is enabled. In this scenario if you link this crate to a program that also links to libc you'll get a "multiple definition" error if the "mem" feature was enabled. |
let mut src = src as *mut u32; | ||
|
||
while n >= 4 { | ||
ptr::write(dest, ptr::read(src)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, the 4
in the name is "aligned to 4 bytes", right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. That's why we can optimize the routine by doing these aligned 4-byte reads / writes here.
I think we have iOS which may not respect this? |
Er, I forgot about iOS. (I was looking at |
Did the real test crate change recently? |
I have fixed the utest related error. @alexcrichton would you prefer to go down the "mem" route, instead of using weak linkage, for now? If we want to eventually make binary releases of rust-std for the thumb targets that rust-std package will have to include a libcompiler_builtins crate with weak aeabi_mem* symbols or it won't be usable in bare metal contexts where you don't link to libc. |
@japaric oh enabling weak symbols on ELF targets makes sense to me, I'm fine doing that by default (starting off on thumb and maybe moving outwards slowly). Just wanted to point out that this may not work for iOS I think? (unsure) |
@alexcrichton I checked the contents of the libcompiler_builtins.rlib that ships with in rust-std component for the arm-unknown-linux-gnueabi and armv7-apple-ios targets. The Linux one contains files like After checking that I was wondering why there was no "multiple definition" error when linking a program for arm-unknown-linux-gnueabi given that both libc and libcompiler_builtins provide the same aeabi_mem* symbols and neither library exposes them as weak symbols whereas I get that error when linking a thumbv7m-none-eabi program. The difference seems to be that in the linux program libc is linked dynamically and it appears as -lc at the end of the linker invocation. Whereas in the thumb case libc is linked statically and appears as -Wl,--whole-archive -lc -Wl,--no-whole-archive in the linker invocation. So to maintain the current behavior the aeabi_mem* symbols should not be exposed on iOS, be exposed as weak symbols on thumb, and be exposed as normal symbols elsewhere (Linux). |
@bors: r+ |
📌 Commit b8a6620 has been approved by |
optimize memset and memclr for ARM This commit optimizes those routines by rewriting them in assembly and performing the memory copying in 32-bit chunks, rather than in 8-bit chunks as it was done before this commit. This assembly implementation is compatible with the ARMv6 and ARMv7 architectures. This change results in a reduction of runtime of about 40-70% in all cases that matter (the compiler will never use these intrinsics for sizes smaller than 4 bytes). See data below: | Bytes | HEAD | this PR | diff | | ----- | ---- | ------- | ---------- | | 0 | 6 | 14 | +133.3333% | | 1 | 10 | 13 | +30% | | 2 | 14 | 13 | -7.1429% | | 3 | 18 | 13 | -27.77% | | 4 | 24 | 21 | -12.5% | | 16 | 70 | 36 | -48.5714% | | 64 | 263 | 97 | -63.1179% | | 256 | 1031 | 337 | -67.3133% | | 1024 | 4103 | 1297 | -68.389% | All times are in clock cycles. The measurements were done on a Cortex-M3 processor running at 8 MHz using the technique described [here]. [here]: http://blog.japaric.io/rtfm-overhead --- For relevance all pure Rust programs for Cortex-M microcontrollers use memclr to zero the .bss during startup so this change results in a quicker boot time. Some questions / comments: - ~~the original code (it had a bug) comes from this [repo] and it's licensed under the ICS license. I have preserved the copyright and license text in the source code. IANAL, is that OK?~~ no longer applies. The intrinsics are written in Rust now. - ~~I don't know whether this ARM implementation works for ARMv4 or ARMv5. @FenrirWolf and @Uvekilledkenny may want to take look at it first.~~ no longer applies. The intrinsics are written in Rust now. - ~~No idea whether this implementation works on processors that have no thumb instruction set. The current implementation uses 16-bit thumb instructions.~~ no longer applies. The intrinsics are written in Rust now. - ~~The loop code can be rewritten in less instructions but using 32-bit thumb instructions. That 32-bit version would only work on ARMv7 though. I have yet to check whether that makes any difference in the runtime of the intrinsic.~~ no longer applies. The intrinsics are written in Rust now. - ~~I'll look into memcpy4 next.~~ done [repo]: https://github.com/bobbl/libaeabi-cortexm0
☀️ Test successful - status-appveyor, status-travis |
This commit optimizes those routines by rewriting them in assembly and
performing the memory copying in 32-bit chunks, rather than in 8-bit chunks
as it was done before this commit. This assembly implementation is
compatible with the ARMv6 and ARMv7 architectures.
This change results in a reduction of runtime of about 40-70% in all cases
that matter (the compiler will never use these intrinsics for sizes smaller
than 4 bytes). See data below:
All times are in clock cycles. The measurements were done on a Cortex-M3
processor running at 8 MHz using the technique described here.
For relevance all pure Rust programs for Cortex-M microcontrollers use memclr to
zero the .bss during startup so this change results in a quicker boot time.
Some questions / comments:
the original code (it had a bug) comes from this repo and it's licensedno longer applies. The intrinsics are written in Rust now.under the ICS license. I have preserved the copyright and license text in the
source code. IANAL, is that OK?
I don't know whether this ARM implementation works for ARMv4 or ARMv5.no longer applies. The intrinsics are written in Rust now.@FenrirWolf and @Uvekilledkenny may want to take look at it first.
No idea whether this implementation works on processors that have no thumbno longer applies. The intrinsics are written in Rust now.instruction set. The current implementation uses 16-bit thumb instructions.
The loop code can be rewritten in less instructions but using 32-bit thumbno longer applies. The intrinsics are written in Rust now.instructions. That 32-bit version would only work on ARMv7 though. I have yet
to check whether that makes any difference in the runtime of the intrinsic.
I'll look into memcpy4 next.done