Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using newer compiler for ARM (13.2.Rel1) #2455

Closed
fanoush opened this issue Jan 26, 2024 · 20 comments
Closed

using newer compiler for ARM (13.2.Rel1) #2455

fanoush opened this issue Jan 26, 2024 · 20 comments

Comments

@fanoush
Copy link
Contributor

fanoush commented Jan 26, 2024

Creating this issue to track more issues and discuss rationale.

Current 8-2018-q4-major 8.2.1 compiler downloaded by provision script is quite old but produced shorter code than newer ones. However when testing latest one 13.2.rel1 gcc 13.2.1 from
https://developer.arm.com/downloads/-/arm-gnu-toolchain-downloads I see that it produces smaller code both for bootloader and main firmware. However there are some issues:

arm-gnu-toolchain-13.2.Rel1-x86_64-arm-none-eabi/bin/../lib/gcc/arm-none-eabi/13.2.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/softfp/libc_nano.a(libc_a-closer.o): in function `_close_r':
closer.c:(.text._close_r+0xc): warning: _close is not implemented and will always fail

Discussed here https://community.arm.com/support-forums/f/compilers-and-libraries-forum/53519/arm-gcc-11-3-rel1-newlib-nano-linker-warning-_read-is-not-implemented-and-will-always-fail
Looks like it works fine and the code calling this is not present in final binary but it is a bit strange. I did not succeed to make it go away when messing with nosys specs as mentioned in various places. However I don't see this in bootloader build (maybe something from stdio is not pulled in there at all, in main FW it may be there but optimized out).

  • For BANGLE2 I have tested both bootloader and main firmware and in general it seems to work for me - Espruino works, I see no issue with any apps I use. Bootloader can start Nordic DFU, can also flash update from storage file however there is one issue - the screen output/font rendering in bootloader is somehow garbled most probably when printing space character (?), font letters can be read but there are random pixels set between that. I briefly checked font rendering but don't see how space is rendered, any tips? Can be different layout after linking or alignment issue or some timing issue however other font letters are OK and screen works perfectly fine in main firmware.
@fanoush
Copy link
Contributor Author

fanoush commented Jan 26, 2024

Forgot one more reason. The old compiler has very slow linker, in typical case this is longest part of whole build. Newer compilers are much faster in last LTO linking phase (like ~10 seconds instead of > 1 minute on slower computers).

linking speed rm bin/*elf ; time make BOARD=BANGLEJS RELEASE=1

linking 8.2.1 (on i9 12900H - pretty fast computer, typically I use much slower machines)
real 0m24.178s
user 0m23.537s
sys 0m0.649s
linking 13.2.1 with -flto (sigle threaded, warning: using serial compilation of 11 LTRANS jobs)
real 0m10.796s
user 0m10.348s
sys 0m0.459s
linking 13.2.1 with -flto=auto
real 0m1.777s
user 0m13.339s
sys 0m0.495s

The size reduction is not that much but it is there, here is Bangle 1 build - saves about 1.1KB
BANGLE1
gcc version 13.2.1
CODE: 0x1f000 -> 0x757e8 (354100 bytes)
Code area Fits before FS Area (2072b free)
gcc version 8.2.1
CODE: 0x1f000 -> 0x75c34 (355180 bytes)
Code area Fits before FS Area (972b free)

also the bootloader

BANGLE1 bootloader with 8.2.1 - build with untested spi flashing - only 8 bytes free

0007DFE0 00 10 00 00 00 30 08 00 41 A5 07 00 00 00 00 00 |.....0..A.......|
0007DFF0 25 C2 07 00 D1 C0 07 00 -- -- -- -- -- -- -- -- |%....... |
0007E000 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- | |

with gcc version 13.2.1 there is 244 bytes saving, now 252 bytes free

0007DEE0 01 00 FF FF 00 90 D0 03 10 00 00 00 00 10 00 00 |................|
0007DEF0 00 30 08 00 71 A9 07 00 00 00 00 00 61 C1 07 00 |.0..q.......a...|
0007DF00 09 CD 07 00 -- -- -- -- -- -- -- -- -- -- -- -- |.... |
0007DF10 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- | |

@gfwilliams
Copy link
Member

Newer compilers are much faster in last LTO linking phase

Where do I sign up? :) Even after a very expensive PC upgrade that is still frustrating!

Things like this make me think I really do need to finally make a proper test harness that will test out the firmwares automatically on real hardware. It would make merging PRs like this seem a lot less risky.

font in bootloader

Very strange - it's handled in https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L839-L840

It should see space, realise it's less than char 0 and just ignore it (stepping on X by 4)

Do you have a picture of it? It should only be capable of setting pixels, and ONLY on X+0/1/2 - so if you see pixels being set at 4*X+3 (3/7/11/etc) then something is going wrong maybe in lcd_pixel

@fanoush
Copy link
Contributor Author

fanoush commented Jan 26, 2024

Added PR just to see what I did with the code in ram fix if you want to test it.
Don't have SWD for BANGLE2 here, will make a photo with bootloader font later.

@fanoush
Copy link
Contributor Author

fanoush commented Jan 26, 2024

here are photos https://ibb.co/album/hLbQ4M , just changing compiler back and rebuilding from same source fixes it

@fanoush
Copy link
Contributor Author

fanoush commented Jan 26, 2024

Where do I sign up? :) Even after a very expensive PC upgrade that is still frustrating!

my second fastest machine (Dell Optiplex 3000 thin client - 120EUR refurbished fanless thing from ebay, 3.3GHz Pentium Silver N6005)

gcc version 8.2.1 8-2018-q4-major - linking takes 57s out of 1m28.715s

make BOARD=BANGLEJS2 clean ; time make -j BOARD=BANGLEJS2 RELEASE=1
real 1m28.715s
user 2m47.380s
sys 0m8.753s

rm bin/*.elf ; time make -j BOARD=BANGLEJS2 RELEASE=1
real 0m57.828s
user 0m56.636s
sys 0m1.198s

gcc version 13.2.1 linking takes 9s our of 23s

make BOARD=BANGLEJS2 clean ; time make -j BOARD=BANGLEJS2 RELEASE=1
real 0m23.845s
user 1m14.575s
sys 0m8.153s

rm bin/*.elf ; time make -j BOARD=BANGLEJS2 RELEASE=1
real 0m9.345s
user 0m26.890s
sys 0m1.446s

this is with -flto=auto (=use more threads, autodetect # of CPU cores like make -j)

diff --git a/make/common/ARM.make b/make/common/ARM.make
index 335c1834d..c27139760 100644
--- a/make/common/ARM.make
+++ b/make/common/ARM.make
@@ -13,7 +13,7 @@ OPTIMIZEFLAGS += -fno-common -fno-exceptions -fdata-sections -ffunction-sections
 # Given we've left 10k for it, there's no real reason to enable LTO anyway.
 ifndef BOOTLOADER
 # Enable link-time optimisations (inlining across files)
-OPTIMIZEFLAGS += -flto -fno-fat-lto-objects -Wl,--allow-multiple-definition
+OPTIMIZEFLAGS += -flto=auto -fno-fat-lto-objects -Wl,--allow-multiple-definition
 DEFINES += -DLINK_TIME_OPTIMISATION
 endif

and for bootloader possibly also

diff --git a/make/family/NRF52.make b/make/family/NRF52.make
index b6000d6ea..a3d8080fa 100644
--- a/make/family/NRF52.make
+++ b/make/family/NRF52.make
@@ -150,7 +150,7 @@ endif # USE_BOOTLOADER
 ifdef BOOTLOADER
 # we're trying to compile the bootloader itself
 LINKER_FILE = $(LINKER_BOOTLOADER)
-OPTIMIZEFLAGS=-Os -flto -fno-fat-lto-objects -Wl,--allow-multiple-definition # try to reduce bootloader size
+OPTIMIZEFLAGS=-Os -flto=auto -fno-fat-lto-objects -Wl,--allow-multiple-definition # try to reduce bootloader size
 else # not BOOTLOADER - compiling something to run under a bootloader
 LINKER_FILE = $(LINKER_ESPRUINO)
 INCLUDE += -I$(NRF5X_SDK_PATH)/nrf52_config

@fanoush
Copy link
Contributor Author

fanoush commented Jan 27, 2024

Good news is that the bug is in the bootloader LCD code :-)

There is mix of LCD_DATA_HEIGHT and LCD_HEIGHT

LCD_HEIGHT
https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L892
LCD_DATA_HEIGHT
https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L863

I guess LCD_DATA_HEIGHT should be used everywhere as that is what is in the lcd.h file so it is wrong in lcd_clear() and also in lcd_flip() in several places. When fixing it in lcd_clear() there is no garbage on screen on bootloader startup.

EDIT: for bangle 2 it looks like sensible fix but when I tried to replace it everywhere I am not sure about these places
https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L449-L450
https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L453
https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L485
hard to guess what is the idea here and why it is different from other cases

I would guess that when x,y is used to access data in lcd_data it should not be multiplied by 2 (or use LCD_WIDTH/HEIGHT) or it may go past the array(?)

@gfwilliams
Copy link
Member

gfwilliams commented Jan 29, 2024

Great spot - thanks!

Just a bit of background - to make it readable on most devices, we have an offscreen buffer and pixel-double. ymin/max should be the y values in that offscreen buffer (at half the size of the main one).

Those other bits look fine - although https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L485 could change (but it doesn't matter really as all it wants to do is make ymin>ymax).

For the other places, it's setting ymin/ymax to double because then it scans out all the pixels and uses y>>1 when indexing into the data array. And after it resets ymin/ymax

Do you want to do a PR for that fix, or shall I stick it in?


Just to add, on GCC 8, I get:

make BOARD=BANGLEJS2 clean ; time make -j BOARD=BANGLEJS2 RELEASE=1
real	0m21.997s
user	0m53.521s
sys	0m4.007s

rm bin/*.elf ; time make -j BOARD=BANGLEJS2 RELEASE=1
real	0m19.008s
user	0m18.511s
sys	0m0.496s

So it's a lot faster, but compiling everything else takes 2 seconds, and the final step takes 19 seconds - so updating the compiler would still be a great help.

@fanoush
Copy link
Contributor Author

fanoush commented Jan 29, 2024

Do you want to do a PR for that fix, or shall I stick it in?

Please do without PR if possible. If you think everything else is OK then at least this causes the garbage (too much data copied)
https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L892
and this is confusing but does not hurt as it gets fixed when first pixel is drawn
https://github.com/espruino/Espruino/blob/master/targets/nrf5x_dfu/lcd.c#L810

gfwilliams added a commit that referenced this issue Jan 29, 2024
@gfwilliams
Copy link
Member

Just done :)

gfwilliams added a commit that referenced this issue Jan 31, 2024
…rrent 8.2.1 compiler - should work faster with 13.2.1
@gfwilliams
Copy link
Member

I just added flto=auto and it doesn't appear to change the binary on 8.2.1

Am I right in thinking that there's actually not anything else that needs to change now before we swap compiler?

I'm tempted to make a 2v21 firmware release before we do the swap, just in case anything comes up from it that we didn't notice. Then there'll be several weeks to notice any issues before 2v22 :)

@fanoush
Copy link
Contributor Author

fanoush commented Jan 31, 2024

Yes, nothing else. There are those stdio warnings but should be harmless. Maybe debug builds could have more of that linked in? I do run Bangle2 build now in my watch and also started to use it for other builds but so far only flashed it to nrf52840 dongle which does not test more than Bangle2. EDIT: I also tried nrf52832 board with SDK12 and bootloader and it works. Will try stm32f103 and nrf51 in next days.

@fanoush
Copy link
Contributor Author

fanoush commented Feb 2, 2024

Tried STM32F103 build for STLINK V2 USB dongle and it works fine in device however in this case new compiler produces larger code
13.2.1 PASS - size of 120304 is under 120832 bytes
8.2.1 PASS - size of 119400 is under 120832 bytes
This is interesting, maybe some Cortex M3 related thing or some libraries?

tried to build MICROBIT1 and that one is slightly smaller with newer compiler
13.2.1 CODE: 0x1b000 -> 0x3e934 (145716 bytes)
8.2.1 CODE: 0x1b000 -> 0x3ea0c (145932 bytes)

Tried also PICO_R1_3 build and that one is smaller with new compiler too
13.2.1 PASS - size of 326088 is under 327680 bytes
8.2.1 PASS - size of 327448 is under 327680 bytes

and ESPRUINOBOARD, slightly bigger
13.2.1 PASS - size of 217804 is under 231424 bytes
8.2.1 PASS - size of 217712 is under 231424 bytes

So looks like Cortex M3 thing(?)

@gfwilliams
Copy link
Member

Interesting! Thanks for all that testing.

It does look like an M3 thing (maybe they're putting more emphasis on M4 code generation and the M3 output is just M4 code with unsupported instructions replaced). It's a shame really as the M3 devices tend to have less flash so would really benefit.

It feels like a 1.5k+ saving on all modern platforms should outweigh 1-200 bytes extra on old platforms though, so I'm still all for this

@gfwilliams
Copy link
Member

Just looking into this a bit more - it's looking good. Interestingly the MICROBIT1 build doesn't have the _close warnings, and commenting:

     'DEFINES+=-DSAVE_ON_FLASH_EXTREME',
     'DEFINES+=-DESPR_NO_DAYLIGHT_SAVING',
     'DEFINES+=-DJSVAR_FORCE_NO_INLINE=1',
     'CFLAGS += -ffreestanding', # needed for SAVE_ON_FLASH_EXTREME (jswrap_math, __aeabi_dsub)
     'ASFLAGS += -D__STARTUP_CLEAR_BSS -D__START=main',
     'LDFLAGS += -nostartfiles',

Makes them come back.

Doing the builds the same way (using the compiler in a zip in GitHub) won't work as it's now too big (even when I remove all the targets we don't need), so I've done it direct from ARM for now.

Last time it made the builds slower and unreliable as the 'external' server would sometimes fail, but we'll have to see what happens.

@gfwilliams
Copy link
Member

It's in - and looks good: https://github.com/espruino/Espruino/actions/runs/7833136232

@fanoush
Copy link
Contributor Author

fanoush commented Feb 9, 2024

using the compiler in a zip in GitHub) won't work as it's now too big (even when I remove all the targets we don't need)

what is the limit? got it down from 170MB to 60MB by removing targets, stripping symbols from binaries and compressing into tar.zx (XZ_OPT=-8eT0 tar -Jcvf 13.2.tar.xz arm-gnu-toolchain-13.2.Rel1-x86_64-arm-none-eabi)

@fanoush
Copy link
Contributor Author

fanoush commented Feb 9, 2024

I see it described as 100MB here https://github.com/espruino/EspruinoBuildTools/tree/master/arm
For that you can just strip binaries

tar -xvf arm-gnu-toolchain-13.2.rel1-x86_64-arm-none-eabi.tar.xz
cd arm-gnu-toolchain-13.2.Rel1-x86_64-arm-none-eabi/
strip  bin/* arm-none-eabi/bin/* libexec/gcc/arm-none-eabi/13.2.1/{cc*,f*,lto1}
cd ..
XZ_OPT=-8eT0 tar -Jcf 13.2.1-stripped.tar.xz arm-gnu-toolchain-13.2.Rel1-x86_64-arm-none-eabi/

gives file size 83056216
EDIT:
oh I see the trick is mainly in compression flags, just stripped binaries with default xz compression options is 141691464 vs 83054996 with -8e (and -9e gives 67400860).

@gfwilliams
Copy link
Member

gfwilliams commented Feb 9, 2024

That's great! The limit is 100 so that would be perfect.

It must be the XZ_OPT - I tried here and I deleted everything apart from:

  • nRF51 (micro:bit 1) : v6-m/nofp
  • nRF52: Cortex M4 v7e-m+fp/softfp
  • Original Espruino: Cortex M3 v7-m/nofp

And it still was at 120 with the default options.

Is there any side-effect of stripping the files? Does it hurt debugability? I just made the mistake of trying to google the man page for strip...

@fanoush
Copy link
Contributor Author

fanoush commented Feb 9, 2024

Does it hurt debugability?

Debuggability of gcc itself when it crashes or miscompiles probably yes. This strips just compiler executables, not target libraries.

@gfwilliams
Copy link
Member

This strips just compiler executables, not target libraries.

Oh wow, ok - that's perfect then. Thanks!

I'm not going to get around to this before my break so I'll reopen for now, and close when I get the compiler in GitHub

@gfwilliams gfwilliams reopened this Feb 9, 2024
gfwilliams added a commit to espruino/EspruinoBuildTools that referenced this issue Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants