Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLVED: 20% and up to 32% faster fp32 radix-2 FFT and up to 12% and up to 27% faster fp32 radix-4 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) #1325

Open
f4lc0n-asm opened this issue Dec 9, 2024 · 0 comments

Comments

@f4lc0n-asm
Copy link

f4lc0n-asm commented Dec 9, 2024

Hello,

here are the 20% and up to 32% faster ASM drop-in replacements for dsps_fft2r_fc32_ae32_.S and dsps_fft2r_fc32_aes3_.S from ESP-DSP repository.

I achieved this by:

  • Replacing slli / add.n pair with addx8 on 3 occasions in radix-2 routines.
  • Carefully reordering instructions for much fewer pipeline stalls and preventing the assembler to automatically insert a NOP in front of a hw loop so that the first instruction in the hw loop and branch target instructions entirely fit within a naturally aligned four byte region. Sometines the assembler fails to achieve this, which incurs 1-cycle penalty (the instruction fetch takes 2 cycles instead of 1). Consult Xtensa ISA Summary (LOOPNEZ description) for more info - it's vital for achieving the maximum loop/branch/jump speed!
  • Removing some superfluous instructions (forgotten debugging stuff).

Note 1: Passing the 3rd argument float* dsps_fft_w_table_fc32 to FFT functions is actually a good idea. Because when you do parallel computations of say 2 FFTs with different Ns, you need 2 different table pointers for them. The table pointer cannot be thus hardcoded in the FFT functions, as attempted.

Note 2: Optimized radix-2 ESP32-S3 ASM version is just 11-15% faster than optimized ESP32 ASM version because of a pipeline stall, which cannot be avoided.

Note 3: Optimized radix-4 ESP32-S3 ASM version is just 17-27% faster than the original ESP32 ASM version because only 4 fp32 complex reads and 4 fp32 complex writes can be sped up with 64-bit load/store instructions.

Note 4: When int is used for non-negative numbers only, the GCC does not know this and produces more complex code, which correctly handles negative values, which never happen - it just slows things down! Just have a look at what assembly GCC produces for the for cycle in dsps_fft4r_fc32_ae32.c when using int or uint for the j counter and m variables…

Note 5: As for an integer FFT, even some 30 years ago, when I programmed a full MP3 decoder in pure assembly (including a custom made DCT matrix decomposition - basically a custom made FDCT for maximum performance), I used a 2MHz fp32 DSP, and mostly never looked back since! As for the new ESP32-P4, I still wait for it to overcome its teething problems!

Happy Holiday!

f4lc0n signing off with the saying: "Est rerum omnium magister usus."

fft_fp32_xtensa_v2.2.zip (7-Zip)
Improved: Deleted 4 superfluous assembly instructions in radix-4 FFTs, which are no longer needed for branch target alignment adjustments. Every branch target is yet properly aligned.
Fixed: 1 wrong comment in radix-4 FFTs.
Improved: More in-depth comments on arithmetic computations in radix-4 FFTs so that people can easily follow the computations - due to popular demand.
Improved: Faster checking for N to be a power of 4 in radix-4 FFTs + branch/jump target alignment optimizations.
Added: Optimized radix-4 ESP32 ASM version of the original radix-4 ESP32 ASM version. Just comment out / disable the original FFT ASM source in dsps_fft4r_fc32_ae32.c and add dsps_fft4r_fc32_ae32_.S to your project - it will be called instead.
Added: Optimized radix-4 ESP32-S3 ASM version. Just add dsps_fft4r_fc32_aes3_.S to your project, add its prototype and #define dsps_fft4r_fc32_aes3(data, N) dsps_fft4r_fc32_aes3_(data, N, dsps_fft4r_w_table_fc32, dsps_fft4r_w_table_size) to dsps_fft4r.h and #define dsps_fft4r_fc32_aes3_enabled 1 to dsps_fft4r_platform.h and you are basically set.
Improved: Further speed optimizations of dsps_fft2r_fc32_aes3_.S. Now it is up to 32% faster (N=4096)!

The new performance table:
(R2 stands for radix-2, R4 for radix-4, and O for the new optimized version. The measurements are in CPU cycles.)

   N ‖   ANSI |  ESP32 |  ESP32 |ESP32-S3|ESP32-S3‖   ANSI |  ESP32 |  ESP32 |ESP32-S3
     ‖    R2  |    R2  |  O R2  |    R2  |  O R2  ‖    R4  |    R4  |  O R4  |  O R4
  16 ‖   1277 |    978 |    818 |    931 |    737 ‖    979 |    613 |    557 |    496
  32 ‖   2972 |   2319 |   1935 |   2208 |   1726 ‖        |        |        |
  64 ‖   6835 |   5380 |   4484 |   5125 |   3971 ‖   4928 |   3070 |   2984 |   2603
 128 ‖  15514 |  12265 |  10217 |  11690 |   9000 ‖        |        |        |
 256 ‖  34785 |  27566 |  22958 |  26287 |  20141 ‖  24828 |  15427 |  15263 |  13218
 512 ‖  77160 |  61237 |  50995 |  58420 |  44594 ‖        |        |        |
1024 ‖ 169583 | 134712 | 112184 | 128569 |  97847 ‖ 121192 |  75160 |  74726 |  64489
2048 ‖ 369785 | 293949 | 244801 | 280638 | 213058 ‖        |        |        |
4096 ‖ 800893 | 636994 | 530498 | 608323 | 460865 ‖ 574098 | 355501 | 354029 | 304880
@github-actions github-actions bot changed the title SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) Dec 9, 2024
@f4lc0n-asm f4lc0n-asm changed the title SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) SOLVED: 20% and up to 32% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) Dec 13, 2024
@f4lc0n-asm f4lc0n-asm changed the title SOLVED: 20% and up to 32% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) SOLVED: 20% and up to 32% faster fp32 radix-2 FFT and up to 12% and up to 27% faster fp32 radix-4 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911) Dec 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant