You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replacing slli / add.n pair with addx8 on 3 occasions in radix-2 routines.
Carefully reordering instructions for much fewer pipeline stalls and preventing the assembler to automatically insert a NOP in front of a hw loop so that the first instruction in the hw loop and branch target instructions entirely fit within a naturally aligned four byte region. Sometines the assembler fails to achieve this, which incurs 1-cycle penalty (the instruction fetch takes 2 cycles instead of 1). Consult Xtensa ISA Summary (LOOPNEZ description) for more info - it's vital for achieving the maximum loop/branch/jump speed!
Removing some superfluous instructions (forgotten debugging stuff).
Note 1: Passing the 3rd argument float* dsps_fft_w_table_fc32 to FFT functions is actually a good idea. Because when you do parallel computations of say 2 FFTs with different Ns, you need 2 different table pointers for them. The table pointer cannot be thus hardcoded in the FFT functions, as attempted.
Note 2: Optimized radix-2 ESP32-S3 ASM version is just 11-15% faster than optimized ESP32 ASM version because of a pipeline stall, which cannot be avoided.
Note 3: Optimized radix-4 ESP32-S3 ASM version is just 17-27% faster than the original ESP32 ASM version because only 4 fp32 complex reads and 4 fp32 complex writes can be sped up with 64-bit load/store instructions.
Note 4: When int is used for non-negative numbers only, the GCC does not know this and produces more complex code, which correctly handles negative values, which never happen - it just slows things down! Just have a look at what assembly GCC produces for the for cycle in dsps_fft4r_fc32_ae32.c when using int or uint for the j counter and m variables…
Note 5: As for an integer FFT, even some 30 years ago, when I programmed a full MP3 decoder in pure assembly (including a custom made DCT matrix decomposition - basically a custom made FDCT for maximum performance), I used a 2MHz fp32 DSP, and mostly never looked back since! As for the new ESP32-P4, I still wait for it to overcome its teething problems!
Happy Holiday!
f4lc0n signing off with the saying: "Est rerum omnium magister usus."
fft_fp32_xtensa_v2.2.zip (7-Zip) Improved: Deleted 4 superfluous assembly instructions in radix-4 FFTs, which are no longer needed for branch target alignment adjustments. Every branch target is yet properly aligned. Fixed: 1 wrong comment in radix-4 FFTs. Improved: More in-depth comments on arithmetic computations in radix-4 FFTs so that people can easily follow the computations - due to popular demand. Improved: Faster checking for N to be a power of 4 in radix-4 FFTs + branch/jump target alignment optimizations. Added:Optimized radix-4 ESP32 ASM version of the original radix-4 ESP32 ASM version. Just comment out / disable the original FFT ASM source in dsps_fft4r_fc32_ae32.c and add dsps_fft4r_fc32_ae32_.S to your project - it will be called instead. Added:Optimized radix-4 ESP32-S3 ASM version. Just add dsps_fft4r_fc32_aes3_.S to your project, add its prototype and #define dsps_fft4r_fc32_aes3(data, N) dsps_fft4r_fc32_aes3_(data, N, dsps_fft4r_w_table_fc32, dsps_fft4r_w_table_size) to dsps_fft4r.h and #define dsps_fft4r_fc32_aes3_enabled 1 to dsps_fft4r_platform.h and you are basically set. Improved: Further speed optimizations of dsps_fft2r_fc32_aes3_.S. Now it is up to 32% faster (N=4096)!
The new performance table:
(R2 stands for radix-2, R4 for radix-4, and O for the new optimized version. The measurements are in CPU cycles.)
The text was updated successfully, but these errors were encountered:
github-actionsbot
changed the title
SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3
SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911)
Dec 9, 2024
f4lc0n-asm
changed the title
SOLVED: 20% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911)
SOLVED: 20% and up to 32% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911)
Dec 13, 2024
f4lc0n-asm
changed the title
SOLVED: 20% and up to 32% faster fp32 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911)
SOLVED: 20% and up to 32% faster fp32 radix-2 FFT and up to 12% and up to 27% faster fp32 radix-4 FFT in ASM for ESP32 and ESP32-S3 (AUD-5911)
Dec 25, 2024
Hello,
here are the 20% and up to 32% faster ASM drop-in replacements for dsps_fft2r_fc32_ae32_.S and dsps_fft2r_fc32_aes3_.S from ESP-DSP repository.
I achieved this by:
slli / add.n
pair withaddx8
on 3 occasions in radix-2 routines.LOOPNEZ
description) for more info - it's vital for achieving the maximum loop/branch/jump speed!Note 1: Passing the 3rd argument
float* dsps_fft_w_table_fc32
to FFT functions is actually a good idea. Because when you do parallel computations of say 2 FFTs with different Ns, you need 2 different table pointers for them. The table pointer cannot be thus hardcoded in the FFT functions, as attempted.Note 2: Optimized radix-2 ESP32-S3 ASM version is just 11-15% faster than optimized ESP32 ASM version because of a pipeline stall, which cannot be avoided.
Note 3: Optimized radix-4 ESP32-S3 ASM version is just 17-27% faster than the original ESP32 ASM version because only 4 fp32 complex reads and 4 fp32 complex writes can be sped up with 64-bit load/store instructions.
Note 4: When
int
is used for non-negative numbers only, the GCC does not know this and produces more complex code, which correctly handles negative values, which never happen - it just slows things down! Just have a look at what assembly GCC produces for thefor
cycle in dsps_fft4r_fc32_ae32.c when usingint
oruint
for thej
counter andm
variables…Note 5: As for an integer FFT, even some 30 years ago, when I programmed a full MP3 decoder in pure assembly (including a custom made DCT matrix decomposition - basically a custom made FDCT for maximum performance), I used a 2MHz fp32 DSP, and mostly never looked back since! As for the new ESP32-P4, I still wait for it to overcome its teething problems!
Happy Holiday!
f4lc0n signing off with the saying: "Est rerum omnium magister usus."
fft_fp32_xtensa_v2.2.zip (7-Zip)
Improved: Deleted 4 superfluous assembly instructions in radix-4 FFTs, which are no longer needed for branch target alignment adjustments. Every branch target is yet properly aligned.
Fixed: 1 wrong comment in radix-4 FFTs.
Improved: More in-depth comments on arithmetic computations in radix-4 FFTs so that people can easily follow the computations - due to popular demand.
Improved: Faster checking for N to be a power of 4 in radix-4 FFTs + branch/jump target alignment optimizations.
Added: Optimized radix-4 ESP32 ASM version of the original radix-4 ESP32 ASM version. Just comment out / disable the original FFT ASM source in dsps_fft4r_fc32_ae32.c and add dsps_fft4r_fc32_ae32_.S to your project - it will be called instead.
Added: Optimized radix-4 ESP32-S3 ASM version. Just add dsps_fft4r_fc32_aes3_.S to your project, add its prototype and
#define dsps_fft4r_fc32_aes3(data, N) dsps_fft4r_fc32_aes3_(data, N, dsps_fft4r_w_table_fc32, dsps_fft4r_w_table_size)
to dsps_fft4r.h and#define dsps_fft4r_fc32_aes3_enabled 1
to dsps_fft4r_platform.h and you are basically set.Improved: Further speed optimizations of dsps_fft2r_fc32_aes3_.S. Now it is up to 32% faster (N=4096)!
The new performance table:
(R2 stands for radix-2, R4 for radix-4, and O for the new optimized version. The measurements are in CPU cycles.)
The text was updated successfully, but these errors were encountered: