Possible improvment to the benchmark

While reading the code, I'm spotting 3 potentials issues:

1. Your code should be placed in IRAM (using the `IRAM_ATTR` [macro](https://docs.espressif.com/projects/esp-idf/en/stable/esp32s2/api-guides/memory-types.html#how-to-place-code-in-iram)) so it's not evicted from cache when the PSRAM is loading (and the time code doesn't measure slowness of SPI Flash reading).
2. I'm not sure you're testing what you expect to test. The IRAM->IRAM test isn't doing what you think, since you've allocated the buffer with MALLOC_CAP_DMA and not MALLOC_CAP_32BIT. It's only with the latter that you *may* get an IRAM buffer (see [here](https://docs.espressif.com/projects/esp-idf/en/stable/esp32s2/api-reference/system/mem_alloc.html#dma-capable-memory)). The other flag won't give you IRAM (they might give you DRAM, so you're measuring the SRAM speed here). Please notice that the memory bus of ESP32-S3 is 240MHz (32 bits, so a maximum theorical bandwidth is 960MB/s, a copy should halve that bandwidth so at best, you *should* get 480MB/s). I don't understand how you could get more than that from the results you've shown. The ESP32 doesn't do any DDR like behavior on the SRAM, only the PSRAM supports it. 
If your results are confirmed, then this means that the DSP/PIE can access the SRAM in larger format than 32 bits (and the SRAM supports it), which would be excellent.
4. PSRAM tests are likely wrong. You need to evince the cache before starting the test, else you'll measure the time it takes to move from the cache to the cache. The default cache size in the ESP32S3 is 32KB (see [here](https://www.espressif.com/sites/default/files/documentation/esp32-s3_technical_reference_manual_en.pdf), p400 and later). In ESP-IDF, they usually configure it to 64KB. So the 100KB test already include a large part of it from the cache. You can use `esp_cache_msync` to evict it or simply read 64KB from somewhere here first to prune the cache, and then run the test. The size must also be larger than 128KB so when it finishes, the result is actually stored in the PSRAM and not in the cache. Else, what you'll see is like this:
```
Source PSRAM buffer
 32KB   |  32KB |  28KB  
[   A   |   B   |   C   ]

Dest PSRAM buffer  
 32KB   |  32KB |  28KB  
[   D   |   E   |   F   ]

Cache
[   x   |   x   ]

After warmup, you'll have this in the cache: (state diagram by time) 

|
|   S1    S2  S3    S4
+---*-----*----*----*--------------------> time

S1: Start of benchmark, cache contains [x|x]
S2: Read from PSRAM to cache and write to cache, cache contains [A|D] (actually measuring the PSRAM=>SRAM bandwidth for reading A + SRAM =>SRAM copying)
S3: End of reading A, starts reading B: doesn't fit. cache must evict its data to PSRAM, (actually measuring SRAM =>PSRAM bandwidth for transfering D to PSRAM, then PSRAM=>SRAM for reading B)
S4: End of reading C. Benchmarking stop, yet the data F isn't in PSRAM yet, since you haven't msync'ed yet.
```
The only part that would be relevant here for benchmark would be around state S3, since it's only there that the PSRAM is read and written. It happens after you've exhausted the cache and before it can leave lingering data in the cache, so after 32KB of source data and 32KB before its end. So in a 100KB test, you only experience this in the 28KB part between both ends. A 192KB test would have measured it in the inner 128KB area and left the 2 ends of 32KB unrelated.

Please notice that you can boost the PSRAM speed. On the ESP32-S3-WROOM-1U-N8R8, the PSRAM is mounted as a Octal SPI, and it's built by ESP-IDF by default as MSPI DDR mode running at 80MHz (see [here](https://docs.espressif.com/projects/esp-idf/en/stable/esp32s3/api-guides/flash_psram_config.html)). So the best bandwidth you could expect is 8bit * 80MHz * 2 = 160MB/s (a lot less, since there's the SPI and address selection overhead). Again, a copy operation will halve that (even more since you need to select the source and destination address for each transaction). So getting more than 50-60MB/s would be very excellent.

Yet, you can configure the MSPI to run at 120MHz (under the CONFIG_IDF_EXPERIMENTAL_FEATURES), with many caveats, but in that case you could expect a theorical bandwidth of 240MB/s or maybe ~100MB/s for a copy.

Anyway, thank you for your repository, it's excellent and I'm sure it can be even better. I've had hard time figuring out actual values for the DSP and SIMD stuff.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible improvment to the benchmark #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible improvment to the benchmark #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions