Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDCard corruption of files and directory/folders #5077

Closed
1 task done
frankiearzu opened this issue May 28, 2024 · 59 comments · Fixed by #5148
Closed
1 task done

SDCard corruption of files and directory/folders #5077

frankiearzu opened this issue May 28, 2024 · 59 comments · Fixed by #5148
Labels
bug 🪲 Something isn't working triage Bug report awaiting review / sorting

Comments

@frankiearzu
Copy link
Contributor

frankiearzu commented May 28, 2024

Is there an existing issue for this problem?

  • I have searched the existing issues

What part of EdgeTX is the focus of this bug?

Transmitter firmware

Current Behavior

After some time of using the radio, the SDcard seems to have corruption of files and directory/folders.

Some scenarios:

  1. While using the DSM Forward Programming tool, it start giving errors related to reading and writing files from the SDCard. The messages displayed in the GUI are stored externally and are read into memory at start. On some low-mem radios, the scripts read the messages o chunks later on. After that errors, and exiting the script, the radio cannot ready anything from the SDCard. If you go to SYS->SDCard, it shows empty. Can't turn off the radio. After removing the battery, and start again, the SDCard seems back to normal. The script is also writing a log while is running. The LUA has been stable and no mods in a while, and did not have a problem in 2.9.x.

  2. Don't know is related to the previous description or independent, but after some time, the files/directory/folders seems corrupted. The SDCard has directory/folders with garbage names, and some images and sound files don't display/sound correctly. the only way to recover is to format again the SDCard, and rebuild the SDCard from a backup or start clean.

Expected Behavior

Not corrupt the SDCard.

Steps To Reproduce

Haven't found a way to reproduce it in a consistent way.. When i want it to fail, it does not, and when not expected, start to see corrupted files/folders.

Version

2.10.0

Transmitter

FrSky Q X7 / Q X7S (ACCESS), RadioMaster TX16S / TX16SMK2

Operating System (OS)

Windows

OS Version

No response

Anything else?

No response

@frankiearzu frankiearzu added bug 🪲 Something isn't working triage Bug report awaiting review / sorting labels May 28, 2024
@frankiearzu
Copy link
Contributor Author

Similar problem reported by @offer-shmuely on Discord:
image

@frankiearzu
Copy link
Contributor Author

frankiearzu commented May 28, 2024

Example on DSM Forward Prog.. The tool initially validates that the file is there, and after that, it read chucks (open, read, close) without much validation. in this case, we could not re-open a file that was already read during the initial start of the script.
(image courtesy of Shane DIY who was having similar problems)
image

On a TX16, similar situation..
After the script failed, the SDCard cannot be read by the radio.. Sys->Tools empty, Sys->SDCard empty

image
image
After restart (by removing the battery)
image

@raphaelcoeffic
Copy link
Member

Can't turn off the radio

@frankiearzu what does this mean exactly?

@frankiearzu
Copy link
Contributor Author

@raphaelcoeffic After the weird behavior on scripts/tools was noted, trying to turn off the radio using the switch got into the shutdown/off screen and the radio froze there. Had to remove the battery to restart.

Initially reported to me by Shane, played with scripts and radio a few times, and was able to duplicate the error myself.
Yesterday I tried like 30 times, and finally was able to reproduce it once, but after restart (disconnect bat), the SDcard seems ok.

Trying to find a consistent way to reproduce it.

@frankiearzu
Copy link
Contributor Author

frankiearzu commented May 31, 2024

Oh well, corrupted the SDcard again. The garbage names of the directories is actually data that the script was writing to the dsm_log.txt. Today was using a 2nd radio with a newly formatted SDcard (good quality one).

image

Connected the radio again, and Windows said that there was a problem with the Drive.. let it fix the problems, and actually deleted everything inside LOGS

Ran the script 7 times, connected the radio to the PC, checked the LOGS directory and again corrupted.. seems like LUAs writing to the LOGS corrupts it, and eventually propagates to other directories.
Will try with another SDCard, created new from scratch.

image

@raphaelcoeffic
Copy link
Member

trying to turn off the radio using the switch got into the shutdown/off screen and the radio froze there. Had to remove the battery to restart.

This is possibly a good hint at what is happening. Normally, and as the watchdog cannot be turned off (by any mean), the radio should have rebooted. So this limits the possibilities a lot: watchdog is still being refreshed (aka. mixer is still ON, and running, aka no hard fault). However something is preventing the radio from shutting down. On shutdown we try to flush radio/model files, so in case the file system mutex is still locked, it would behave exactly this way.

@raphaelcoeffic
Copy link
Member

The garbage names of the directories is actually data that the script was writing to the dsm_log.txt.

Ok, so a block of data found its way into the FAT (file system table; basically the list of files).

@frankiearzu
Copy link
Contributor Author

frankiearzu commented May 31, 2024

Finally able to reproduce it again with 10 tries consistently... Just have to run my DSM Forward prog, and then use the RX simulation about 10 times. (exiting completely the script on each run) Connect to the PC, and LOGS corrupted. On each run of the Simulator, the dsm_log.txt file is open for override (start from the beginning), and then write/append to it. Probably any LUA who writes will do the trick.
Will create a simple LUA that do the same behavior of writing the LOGS.

Looks like at some point the writing of the file affects the SDCard directory structure, since the name of strange files matches the content of the file:

image
image

@blopa1961

This comment was marked as off-topic.

@frankiearzu
Copy link
Contributor Author

frankiearzu commented May 31, 2024

Im starting to suspect that is related to leaving log files open by aborting the script using LONG Rtn to exit the script.
Created a log file writer test, and on purpose left the log without close, and just open the file again. Companion seems to properly garbage collect and close the files. In the real radio, got corruption of the script source file itself, and now fully corrupted the SD to a point that the radio does not start.. will re-format and redo the SDCard and try to reproduce.

In this version, on propose not closing the files, just re-open.. after 5-6 tests, i can't read the source file from "/SCRIPTS/TOOLS" using notepad, and after run it like 10 times and "long RTR" in the middle of the script, the script will start to fail a file error will show, and likely, the logs are corrupted.

The file error is a nil parameter to io.write() on a file that was created/open on the previous step, so failed to open.

run the Windows Disk Check tool and repair. it deletes the LOGS dir, recreate it, and was able to reproduce on the 7th run of the script.

image

LOG_Test.zip

@frankiearzu
Copy link
Contributor Author

frankiearzu commented May 31, 2024

My steps to reproduce.

  1. Run the LOG_Test.lua (/SCRIPTS/TOOLS).. run it like 5-6 times.
  2. Connect the radio to the computer, and try to open the LOG_Test.lua with notepad.. for me, it fails.
  3. Check if you LOGS is corrupted (most of the times, is not, but eventually it will)
  4. Disconnect the radio from computer
  5. Reconnect radio to computer, windows will say that needs to check and repair the SDCard. Proceed with Repair.
  6. Disconnect radio from computer
  7. Run a few times (10 times) again the LOG_Test, by run, and LONG RTN to abort.. eventually the Script starts giving problems on writing to a file that was just created.
  8. Radio will start behaving weird, and sometimes say that it needs "storage preparation".. do the calibration again.
  9. run the LOGtest again, will also fail within 10 runs more consistently.

@pfeerick
Copy link
Member

pfeerick commented Jun 1, 2024

Now that we are on the subject of SD card corruption; could you add a startup check for SD supported formats? I don't know if cluster size is also an issue or not, but I've seen problems related to ExFat formatted 64Gb+ cards instead of FAT32.

@blopa1961 I think the cluster size generally only affects performance. But the ExFAT issue is simple... ExFAT is not supported, period. Anyway, this is not related to this issue.

@richardclli
Copy link
Collaborator

richardclli commented Jun 1, 2024

I conducted a SD card test via the USB connection to my TX16S flashed with obsolete_stdperiph branch, the SD card is a 16GB SANDISK. Runs for 2 hours with 3 threads, one create files, one read and verify files and one delete files at random to prevent card from full. Everything fine right now, the log shows the test already created ~20000 files, with 64kB each. Will continue to leave it running for at least a day. If no errors occurs, we can assume that the hardware driver part is completely stable.

And the problem may resides in the FatFS, in which from the observation of how it is corrupted, it is very likely the FAT is messed up. Maybe due to concurrent file write and update the FAT which cause the mess up. Need somebody do a trace to see if there are any locking problem.

@pfeerick
Copy link
Member

pfeerick commented Jun 1, 2024

Please keep in mind Richard this is in relation to 2.10 ;) The tests you are doing will certainly reassure as to the stability of the updates to the hardware drivers for 2.11 though :)

@richardclli
Copy link
Collaborator

I am thinking, can this be repeated by a PL18/PL18EV? It did not have any SD card and use NOR flash. If this is about the FATFS part, this should be repeatable as well.

@pfeerick
Copy link
Member

pfeerick commented Jun 1, 2024

I can give that a try later today ... the script Frankie is using to break this is here ;) #5077 (comment) with the reproduction steps in the next comment. It will also be interesting to try T20, as that takes any colorlcd specific differences out of the equation.

@raphaelcoeffic
Copy link
Member

@frankiearzu It seems your test program is doing basically exactly what FatFs does not support in the configuration we are using: opening the same file multiple times with write access.

When destroying the file object with:

logFile = nil

what happens really here is that the object is marked for garbage collection. However, it is not specified when this will happen. So, in the meantime, if you re-open the file for writing, you are in unchartered territory.

We could prevent this by using file locking (FF_FS_LOCK option) as described by FatFs:

The option FF_FS_LOCK switches file lock function to control duplicated file open
and illegal operation to open objects. This option must be 0 when FF_FS_READONLY
is 1.

  0:  Disable file lock function. To avoid volume corruption, application program
      should avoid illegal open, remove and rename to the open objects.

  >0: Enable file lock function. The value defines how many files/sub-directories
      can be opened simultaneously under file lock control. Note that the file
      lock control is independent of re-entrancy.

This would come at the cost of 16 bytes per maximum number of simultaneous open files, plus an additional FreeRTOS mutex per volume. The additional mutex and internal locking in FatFs will as well have some impact on performance.

So, now the question is if we have any evidence of SD card corruption caused by normal usage patterns outside of LUA scripts violating the safe usage as implemented by FatFs with FF_FS_LOCK == 0.

@raphaelcoeffic
Copy link
Member

Maybe something worth mentioning: when killing a script with [Long RTN], I checked that the garbage collecting done afterwards effectively closes the file. Thus this is not the issue itself, but really re-opening a file while the script is still running, without assurance that the previous instance has been effectively closed.

@gagarinlg
Copy link
Member

There were reports of issues when users just had logging on

@pfeerick
Copy link
Member

pfeerick commented Jun 1, 2024

Oher... I ran Frankie's test script like 25-30 times on PL18EV with v2.10.1, with a reboot or two of the radio, and some force quits of the script rather than letting it run to completion, and zero issues, file system seems intact. TX16S (also v2.10.1) pooped itself after about 6-7 complete runs of the script ... lua broke with bad argument #1 to 'write' (FILE* expected, got nil) and unable to browse SD card on radio. Attempting to power off radio results in hang, requiring hard power off. Did this twice, with maybe two less runs of the script the second time. I didn't let compter repair FS, but dirty volume was detected.

And the punchline? Repeat testing with same radio, same SD card, but 2.9.4 is now up to 10 20 edit: 30 repeat runs of script with no apparent issue... 😖

@offer-shmuely
Copy link
Contributor

Maybe something worth mentioning: when killing a script with [Long RTN], I checked that the garbage collecting done afterwards effectively closes the file. Thus this is not the issue itself, but really re-opening a file while the script is still running, without assurance that the previous instance has been effectively closed.

If it worth anything, on the simulator it definitely crashes more often if i reload os (F9) during a running script, instead of first exit the script with RTN, before the F9

@raphaelcoeffic
Copy link
Member

Ok, so there is something with which you can test if you are using illegal operations on open objects (as specified by FatFs):

diff --git a/radio/src/thirdparty/FatFs/ffconf.h b/radio/src/thirdparty/FatFs/ffconf.h
index fda61e9edc..07115ea81b 100644
--- a/radio/src/thirdparty/FatFs/ffconf.h
+++ b/radio/src/thirdparty/FatFs/ffconf.h
@@ -260,8 +260,11 @@
 /  FF_NORTC_MDAY and FF_NORTC_YEAR have no effect.
 /  These options have no effect at read-only configuration (FF_FS_READONLY = 1). */
 
-
-#define FF_FS_LOCK             0
+#if FF_FS_READONLY > 0
+  #define FF_FS_LOCK 0
+#else
+  #define FF_FS_LOCK 8
+#endif
 /* The option FF_FS_LOCK switches file lock function to control duplicated file open
 /  and illegal operation to open objects. This option must be 0 when FF_FS_READONLY
 /  is 1.

When building with this, LOG_Test fails exactly as expected: once the file has been discarded by writing nil into the variable, subsequent open() fails and returns nil, which logically make the next call to write() fail.

Please note that this works the same way both on 2.9 and 2.10. So, we should probably check that we are using the correct usage patterns, and do not mess with files open for writing under any circumstances. IMHO, what is being exposed by LOG_Test script is not an issue with the firmware, but with the script.

@frankiearzu
Copy link
Contributor Author

@raphaelcoeffic The "production" code does close the file before opening it. This was a way I found to accelerate the behavior. With the production code, sometimes have to run 70-80 times to start having the behavior.. so maybe the LONG RTN exit might still leaving something not closed there.

On the script, just remove the comments on closing the files, and that will be the same as in production, but is much harder to reproduce the problem.

@raphaelcoeffic
Copy link
Member

so maybe the LONG RTN exit might still leaving something not closed there

I could not find anything in that direction, so far.

@frankiearzu
Copy link
Contributor Author

frankiearzu commented Jun 1, 2024

Once you cannot open the files, it is expected the way is failing (the nil error on the write), but the SDCard should not get corrupted.. I know that is much more difficult to reproduce the corruption of the SDcard, but eventually, it does happen with the production script that has the "proper" closing of the files before opening.

This new version (below), closes properly the files, with the exception of the forced exit (Long RTN), and is how the the production code is (but with the proper RTN exit handling). It just that it takes longer to replicate the error.

LOG_Test_2.zip

In 2.9.4, even with the script that is not closing the file, still no bad behavior after closing, and no SDcard corruption.

So i think there is still a gremlin in 2,10.. but much more difficult to reproduce... but if you see the very first screenshot, that came from @offer-shmuely reported on Discord. The corruption seems to be data that was probably ment to be on a log file.

This kind of bugs are frustrating because they cannot be easily reproduced. When running script, when I want it to fail, it does not, and when less expected, happen on 2nd run! murphy's law I guess. In the meantime, I will release a Forward Programming script with the writing to the Logs turn off.

@raphaelcoeffic
Copy link
Member

@frankiearzu could you please try with the patch I posted? This will prevent the conditions which, according to FatFs, would lead to a potentially corrupted file system. By the way, the same patch works both for 2.9 and 2.10. In my tests, it exposes the same behaviour. Whether or not the file system corruption effectively happens, when illegal operations are issued, is not deterministic. We could just be lucky in 2.9 due to timings or whatever else.

@raphaelcoeffic
Copy link
Member

If it worth anything, on the simulator it definitely crashes more often if i reload os (F9) during a running script, instead of first exit the script with RTN, before the F9

That is probably yet some other issue, this time with the simulator. By "crashes", you mean the simulator crashes, or something else happens inside the simulation?

@frankiearzu
Copy link
Contributor Author

frankiearzu commented Jun 2, 2024

@raphaelcoeffic Tested the patch with the changes to "FF_FS_LOCK = 8" on the 2.10 branch (2.10.1). The original LOG_TEST.lua (the one not closing properly the file and reopen) failed right away when trying to reopen the file as expected.

LOG_TEST2.lua ran OK (the one properly closing the files), but after about 32 runs with aborts (LONG RTN in the middle), Then it reported an error opening the file at startup (the very first open for write).. Looked at the SDCard, and /LOGS directory corrupted. It could not open due to /LOGS been really corrupted. Again the garbage names of directory/folders/files matches the data the script was writing.

Repaired the drive (With windows).

Ran it for about 20 times, and no problem.. /LOGS seems ok.

Ran the "DSM Forward Programming v0.56" (using the RX simulator, so no real RX needed), failed at the 4th run, but on the initial opening of the file ("w" option, override). Connected the radio to the computer, and LOGS corrupted again.

Don't know why, but the DSM lua seems to make it fail faster. It uses a lot more memory, but I tried to replicate that on the LOG_TEST2 by creating a large table.

Modified the script to run forever (open, write, close, do again).. no errors or corruption after running for about 30 min twice. Wondering if the problem is on start or shutdown of Lua scripts that corrupts something in memory, and later affect the filesystem. Just speculating.

Will continue testing to see if I find a way to make it easier to replicate.

@pfeerick
Copy link
Member

pfeerick commented Jun 4, 2024

Just built 2.10.1 with DISK_CACHE=OFF ... let's see what happens 🤷

@pfeerick
Copy link
Member

pfeerick commented Jun 4, 2024

Took a little longer than I intended, but... I initially had a crash (this could have been a leftover from before). Restarted the radio, rescanned the file system, and then restarted again, and was able to do 30 successive runs of our favourite breaker script with no failures. I then tried opening a few other Lua tools, no issues. Went back to the log_test lua, started randomly force quitting it about another dozen times, still no issues. Was able to restart radio with no errors. 🤔

@philmoz
Copy link
Collaborator

philmoz commented Jun 4, 2024

Took a little longer than I intended, but... I initially had a crash (this could have been a leftover from before). Restarted the radio, rescanned the file system, and then restarted again, and was able to do 30 successive runs of our favourite breaker script with no failures. I then tried opening a few other Lua tools, no issues. Went back to the log_test lua, started randomly force quitting it about another dozen times, still no issues. Was able to restart radio with no errors. 🤔

Ok, so disk cache may be a problem on color radios. I wonder if the PL18 would have the problem if build with disk caching on?

I've been running Log_Test_2 for about 1/2 hour on T14 without any errors so far.

@3djc
Copy link
Collaborator

3djc commented Jun 4, 2024

I think we need to be very cautious not to mix issues there. There is an identified colorlcd issue at work here.

BW sd is another beast altogether (hw and sw), my understanding is that we do NOT have confirmation that this specific issues also manifest itself on bw radio (as opposed to random sd incompatibilities)

@philmoz
Copy link
Collaborator

philmoz commented Jun 4, 2024

@pfeerick Can you please try the attached patch on your TX16S - it adds a mutex and locking to the cache to make it thread safe(r).

I have not (yet) been able to crash mine with this changes.

cache_mutex.diff.txt

@frankiearzu
Copy link
Contributor Author

frankiearzu commented Jun 4, 2024

Will do some test too, thank you guys! (Work interfering with the hobby :-)
It happens too in B&W.. @qsiguy (Shane's DIY) replicated on a Zorro too.. Shane, can you confirm.

@frankiearzu
Copy link
Contributor Author

frankiearzu commented Jun 4, 2024

After applying the Disk Cache patch, took me like 60 times with LOG_TEST2 to corrupt my LOGS dir.

But seems that DSM Forward prog 56 is failing quicker now (like in 5 runs). If you have the DSM Forward Prog 0.56, even if you have no spektrum RX, run it, select "RX Simulator", then pick any of the RX models there.. and you are inside the sim. then LONG RTN to abort.
After 5 runs, go to the Sys->SDCard, go into /LOGS, and see if there is garbage in the directory.. that saves the time of connecting it to the PC to check.

If not corrupted yet, do 5 runs again.
All this with two different TX16S, and 2.10.1 with the Disk Cache patch.. removed the FF_LOCK patch, but will try with both.

Here is a link to the DSM tool:
https://github.com/frankiearzu/DSMTools/tree/main/DSM%20Forward%20Programming/Color_480x272. The ZIP has the same SDCard structure, so is just copying the files after unzip.

@frankiearzu
Copy link
Contributor Author

frankiearzu commented Jun 5, 2024

Tried a bit more, and seems that i can duplicate it easily using the DSM Forward Prog 0.56. Is opening and closing the logs properly, and that's the only file write that is doing while running the RX simulator. 5-10 runs, and /LOGS is corrupted.
About the same with official 2.10.1 or with the cache mutex change.

For the LOG_TEST2, still happen, but takes much longer. From the file operation perspective, both are similar, but DSM FP uses significantly more memory, and did some "read" operations to the message file at initialization.

Here is a video of how easy is to replicate with official 2.10.1 and DSM forward prog 0.56. disk was check correct on the computer before each run, and radio fresh restart.

This one corrupted at the 6th time:
https://www.youtube.com/watch?v=xQilp-u13CA
This one on the 3rd:
https://www.youtube.com/watch?v=23Owxm8a6B0

@philmoz
Copy link
Collaborator

philmoz commented Jun 5, 2024

I think I've tracked this down to PR #3182.
I can't reproduce it on the prior commit - 3342d21.

@frankiearzu & @pfeerick - can you please try and replicate my finding regarding building with commit 3342d21 not showing the error.
I find building with -DDISK_CACHE=NO triggers the problem quicker on my TX16S.

@pfeerick
Copy link
Member

pfeerick commented Jun 5, 2024

I'll add that to my TODO list for tomorrow (or tonight depending on how bored I get).

@frankiearzu
Copy link
Contributor Author

Tried to build commit 3342d21 but failed:
image

@pfeerick
Copy link
Member

pfeerick commented Jun 5, 2024 via email

@frankiearzu
Copy link
Contributor Author

frankiearzu commented Jun 6, 2024

@philmoz You are right, that's exactly the point where it start failing.
Just before this PR #3182, was working properly and could not duplicate the error.
After that PR, the error can be replicated.

@pfeerick
Copy link
Member

pfeerick commented Jun 6, 2024

Same TX16S and SD, 42 cycles of original log test lua, with 3342d21, stock options (so DISK_CACHE ON) and no corruption or errors. Because I'm lazy, I then tried DSM Forward Prog v0.55a Color as that is what I had on the SD card. 5 runs, poking around in some of the emulator screens each time. LOGS folder looks fine. Another run of log test lua for good measure, and another five runs of DSM Forward Prog, still no problems... not looking promising for #3182!

btw, https://tallycount.app/ is handy for tests like this 🤭

@philmoz
Copy link
Collaborator

philmoz commented Jun 6, 2024

@frankiearzu There are a couple of things you might want to fix in the script:

  • LogLib.LOG_close should set logFile to nil to prevent using the closed file handle.
  • GUI_SwitchToSIM closes the log file; but does not re-open it, so any future debug writes do nothing.

@frankiearzu
Copy link
Contributor Author

frankiearzu commented Jun 6, 2024

@philmoz did the changes, but the the corruption still happens.

Did a lot of read of the code and did some modifications, but did not solved the problem. Using the code at the point after the change of SD HAL, not the very latest 2.10. Maybe the problem is some kind of concurrently, who are difficult do debug. Don't know if read/writes are mutually exclusive a the FatFS level, but if not, that's something else that i will try to add a mutex.

This is what I have tried so far: ( still have the problem).

  1. Longer Timeouts. If you see the original code, in the Wait for reading/writing has different timeouts..For read originally did SD_WaitReadOpereation(200count ), and for SD_WaitWriteOperation(500count). The new code does it via a completion callbacks (trigered by an interrupt) to set a flag to exit the waiting loops, so change the timeout for the wait. NO EFFECT.

  2. UnCommented SD_WaitReadOpereation/SW_WaitWriteOperation instead of interrupts, but they don't work.. just make everything slower, that means that they are timing out. What is interesting is that even with the code timing out and reporting error, besides the slowness, i did not see a change in behavior on the radio. SLOWER RADIO RESPONSE.

  3. Change the logic of waiting for SD TRANFER complete check to be similar to how it was before. Something to notice, is that after waiting for read/write to complete (successful or unsuccessful), the old code always enter the loop to check that the SD_TRANSFER to complete even when the read/write failed, the new code bypass it, The original code is an infinite loop until the transfer either succeeds or fails (no timeout). My new code based on the pre-change code is on the right.
    image.
    NO EFFECT.

  4. Added SD callsback for Error and Abort SD operations, beside the Read/Write completion, and added code to check them during read/write completion.. NO CHANGE.
    image

  5. This version still has 2 things that are removed on the latest 2.10:
    5.a. Read Retry loop (3 tried).
    5.b. If a multi-block read fails, do read of individual blocks one at a time.

**Other modifications from the SD HALL change to official 2.10: **

  1. Improvements to "diskio_sdio.cpp" to bypass calls to functions who just call other functions (transitivity improvements).
  2. Removed the retries of 3 times for read operations and single block reads when multi-block read operations failed.
  3. Seems that we are missing code to check transfer complete when reading blocks who are not aligned, the write and original code do have it. Probably better to move this transfer check to inside the "__read_dma" and "_write_dma".
    image
    That call below in red, do check the SD_TRANSFER operation to complete (they are equivalent).
    image
    Did not change the corruption behavior, but was curios. For some of my temporary testing, i move that code inside the _read_dma, and _write_dma

@frankiearzu
Copy link
Contributor Author

Also tested nighly (Jun6) 2.11, and can reproduce the corruption of the /LOG with DSM_FP 0.56. I think 2.11 has an updated FatFS.

@philmoz
Copy link
Collaborator

philmoz commented Jun 9, 2024

@frankiearzu Can you try this please:

  • edit radio/src/thirdparty/FatFs/ff_conf.h
  • change line 238 to '#define FF_FS_TINY 1'

@frankiearzu
Copy link
Contributor Author

frankiearzu commented Jun 9, 2024

@philmoz: GOOD NEWS! did the change you suggested, and after 50 tries, everything good so far, no /LOGS corruption. Back on the original code, can replicate it within 10 tries consistently with DSM FD prog.

From what I can understand from the code, The file system object FATFS has a sector buffer that is used for disk access related to directories. When FF_FS_TINY=1, is using that buffer also as buffer for read/write Files. (the suggested change that seems to be working fine now)

When FF_FS_TINY=0 (default), each file has its own sector buffer independent from the FATFS one.. more efficient.
So for a directory to end up with data that was written to a file, somewhere in the code where it should have used the file buffer, it actually used the FATFS buffer instead. Maybe one of that #if FF_FS_TINY is not right. Will research more.

Another interesting (unrelated) thing, I had an original SDCard (the not good quality one), and the radio was not recognizing the SDCard with 2.10, but worked with 2.9.4.. after bringing back some retry read logic and longer read timeout, the old card is working fine.

@pfeerick
Copy link
Member

pfeerick commented Jun 9, 2024

Another interesting (unrelated) thing, I had an original SDCard (the not good quality one), and the radio was not recognizing the SDCard with 2.10, but worked with 2.9.4.. after bringing back some retry read logic and longer read timeout, the old card is working fine.

There was some work to resolve this back in #4450. But, because this is indeed an entirely different issue, to do with SD cards that are slow to respond to initialisation, it might be worth doing a PR with your initialisation changes to see if it resolves the other "my SD card isn't working with 2.10" issues.

@richardclli
Copy link
Collaborator

Could yoy give it a try to use current 2.10.1 and change the line in hal.h

#define SDIO_CLK_DIV(24000000)

to

#define SDIO_CLK_DIV(16000000)

and see if this helps. This reduce the bus speed when accessing the card.

@frankiearzu
Copy link
Contributor Author

frankiearzu commented Jun 14, 2024

Just did a test changing the #define SDIO_CLK_DIV(16000000) on the src/targets/hours/hal.h (testing on TX16S).
Same corruption.

So far seems like the FF_FS_TINY=1 is the only thing that solves it. I think is concurrency, the FF_FS_TINY=1 forces to use a single buffer for file and directory SD access, and that becomes the "mutex" for sector read/write when we are not reading/writing full sectors. If writing partial (less than a sector of data), we still need to load the full sector from SD into memory, append the data in memory, and later write it back to SD the full sector.

The weird part is that the FatFS code is the same as 2.9.x, and did not have that problem until we move to SD read/write to use the new method.

Was trying to add a RTOS mutex on the dma_read/dma_write, but it does not compile.. some RTOS.h that is probably not pointing into the right place when compiling that part of the code.

@raphaelcoeffic
Copy link
Member

So far seems like the FF_FS_TINY=1 is the only thing that solves it. I think is concurrency, the FF_FS_TINY=1 forces to use a single buffer for file and directory SD access, and that becomes the "mutex" for sector read/write when we are not reading/writing full sectors. If writing partial (less than a sector of data), we still need to load the full sector from SD into memory, append the data in memory, and later write it back to SD the full sector.

We may seek guidance from the FatFs author, but as a matter of fact, I think different files do not use the same sectors, at least that is my assumption. Also, I don't think any sector shares a chunk of FAT and a chunk of a file. I may be wrong though, and would need to read more deeply into the FAT documentation / specs to have more insight.

@frankiearzu
Copy link
Contributor Author

Spent many hours today trying to figure out the problem.. Seems that is mostly on LUA scripts who are closer to the edge to the upper memory usage that some of the file IO writes could get corrupted. So difficult to tell. In some cases, calling the garbage collector before some file operations (write and close) hides the problem.

Probably enough time spent, revisit until we have other reports.. as for the Forward Programming Script, released a version with writing to the LOGs turned off, and also added some calls to garbage collector before closing the files.

@3djc
Copy link
Collaborator

3djc commented Jun 16, 2024

Have you been able to replicate with FF_FS_TINY=1. or where you trying without it ?

@frankiearzu
Copy link
Contributor Author

frankiearzu commented Jun 16, 2024

Have you been able to replicate with FF_FS_TINY=1. or where you trying without it ?

With FF_FS_TINY=1 I cannot replicate the problem. Default is FF_FS_TINY=0 and can replicate it consistently with LUA DSM forward prog (high memory usage LUA). Not FatFS expert, but from what i understand, it changes completely the logic of the utilization of the in-memory sector buffer.

With FF_FS_TINY=1, a single buffer is used to cache what is read (or sometimes write) all the SDCard data.. is shared for both directory and file data access, and is very short term. In here, this buffer becomes the bottleneck, and serialize the access to the SDCard read/write operations.

On the other hand, when FF_FS_TINY=0, a buffer for directory access is different from the file data access buffer, the file data buffer is independent for each open file, and is more long term (for example, multiple small writes, only when exceeding the buffer size, it will flush to SDcard). More buffers, eliminates the bottleneck when multiple files are open at the same time.

Since the corruption of the "/LOGS" directory structure has the data that was intended to be written to a file, seems like a bug where it picked the wrong buffer when writing data (writing to the directory instead of the file). The weird thing is that so far, only happen with high memory usage LUA scripts who writes to files (logs).

My next step would be to capture some disk access log via VCP to try to understand it.. did many tries at different things in the LUA IO API, where we are ignoring errors (like failing to close a file), but did not change the behavior so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🪲 Something isn't working triage Bug report awaiting review / sorting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants