-
-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SDCard corruption of files and directory/folders #5077
Comments
Similar problem reported by @offer-shmuely on Discord: |
@frankiearzu what does this mean exactly? |
@raphaelcoeffic After the weird behavior on scripts/tools was noted, trying to turn off the radio using the switch got into the shutdown/off screen and the radio froze there. Had to remove the battery to restart. Initially reported to me by Shane, played with scripts and radio a few times, and was able to duplicate the error myself. Trying to find a consistent way to reproduce it. |
Oh well, corrupted the SDcard again. The garbage names of the directories is actually data that the script was writing to the dsm_log.txt. Today was using a 2nd radio with a newly formatted SDcard (good quality one). Connected the radio again, and Windows said that there was a problem with the Drive.. let it fix the problems, and actually deleted everything inside LOGS Ran the script 7 times, connected the radio to the PC, checked the LOGS directory and again corrupted.. seems like LUAs writing to the LOGS corrupts it, and eventually propagates to other directories. |
This is possibly a good hint at what is happening. Normally, and as the watchdog cannot be turned off (by any mean), the radio should have rebooted. So this limits the possibilities a lot: watchdog is still being refreshed (aka. mixer is still ON, and running, aka no hard fault). However something is preventing the radio from shutting down. On shutdown we try to flush radio/model files, so in case the file system mutex is still locked, it would behave exactly this way. |
Ok, so a block of data found its way into the FAT (file system table; basically the list of files). |
Finally able to reproduce it again with 10 tries consistently... Just have to run my DSM Forward prog, and then use the RX simulation about 10 times. (exiting completely the script on each run) Connect to the PC, and LOGS corrupted. On each run of the Simulator, the dsm_log.txt file is open for override (start from the beginning), and then write/append to it. Probably any LUA who writes will do the trick. Looks like at some point the writing of the file affects the SDCard directory structure, since the name of strange files matches the content of the file: |
This comment was marked as off-topic.
This comment was marked as off-topic.
Im starting to suspect that is related to leaving log files open by aborting the script using LONG Rtn to exit the script. In this version, on propose not closing the files, just re-open.. after 5-6 tests, i can't read the source file from "/SCRIPTS/TOOLS" using notepad, and after run it like 10 times and "long RTR" in the middle of the script, the script will start to fail a file error will show, and likely, the logs are corrupted. The file error is a nil parameter to io.write() on a file that was created/open on the previous step, so failed to open. run the Windows Disk Check tool and repair. it deletes the LOGS dir, recreate it, and was able to reproduce on the 7th run of the script. |
My steps to reproduce.
|
@blopa1961 I think the cluster size generally only affects performance. But the ExFAT issue is simple... ExFAT is not supported, period. Anyway, this is not related to this issue. |
I conducted a SD card test via the USB connection to my TX16S flashed with obsolete_stdperiph branch, the SD card is a 16GB SANDISK. Runs for 2 hours with 3 threads, one create files, one read and verify files and one delete files at random to prevent card from full. Everything fine right now, the log shows the test already created ~20000 files, with 64kB each. Will continue to leave it running for at least a day. If no errors occurs, we can assume that the hardware driver part is completely stable. And the problem may resides in the FatFS, in which from the observation of how it is corrupted, it is very likely the FAT is messed up. Maybe due to concurrent file write and update the FAT which cause the mess up. Need somebody do a trace to see if there are any locking problem. |
Please keep in mind Richard this is in relation to 2.10 ;) The tests you are doing will certainly reassure as to the stability of the updates to the hardware drivers for 2.11 though :) |
I am thinking, can this be repeated by a PL18/PL18EV? It did not have any SD card and use NOR flash. If this is about the FATFS part, this should be repeatable as well. |
I can give that a try later today ... the script Frankie is using to break this is here ;) #5077 (comment) with the reproduction steps in the next comment. It will also be interesting to try T20, as that takes any colorlcd specific differences out of the equation. |
@frankiearzu It seems your test program is doing basically exactly what FatFs does not support in the configuration we are using: opening the same file multiple times with write access. When destroying the file object with: logFile = nil what happens really here is that the object is marked for garbage collection. However, it is not specified when this will happen. So, in the meantime, if you re-open the file for writing, you are in unchartered territory. We could prevent this by using file locking (
This would come at the cost of 16 bytes per maximum number of simultaneous open files, plus an additional FreeRTOS mutex per volume. The additional mutex and internal locking in FatFs will as well have some impact on performance. So, now the question is if we have any evidence of SD card corruption caused by normal usage patterns outside of LUA scripts violating the safe usage as implemented by FatFs with |
Maybe something worth mentioning: when killing a script with |
There were reports of issues when users just had logging on |
Oher... I ran Frankie's test script like 25-30 times on PL18EV with v2.10.1, with a reboot or two of the radio, and some force quits of the script rather than letting it run to completion, and zero issues, file system seems intact. TX16S (also v2.10.1) pooped itself after about 6-7 complete runs of the script ... lua broke with And the punchline? Repeat testing with same radio, same SD card, but 2.9.4 is now up to |
If it worth anything, on the simulator it definitely crashes more often if i reload os (F9) during a running script, instead of first exit the script with RTN, before the F9 |
Ok, so there is something with which you can test if you are using diff --git a/radio/src/thirdparty/FatFs/ffconf.h b/radio/src/thirdparty/FatFs/ffconf.h
index fda61e9edc..07115ea81b 100644
--- a/radio/src/thirdparty/FatFs/ffconf.h
+++ b/radio/src/thirdparty/FatFs/ffconf.h
@@ -260,8 +260,11 @@
/ FF_NORTC_MDAY and FF_NORTC_YEAR have no effect.
/ These options have no effect at read-only configuration (FF_FS_READONLY = 1). */
-
-#define FF_FS_LOCK 0
+#if FF_FS_READONLY > 0
+ #define FF_FS_LOCK 0
+#else
+ #define FF_FS_LOCK 8
+#endif
/* The option FF_FS_LOCK switches file lock function to control duplicated file open
/ and illegal operation to open objects. This option must be 0 when FF_FS_READONLY
/ is 1. When building with this, Please note that this works the same way both on 2.9 and 2.10. So, we should probably check that we are using the correct usage patterns, and do not mess with files open for writing under any circumstances. IMHO, what is being exposed by |
@raphaelcoeffic The "production" code does close the file before opening it. This was a way I found to accelerate the behavior. With the production code, sometimes have to run 70-80 times to start having the behavior.. so maybe the LONG RTN exit might still leaving something not closed there. On the script, just remove the comments on closing the files, and that will be the same as in production, but is much harder to reproduce the problem. |
I could not find anything in that direction, so far. |
Once you cannot open the files, it is expected the way is failing (the nil error on the write), but the SDCard should not get corrupted.. I know that is much more difficult to reproduce the corruption of the SDcard, but eventually, it does happen with the production script that has the "proper" closing of the files before opening. This new version (below), closes properly the files, with the exception of the forced exit (Long RTN), and is how the the production code is (but with the proper RTN exit handling). It just that it takes longer to replicate the error. In 2.9.4, even with the script that is not closing the file, still no bad behavior after closing, and no SDcard corruption. So i think there is still a gremlin in 2,10.. but much more difficult to reproduce... but if you see the very first screenshot, that came from @offer-shmuely reported on Discord. The corruption seems to be data that was probably ment to be on a log file. This kind of bugs are frustrating because they cannot be easily reproduced. When running script, when I want it to fail, it does not, and when less expected, happen on 2nd run! murphy's law I guess. In the meantime, I will release a Forward Programming script with the writing to the Logs turn off. |
@frankiearzu could you please try with the patch I posted? This will prevent the conditions which, according to FatFs, would lead to a potentially corrupted file system. By the way, the same patch works both for 2.9 and 2.10. In my tests, it exposes the same behaviour. Whether or not the file system corruption effectively happens, when illegal operations are issued, is not deterministic. We could just be lucky in 2.9 due to timings or whatever else. |
That is probably yet some other issue, this time with the simulator. By "crashes", you mean the simulator crashes, or something else happens inside the simulation? |
@raphaelcoeffic Tested the patch with the changes to "FF_FS_LOCK = 8" on the 2.10 branch (2.10.1). The original LOG_TEST.lua (the one not closing properly the file and reopen) failed right away when trying to reopen the file as expected. LOG_TEST2.lua ran OK (the one properly closing the files), but after about 32 runs with aborts (LONG RTN in the middle), Then it reported an error opening the file at startup (the very first open for write).. Looked at the SDCard, and /LOGS directory corrupted. It could not open due to /LOGS been really corrupted. Again the garbage names of directory/folders/files matches the data the script was writing. Repaired the drive (With windows). Ran it for about 20 times, and no problem.. /LOGS seems ok. Ran the "DSM Forward Programming v0.56" (using the RX simulator, so no real RX needed), failed at the 4th run, but on the initial opening of the file ("w" option, override). Connected the radio to the computer, and LOGS corrupted again. Don't know why, but the DSM lua seems to make it fail faster. It uses a lot more memory, but I tried to replicate that on the LOG_TEST2 by creating a large table. Modified the script to run forever (open, write, close, do again).. no errors or corruption after running for about 30 min twice. Wondering if the problem is on start or shutdown of Lua scripts that corrupts something in memory, and later affect the filesystem. Just speculating. Will continue testing to see if I find a way to make it easier to replicate. |
Just built 2.10.1 with |
Took a little longer than I intended, but... I initially had a crash (this could have been a leftover from before). Restarted the radio, rescanned the file system, and then restarted again, and was able to do 30 successive runs of our favourite breaker script with no failures. I then tried opening a few other Lua tools, no issues. Went back to the log_test lua, started randomly force quitting it about another dozen times, still no issues. Was able to restart radio with no errors. 🤔 |
Ok, so disk cache may be a problem on color radios. I wonder if the PL18 would have the problem if build with disk caching on? I've been running Log_Test_2 for about 1/2 hour on T14 without any errors so far. |
I think we need to be very cautious not to mix issues there. There is an identified colorlcd issue at work here. BW sd is another beast altogether (hw and sw), my understanding is that we do NOT have confirmation that this specific issues also manifest itself on bw radio (as opposed to random sd incompatibilities) |
@pfeerick Can you please try the attached patch on your TX16S - it adds a mutex and locking to the cache to make it thread safe(r). I have not (yet) been able to crash mine with this changes. |
Will do some test too, thank you guys! (Work interfering with the hobby :-) |
After applying the Disk Cache patch, took me like 60 times with LOG_TEST2 to corrupt my LOGS dir. But seems that DSM Forward prog 56 is failing quicker now (like in 5 runs). If you have the DSM Forward Prog 0.56, even if you have no spektrum RX, run it, select "RX Simulator", then pick any of the RX models there.. and you are inside the sim. then LONG RTN to abort. If not corrupted yet, do 5 runs again. Here is a link to the DSM tool: |
Tried a bit more, and seems that i can duplicate it easily using the DSM Forward Prog 0.56. Is opening and closing the logs properly, and that's the only file write that is doing while running the RX simulator. 5-10 runs, and /LOGS is corrupted. For the LOG_TEST2, still happen, but takes much longer. From the file operation perspective, both are similar, but DSM FP uses significantly more memory, and did some "read" operations to the message file at initialization. Here is a video of how easy is to replicate with official 2.10.1 and DSM forward prog 0.56. disk was check correct on the computer before each run, and radio fresh restart. This one corrupted at the 6th time: |
I think I've tracked this down to PR #3182. @frankiearzu & @pfeerick - can you please try and replicate my finding regarding building with commit 3342d21 not showing the error. |
I'll add that to my TODO list for tomorrow (or tonight depending on how bored I get). |
Tried to build commit 3342d21 but failed: |
Replace the radio/util/find_clang.py with the one from main.
…On Thu, 6 June 2024, 4:17 am Frankie Arzu, ***@***.***> wrote:
Tried to build commit 3342d21
<3342d21>
but failed:
image.png (view on web)
<https://github.com/EdgeTX/edgetx/assets/32604366/054bd085-5c00-4c23-b971-06e292876682>
—
Reply to this email directly, view it on GitHub
<#5077 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABJ66KIR6ECR5CLJGT6LPXDZF5I5NAVCNFSM6AAAAABINGRIKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQGY3TOMBSGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Same TX16S and SD, 42 cycles of original log test lua, with 3342d21, stock options (so DISK_CACHE ON) and no corruption or errors. Because I'm lazy, I then tried DSM Forward Prog v0.55a Color as that is what I had on the SD card. 5 runs, poking around in some of the emulator screens each time. LOGS folder looks fine. Another run of log test lua for good measure, and another five runs of DSM Forward Prog, still no problems... not looking promising for #3182! btw, https://tallycount.app/ is handy for tests like this 🤭 |
@frankiearzu There are a couple of things you might want to fix in the script:
|
@philmoz did the changes, but the the corruption still happens. Did a lot of read of the code and did some modifications, but did not solved the problem. Using the code at the point after the change of SD HAL, not the very latest 2.10. Maybe the problem is some kind of concurrently, who are difficult do debug. Don't know if read/writes are mutually exclusive a the FatFS level, but if not, that's something else that i will try to add a mutex. This is what I have tried so far: ( still have the problem).
**Other modifications from the SD HALL change to official 2.10: **
|
Also tested nighly (Jun6) 2.11, and can reproduce the corruption of the /LOG with DSM_FP 0.56. I think 2.11 has an updated FatFS. |
@frankiearzu Can you try this please:
|
@philmoz: GOOD NEWS! did the change you suggested, and after 50 tries, everything good so far, no /LOGS corruption. Back on the original code, can replicate it within 10 tries consistently with DSM FD prog. From what I can understand from the code, The file system object FATFS has a sector buffer that is used for disk access related to directories. When FF_FS_TINY=1, is using that buffer also as buffer for read/write Files. (the suggested change that seems to be working fine now) When FF_FS_TINY=0 (default), each file has its own sector buffer independent from the FATFS one.. more efficient. Another interesting (unrelated) thing, I had an original SDCard (the not good quality one), and the radio was not recognizing the SDCard with 2.10, but worked with 2.9.4.. after bringing back some retry read logic and longer read timeout, the old card is working fine. |
There was some work to resolve this back in #4450. But, because this is indeed an entirely different issue, to do with SD cards that are slow to respond to initialisation, it might be worth doing a PR with your initialisation changes to see if it resolves the other "my SD card isn't working with 2.10" issues. |
Could yoy give it a try to use current 2.10.1 and change the line in hal.h #define SDIO_CLK_DIV(24000000) to #define SDIO_CLK_DIV(16000000) and see if this helps. This reduce the bus speed when accessing the card. |
Just did a test changing the #define SDIO_CLK_DIV(16000000) on the src/targets/hours/hal.h (testing on TX16S). So far seems like the FF_FS_TINY=1 is the only thing that solves it. I think is concurrency, the FF_FS_TINY=1 forces to use a single buffer for file and directory SD access, and that becomes the "mutex" for sector read/write when we are not reading/writing full sectors. If writing partial (less than a sector of data), we still need to load the full sector from SD into memory, append the data in memory, and later write it back to SD the full sector. The weird part is that the FatFS code is the same as 2.9.x, and did not have that problem until we move to SD read/write to use the new method. Was trying to add a RTOS mutex on the dma_read/dma_write, but it does not compile.. some RTOS.h that is probably not pointing into the right place when compiling that part of the code. |
We may seek guidance from the FatFs author, but as a matter of fact, I think different files do not use the same sectors, at least that is my assumption. Also, I don't think any sector shares a chunk of FAT and a chunk of a file. I may be wrong though, and would need to read more deeply into the FAT documentation / specs to have more insight. |
Spent many hours today trying to figure out the problem.. Seems that is mostly on LUA scripts who are closer to the edge to the upper memory usage that some of the file IO writes could get corrupted. So difficult to tell. In some cases, calling the garbage collector before some file operations (write and close) hides the problem. Probably enough time spent, revisit until we have other reports.. as for the Forward Programming Script, released a version with writing to the LOGs turned off, and also added some calls to garbage collector before closing the files. |
Have you been able to replicate with FF_FS_TINY=1. or where you trying without it ? |
With FF_FS_TINY=1 I cannot replicate the problem. Default is FF_FS_TINY=0 and can replicate it consistently with LUA DSM forward prog (high memory usage LUA). Not FatFS expert, but from what i understand, it changes completely the logic of the utilization of the in-memory sector buffer. With FF_FS_TINY=1, a single buffer is used to cache what is read (or sometimes write) all the SDCard data.. is shared for both directory and file data access, and is very short term. In here, this buffer becomes the bottleneck, and serialize the access to the SDCard read/write operations. On the other hand, when FF_FS_TINY=0, a buffer for directory access is different from the file data access buffer, the file data buffer is independent for each open file, and is more long term (for example, multiple small writes, only when exceeding the buffer size, it will flush to SDcard). More buffers, eliminates the bottleneck when multiple files are open at the same time. Since the corruption of the "/LOGS" directory structure has the data that was intended to be written to a file, seems like a bug where it picked the wrong buffer when writing data (writing to the directory instead of the file). The weird thing is that so far, only happen with high memory usage LUA scripts who writes to files (logs). My next step would be to capture some disk access log via VCP to try to understand it.. did many tries at different things in the LUA IO API, where we are ignoring errors (like failing to close a file), but did not change the behavior so far. |
Is there an existing issue for this problem?
What part of EdgeTX is the focus of this bug?
Transmitter firmware
Current Behavior
After some time of using the radio, the SDcard seems to have corruption of files and directory/folders.
Some scenarios:
While using the DSM Forward Programming tool, it start giving errors related to reading and writing files from the SDCard. The messages displayed in the GUI are stored externally and are read into memory at start. On some low-mem radios, the scripts read the messages o chunks later on. After that errors, and exiting the script, the radio cannot ready anything from the SDCard. If you go to SYS->SDCard, it shows empty. Can't turn off the radio. After removing the battery, and start again, the SDCard seems back to normal. The script is also writing a log while is running. The LUA has been stable and no mods in a while, and did not have a problem in 2.9.x.
Don't know is related to the previous description or independent, but after some time, the files/directory/folders seems corrupted. The SDCard has directory/folders with garbage names, and some images and sound files don't display/sound correctly. the only way to recover is to format again the SDCard, and rebuild the SDCard from a backup or start clean.
Expected Behavior
Not corrupt the SDCard.
Steps To Reproduce
Haven't found a way to reproduce it in a consistent way.. When i want it to fail, it does not, and when not expected, start to see corrupted files/folders.
Version
2.10.0
Transmitter
FrSky Q X7 / Q X7S (ACCESS), RadioMaster TX16S / TX16SMK2
Operating System (OS)
Windows
OS Version
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: