-
-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze/crash in About menu since 9f15fd1 #327
Comments
What exactly have you changed in the commit? |
No change, I just did a
What is weird, is that the the crash happens on the About menu (SystemInfo.cpp) which has nothing to do with the motor controller change in this commit. So indeed, I also initially though that the problem might be that the memory for the stack was at its limit and the additional So then I though that it might be a memory leak or memory corruption in the LVGL, and that's why opening the same menu multiple times resulted in a crash. But then it makes no sense that I cannot reproduce this crash on the previous commit (d141888) which contains exactly the same LVGL version and same menus. So without SWD access, I really don't think I can solve this issue. If no one has time to test this, I'll see if I can re-order a dev kit when they'll be back in stock. |
Potential memory leak? On another note: Something like Valgrind but for remotely debugged MCUs would be so nice. If anyone knows tools like that that work over a gdb remote, please let me know. |
Is the vibration motor causing some kind of limit? Not really sure as it happening only around the third time as @nlfx told is weird. Is it maybe a glitch in the Watchdog? |
Vibration motor could cause a brownout event when the battery has a very high internal resistance... but someone with that specific watch would have to actually measure and test out this theory. It's really just predicting on tea leaves when the issue isn't reproducible. |
The crash happens without ever activating the motor. As explained / shown in the above bug description, just opening / closing the About menu 3 time and trying to go to its 2nd screen is enough to crash the watch.
Is it not? Did anyone try to reproduce the steps I've described in my bug report with commit 9f15fd1 or any commit more recent? On my watch it reliably crashes 100% of the time if I follow the steps I've described. |
Oh thanks for explaining , this seems really weird and the fact that it only happens after three tries (On the third one) , can you try this again but try to access the second menu on the fourth try instead of the third? |
It also happens on the 4th, 5th, etc. |
What intrigued me was that it only happens at the third time. |
I am not sure if this will fix the error or not , can you try replacing the "components/motor/MotorController.cpp" file with this one(Modified it a bit , not sure if it will fix it or not): Do rename it back to MotorController.cpp from MotorController.cpp.txt |
The behavior you're describing makes me think of a memory fragmentation issue : after a few executions, the memory (from lvgl, freertos,...) is too fragmented and it cannot find a memory segment big enough for its needs. However, I've just tried to reproduce the issue on the latest commit on develop with no success for now. It doesn't mean that the issue was auto-magically fixed, but it probably "moved" elsewhere in the code and execution paths :/ EDIT : it happened to me too on this last commit : just need to open/close the app multiple time, and then to scroll down to display the next page, it crashes. |
This sounds really weird , what happens of we were to start removing info from the screen? |
I've tried to remove the content of the info screen completely, it would only display a black screen... and it still crashed! And I have no clue why its hardfaulting that way. Next experiments: run the whole Screen/ScreenList/Settings/SystemInfo classes on a computer to check with memcheck that everything is ok, check IRQ priority in FreeRTOS, disable some code until the problem disappears,... |
Well , Good Luck👍 |
This makes no sense:
This bug seems to be "moving" on each commit : it's present in 9f15fd1 and ff00873 (branch move-heap-to-static) but not on current develop (79f0fcb) and this random commit in-between : 13e3463. I've identified 2 consecutive commits: All I can say for now is that this issue is not caused by compiler optimizations (-O3) because it also happens in -Og. Soooo... This bug is probably present for some time, but it probably corrupts memory that is sometimes critical, sometimes not. I still have no idea how to debug this... any help is welcome! EDIT: even more weird:
PLEAAAASE help! Don't make me commit this! :D |
Are any bits set in any of the fault status registers when the problem occurs? |
@jonvmey |
Logs from NRF SDK:
What does it mean? I don't know! |
I'm pretty sure someone already found this, but does this help (see the answer): https://stackoverflow.com/questions/53253652/debugging-a-hard-fault-in-arm-cortex-m4 |
More info from : https://interrupt.memfault.com/blog/cortex-m-fault-debug
So, if we look at MSP (Main Stack Pointer): R0 = 0x20010000, -> This is the main stack, where the scheduler of FreeRTOS was called.
And the coresponding instruction:
PSP (Process Stack Pointer) : R0 = 0x200080e8 <ucHeap+14016> According to the map file, 0x2600 is an instruction from
And the corresponding instruction :
What is it? In my opinion, it makes more sense to analyze the PSP, as the display task is running at that time and PC correspond exactly to what the firmware was supposed to do at that time. Now, why does this instruction at 0x2600 crashes the CPU? |
0x2600 doesn't contain an instruction it contains data, so it makes sense that executing it would cause a fault. The real question is how is 0x2600 getting into the PC? |
Ok, I've dug into freertos, irq priorities, task priorities,... everything looks good to me. And then, I looked at the code... for the 1000th time... and here's what I found:
You know what's scary? ApplicationList and Settings work this way since 1.0! |
Oh... Yeah, that's bad. Definitely use-after-free issues in some (all?) of the "nested" application constructs. Would it make sense for |
Yes, I think so. This pattern is used by ApplicationList and Settings to load an app when the entry is selected. We have a design issue here : we have a unique_ptr AND a raw ptr pointing to the app:
In this case, we are calling the app using the raw pointer, it calls DisplayApp which destroys the app using the unique_ptr... It's "the snake that bites its tail"... |
That's an idea. I tried to avoid nested app to keep the memory usage under control and to keep the code simple, but we can think about that. |
Should be fixed in #415 |
For those interested, here is a simple project that reproduces the issue that was fixed: A.h:
B.h:
B.cpp:
main.cpp:
Valgrind is not happy with that code:
|
@JF002 Wow, that was some nasty bug, and must have been a pain to troubleshoot. Thank you so much for tracking this down, finding the root cause, and fixing it :-) |
I'll use the opportunity to reference https://liberapay.com/JF002, there's a nice way to support the project and its lead maintainer. |
Oooh yeah, I spent way too much time on this issue! But I'm happy I found the issue and fixed it! Thanks for your support ! |
I've had frequent crashes since yesterday merges. Here is how to reliably reproduce the issue:
infinitime-about-menu-crash.mp4
I've bisected the problem to commit 9f15fd1. However given its content, I don't understand how it could create this crash. Maybe it's a problem with some completely unrelated part of the code which only becomes visible after this commit due to some timing or memory position constraint?
Could someone with a dev kit and SWD access please see if they manage to get more info about what is creating the above crash?
In case this bug is memory location dependent and strings with file path in the firmware change the behavior, here is my version of the above commit.
Thanks a lot for your help!
The text was updated successfully, but these errors were encountered: