Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze/crash in About menu since 9f15fd1 #327

Closed
nlfx opened this issue May 3, 2021 · 31 comments
Closed

Freeze/crash in About menu since 9f15fd1 #327

nlfx opened this issue May 3, 2021 · 31 comments
Labels
bug Something isn't working
Milestone

Comments

@nlfx
Copy link
Contributor

nlfx commented May 3, 2021

I've had frequent crashes since yesterday merges. Here is how to reliably reproduce the issue:

  • Open and close About menu twice
  • Open About menu a third time
  • Swipe up to try to access screen 2
  • Watch freezes and is reset by watchdog a few seconds later
infinitime-about-menu-crash.mp4

I've bisected the problem to commit 9f15fd1. However given its content, I don't understand how it could create this crash. Maybe it's a problem with some completely unrelated part of the code which only becomes visible after this commit due to some timing or memory position constraint?

Could someone with a dev kit and SWD access please see if they manage to get more info about what is creating the above crash?

In case this bug is memory location dependent and strings with file path in the firmware change the behavior, here is my version of the above commit.

Thanks a lot for your help!

@ObiKeahloa
Copy link
Contributor

What exactly have you changed in the commit?
My first response too was "memory allocation maybe?" but there seems to be enough memory for doing a simple vibration.
It might be something else triggering this maybe the driver?

@nlfx
Copy link
Contributor Author

nlfx commented May 5, 2021

What exactly have you changed in the commit?

No change, I just did a git checkout --recurse-submodules 9f15fd1820e620e42fb717fb5d692d44a3f0f70d, compiled it and flashed it.

My first response too was "memory allocation maybe?" but there seems to be enough memory for doing a simple vibration.

What is weird, is that the the crash happens on the About menu (SystemInfo.cpp) which has nothing to do with the motor controller change in this commit.

So indeed, I also initially though that the problem might be that the memory for the stack was at its limit and the additional motorController argument passed down the call stack was enough to start overwriting other memory region. But the fact that the crash happen only when accessing the About menu for the 3rd time would make no sense if that was the real problem. Unless I'm missing something?

So then I though that it might be a memory leak or memory corruption in the LVGL, and that's why opening the same menu multiple times resulted in a crash. But then it makes no sense that I cannot reproduce this crash on the previous commit (d141888) which contains exactly the same LVGL version and same menus.

So without SWD access, I really don't think I can solve this issue. If no one has time to test this, I'll see if I can re-order a dev kit when they'll be back in stock.

@Avamander
Copy link
Collaborator

@JF002

But the fact that the crash happen only when accessing the About menu for the 3rd time would make no sense if that was the real problem.

Potential memory leak?

On another note: Something like Valgrind but for remotely debugged MCUs would be so nice. If anyone knows tools like that that work over a gdb remote, please let me know.

@ObiKeahloa
Copy link
Contributor

Is the vibration motor causing some kind of limit? Not really sure as it happening only around the third time as @nlfx told is weird.

Is it maybe a glitch in the Watchdog?

@Avamander
Copy link
Collaborator

Avamander commented May 5, 2021

Vibration motor could cause a brownout event when the battery has a very high internal resistance... but someone with that specific watch would have to actually measure and test out this theory. It's really just predicting on tea leaves when the issue isn't reproducible.

@nlfx
Copy link
Contributor Author

nlfx commented May 5, 2021

Is the vibration motor causing some kind of limit?

The crash happens without ever activating the motor. As explained / shown in the above bug description, just opening / closing the About menu 3 time and trying to go to its 2nd screen is enough to crash the watch.

It's really just predicting on tea leaves when the issue isn't reproducible.

Is it not? Did anyone try to reproduce the steps I've described in my bug report with commit 9f15fd1 or any commit more recent? On my watch it reliably crashes 100% of the time if I follow the steps I've described.

@ObiKeahloa
Copy link
Contributor

Is the vibration motor causing some kind of limit?

The crash happens without ever activating the motor. As explained / shown in the above bug description, just opening / closing the About menu 3 time and trying to go to its 2nd screen is enough to crash the watch.

It's really just predicting on tea leaves when the issue isn't reproducible.

Is it not? Did anyone try to reproduce the steps I've described in my bug report with commit 9f15fd1 or any commit more recent? On my watch it reliably crashes 100% of the time if I follow the steps I've described.

Oh thanks for explaining , this seems really weird and the fact that it only happens after three tries (On the third one) , can you try this again but try to access the second menu on the fourth try instead of the third?

@nlfx
Copy link
Contributor Author

nlfx commented May 5, 2021

Oh thanks for explaining , this seems really weird and the fact that it only happens after three tries (On the third one) , can you try this again but try to access the second menu on the fourth try instead of the third?

It also happens on the 4th, 5th, etc.
If the Screen1 of the About menu has been opened 3 times or more, the watch crashes when trying to access the Screen2.

@ObiKeahloa
Copy link
Contributor

Oh thanks for explaining , this seems really weird and the fact that it only happens after three tries (On the third one) , can you try this again but try to access the second menu on the fourth try instead of the third?

It also happens on the 4th, 5th, etc.
If the Screen1 of the About menu has been opened 3 times or more, the watch crashes when trying to access the Screen2.

What intrigued me was that it only happens at the third time.

@ObiKeahloa
Copy link
Contributor

I am not sure if this will fix the error or not , can you try replacing the "components/motor/MotorController.cpp" file with this one(Modified it a bit , not sure if it will fix it or not):

MotorController.cpp.txt

Do rename it back to MotorController.cpp from MotorController.cpp.txt

@JF002 JF002 added the bug Something isn't working label May 7, 2021
@JF002
Copy link
Collaborator

JF002 commented May 16, 2021

The behavior you're describing makes me think of a memory fragmentation issue : after a few executions, the memory (from lvgl, freertos,...) is too fragmented and it cannot find a memory segment big enough for its needs.

However, I've just tried to reproduce the issue on the latest commit on develop with no success for now. It doesn't mean that the issue was auto-magically fixed, but it probably "moved" elsewhere in the code and execution paths :/

EDIT : it happened to me too on this last commit : just need to open/close the app multiple time, and then to scroll down to display the next page, it crashes.

@JF002
Copy link
Collaborator

JF002 commented Jun 6, 2021

I can easily reproduce this issue : open/close SystemInfo 2-3 times, then, scroll to the 2nd screen of displayInfo aaand... crash!

I spent far too much hours trying to figure out the cause of this crash, with no luck :/ It looks like the stack of the Display task gets corrupted when it crashes and falls into HardFault_handler:
image

Here's the stacktrace when the screen is displayed correctly:
image

I've tried many things : analyze heap, stack and freertos heap/stack usage, disable all the code inside CreateScreen1() and CreateScreen2(), logging, breakpoints,.. but I cannot understand what's happening!

@ObiKeahloa
Copy link
Contributor

This sounds really weird , what happens of we were to start removing info from the screen?

@JF002
Copy link
Collaborator

JF002 commented Jun 6, 2021

I've tried to remove the content of the info screen completely, it would only display a black screen... and it still crashed!

And I have no clue why its hardfaulting that way. Next experiments: run the whole Screen/ScreenList/Settings/SystemInfo classes on a computer to check with memcheck that everything is ok, check IRQ priority in FreeRTOS, disable some code until the problem disappears,...

@ObiKeahloa
Copy link
Contributor

I've tried to remove the content of the info screen completely, it would only display a black screen... and it still crashed!

And I have no clue why its hardfaulting that way. Next experiments: run the whole Screen/ScreenList/Settings/SystemInfo classes on a computer to check with memcheck that everything is ok, check IRQ priority in FreeRTOS, disable some code until the problem disappears,...

Well , Good Luck👍

@JF002
Copy link
Collaborator

JF002 commented Jun 7, 2021

This makes no sense:

  • checkout and build develop (79f0fcb) -> No error
  • then checkout 9f15fd1 (the commit referenced in this issue) -> No error
  • then clean the build folder and try again -> crash

This bug seems to be "moving" on each commit : it's present in 9f15fd1 and ff00873 (branch move-heap-to-static) but not on current develop (79f0fcb) and this random commit in-between : 13e3463.

I've identified 2 consecutive commits:

All I can say for now is that this issue is not caused by compiler optimizations (-O3) because it also happens in -Og.

Soooo... This bug is probably present for some time, but it probably corrupts memory that is sometimes critical, sometimes not.

I still have no idea how to debug this... any help is welcome!

EDIT: even more weird:

  • on ff00873, add a 50B buffer in Screen -> No crash!
class Screen {
public:
  explicit Screen(DisplayApp* app) : app {app} {
    std::fill(bumper.begin(), bumper.end(), 0xaa);
  }

//...
private:
  std::array<uint8_t, 50> bumper;

PLEAAAASE help! Don't make me commit this! :D

@jonvmey
Copy link
Contributor

jonvmey commented Jun 7, 2021

Are any bits set in any of the fault status registers when the problem occurs?

@Raupinger
Copy link
Contributor

Raupinger commented Jun 7, 2021

@jonvmey
CFSR is 0x20000 (INVSTATE) when HardFault_Handler is called
HFSR is 0x40000000 (FORCED)

@JF002
Copy link
Collaborator

JF002 commented Jun 8, 2021

Logs from NRF SDK:

<error> hardfault: HARD FAULT at 0x00010E2E
<error> hardfault:   R0:  0x20010000  R1:  0x40011008  R2:  0xE000EF30  R3:  0x00000010
<error> hardfault:   R12: 0x20001F74  LR:  0x0001113D  PSR: 0x61000000
<error> hardfault: Cause: The processor has attempted to execute an instruction that makes illegal use of the EPSR.

What does it mean? I don't know!

@Itai-Nelken
Copy link
Contributor

Itai-Nelken commented Jun 8, 2021

I'm pretty sure someone already found this, but does this help (see the answer): https://stackoverflow.com/questions/53253652/debugging-a-hard-fault-in-arm-cortex-m4

@JF002
Copy link
Collaborator

JF002 commented Jun 8, 2021

More info from : https://interrupt.memfault.com/blog/cortex-m-fault-debug

(gdb) p/a *(uint32_t[8] *)$msp
$2 = {0x20010000, 0x40011008, 0xe000ef30, 0x10 <_lv_gc_clear_roots+16>, 0x20001f74 <systemTask+1788>, 0x1113d <xPortStartScheduler+156>, 0x10e2e <vPortStartFirstTask+22>, 0x61000000}

(gdb) p/a *(uint32_t[8] *)$psp
$3 = {0x200080e8 <ucHeap+14016>, 0x2000e4ac, 0x2000db48, 0x2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408>, 0x0 <_lv_gc_clear_roots>, 0xec7 <std::function<std::unique_ptr<Pinetime::Applications::Screens::Screen, std::default_delete<Pinetime::Applications::Screens::Screen> > ()>::operator()() const+12>, 0x2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408>, 0x0 <_lv_gc_clear_roots>}

So, if we look at MSP (Main Stack Pointer):

R0 = 0x20010000,
R1 = 0x40011008,
R2 = 0xe000ef30,
R3 = 0x10 <_lv_gc_clear_roots+16>,
R12 = 0x20001f74 <systemTask+1788>,
LR = 0x1113d <xPortStartScheduler+156>,
PC = 0x10e2e <vPortStartFirstTask+22>,
xPSR = 0x61000000

-> This is the main stack, where the scheduler of FreeRTOS was called.
Disassembly:

vPortStartFirstTask():
  10e18:	4805      	ldr	r0, [pc, #20]	; (10e30 <vPortStartFirstTask+0x18>)
  10e1a:	6800      	ldr	r0, [r0, #0]
  10e1c:	f380 8808 	msr	MSP, r0
  10e20:	b662      	cpsie	i
  10e22:	b661      	cpsie	f
  10e24:	f3bf 8f4f 	dsb	sy
  10e28:	f3bf 8f6f 	isb	sy
  10e2c:	df00      	svc	0
  10e2e:	bf00      	nop
  10e30:	00000000 	.word	0x00000000

And the coresponding instruction:

   10e2e:	bf00      	nop 

PSP (Process Stack Pointer) :

R0 = 0x200080e8 <ucHeap+14016>
R1 = 0x2000e4ac,
R2 = 0x2000db48,
R3 = 0x2600 Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408
R12 = 0x0 <_lv_gc_clear_roots>,
LR = 0xec7 <std::function<std::unique_ptr<Pinetime::Applications::Screens::Screen, std::default_deletePinetime::Applications::Screens::Screen > ()>::operator()() const+12>,
PC = 0x2600 Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408,
xPSR = 0x0 <_lv_gc_clear_roots>

According to the map file, 0x2600 is an instruction from Pinetime::Applications::Screens::SystemInfo::CreateScreen2() which is the method called to create the 2nd screen of system info, exactly what I was trying to do at that time. Here is the dissassembly:

Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    2468:	e92d 4ff0 	stmdb	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
    246c:	ed2d 8b02 	vpush	{d8}
    2470:	b099      	sub	sp, #100	; 0x64
    2472:	4681      	mov	r9, r0
    2474:	460c      	mov	r4, r1
    2476:	690d      	ldr	r5, [r1, #16]
_ZNKSt5arrayIiLj5EE4dataEv():
    2478:	1d2b      	adds	r3, r5, #4
_ZNKSt5arrayIiLj5EE3endEv():
    247a:	f105 0018 	add.w	r0, r5, #24
    247e:	2200      	movs	r2, #0
_ZSt10accumulateIPKiiET0_T_S3_S2_():
    2480:	4298      	cmp	r0, r3
    2482:	d003      	beq.n	248c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x24>
    2484:	f853 1b04 	ldr.w	r1, [r3], #4
    2488:	440a      	add	r2, r1
    248a:	e7f9      	b.n	2480 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x18>
_ZNK8Pinetime11Controllers10CircBufferILi5EE10GetAverageEv():
    248c:	7e2b      	ldrb	r3, [r5, #24]
    248e:	fb92 f3f3 	sdiv	r3, r2, r3
    2492:	9314      	str	r3, [sp, #80]	; 0x50
_ZNK8Pinetime11Controllers7Battery7VoltageEv():
    2494:	edd5 7a07 	vldr	s15, [r5, #28]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    2498:	69e3      	ldr	r3, [r4, #28]
_ZNK8Pinetime7Drivers12WatchdogView11ResetReasonEv():
    249a:	681b      	ldr	r3, [r3, #0]
_ZNK8Pinetime7Drivers8Watchdog11ResetReasonEv():
    249c:	681b      	ldr	r3, [r3, #0]
    249e:	2b08      	cmp	r3, #8
    24a0:	f200 80a8 	bhi.w	25f4 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x18c>
    24a4:	e8df f003 	tbb	[pc, r3]
    24a8:	979405a3 	.word	0x979405a3
    24ac:	91a09d9a 	.word	0x91a09d9a
    24b0:	a9          	.byte	0xa9
    24b1:	00          	.byte	0x00
    24b2:	4b53      	ldr	r3, [pc, #332]	; (2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x198>)
    24b4:	9315      	str	r3, [sp, #84]	; 0x54
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    24b6:	68e3      	ldr	r3, [r4, #12]
_ZNK8Pinetime11Controllers8DateTime6UptimeEv():
    24b8:	699a      	ldr	r2, [r3, #24]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    24ba:	4e52      	ldr	r6, [pc, #328]	; (2604 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x19c>)
    24bc:	fba6 3602 	umull	r3, r6, r6, r2
    24c0:	0c36      	lsrs	r6, r6, #16
    24c2:	4b51      	ldr	r3, [pc, #324]	; (2608 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1a0>)
    24c4:	fb03 2216 	mls	r2, r3, r6, r2
    24c8:	4f50      	ldr	r7, [pc, #320]	; (260c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1a4>)
    24ca:	fba7 3702 	umull	r3, r7, r7, r2
    24ce:	0aff      	lsrs	r7, r7, #11
    24d0:	f44f 6361 	mov.w	r3, #3600	; 0xe10
    24d4:	fb03 2317 	mls	r3, r3, r7, r2
    24d8:	4d4d      	ldr	r5, [pc, #308]	; (2610 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1a8>)
    24da:	fba5 2503 	umull	r2, r5, r5, r3
    24de:	096d      	lsrs	r5, r5, #5
    24e0:	ebc5 1205 	rsb	r2, r5, r5, lsl #4
    24e4:	eba3 0382 	sub.w	r3, r3, r2, lsl #2
    24e8:	9311      	str	r3, [sp, #68]	; 0x44
    24ea:	eebc 7ae7 	vcvt.u32.f32	s14, s15
    24ee:	ed8d 7a0e 	vstr	s14, [sp, #56]	; 0x38
    24f2:	f89d 2038 	ldrb.w	r2, [sp, #56]	; 0x38
    24f6:	ee08 2a10 	vmov	s16, r2
    24fa:	eeb8 7ac8 	vcvt.f32.s32	s14, s16
    24fe:	ee77 7ac7 	vsub.f32	s15, s15, s14
    2502:	ed9f 7a44 	vldr	s14, [pc, #272]	; 2614 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1ac>
    2506:	ee67 7a87 	vmul.f32	s15, s15, s14
    250a:	eefc 7ae7 	vcvt.u32.f32	s15, s15
    250e:	edcd 7a0e 	vstr	s15, [sp, #56]	; 0x38
    2512:	f89d 2038 	ldrb.w	r2, [sp, #56]	; 0x38
    2516:	920e      	str	r2, [sp, #56]	; 0x38
    2518:	f037 f8b6 	bl	39688 <lv_disp_get_default>
lv_scr_act():
    251c:	f02c fb0a 	bl	2eb34 <lv_disp_get_scr_act>
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    2520:	2100      	movs	r1, #0
    2522:	f040 f95b 	bl	427dc <lv_label_create>
    2526:	4680      	mov	r8, r0
    2528:	2101      	movs	r1, #1
    252a:	f040 f945 	bl	427b8 <lv_label_set_recolor>
    252e:	68e3      	ldr	r3, [r4, #12]
_ZNK8Pinetime11Controllers8DateTime3DayEv():
    2530:	78da      	ldrb	r2, [r3, #3]
    2532:	9213      	str	r2, [sp, #76]	; 0x4c
_ZNK8Pinetime11Controllers8DateTime5MonthEv():
    2534:	7899      	ldrb	r1, [r3, #2]
    2536:	9112      	str	r1, [sp, #72]	; 0x48
_ZNK8Pinetime11Controllers8DateTime4YearEv():
    2538:	8818      	ldrh	r0, [r3, #0]
    253a:	900f      	str	r0, [sp, #60]	; 0x3c
_ZNK8Pinetime11Controllers8DateTime5HoursEv():
    253c:	f893 c005 	ldrb.w	ip, [r3, #5]
    2540:	f8cd c040 	str.w	ip, [sp, #64]	; 0x40
_ZNK8Pinetime11Controllers8DateTime7MinutesEv():
    2544:	f893 a006 	ldrb.w	sl, [r3, #6]
_ZNK8Pinetime11Controllers8DateTime7SecondsEv():
    2548:	f893 b007 	ldrb.w	fp, [r3, #7]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    254c:	6960      	ldr	r0, [r4, #20]
    254e:	f00b fb01 	bl	db54 <Pinetime::Controllers::BrightnessController::ToString()>
    2552:	9b15      	ldr	r3, [sp, #84]	; 0x54
    2554:	930c      	str	r3, [sp, #48]	; 0x30
    2556:	900b      	str	r0, [sp, #44]	; 0x2c
    2558:	980e      	ldr	r0, [sp, #56]	; 0x38
    255a:	900a      	str	r0, [sp, #40]	; 0x28
    255c:	ed8d 8a09 	vstr	s16, [sp, #36]	; 0x24
    2560:	f89d 3050 	ldrb.w	r3, [sp, #80]	; 0x50
    2564:	9308      	str	r3, [sp, #32]
    2566:	9911      	ldr	r1, [sp, #68]	; 0x44
    2568:	9107      	str	r1, [sp, #28]
    256a:	9506      	str	r5, [sp, #24]
    256c:	9705      	str	r7, [sp, #20]
    256e:	9604      	str	r6, [sp, #16]
    2570:	f8cd b00c 	str.w	fp, [sp, #12]
    2574:	f8cd a008 	str.w	sl, [sp, #8]
    2578:	9d10      	ldr	r5, [sp, #64]	; 0x40
    257a:	9501      	str	r5, [sp, #4]
    257c:	980f      	ldr	r0, [sp, #60]	; 0x3c
    257e:	9000      	str	r0, [sp, #0]
    2580:	9b12      	ldr	r3, [sp, #72]	; 0x48
    2582:	9a13      	ldr	r2, [sp, #76]	; 0x4c
    2584:	4924      	ldr	r1, [pc, #144]	; (2618 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1b0>)
    2586:	4640      	mov	r0, r8
    2588:	f040 f88a 	bl	426a0 <lv_label_set_text_fmt>
    258c:	f037 f87c 	bl	39688 <lv_disp_get_default>
lv_scr_act():
    2590:	f02c fad0 	bl	2eb34 <lv_disp_get_scr_act>
    2594:	4601      	mov	r1, r0
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    2596:	2200      	movs	r2, #0
    2598:	9200      	str	r2, [sp, #0]
    259a:	4613      	mov	r3, r2
    259c:	4640      	mov	r0, r8
    259e:	f02d f86d 	bl	2f67c <lv_obj_align>
_ZSt11make_uniqueIN8Pinetime12Applications7Screens5LabelEJiiRPNS1_10DisplayAppERP9_lv_obj_tEENSt9_MakeUniqIT_E15__single_objectEDpOT0_():
    25a2:	6865      	ldr	r5, [r4, #4]
    25a4:	2028      	movs	r0, #40	; 0x28
    25a6:	f044 fd39 	bl	4701c <operator new(unsigned int)>
    25aa:	4604      	mov	r4, r0
    25ac:	f8cd 8000 	str.w	r8, [sp]
    25b0:	462b      	mov	r3, r5
    25b2:	2204      	movs	r2, #4
    25b4:	2101      	movs	r1, #1
    25b6:	f000 ffb9 	bl	352c <Pinetime::Applications::Screens::Label::Label(unsigned char, unsigned char, Pinetime::Applications::DisplayApp*, _lv_obj_t*)>
_ZNSt10_Head_baseILj0EPN8Pinetime12Applications7Screens6ScreenELb0EEC4IRS4_EEOT_():
    25ba:	f8c9 4000 	str.w	r4, [r9]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    25be:	4648      	mov	r0, r9
    25c0:	b019      	add	sp, #100	; 0x64
    25c2:	ecbd 8b02 	vpop	{d8}
    25c6:	e8bd 8ff0 	ldmia.w	sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}
operator()():
    25ca:	4b14      	ldr	r3, [pc, #80]	; (261c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1b4>)
    25cc:	9315      	str	r3, [sp, #84]	; 0x54
    25ce:	e772      	b.n	24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25d0:	4b13      	ldr	r3, [pc, #76]	; (2620 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1b8>)
    25d2:	9315      	str	r3, [sp, #84]	; 0x54
    25d4:	e76f      	b.n	24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25d6:	4b13      	ldr	r3, [pc, #76]	; (2624 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1bc>)
    25d8:	9315      	str	r3, [sp, #84]	; 0x54
    25da:	e76c      	b.n	24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25dc:	4b12      	ldr	r3, [pc, #72]	; (2628 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1c0>)
    25de:	9315      	str	r3, [sp, #84]	; 0x54
    25e0:	e769      	b.n	24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25e2:	4b12      	ldr	r3, [pc, #72]	; (262c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1c4>)
    25e4:	9315      	str	r3, [sp, #84]	; 0x54
    25e6:	e766      	b.n	24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25e8:	4b11      	ldr	r3, [pc, #68]	; (2630 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1c8>)
    25ea:	9315      	str	r3, [sp, #84]	; 0x54
    25ec:	e763      	b.n	24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25ee:	4b11      	ldr	r3, [pc, #68]	; (2634 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1cc>)
    25f0:	9315      	str	r3, [sp, #84]	; 0x54
    25f2:	e760      	b.n	24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25f4:	4b10      	ldr	r3, [pc, #64]	; (2638 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1d0>)
    25f6:	9315      	str	r3, [sp, #84]	; 0x54
    25f8:	e75d      	b.n	24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25fa:	4b10      	ldr	r3, [pc, #64]	; (263c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1d4>)
    25fc:	9315      	str	r3, [sp, #84]	; 0x54
    25fe:	e75a      	b.n	24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    2600:	0004c778 	.word	0x0004c778
    2604:	c22e4507 	.word	0xc22e4507
    2608:	00015180 	.word	0x00015180
    260c:	91a2b3c5 	.word	0x91a2b3c5
    2610:	88888889 	.word	0x88888889
    2614:	42c80000 	.word	0x42c80000
    2618:	0004c780 	.word	0x0004c780
    261c:	0004c750 	.word	0x0004c750
    2620:	0004c754 	.word	0x0004c754
    2624:	0004c75c 	.word	0x0004c75c
    2628:	0004c764 	.word	0x0004c764
    262c:	0004c768 	.word	0x0004c768
    2630:	0004c770 	.word	0x0004c770
    2634:	0004c774 	.word	0x0004c774
    2638:	0004c744 	.word	0x0004c744
    263c:	0004c748 	.word	0x0004c748

And the corresponding instruction :

    2600:	0004c778 	.word	0x0004c778

What is it?

In my opinion, it makes more sense to analyze the PSP, as the display task is running at that time and PC correspond exactly to what the firmware was supposed to do at that time. Now, why does this instruction at 0x2600 crashes the CPU?

@jonvmey
Copy link
Contributor

jonvmey commented Jun 8, 2021

0x2600 doesn't contain an instruction it contains data, so it makes sense that executing it would cause a fault. The real question is how is 0x2600 getting into the PC?

@JF002
Copy link
Collaborator

JF002 commented Jun 9, 2021

The real question is how is 0x2600 getting into the PC?
The answer might be 42 ;-D

Ok, I've dug into freertos, irq priorities, task priorities,... everything looks good to me.

And then, I looked at the code... for the 1000th time... and here's what I found:

  • DisplayApp::currentScreen points to the current screen, allocated on the heap.
  • When Settings is running, DisplayApp::currentScreen points to that instance of Settings
  • Settings has a list of List as member variable
  • When you tap on an entry from this list, DisplayApp::StartApp() is called
  • This method destroys DisplayApp::currentScreen to create a new app (the one you selected in the menu)
  • Wait... let's have a look at the callstack:
    • DisplayApp::lv_taskl_handler
    • ...
    • List::OnButtonEvent()
    • DisplayApp::StartApp()
    • DisplayAp::LoadApp()
    • currentScreen.reset(nullptr);
  • When currentScreen is reset, the dtor() of the object (Settings, in this case) is called. The dtor() of Settings will destroy all member variables of Settings, List included! It means that the code is destroying the object that called this method... What happens to the stack in this case? Probably nothing good.

You know what's scary? ApplicationList and Settings work this way since 1.0!

@jonvmey
Copy link
Contributor

jonvmey commented Jun 9, 2021

Oh... Yeah, that's bad. Definitely use-after-free issues in some (all?) of the "nested" application constructs.

Would it make sense for DisplayApp to have something like a stack of applications for situations like this? Then entering a nested application from another would push the new app onto the stack and make it the currentScreen and exiting a nested application would pop it off of the stack and restore the parent (now the top of the stack) to the currentScreen.

@JF002
Copy link
Collaborator

JF002 commented Jun 9, 2021

Yes, I think so. This pattern is used by ApplicationList and Settings to load an app when the entry is selected.

We have a design issue here : we have a unique_ptr AND a raw ptr pointing to the app:

  • DisplayApp maintains the current app in currentScreen (unique_ptr). Ideally, that unique_ptr should be the only ptr to the app.
  • The ptr to this that is stored in user_data of the button when creating the app. In fact, the app is leaking a ptr to itself in the constructor

In this case, we are calling the app using the raw pointer, it calls DisplayApp which destroys the app using the unique_ptr... It's "the snake that bites its tail"...

@JF002
Copy link
Collaborator

JF002 commented Jun 9, 2021

Would it make sense for DisplayApp to have something like a stack of applications for situations like this?

That's an idea. I tried to avoid nested app to keep the memory usage under control and to keep the code simple, but we can think about that.

@JF002
Copy link
Collaborator

JF002 commented Jun 10, 2021

Should be fixed in #415

@JF002 JF002 added this to the Version 1.2 milestone Jun 12, 2021
@JF002
Copy link
Collaborator

JF002 commented Jun 12, 2021

For those interested, here is a simple project that reproduces the issue that was fixed:

A.h:

#pragma once
#include <memory>
#include "B.h"

class A {
  public:
    A() {
      ptr = std::make_unique<B>(this);
    }

    void Test() {
      std::cout << "A::Test" << std::endl;
      ptr->Test();
    }

    void Delete() {
      std::cout << "A::Delete()" << std::endl;
      ptr.reset(nullptr);
    }

  private:
    std::unique_ptr<B> ptr;
};

B.h:

#pragma once

class A;
class B {
  public:
    B(A* a);

    ~B();

    void Test();

  private:
    A* ptr;
    int tt = 0;
};

B.cpp:

#include <iostream>
#include "B.h"
#include "A.h"
B::B(A* a) : ptr{a} {

}

void B::Test() {
  std::cout << "B::Test" << std::endl;
  ptr->Delete();

  std::cout << "Is B alive?" << std::endl;
  tt++;
}

B::~B() {
  std::cout << "B::~B() " << std::to_string(tt) << std::endl;
}

main.cpp:

#include <iostream>
#include "A.h"

int main() {
  std::cout << "Hello, World!" << std::endl;

  A a;
  a.Test();
  return 0;
}

Valgrind is not happy with that code:

$ valgrind ./untitled1 
==24823== Memcheck, a memory error detector
==24823== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==24823== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==24823== Command: ./untitled1
==24823== 
Hello, World!
A::Test
B::Test
A::Delete()
B::~B() 0
Is B alive?
==24823== Invalid read of size 4
==24823==    at 0x109A25: B::Test() (B.cpp:13)
==24823==    by 0x10943D: A::Test() (A.h:13)
==24823==    by 0x109260: main (main.cpp:8)
==24823==  Address 0x4d9c0c8 is 8 bytes inside a block of size 16 free'd
==24823==    at 0x48429AB: operator delete(void*, unsigned long) (vg_replace_malloc.c:814)
==24823==    by 0x1096B4: std::default_delete<B>::operator()(B*) const (unique_ptr.h:85)
==24823==    by 0x109770: std::__uniq_ptr_impl<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:182)
==24823==    by 0x109E10: std::unique_ptr<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:456)
==24823==    by 0x109CF6: A::Delete() (A.h:18)
==24823==    by 0x1099F8: B::Test() (B.cpp:10)
==24823==    by 0x10943D: A::Test() (A.h:13)
==24823==    by 0x109260: main (main.cpp:8)
==24823==  Block was alloc'd at
==24823==    at 0x483FF3F: operator new(unsigned long) (vg_replace_malloc.c:417)
==24823==    by 0x109565: std::_MakeUniq<B>::__single_object std::make_unique<B, A*>(A*&&) (unique_ptr.h:962)
==24823==    by 0x10939E: A::A() (A.h:8)
==24823==    by 0x109254: main (main.cpp:7)
==24823== 
==24823== Invalid write of size 4
==24823==    at 0x109A2F: B::Test() (B.cpp:13)
==24823==    by 0x10943D: A::Test() (A.h:13)
==24823==    by 0x109260: main (main.cpp:8)
==24823==  Address 0x4d9c0c8 is 8 bytes inside a block of size 16 free'd
==24823==    at 0x48429AB: operator delete(void*, unsigned long) (vg_replace_malloc.c:814)
==24823==    by 0x1096B4: std::default_delete<B>::operator()(B*) const (unique_ptr.h:85)
==24823==    by 0x109770: std::__uniq_ptr_impl<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:182)
==24823==    by 0x109E10: std::unique_ptr<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:456)
==24823==    by 0x109CF6: A::Delete() (A.h:18)
==24823==    by 0x1099F8: B::Test() (B.cpp:10)
==24823==    by 0x10943D: A::Test() (A.h:13)
==24823==    by 0x109260: main (main.cpp:8)
==24823==  Block was alloc'd at
==24823==    at 0x483FF3F: operator new(unsigned long) (vg_replace_malloc.c:417)
==24823==    by 0x109565: std::_MakeUniq<B>::__single_object std::make_unique<B, A*>(A*&&) (unique_ptr.h:962)
==24823==    by 0x10939E: A::A() (A.h:8)
==24823==    by 0x109254: main (main.cpp:7)
==24823== 
==24823== 
==24823== HEAP SUMMARY:
==24823==     in use at exit: 0 bytes in 0 blocks
==24823==   total heap usage: 3 allocs, 3 frees, 73,744 bytes allocated
==24823== 
==24823== All heap blocks were freed -- no leaks are possible
==24823== 
==24823== For lists of detected and suppressed errors, rerun with: -s
==24823== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

@JF002 JF002 closed this as completed Jun 12, 2021
@nlfx
Copy link
Contributor Author

nlfx commented Jun 12, 2021

@JF002 Wow, that was some nasty bug, and must have been a pain to troubleshoot. Thank you so much for tracking this down, finding the root cause, and fixing it :-)

@Avamander
Copy link
Collaborator

I'll use the opportunity to reference https://liberapay.com/JF002, there's a nice way to support the project and its lead maintainer.

@JF002
Copy link
Collaborator

JF002 commented Jun 12, 2021

@JF002 Wow, that was some nasty bug, and must have been a pain to troubleshoot. Thank you so much for tracking this down, finding the root cause, and fixing it :-)

Oooh yeah, I spent way too much time on this issue! But I'm happy I found the issue and fixed it! Thanks for your support !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants