Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootloader does not start properly after a watchdog reset/incoherent SPI state #24

Closed
JF002 opened this issue Sep 13, 2020 · 15 comments
Closed

Comments

@JF002
Copy link
Contributor

JF002 commented Sep 13, 2020

I'm working on this issue in InfiniTime : race conditions occurs during sleep/wakeup and something put the devices (spi) into an incoherent state (I think it is disabled while a transaction is running). This bug must be fixed in InfiniTime, of course.

When this bug happens, it looks like the watch is totally frozen, because the screen stays black. This is strange because the watchdog should reset the MCU and the bootloader should be able to start again.

I did some checks and i'm positive that the watchdog works as intended and resets the MCU. But... the bootloader does not start. I cannot debug it right now, but I guess it is also stuck somewhere because of this incoherent state of the SPI device (weird, because the MCU has just been reset...).

I'm not 100% sure of my analysis, but I think we really should have a look at this in order to ensure that the bootloader is as robust as possible.

@JF002
Copy link
Contributor Author

JF002 commented Sep 14, 2020

Using InfiniTime 0.8.1 RC, I noticed that the bootloader won't run properly if the watchdog reset is triggered while the display is off (meaning that the cpu is in deep sleep, all devices are disabled,...).

I don't know if this is caused by a bug in InfiniTime in sleep mode or in the bootloader... Anyway, I think the bootloader should be able to run no matter what happens in the application firmware!

@lupyuen
Copy link
Owner

lupyuen commented Sep 16, 2020

Looks like I need to do a full reset or something. Lemme do some research.

How can I reproduce this?

Once I can reproduce this, I'll isolate the problem...

  1. Disable SPI access to display controller

  2. Display SPI access to external flash

  3. Check whether we need to power on the external flash

Also is there some way for me to test whether SPI is in incoherent state?

Once I detect that SPI is incoherent, I can do a full reset maybe.

I can debug the bootloader without SPI by toggling the GPIO pins to change the screen brightness, so that it flashes various patterns to show the bootloader status.

@lupyuen
Copy link
Owner

lupyuen commented Sep 16, 2020

I'm seeing this intermittent problem with the SPI flash, maybe because it's powered down. Will investigate...

target halted due to breakpoint, current mode: Thread 
xPSR: 0x21000000 pc: 0x00000b5e msp: 0x2000ff90, semihosting

That's an assertion failure...

int
spiflash_identify(struct spiflash_dev *dev)
{
    int i;
    int j;
    uint8_t manufacturer = 0;
    uint8_t memory_type = 0;
    uint8_t capacity = 0;
    int release_power_down_count = 0;
    /* Table with unique release power down functions */
    void (*rpd[3])(struct spiflash_dev *);
    int rc = 0;

    /* List of supported spi flash chips can be found in:
     * hw/drivers/flash/spiflash/chips/sysconfig.yml
     */
    static_assert((sizeof(supported_chips) / sizeof(supported_chips[0])) > 1,
        "At lease one spiflash chip must be specified in sysconfig with SPIFLASH_<chipid>:1");

    spiflash_lock(dev);

    /* Only one chip specified, no need for search*/
    if ((sizeof(supported_chips) / sizeof(supported_chips[0])) == 2) {
        supported_chips[0].fc_release_power_down(dev);
        spiflash_read_jedec_id(dev, &manufacturer, &memory_type, &capacity);
        /* If BSP defined SpiFlash manufacturer or memory type does not
         * match SpiFlash is most likely not connected, connected to
         * different pins, or of different type.
         * It is unlikely that flash depended packaged will work correctly.
         */
        assert(manufacturer == supported_chips[0].fc_jedec_id.ji_manufacturer &&
               memory_type == supported_chips[0].fc_jedec_id.ji_type &&
               capacity == supported_chips[0].fc_jedec_id.ji_capacity);  //  Stops here

@lupyuen
Copy link
Owner

lupyuen commented Sep 16, 2020

On 0.8.0: When pressing the buttons to make it hang, we don't get the the assertion failure above. It just hangs.

With the new bootloader that blinks, nothing blinks during the hang.

When I run with gdb and force break during the hang, gdb says that device is stuck at address 0xc488

According to the .map file, 0xc488 is in Pinetime::Drivers::SpiMaster::Write...

 .text._ZN8Pinetime7Drivers9SpiMaster5WriteEhPKhj
                0x000000000000c3b0      0x184 CMakeFiles/pinetime-mcuboot-app.dir/drivers/SpiMaster.cpp.o
                0x000000000000c3b0                Pinetime::Drivers::SpiMaster::Write(unsigned char, unsigned char const*, unsigned int)
 .text._ZN8Pinetime7Drivers9SpiMaster4ReadEhPhjS2_j
                0x000000000000c534      0x134 CMakeFiles/pinetime-mcuboot-app.dir/drivers/SpiMaster.cpp.o
                0x000000000000c534                Pinetime::Drivers::SpiMaster::Read(unsigned char, unsigned char*, unsigned int, unsigned char*, unsigned int)

So it's hanging while writing to SPI?

Screenshot 2020-09-16 at 8 03 03 PM

@lupyuen
Copy link
Owner

lupyuen commented Sep 16, 2020

While the watch was hanging, I waited 1 minute, forced break in gdb. It stopped in a different location: 0x10f7a, which is in Pinetime::Drivers::TwiMaster::Write...

 .text._ZN8Pinetime7Drivers9TwiMaster5WriteEhPKhjb
                0x0000000000010f50       0xa4 CMakeFiles/pinetime-mcuboot-app.dir/drivers/TwiMaster.cpp.o
                0x0000000000010f50                Pinetime::Drivers::TwiMaster::Write(unsigned char, unsigned char const*, unsigned int, bool)

According to the call stack, the caller of Pinetime::Drivers::TwiMaster::Write is 0x11020, which is Pinetime::Drivers::TwiMaster::Wakeup...

 .text._ZN8Pinetime7Drivers9TwiMaster6WakeupEv
                0x0000000000011118       0xa4 CMakeFiles/pinetime-mcuboot-app.dir/drivers/TwiMaster.cpp.o
                0x0000000000011118                Pinetime::Drivers::TwiMaster::Wakeup()

Screenshot 2020-09-16 at 9 35 41 PM

@JF002
Copy link
Contributor Author

JF002 commented Sep 16, 2020

Thanks for this torough analysis @lupyuen !

Here are some more details: In InfiniTime 0.8.0, there are race conditions on between the Display task (which writes on the SPI bus to the display and reads on the TWI bus from the touch controller) and the System task (which put the watch to sleep and wake it up when the button in pressed) : sometimes, system task disables the SPI and TWI devices while it is still in use by the Display task. When that happens, Display task waits for the end of a transaction, and that will never happen because the peripheral is disabled.

When I debugged this issue, I noticed 2 behaviors:

  • System task is locked into an infinite loop waiting for the SPI event, but System task is still running. InfiniTime seems to be frozen, but in fact, only the Display task is locked. System Task is still running and feeds the watchdog. In that case, a long press on the button should prevent the watchdog from beeing fed, and it should reset the CPU.
  • System task is frozen because of the race condition on the SPI/TWI bus, it doesn't feed the watchdog, and the watchdog resets the CPU.

In both cases, I'm pretty sure that the watchdog did reset the MCU, but nothing appears on the display, as if the bootloader was not running properly. That's why I created this issue.

In your case when you waited for 1 minute before attaching GDB, I think you were in the first case: Display task is locked into an infinite loop waiting for the SPI/TWI peripheral. That is the bug I fixed in InfiniTime 0.8.1.
When that happen, could you try to push the button for 7-10s, confirm that nothing happens on the screen and attach the debugger. If I'm right, GDB should tell you it is locked somewhere in the bootloader code.

Using InfiniTime 0.8.1, where this race condition is fixed, I noticed that the bootloader would not run properly when the watchdog reset occurs when the display is OFF (meaning that everything is disabled/put to sleep).

I saw in the chat that you were wondering what happens when the button is pushed:

  • An IRQ is triggered and the function nrfx_gpiote_evt_handler() is called. It starts the debounce timer.
  • DebounceTimerCallback() is called when the debouncing is over.
  • It calls SystemTask::OnButtonPushed()
  • It enables/disabled (Wakeup()/Sleep()) all devices and sends message to display task and system task to let them know they should restart processing or stop them to go to sleep.

Sorry for this long post, but I doubt I'll be able to give it a lot of time before the week-end, but hopefully some of these information will help your analysis. If not, I'll do my best to help you during the evenings in my timezone or during the weekend :)

Thanks again!

@lupyuen
Copy link
Owner

lupyuen commented Sep 17, 2020

Cool thanks and congrats :-) Here's what happened when I tried 0.8.1...

You're right I don't see any more hanging in InfiniTime. The bootloader is now hanging because of the assertion failure in spiflash_identify I mentioned earlier. So lemme now fix this.

The SPI bus is probably messed up, causing this assertion failure.

If I bypass the assertion failure and call NVIC_SystemReset, it seems to fix the problem.

Right now the assertion failure is stuck in an NMI Handler that loops forever. Lemme change the NMI Handler to call NVIC_SystemReset...

Screenshot 2020-09-17 at 11 52 03 AM

@lupyuen
Copy link
Owner

lupyuen commented Sep 17, 2020

OK NMI Handler now calls NVIC_SystemReset, seems to fix the hanging problem! The test binaries are here (with fixed manual firmware rollback checking): https://github.com/lupyuen/pinetime-rust-mynewt/releases/tag/v5.0.3

@JF002
Copy link
Contributor Author

JF002 commented Sep 17, 2020

Wow, thanks for this fix, I'll try it asap!

I admit that, at first, I was a bit surprised by your fix : override NMI_Handler to call NVIC_SystemReset(). But yeah, what else can you do? It looks like that many default implementation just enter into an infinite loop and does nothing (probably what happens here).

I searched how the NRF SDK handles that case :

  • A failed assert calls assert_nrf_callback()
  • which calls app_error_fault_handler()
  • If it is running in DEBUG mode, it logs a lot of useful info about the failed assert and enter into an infinite loop, and if it not running in DEBUG, it calls... NVIC_SystemReset();

So, yeah, congratulations, you did as good as NRF engineers can do 👍

image
image

Now, I'm still wondering why the SPI bus does not work correctly after a watchdog reset... Reset should put all the registers in their default state and everything should work reliably after the reset...
Or, maybe the SPI flash is "frozen" and needs a bit of time or a different init sequence to run properly after InfinTime messed the spi bus?

@JF002
Copy link
Contributor Author

JF002 commented Sep 17, 2020

I've just done some tests :

  • using InfiniTime 0.8.0 : quickly push the button or wdt reset while in sleep mode : the bootloader reboots correctly, with 4 rapid flashes on the backlight.
  • using InfiniTime 0.8.1 : quickly push the button does not crash the firmware, and reset while in sleep mode reset to the bootloader with 4 rapid flashes.
  • OTA from 0.8.0 to 0.8.1 succeeded 3 times
  • auto revert when the image is not validated works
  • manual revert when the image is validated but the button is pushed while the bootloader is running works too!

Everything looks fine to me! I think we can do some other test (like OTA a not working firmware) to ensure that edge cases are covered too, but I won't have the time to do that this evening :)

Thanks a lot for your analysis, work and fix, I think the bootloader is much more reliable now!

@lupyuen
Copy link
Owner

lupyuen commented Sep 18, 2020

Cool thanks for testing! I have merged to master branch and released it here:

https://github.com/lupyuen/pinetime-rust-mynewt/releases/tag/v5.0.4

@rafacouto
Copy link

To review:

  • syscfg.MCU_DEBUG_IGNORE_BKPT (default 0): When enabled, asm(bkpt) will be ignored. If not set, it will hit the breakpoint wherever it gets called, For example, reset and crash.
  • syscfg.HAL_SYSTEM_RESET_CB (default 0): If set, hal system reset callback gets called inside hal_system_reset().

@rafacouto
Copy link

What about disabling SPI master on bootloader at early stage (before newt's hal_init_*)? SPIM.ENABLE register <-- 0

@rafacouto
Copy link

See InfiniTimeOrg/InfiniTime#79 (comment)

In this case, example in 3.8.1 SPIM - TX workaround subsection of Nordic's Anomaly 109 for nRF52832 could be good.

@lupyuen
Copy link
Owner

lupyuen commented Oct 19, 2020

Sorry I'm not planning to make any more changes to MCUBoot, please submit a tested PR for updates.

@lupyuen lupyuen closed this as completed Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants