Random firmware reset #206

xoxys · 2019-05-02T10:36:07Z

Hi, im running valetudo 0.3.1 and today robo was not accessable, so i restarted the robot and after that he was back with default AP. No ssh connection possible und no valetudo available at port 80.

So maybe this is not fixed?

Hypfer · 2019-05-02T10:38:10Z

There were some commits for the upstart config which could possibly fix this.

You might want to try those

xoxys · 2019-05-02T10:49:48Z

How does the upstart fix Firmware resets?

xoxys · 2019-05-02T10:51:02Z

Do you mean 6fd1be6?

Hypfer · 2019-05-02T10:54:46Z

There is some kind of error counter somewhere in the roborock software which reverts to the previous firmware if it reaches a certain value.

I have no idea where it is and what causes it to increment, but I assume that going OOM is one thing that might do it.

9108819 contains some mitigations against memory leakage which causes the player process to be killed.

xoxys · 2019-05-02T11:37:21Z

Thanks for clarification :) I'll give it a try. First I have to re-flash the firmware to get back control over the robot...

xoxys · 2019-05-03T17:25:37Z

Robot is back to life, lets see what happens.

xoxys · 2019-05-04T10:21:10Z

This morning robot was not reachable and wlan led was off. So I decide to reboot, after that robot was reconnecting to wifi and NOT reseted. So there seems to be another problem. valetudo log contains:

Loading configuration file: /mnt/data/valetudo/config.json^M
Dummycloud is spoofing 203.0.113.1:8053 on 127.0.0.1:8053^M
Webserver running on port 80^M
events.js:183^M
      throw er; // Unhandled 'error' event^M
      ^^M
^M
Error: getaddrinfo EAI_AGAIN mymqtt.example.com:8883^M
    at Object._errnoException (util.js:992:11)^M
    at errnoException (dns.js:55:15)^M
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:92:26)^M
Loading configuration file: /mnt/data/valetudo/config.json^M
Dummycloud is spoofing 203.0.113.1:8053 on 127.0.0.1:8053^M
Webserver running on port 80^M
events.js:183^M
      throw er; // Unhandled 'error' event^M
      ^^M
^M
Error: getaddrinfo EAI_AGAIN mymqtt.example.com:8883^M
    at Object._errnoException (util.js:992:11)^M
    at errnoException (dns.js:55:15)^M
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:92:26)^M

and boot.log:

 * Stopping flush early job output to logs^[[74G[ OK ]^M
 * Starting configure virtual network devices^[[74G[ OK ]^M
 * Stopping System V initialisation compatibility^[[74G[ OK ]^M
 * Starting system logging daemon^[[74G[ OK ]^M
^M
dnsmasq: unknown interface wlan0^M

Hypfer · 2019-05-04T10:50:39Z

dnsmasq: unknown interface wlan0

That sounds very broken

xoxys · 2019-05-04T10:54:35Z

... Flashed with a fresh firmware yesterday. Works fine for a whole day, and works also after a reboot. No idea what's wrong

xoxys · 2019-05-05T16:29:58Z

@Hypfer does the reboot command over ssh works for you? Running reboot from a SSH session kills my current session but does not bring back the robo until i hardly power it off with the physical button and turn on again

Hypfer · 2019-05-05T18:09:02Z

Yup works here

xoxys · 2019-05-05T18:27:57Z

Ok so what I've done:
I've build a fresh Firmware image from v1810 and v1820 both have a dnsmaq error in the Boot log also without adding valetudo to the image. So i decide do remove dnsmasq from upstart. After reboot all seems to work. AP mode and provisioning works fine. Also I notices sometimes a network error in my valetudo log. It seems that the upstart script does not respect the network-start directive... Maybe there is a small gap between "network interface is up" and network interface can reach my mqtt server. To fix this i've added a 30s sleep to pre-script section in the upstart config.

For now, no more errors in valetudo log and reboot works also. Lets see what happens tomorrow :D

xoxys · 2019-05-08T14:46:13Z

Robot is now up and running since 3 days, but: Lost the map today.....

tadly · 2019-05-08T15:02:53Z

Because I had my robot also reset on my 2 days ago, I was searching for related issues.

For me, it took about a week before the robot reset itself (I think, didn't pay to much attention)

In regards to the error counter, could it be possible they have it in memory?
Was thinking that maybe restarting the player process once a day (cron) could be worth a shot until a real solution was found.
Unless you guys think I'm way off base :)

matthiasharrer · 2019-05-08T20:47:36Z

There is already a memory limit with the upstart config on master which should prevent the resets I think

xoxys · 2019-05-08T20:50:55Z

Currently for me its very hard to get a stable solution. For me the memory limit does not solve the reset issue finally.

matthiasharrer · 2019-05-08T20:54:51Z

Do you use the memory limit already? Well then I am afraid I cannot help :( I do think however that restarting player is not the right direction to go torwards .. maybe just downgrade valetudo to a more stable version

xoxys · 2019-05-08T21:16:22Z

Jep using the limit :) dont worry i know its hard develop for a "closed source" device. Random resets and my current map issue are hard to debug.

desq42 · 2019-05-13T06:43:26Z

…same here; memory limit set, reboot & firmware reset this night at 3:52 am.
(My Gen1 "lives" in the bedroom. - my wife was not amused… ;-)

xoxys · 2019-05-13T07:18:01Z

For me the robot is stable since more than a week with my changes. The map was back after a reboot, no idea why the map wasn't displayed

tadly · 2019-05-13T07:33:50Z

@xoxys would you mind telling me where exactly to remove dnsmasq from?

Also, what might be interesting is how many times the vacuum has been active.
I'd expect this to influence the time when the vacuum resets. (I'm actually down to 2x a week)

dugite-code · 2019-05-13T07:57:57Z

@tadly @xoxys I had the error dnsmasq: unknown interface wlan0 in my boot log as well modifying the option bind-interfaces in /etc/dnsmasq.conf to bind-dynamic cleared that error in boot.log

So far this doesn't appear to have cased any issues, probably better than disabling dnsmasq altogether

xoxys · 2019-05-13T08:36:00Z

@dugite-code good catch :) I'm not sure if dnsmasq is really required, so I decided to remove it from my setup. But yes you should be careful with these changes...

@tadly you should maybe follow @dugite-codeinstructions first, bu if you look for the file I think it is under /etc/upstart

samtimes1 · 2019-05-15T08:53:41Z

Hi everyone, last night my robot also reset its firmware and woke us up in the middle of the night.
I didn't have much time today before work, so I just tried the same command to push the firmware to the robot I did last time (still had it in my history). It didn't work because the token was wrong. I started discovery and received a different token. With that new token it kinda worked but the robot told me to charge it before updating it, so I just left for work.
Is it possible that the token changed? I thought it always stays the same

posixx · 2019-05-15T17:19:15Z

i also have the same problem; after some days the rockrobo looses config. After rebooting the rockrobo i can connect to the internal wifi, but no ssh or GUI access. So it seems it is completely reset. Very annoying. Hope this is solved soon..

xoxys · 2019-05-19T20:23:04Z

@posixx It's not that simple. If you have any suggestions please share with us :)

xoxys · 2019-05-19T20:24:24Z

@rassaei As I know the token will be randomly created at first boot. So after a full reset the token will be re-generated.

dugite-code · 2019-05-20T03:29:31Z

@posixx @xoxys Found something new: On the German roboter-forum.com there was a thread suggesting they stopped seeing this issue once they added the missing /mnt/default/roborock.conf file. If you are missing that as well you can create one following the CCC to CE conversion guide

josch · 2019-05-20T04:08:04Z

After every reset I'm flashing the same firmware image (based on 1792) on my gen2. That image contains /mnt/default/roborock.conf:

root@rockrobo:~# cat /mnt/default/roborock.conf
language=en
name=custom_A.03.0005_CE
bom=A.03.0005
location=de
wifiplan=
timezone=Europe/Berlin
logserver=awsde0.fds.api.xiaomi.com

But I'm still seeing reboots once in a while. I did not modify that image otherwise, so maybe multiple conditions have to be met.

josch · 2020-05-04T21:46:07Z

If you use /bin/sh, the echos do not work as intended.

The script should not use echo -n (which is not portable) but printf instead...

dgiese · 2020-05-04T23:11:57Z

So I implemented a flag-cleaner in my dustbuilder and tested it against the v1-fw (4004 and 4007). It works so far for me. I created it with init scripts for runlevel0 (restart) and runlevel6(halt) for the Ubuntu based firmware. The script checks for the flag "04" and resets its to "01" if necessary. That is checked at boot-time and at restart. I think that's the cleanest solution.

Here is the code I use: https://dustbuilder.xvm.mit.edu/resetfix_maybe/

AlexanderS · 2020-05-05T01:07:19Z

@dgiese Thanks for the script. You can replace echo -n -e with printf and even replace this:

echo -n $(date) >> "$_LOG_FILE"
echo -n " " >> "$_LOG_FILE"
echo -n $((offset)) >> "$_LOG_FILE"
echo -n $_SHALL >> "$_LOG_FILE"
echo -n " " >> "$_LOG_FILE"
echo -n $actual >> "$_LOG_FILE"
echo -n " " >> "$_LOG_FILE"
echo " - bad partition flag detected for systemA" >> "$_LOG_FILE"

with something like this:

printf "%s %s %s %s - bad partition flag detected for systemA\n" "$(date)" "$offset" "$_SHALL" "$actual" >> "$_LOG_FILE"

By the way $offset is undefined in cleanflags.sh.

dgiese · 2020-05-05T16:01:52Z

Thanks for the hint. I mean for know it works. @sareyko original script was more cleaner (please dont sue me), but I wanted to make sure that it runs in every environment (bash, ash). I somehow cannot get the print the hex stuff in ash. Also I write now to /mnt/reserve as that partition is persistent over factory resets.

josch · 2020-05-05T16:37:55Z

If you want to be most compatible, don't try to use hex characters but use octal instead. From the POSIX manual of printf: "Hexadecimal character constants as defined in the ISO C standard are not recognized in the format". The following worked fine for me under ash version 0.5.10.2:

$ printf '\001' | xxd
00000000: 01                                       .
$ printf '%b' '\0001' | xxd
00000000: 01

This also worked under dash, bash, zsh and ksh.

diabl0w · 2020-05-05T17:04:49Z

will using this script cause damage? I mean in that whatever is causing the device to factory reset, if we prevent the reset, then some sort of damage would happen that resetting would have prevented

Hypfer · 2020-05-05T17:10:22Z

I doubt it but of course there's no warranty here. Everything you're doing is your own risk.

Personally I'd say it's plausible that roborock simply added some kind of hidden root detection which messes with the user just enough that they get nudged back to use the cloud. Especially since that started happening only after they became aware of Valetudo which is a viable alternative.

I can't think of any permanent damage that could be caused by this. It's much safer than disabling reboots imo

diabl0w · 2020-05-05T17:31:41Z

I doubt it but of course there's no warranty here. Everything you're doing is your own risk.

Personally I'd say it's plausible that roborock simply added some kind of hidden root detection which messes with the user just enough that they get nudged back to use the cloud. Especially since that started happening only after they became aware of Valetudo which is a viable alternative.

I can't think of any permanent damage that could be caused by this. It's much safer than disabling reboots imo

thanks, that seems fair... i guess it would still be nice to eventually find the actual cause of the reset flags being set, but a step in the right direction at least!

mathiasrabe · 2020-05-05T19:14:24Z

Personally I'd say it's plausible that roborock simply added some kind of hidden root detection which messes with the user just enough that they get nudged back to use the cloud.

As I understood the resets just occur when you use a custom firmware made with Dustbuilder and add Valetudo. I've never heard that the resets occur with custom firmware from Vacuumz or did I miss anything?

If this just happens with Dustbuilder images, there might be other possibilities then a root detection. Maybe it's just a bug in Dustbuilder which is triggered by Valetudo?

Nevertheless, thanks for all your effort to dig deeper in this topic :D

sareyko · 2020-05-05T19:40:28Z

I noticed that the script is broken. If you use /bin/sh, the echos do not work as intended. Both partitions will be flagged with "2D" instead of "01".... which will cause a factory reset at the next boot.

Oh shoot! Forgot about the -n flag to echo being a bashism. But at least on my bot it works just fine in ash (BusyBox 1.24.1).

So I implemented a flag-cleaner in my dustbuilder and tested it against the v1-fw (4004 and 4007). [...] That is checked at boot-time [...]

Why the boot-time check? I don't think that's necessary and might actually be a bad idea in certain circumstances.

will using this script cause damage? I mean in that whatever is causing the device to factory reset, if we prevent the reset, then some sort of damage would happen that resetting would have prevented

I totally agree here. As long as the source of the resets is unknown, preventing them may actually brick the devices. For all we know the resets might actually be needed to recover the device in some kind of filesystem failure.

thanks, that seems fair... i guess it would still be nice to eventually find the actual cause of the reset flags being set, but a step in the right direction at least!

The flags get set by WatchDoge when a certain message is received via IPC from another process. So far I've not been able to find the process sending the message and thus the cause of the reset. I'm still looking around but it seems like the actual source of the message is missing in the firmware I'm using and looking at.
Maybe I'll have a look at another version when I find some more free time.
If anybody wants to help with this: A list of running processes from a RR running a firmware known to reset itself would be a good start.

Hypfer · 2020-05-05T19:52:45Z

What is triggering the factory reset when you're doing it via the hardware buttons?

https://github.com/dgiese/dustcloud/wiki/Xiaomi-Vacuum-Robots-Factory-Reset

If that is a hardware feature, it should be possible to always recover 🤔

diabl0w · 2020-05-06T03:41:31Z

I'm still looking around but it seems like the actual source of the message is missing in the firmware I'm using and looking at.
Maybe I'll have a look at another version when I find some more free time.
If anybody wants to help with this: A list of running processes from a RR running a firmware known to reset itself would be a good start.

Someone else can chime in if they have experienced differently, but in my experience:

https://vacuumz.info/download/gen2/
- built with a modified imagebuilder
- uses valetudo RE
- I have only used one version, so not thorough testing, but I have never had a reset
https://dustbuilder.xvm.mit.edu/
- uses original imagebuilder
- uses original valetudo (an option for valetudo re does exist but I havent used it)
- I have had probably a half dozen or more resets with these images (or ones I self built) in the past

edited to also indicate differences in valetudo versions as pointed out by @dgiese

dgiese · 2020-05-06T05:53:08Z

So from my experience the flags are safe as in the worst case the vacuum does a factory reset via u-boot. As long as you dont mess up the recovery copy of the OS you should be fine. From what I saw the flag "0x4" should never occur under normal circumstances. The other flags (1-3) are normal or are set while an update.

About the differences: vacuumz has prebuild images. If no resets have occured, then there might be two theories: it could be the valetudo version (RE vs. vanilla) or their images have set something special. Technically the images out of dustbuilder should not really differ in configurations, but maybe there is something weird. However resets existed before dustbuilder, so it must be something with valetudo or some configuration...

rlka · 2020-05-11T13:08:57Z

I did not get it, can i just build the new 0.5.1 firmware for my gen1 and this fix will be included, or i need to use DustBuilder with "experimental feature" somehow?

jdus · 2020-05-21T13:27:27Z

I did not get it, can i just build the new 0.5.1 firmware for my gen1 and this fix will be included, or i need to use DustBuilder with "experimental feature" somehow?

In the release notes of 0.5.1 @Hypfer states, that you either can apply the fix your self for local firmware builders, or just use dustbuilder (https://builder.dontvacuum.me/). When using dust builder just don't forget to check the box in 'experimental features':

Poeschl · 2020-05-21T13:32:10Z

When building locally with vacuum the flag --fix-reset applies the fix.

2relativ · 2020-05-30T11:15:11Z

Do I still need to do something manually with 0.5.2? Or is this fix now canon?

Hypfer · 2020-05-30T11:33:59Z

@2relativ you will need to build a new firmware image with the mitigation enabled. Just replacing the valetudo binary is not enough

2relativ · 2020-05-30T12:05:28Z

@Hypfer thanks! So I don't need to set a flag or something else. Just build a new Image with 0.5.2?

Hypfer · 2020-05-30T12:08:37Z

If you follow the updated guide in the docs everything should be fine

exetico · 2020-06-14T19:33:48Z

#206 (comment)

Our vacumm just resetted itself again. This time after 4months~, or so.

I'll use the new solution, and hopefully it'll stay valetudoed 😁

Hypfer · 2020-06-15T07:07:44Z

@exetico Should be solved now. Just make sure to enable the mitigation when building a new firmware

exetico · 2020-06-16T17:45:47Z

Hi @Hypfer

Thanks for the reply. Iit was not my thought to disturb you :-) I just wanted to report my latest issue, to have the timestamp at some place - hereafter i just wanted to find a bit of time, and reflash it.

I've now grapped the lastest version from DustBuilder, including the reset-fix, and evertything is "back to normal".

Fingers crossed :-)

WolfspiritM · 2020-06-23T20:00:30Z

I updated a few weeks ago as well using the fix and until now it didn't reset.
However a few days ago suddenly my zones were gone and valetudo seemed to be reset to default.
Everything including authentication was lost but things like the map and cleaning history was still there.
That happened with the daily reboot.
I can only think of it as if the filesystem where the config.json is stored wasn't available when valetudo started (or it had some other issue reading that file after a reboot) so it created a new one, but I'm not sure. There even is a "config.json.backup" but that is an empty config, too. Maybe it would be good to use a different location if there already is a backup there.

I don't really think that has anything to do with this issue in particular but as I have no way to reproduce and it's some kind of reset I thought I'd mention it here.

Schattenruf · 2020-06-30T11:46:18Z

Same for me. I updated in March with the fix-Script:

I think you most likely referring to my script.
https://github.com/MadJoker0815/roborock_nologs

Up to now no reset and the filesystem looks good as well:

Note: I don‘t have any zones defined

Hypfer · 2020-06-30T12:16:45Z

Since the mitigation does seem to work fine, this issue will now be closed and hopefully never reopened again.

xobs · 2020-07-12T09:20:12Z

I've just now discovered this thread. I'm going to give the fix a try. This is mostly just some thoughts I came up with when reading the thread history.

From what I gather, the most likely source of problems stems from WatchDoge. In a recent post, someone also mentioned /dev/watchdog.

If WatchDoge is, in fact, using /dev/watchdog then it's likely using the SoC WDT that will do a hard-reset if the timer expires. If it does that, then things could be in a bad state. This could happen if the process is killed due to OOM or if the filesystem fills up and a write fails.

They also could do something like reset voltage regulators during boot, which could cause a momentary brownout on eMMC. If it was in the process of writing, that could corrupt data. Or maybe they didn't wire up the eMMC reset line. Or maybe this eMMC doesn't work so well under reset.

Given some of the early reports, it certainly sounds like a WDT, especially since there aren't any logs. Maybe this SoC's WDT reports the current count, and we can read that value back.

If this happens again, we should look closer at the hardware watchdog timer as a source of these reboots and filesystem corruption. Thanks to all who investigated it.

Update: I did some poking and it looks like WatchDoge does set up the hardware watchdog timer to reboot if it hasn't been touched for 16 seconds. Based on the reference manual at http://dl.linux-sunxi.org/A23/A23%20User%20Manual%20V1.0%2020130830.pdf:

Read the current status:

[root@rockrobo ~]# ./devmem2 0x01C20Cb8
/dev/mem opened.
Memory mapped at address 0xb6f5f000.
Value at address 0x1c20cb8 (0xb6f5fcb8): 0x000000b1
[root@rockrobo ~]#

Status 0xbx means 512000 cycles (16 seconds), and 0xx1 means "Enable the watchdog".

The watchdog is set to restart the chip immediately if it doesn't get fed every 16 seconds:

[root@rockrobo ~]# ./devmem2 0x01C20Cb4
/dev/mem opened.
Memory mapped at address 0xb6fc8000.
Value at address 0x1c20cb4 (0xb6fc8cb4): 0x00000001
[root@rockrobo ~]#

If this happens again, one thing we can try is setting it to issue an interrupt rather than restarting (this can be done by running devmem2 0x01C20Cb4 w 2). We can also try just disabling the watchdog timer entirely, since this device doesn't seem to have a trapdoor function and you can actually just set 0x01C20Cb8 to 0x00 to disable it.

But again, that's only if this problem hasn't already been solved.

Hypfer · 2020-08-09T08:38:36Z

As pointed out in the dustcloud group by @bsdice, these resets seem to be related to memory usage. WatchDoge allegedly monitors the ram usage to detect memory leaks and increase the failure counter which then leads to resets.

For Firmware 1720 these default in the WatchDoge binary should be
#MEMORY_WARN_SIZE: 230000 (VmRSS total)
#MEMORY_TOTAL_SIZE: 250000 (Max VmSize)

and can also be overridden by setting larger values as env variables in the /opt/rockrobo/watchdog/rrwatchdoge.conf

josch mentioned this issue May 6, 2019

robot resets itself dgiese/dustcloud#206

Open

Hypfer closed this as completed Jun 30, 2020

github-actions bot locked as resolved and limited conversation to collaborators Jan 20, 2022

Random firmware reset #206

Random firmware reset #206

Comments

xoxys commented May 2, 2019

Hypfer commented May 2, 2019

xoxys commented May 2, 2019

xoxys commented May 2, 2019

Hypfer commented May 2, 2019

xoxys commented May 2, 2019

xoxys commented May 3, 2019

xoxys commented May 4, 2019

Hypfer commented May 4, 2019

xoxys commented May 4, 2019

xoxys commented May 5, 2019

Hypfer commented May 5, 2019

xoxys commented May 5, 2019

xoxys commented May 8, 2019

tadly commented May 8, 2019

matthiasharrer commented May 8, 2019

xoxys commented May 8, 2019

matthiasharrer commented May 8, 2019

xoxys commented May 8, 2019

desq42 commented May 13, 2019

xoxys commented May 13, 2019 • edited Loading

tadly commented May 13, 2019

dugite-code commented May 13, 2019

xoxys commented May 13, 2019

samtimes1 commented May 15, 2019

posixx commented May 15, 2019

xoxys commented May 19, 2019

xoxys commented May 19, 2019

dugite-code commented May 20, 2019

josch commented May 20, 2019

josch commented May 4, 2020 • edited Loading

dgiese commented May 4, 2020 • edited Loading

AlexanderS commented May 5, 2020

dgiese commented May 5, 2020

josch commented May 5, 2020

diabl0w commented May 5, 2020

Hypfer commented May 5, 2020

diabl0w commented May 5, 2020

mathiasrabe commented May 5, 2020

sareyko commented May 5, 2020 • edited Loading

Hypfer commented May 5, 2020

diabl0w commented May 6, 2020 • edited Loading

dgiese commented May 6, 2020

rlka commented May 11, 2020

jdus commented May 21, 2020 • edited Loading

Poeschl commented May 21, 2020

2relativ commented May 30, 2020

Hypfer commented May 30, 2020

2relativ commented May 30, 2020

Hypfer commented May 30, 2020

exetico commented Jun 14, 2020 • edited Loading

Hypfer commented Jun 15, 2020

exetico commented Jun 16, 2020

WolfspiritM commented Jun 23, 2020 • edited Loading

Schattenruf commented Jun 30, 2020 • edited Loading

Hypfer commented Jun 30, 2020

xobs commented Jul 12, 2020 • edited Loading

Hypfer commented Aug 9, 2020

xoxys commented May 13, 2019 •

edited

Loading

josch commented May 4, 2020 •

edited

Loading

dgiese commented May 4, 2020 •

edited

Loading

sareyko commented May 5, 2020 •

edited

Loading

diabl0w commented May 6, 2020 •

edited

Loading

jdus commented May 21, 2020 •

edited

Loading

exetico commented Jun 14, 2020 •

edited

Loading

WolfspiritM commented Jun 23, 2020 •

edited

Loading

Schattenruf commented Jun 30, 2020 •

edited

Loading

xobs commented Jul 12, 2020 •

edited

Loading