Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random firmware reset #206

Closed
xoxys opened this issue May 2, 2019 · 370 comments
Closed

Random firmware reset #206

xoxys opened this issue May 2, 2019 · 370 comments

Comments

@xoxys
Copy link

xoxys commented May 2, 2019

Hi, im running valetudo 0.3.1 and today robo was not accessable, so i restarted the robot and after that he was back with default AP. No ssh connection possible und no valetudo available at port 80.

So maybe this is not fixed?

@Hypfer
Copy link
Owner

Hypfer commented May 2, 2019

There were some commits for the upstart config which could possibly fix this.

You might want to try those

@xoxys
Copy link
Author

xoxys commented May 2, 2019

How does the upstart fix Firmware resets?

@xoxys
Copy link
Author

xoxys commented May 2, 2019

Do you mean 6fd1be6?

@Hypfer
Copy link
Owner

Hypfer commented May 2, 2019

There is some kind of error counter somewhere in the roborock software which reverts to the previous firmware if it reaches a certain value.

I have no idea where it is and what causes it to increment, but I assume that going OOM is one thing that might do it.

9108819 contains some mitigations against memory leakage which causes the player process to be killed.

@xoxys
Copy link
Author

xoxys commented May 2, 2019

Thanks for clarification :) I'll give it a try. First I have to re-flash the firmware to get back control over the robot...

@xoxys
Copy link
Author

xoxys commented May 3, 2019

Robot is back to life, lets see what happens.

@xoxys
Copy link
Author

xoxys commented May 4, 2019

This morning robot was not reachable and wlan led was off. So I decide to reboot, after that robot was reconnecting to wifi and NOT reseted. So there seems to be another problem. valetudo log contains:

Loading configuration file: /mnt/data/valetudo/config.json^M
Dummycloud is spoofing 203.0.113.1:8053 on 127.0.0.1:8053^M
Webserver running on port 80^M
events.js:183^M
      throw er; // Unhandled 'error' event^M
      ^^M
^M
Error: getaddrinfo EAI_AGAIN mymqtt.example.com:8883^M
    at Object._errnoException (util.js:992:11)^M
    at errnoException (dns.js:55:15)^M
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:92:26)^M
Loading configuration file: /mnt/data/valetudo/config.json^M
Dummycloud is spoofing 203.0.113.1:8053 on 127.0.0.1:8053^M
Webserver running on port 80^M
events.js:183^M
      throw er; // Unhandled 'error' event^M
      ^^M
^M
Error: getaddrinfo EAI_AGAIN mymqtt.example.com:8883^M
    at Object._errnoException (util.js:992:11)^M
    at errnoException (dns.js:55:15)^M
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:92:26)^M

and boot.log:

 * Stopping flush early job output to logs^[[74G[ OK ]^M
 * Starting configure virtual network devices^[[74G[ OK ]^M
 * Stopping System V initialisation compatibility^[[74G[ OK ]^M
 * Starting system logging daemon^[[74G[ OK ]^M
^M
dnsmasq: unknown interface wlan0^M

@Hypfer
Copy link
Owner

Hypfer commented May 4, 2019

dnsmasq: unknown interface wlan0

That sounds very broken

@xoxys
Copy link
Author

xoxys commented May 4, 2019

... Flashed with a fresh firmware yesterday. Works fine for a whole day, and works also after a reboot. No idea what's wrong

@xoxys
Copy link
Author

xoxys commented May 5, 2019

@Hypfer does the reboot command over ssh works for you? Running reboot from a SSH session kills my current session but does not bring back the robo until i hardly power it off with the physical button and turn on again

@Hypfer
Copy link
Owner

Hypfer commented May 5, 2019

Yup works here

@xoxys
Copy link
Author

xoxys commented May 5, 2019

Ok so what I've done:
I've build a fresh Firmware image from v1810 and v1820 both have a dnsmaq error in the Boot log also without adding valetudo to the image. So i decide do remove dnsmasq from upstart. After reboot all seems to work. AP mode and provisioning works fine. Also I notices sometimes a network error in my valetudo log. It seems that the upstart script does not respect the network-start directive... Maybe there is a small gap between "network interface is up" and network interface can reach my mqtt server. To fix this i've added a 30s sleep to pre-script section in the upstart config.

For now, no more errors in valetudo log and reboot works also. Lets see what happens tomorrow :D

@xoxys
Copy link
Author

xoxys commented May 8, 2019

Robot is now up and running since 3 days, but: Lost the map today.....

@tadly
Copy link

tadly commented May 8, 2019

Because I had my robot also reset on my 2 days ago, I was searching for related issues.

For me, it took about a week before the robot reset itself (I think, didn't pay to much attention)

In regards to the error counter, could it be possible they have it in memory?
Was thinking that maybe restarting the player process once a day (cron) could be worth a shot until a real solution was found.
Unless you guys think I'm way off base :)

@matthiasharrer
Copy link
Contributor

There is already a memory limit with the upstart config on master which should prevent the resets I think

@xoxys
Copy link
Author

xoxys commented May 8, 2019

Currently for me its very hard to get a stable solution. For me the memory limit does not solve the reset issue finally.

@matthiasharrer
Copy link
Contributor

Do you use the memory limit already? Well then I am afraid I cannot help :( I do think however that restarting player is not the right direction to go torwards .. maybe just downgrade valetudo to a more stable version

@xoxys
Copy link
Author

xoxys commented May 8, 2019

Jep using the limit :) dont worry i know its hard develop for a "closed source" device. Random resets and my current map issue are hard to debug.

@desq42
Copy link

desq42 commented May 13, 2019

…same here; memory limit set, reboot & firmware reset this night at 3:52 am.
(My Gen1 "lives" in the bedroom. - my wife was not amused… ;-)

@xoxys
Copy link
Author

xoxys commented May 13, 2019

For me the robot is stable since more than a week with my changes. The map was back after a reboot, no idea why the map wasn't displayed

@tadly
Copy link

tadly commented May 13, 2019

@xoxys would you mind telling me where exactly to remove dnsmasq from?

Also, what might be interesting is how many times the vacuum has been active.
I'd expect this to influence the time when the vacuum resets. (I'm actually down to 2x a week)

@dugite-code
Copy link

@tadly @xoxys I had the error dnsmasq: unknown interface wlan0 in my boot log as well modifying the option bind-interfaces in /etc/dnsmasq.conf to bind-dynamic cleared that error in boot.log

So far this doesn't appear to have cased any issues, probably better than disabling dnsmasq altogether

@xoxys
Copy link
Author

xoxys commented May 13, 2019

@dugite-code good catch :) I'm not sure if dnsmasq is really required, so I decided to remove it from my setup. But yes you should be careful with these changes...

@tadly you should maybe follow @dugite-codeinstructions first, bu if you look for the file I think it is under /etc/upstart

@samtimes1
Copy link

Hi everyone, last night my robot also reset its firmware and woke us up in the middle of the night.
I didn't have much time today before work, so I just tried the same command to push the firmware to the robot I did last time (still had it in my history). It didn't work because the token was wrong. I started discovery and received a different token. With that new token it kinda worked but the robot told me to charge it before updating it, so I just left for work.
Is it possible that the token changed? I thought it always stays the same

@posixx
Copy link

posixx commented May 15, 2019

i also have the same problem; after some days the rockrobo looses config. After rebooting the rockrobo i can connect to the internal wifi, but no ssh or GUI access. So it seems it is completely reset. Very annoying. Hope this is solved soon..

@xoxys
Copy link
Author

xoxys commented May 19, 2019

@posixx It's not that simple. If you have any suggestions please share with us :)

@xoxys
Copy link
Author

xoxys commented May 19, 2019

@rassaei As I know the token will be randomly created at first boot. So after a full reset the token will be re-generated.

@dugite-code
Copy link

@posixx @xoxys Found something new: On the German roboter-forum.com there was a thread suggesting they stopped seeing this issue once they added the missing /mnt/default/roborock.conf file. If you are missing that as well you can create one following the CCC to CE conversion guide

@josch
Copy link

josch commented May 20, 2019

After every reset I'm flashing the same firmware image (based on 1792) on my gen2. That image contains /mnt/default/roborock.conf:

root@rockrobo:~# cat /mnt/default/roborock.conf
language=en
name=custom_A.03.0005_CE
bom=A.03.0005
location=de
wifiplan=
timezone=Europe/Berlin
logserver=awsde0.fds.api.xiaomi.com

But I'm still seeing reboots once in a while. I did not modify that image otherwise, so maybe multiple conditions have to be met.

@josch
Copy link

josch commented May 4, 2020

If you use /bin/sh, the echos do not work as intended.

The script should not use echo -n (which is not portable) but printf instead...

@dgiese
Copy link

dgiese commented May 4, 2020

So I implemented a flag-cleaner in my dustbuilder and tested it against the v1-fw (4004 and 4007). It works so far for me. I created it with init scripts for runlevel0 (restart) and runlevel6(halt) for the Ubuntu based firmware. The script checks for the flag "04" and resets its to "01" if necessary. That is checked at boot-time and at restart. I think that's the cleanest solution.

Here is the code I use: https://dustbuilder.xvm.mit.edu/resetfix_maybe/

@AlexanderS
Copy link

@dgiese Thanks for the script. You can replace echo -n -e with printf and even replace this:

echo -n $(date) >> "$_LOG_FILE"
echo -n " " >> "$_LOG_FILE"
echo -n $((offset)) >> "$_LOG_FILE"
echo -n $_SHALL >> "$_LOG_FILE"
echo -n " " >> "$_LOG_FILE"
echo -n $actual >> "$_LOG_FILE"
echo -n " " >> "$_LOG_FILE"
echo " - bad partition flag detected for systemA" >> "$_LOG_FILE"

with something like this:

printf "%s %s %s %s - bad partition flag detected for systemA\n" "$(date)" "$offset" "$_SHALL" "$actual" >> "$_LOG_FILE"

By the way $offset is undefined in cleanflags.sh.

@dgiese
Copy link

dgiese commented May 5, 2020

Thanks for the hint. I mean for know it works. @sareyko original script was more cleaner (please dont sue me), but I wanted to make sure that it runs in every environment (bash, ash). I somehow cannot get the print the hex stuff in ash. Also I write now to /mnt/reserve as that partition is persistent over factory resets.

@josch
Copy link

josch commented May 5, 2020

If you want to be most compatible, don't try to use hex characters but use octal instead. From the POSIX manual of printf: "Hexadecimal character constants as defined in the ISO C standard are not recognized in the format". The following worked fine for me under ash version 0.5.10.2:

$ printf '\001' | xxd
00000000: 01                                       .
$ printf '%b' '\0001' | xxd
00000000: 01                                       

This also worked under dash, bash, zsh and ksh.

@diabl0w
Copy link

diabl0w commented May 5, 2020

will using this script cause damage? I mean in that whatever is causing the device to factory reset, if we prevent the reset, then some sort of damage would happen that resetting would have prevented

@Hypfer
Copy link
Owner

Hypfer commented May 5, 2020

I doubt it but of course there's no warranty here. Everything you're doing is your own risk.

Personally I'd say it's plausible that roborock simply added some kind of hidden root detection which messes with the user just enough that they get nudged back to use the cloud. Especially since that started happening only after they became aware of Valetudo which is a viable alternative.

I can't think of any permanent damage that could be caused by this. It's much safer than disabling reboots imo

@diabl0w
Copy link

diabl0w commented May 5, 2020

I doubt it but of course there's no warranty here. Everything you're doing is your own risk.

Personally I'd say it's plausible that roborock simply added some kind of hidden root detection which messes with the user just enough that they get nudged back to use the cloud. Especially since that started happening only after they became aware of Valetudo which is a viable alternative.

I can't think of any permanent damage that could be caused by this. It's much safer than disabling reboots imo

thanks, that seems fair... i guess it would still be nice to eventually find the actual cause of the reset flags being set, but a step in the right direction at least!

@mathiasrabe
Copy link

Personally I'd say it's plausible that roborock simply added some kind of hidden root detection which messes with the user just enough that they get nudged back to use the cloud.

As I understood the resets just occur when you use a custom firmware made with Dustbuilder and add Valetudo. I've never heard that the resets occur with custom firmware from Vacuumz or did I miss anything?

If this just happens with Dustbuilder images, there might be other possibilities then a root detection. Maybe it's just a bug in Dustbuilder which is triggered by Valetudo?

Nevertheless, thanks for all your effort to dig deeper in this topic :D

@sareyko
Copy link

sareyko commented May 5, 2020

I noticed that the script is broken. If you use /bin/sh, the echos do not work as intended. Both partitions will be flagged with "2D" instead of "01".... which will cause a factory reset at the next boot.

Oh shoot! Forgot about the -n flag to echo being a bashism. But at least on my bot it works just fine in ash (BusyBox 1.24.1).

So I implemented a flag-cleaner in my dustbuilder and tested it against the v1-fw (4004 and 4007). [...] That is checked at boot-time [...]

Why the boot-time check? I don't think that's necessary and might actually be a bad idea in certain circumstances.

will using this script cause damage? I mean in that whatever is causing the device to factory reset, if we prevent the reset, then some sort of damage would happen that resetting would have prevented

I totally agree here. As long as the source of the resets is unknown, preventing them may actually brick the devices. For all we know the resets might actually be needed to recover the device in some kind of filesystem failure.

thanks, that seems fair... i guess it would still be nice to eventually find the actual cause of the reset flags being set, but a step in the right direction at least!

The flags get set by WatchDoge when a certain message is received via IPC from another process. So far I've not been able to find the process sending the message and thus the cause of the reset. I'm still looking around but it seems like the actual source of the message is missing in the firmware I'm using and looking at.
Maybe I'll have a look at another version when I find some more free time.
If anybody wants to help with this: A list of running processes from a RR running a firmware known to reset itself would be a good start.

@Hypfer
Copy link
Owner

Hypfer commented May 5, 2020

What is triggering the factory reset when you're doing it via the hardware buttons?

https://github.com/dgiese/dustcloud/wiki/Xiaomi-Vacuum-Robots-Factory-Reset

If that is a hardware feature, it should be possible to always recover 🤔

@diabl0w
Copy link

diabl0w commented May 6, 2020

I'm still looking around but it seems like the actual source of the message is missing in the firmware I'm using and looking at.
Maybe I'll have a look at another version when I find some more free time.
If anybody wants to help with this: A list of running processes from a RR running a firmware known to reset itself would be a good start.

Someone else can chime in if they have experienced differently, but in my experience:

  • https://vacuumz.info/download/gen2/

    • built with a modified imagebuilder
    • uses valetudo RE
    • I have only used one version, so not thorough testing, but I have never had a reset
  • https://dustbuilder.xvm.mit.edu/

    • uses original imagebuilder
    • uses original valetudo (an option for valetudo re does exist but I havent used it)
    • I have had probably a half dozen or more resets with these images (or ones I self built) in the past

edited to also indicate differences in valetudo versions as pointed out by @dgiese

@dgiese
Copy link

dgiese commented May 6, 2020

So from my experience the flags are safe as in the worst case the vacuum does a factory reset via u-boot. As long as you dont mess up the recovery copy of the OS you should be fine. From what I saw the flag "0x4" should never occur under normal circumstances. The other flags (1-3) are normal or are set while an update.

About the differences: vacuumz has prebuild images. If no resets have occured, then there might be two theories: it could be the valetudo version (RE vs. vanilla) or their images have set something special. Technically the images out of dustbuilder should not really differ in configurations, but maybe there is something weird. However resets existed before dustbuilder, so it must be something with valetudo or some configuration...

@rlka
Copy link

rlka commented May 11, 2020

I did not get it, can i just build the new 0.5.1 firmware for my gen1 and this fix will be included, or i need to use DustBuilder with "experimental feature" somehow?

@jdus
Copy link

jdus commented May 21, 2020

I did not get it, can i just build the new 0.5.1 firmware for my gen1 and this fix will be included, or i need to use DustBuilder with "experimental feature" somehow?

In the release notes of 0.5.1 @Hypfer states, that you either can apply the fix your self for local firmware builders, or just use dustbuilder (https://builder.dontvacuum.me/). When using dust builder just don't forget to check the box in 'experimental features':
dustbuilder

@Poeschl
Copy link
Contributor

Poeschl commented May 21, 2020

When building locally with vacuum the flag --fix-reset applies the fix.

@2relativ
Copy link

Do I still need to do something manually with 0.5.2? Or is this fix now canon?

@Hypfer
Copy link
Owner

Hypfer commented May 30, 2020

@2relativ you will need to build a new firmware image with the mitigation enabled. Just replacing the valetudo binary is not enough

@2relativ
Copy link

@Hypfer thanks! So I don't need to set a flag or something else. Just build a new Image with 0.5.2?

@Hypfer
Copy link
Owner

Hypfer commented May 30, 2020

If you follow the updated guide in the docs everything should be fine

@exetico
Copy link

exetico commented Jun 14, 2020

#206 (comment)

Our vacumm just resetted itself again. This time after 4months~, or so.

I'll use the new solution, and hopefully it'll stay valetudoed 😁

@Hypfer
Copy link
Owner

Hypfer commented Jun 15, 2020

@exetico Should be solved now. Just make sure to enable the mitigation when building a new firmware

@exetico
Copy link

exetico commented Jun 16, 2020

Hi @Hypfer

Thanks for the reply. Iit was not my thought to disturb you :-) I just wanted to report my latest issue, to have the timestamp at some place - hereafter i just wanted to find a bit of time, and reflash it.

I've now grapped the lastest version from DustBuilder, including the reset-fix, and evertything is "back to normal".

Fingers crossed :-)

@WolfspiritM
Copy link

WolfspiritM commented Jun 23, 2020

I updated a few weeks ago as well using the fix and until now it didn't reset.
However a few days ago suddenly my zones were gone and valetudo seemed to be reset to default.
Everything including authentication was lost but things like the map and cleaning history was still there.
That happened with the daily reboot.
I can only think of it as if the filesystem where the config.json is stored wasn't available when valetudo started (or it had some other issue reading that file after a reboot) so it created a new one, but I'm not sure. There even is a "config.json.backup" but that is an empty config, too. Maybe it would be good to use a different location if there already is a backup there.

I don't really think that has anything to do with this issue in particular but as I have no way to reproduce and it's some kind of reset I thought I'd mention it here.

@Schattenruf
Copy link

Schattenruf commented Jun 30, 2020

Same for me. I updated in March with the fix-Script:

I think you most likely referring to my script.
https://github.com/MadJoker0815/roborock_nologs

Up to now no reset and the filesystem looks good as well:
D1832926-F6A1-4D5F-B1F2-4914A5C21988

Note: I don‘t have any zones defined

@Hypfer
Copy link
Owner

Hypfer commented Jun 30, 2020

Since the mitigation does seem to work fine, this issue will now be closed and hopefully never reopened again.

@Hypfer Hypfer closed this as completed Jun 30, 2020
@xobs
Copy link

xobs commented Jul 12, 2020

I've just now discovered this thread. I'm going to give the fix a try. This is mostly just some thoughts I came up with when reading the thread history.

From what I gather, the most likely source of problems stems from WatchDoge. In a recent post, someone also mentioned /dev/watchdog.

If WatchDoge is, in fact, using /dev/watchdog then it's likely using the SoC WDT that will do a hard-reset if the timer expires. If it does that, then things could be in a bad state. This could happen if the process is killed due to OOM or if the filesystem fills up and a write fails.

They also could do something like reset voltage regulators during boot, which could cause a momentary brownout on eMMC. If it was in the process of writing, that could corrupt data. Or maybe they didn't wire up the eMMC reset line. Or maybe this eMMC doesn't work so well under reset.

Given some of the early reports, it certainly sounds like a WDT, especially since there aren't any logs. Maybe this SoC's WDT reports the current count, and we can read that value back.

If this happens again, we should look closer at the hardware watchdog timer as a source of these reboots and filesystem corruption. Thanks to all who investigated it.

Update: I did some poking and it looks like WatchDoge does set up the hardware watchdog timer to reboot if it hasn't been touched for 16 seconds. Based on the reference manual at http://dl.linux-sunxi.org/A23/A23%20User%20Manual%20V1.0%2020130830.pdf:

Read the current status:

[root@rockrobo ~]# ./devmem2 0x01C20Cb8
/dev/mem opened.
Memory mapped at address 0xb6f5f000.
Value at address 0x1c20cb8 (0xb6f5fcb8): 0x000000b1
[root@rockrobo ~]#

Status 0xbx means 512000 cycles (16 seconds), and 0xx1 means "Enable the watchdog".

The watchdog is set to restart the chip immediately if it doesn't get fed every 16 seconds:

[root@rockrobo ~]# ./devmem2 0x01C20Cb4
/dev/mem opened.
Memory mapped at address 0xb6fc8000.
Value at address 0x1c20cb4 (0xb6fc8cb4): 0x00000001
[root@rockrobo ~]# 

If this happens again, one thing we can try is setting it to issue an interrupt rather than restarting (this can be done by running devmem2 0x01C20Cb4 w 2). We can also try just disabling the watchdog timer entirely, since this device doesn't seem to have a trapdoor function and you can actually just set 0x01C20Cb8 to 0x00 to disable it.

But again, that's only if this problem hasn't already been solved.

@Hypfer
Copy link
Owner

Hypfer commented Aug 9, 2020

As pointed out in the dustcloud group by @bsdice, these resets seem to be related to memory usage. WatchDoge allegedly monitors the ram usage to detect memory leaks and increase the failure counter which then leads to resets.

For Firmware 1720 these default in the WatchDoge binary should be
#MEMORY_WARN_SIZE: 230000 (VmRSS total)
#MEMORY_TOTAL_SIZE: 250000 (Max VmSize)

and can also be overridden by setting larger values as env variables in the /opt/rockrobo/watchdog/rrwatchdoge.conf

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests