storaged : if impossible to mount crashed (ctree failed) btrfs storaged tries indefinitely #232

delandtj · 2019-09-02T11:32:51Z

If there are problems with the SSD, nothing works, so there is not even a possibility to try to fix the node without going into some form of rescue mode

muhamadazmy · 2019-09-09T10:25:08Z

May be as a "rescue" solution the storage can set a "rescue" flag some where (which should be respected by other daemons) and continue. then this rescue flag will make some daemon to run in some sort of rescue mode (for example, no provision is accepted, minimal networking, etc...) to give a chance to an admin to login (not sure how) to the node and check the problem.

delandtj · 2019-11-20T10:55:09Z

Andreas had the case with an old disk -> boom, no boot, as it waited indefinitely for the disk to come up.
maybe put all mounts in a goroutine and continue booting or smt like that

zaibon · 2019-11-20T12:10:02Z

What we could do is at least the reach the point were we get a network connection and somehow be able to register itself and mark the node as faulty. So we farmer could see there is something wrong with is particular node.

The issue is that without disk, all the binaries will be downloaded in memory. I guess this is fine for just sending a warning message the stop...

LeeSmet · 2019-12-12T09:47:03Z

So there are 2 main ways we can have issues:

The (btrfs) filesystem on the node is broken/corrupted/...
The disk itself is dead or malfunctioning

Right now storaged returns immediately when it encounters an error. To fix this, we can separate the corrupt storage pools and broken disks from the functioning once in the initialization method (rather than returning when we encounter an error), and exposing a function over zbus for other modules (i.e. capacityd) to then check which disks are not working (if any).

zaibon · 2019-12-13T09:59:31Z

Right so indeed I think they are multiple things to put in place to go around this kind of problem.

Allow storaged to be able to operate even when one or more disk are dead/disabled
Capacityd already use storaged to know how much HRU and SRU are available on the node. We Should update the Total method to take in account dead disk.
Put something in place so the node can report dead disk to the farmer. To do so, we should add an extra method to the directory actor of the explorer for the node to use when it needs to report disks status. I'm not sure yet which daemon would be responsible to send this information thought.

muhamadazmy · 2019-12-13T10:10:33Z

We should also take into account the following situations:

The cache disk is broken, hence the node does not have a persisted ID. May be identityd should generate a unique ID that is associated with the node identity/hardware rather than a random generated ID.
Storaged is only considered running once the /var/cache is mounted, hence lots of essential services will not start unless this happens. Should storaged mount a tmpfs in place to allow other services (including capacity) to start so the node can report itself as faulty ?

Rather than returning an error while initializing the storage module, which ultimately crashes the module, maintain separate lists of (presumably) faulty devices and storagepools. This allows storaged to finish intializing with all the known working devices and storagpools in the system. Also expose these lists over the zbus interface, so other modules have the ability to check if there is faulty hardware and if so, take action (e.g. notify farmer).

zaibon · 2019-12-13T11:09:58Z

The cache disk is broken, hence the node does not have a persisted ID. May be identityd should generate a unique ID that is associated with the node identity/hardware rather than a random generated ID.

We have already tried that. We never found anything that be used to do this. So this is not an option. Instead this solution has been chosen.

Storaged is only considered running once the /var/cache is mounted, hence lots of essential services will not start unless this happens. Should storaged mount a tmpfs in place to allow other services (including capacity) to start so the node can report itself as faulty ?

So here you are talking about the possibility that the disk used for cache dies. This is a special case that has not been taken in account yet. I think for case, we could simple prepare another working disk to take over. As long as we have the seed for the identity still available we should be good.

zaibon · 2019-12-13T11:13:27Z

Coming back to this: Put something in place so the node can report dead disk to the farmer.

I think we should consider a more generic approach to hardware failure reporting. In a node, anything can break, not only disk. When this happens, we should always be able to report the failure to the farmer.

To do so we should have some extra method on the explorer directory actor where a node can report any hardware failure.
The daemon that would implement such a report system should be able to work with the minimum vital so as long as it has a network connection it can send report to the explorer.

@muhamadazmy @LeeSmet what do you think ?
I can create a new issue to start designing such a system

muhamadazmy · 2019-12-13T11:49:07Z

Since the entire system runs in memory from tmpfs, it's fairly easy to introduce a new daemon (from tf-zos-bins) that starts on machine boot, and doesn't depend on any of the other services. This daemon will keep running as long as memory is not corrupted.

The difficult part is actually detecting hardware failure without being intrusive. I think lots of there information is available via the kernel, I would be happy to do some research on that.

LeeSmet · 2019-12-13T12:41:18Z

A separate error reporting daemon is probably a good idea. There should probably be some research to find the best way how to detect errors though, since the daemons indeed will likely not be able to detect all errors. E.g once a filesystem is mounted, storaged will not know if a consumer of said FS experiences read/write errors or similar.

Rather than returning an error while initializing the storage module, which ultimately crashes the module, maintain separate lists of (presumably) faulty devices and storagepools. This allows storaged to finish intializing with all the known working devices and storagpools in the system. Also expose these lists over the zbus interface, so other modules have the ability to check if there is faulty hardware and if so, take action (e.g. notify farmer).

Fix #232: Ensure storaged does not crash on boot

zaibon added this to the later milestone Sep 2, 2019

zaibon added the type_bug Something isn't working label Sep 2, 2019

zaibon modified the milestones: later, 0.1.1 Oct 2, 2019

LeeSmet self-assigned this Dec 11, 2019

zaibon modified the milestones: 0.1.3, 0.1.4 Dec 13, 2019

zaibon closed this as completed in 93465e0 Dec 17, 2019

zaibon added a commit that referenced this issue Dec 17, 2019

Merge pull request #444 from threefoldtech/storaged_boot

4e7ee0c

Fix #232: Ensure storaged does not crash on boot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storaged : if impossible to mount crashed (ctree failed) btrfs storaged tries indefinitely #232

storaged : if impossible to mount crashed (ctree failed) btrfs storaged tries indefinitely #232

delandtj commented Sep 2, 2019

muhamadazmy commented Sep 9, 2019

delandtj commented Nov 20, 2019

zaibon commented Nov 20, 2019

LeeSmet commented Dec 12, 2019

zaibon commented Dec 13, 2019

muhamadazmy commented Dec 13, 2019

zaibon commented Dec 13, 2019

zaibon commented Dec 13, 2019

muhamadazmy commented Dec 13, 2019

LeeSmet commented Dec 13, 2019

storaged : if impossible to mount crashed (ctree failed) btrfs storaged tries indefinitely #232

storaged : if impossible to mount crashed (ctree failed) btrfs storaged tries indefinitely #232

Comments

delandtj commented Sep 2, 2019

muhamadazmy commented Sep 9, 2019

delandtj commented Nov 20, 2019

zaibon commented Nov 20, 2019

LeeSmet commented Dec 12, 2019

zaibon commented Dec 13, 2019

muhamadazmy commented Dec 13, 2019

zaibon commented Dec 13, 2019

zaibon commented Dec 13, 2019

muhamadazmy commented Dec 13, 2019

LeeSmet commented Dec 13, 2019