-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storaged : if impossible to mount crashed (ctree failed) btrfs storaged tries indefinitely #232
Comments
May be as a "rescue" solution the storage can set a "rescue" flag some where (which should be respected by other daemons) and continue. then this rescue flag will make some daemon to run in some sort of rescue mode (for example, no provision is accepted, minimal networking, etc...) to give a chance to an admin to login (not sure how) to the node and check the problem. |
Andreas had the case with an old disk -> boom, no boot, as it waited indefinitely for the disk to come up. |
What we could do is at least the reach the point were we get a network connection and somehow be able to register itself and mark the node as faulty. So we farmer could see there is something wrong with is particular node. The issue is that without disk, all the binaries will be downloaded in memory. I guess this is fine for just sending a warning message the stop... |
So there are 2 main ways we can have issues:
Right now storaged returns immediately when it encounters an error. To fix this, we can separate the corrupt storage pools and broken disks from the functioning once in the initialization method (rather than returning when we encounter an error), and exposing a function over zbus for other modules (i.e. capacityd) to then check which disks are not working (if any). |
Right so indeed I think they are multiple things to put in place to go around this kind of problem.
|
We should also take into account the following situations:
|
Rather than returning an error while initializing the storage module, which ultimately crashes the module, maintain separate lists of (presumably) faulty devices and storagepools. This allows storaged to finish intializing with all the known working devices and storagpools in the system. Also expose these lists over the zbus interface, so other modules have the ability to check if there is faulty hardware and if so, take action (e.g. notify farmer).
We have already tried that. We never found anything that be used to do this. So this is not an option. Instead this solution has been chosen.
So here you are talking about the possibility that the disk used for cache dies. This is a special case that has not been taken in account yet. I think for case, we could simple prepare another working disk to take over. As long as we have the seed for the identity still available we should be good. |
Coming back to this: I think we should consider a more generic approach to hardware failure reporting. In a node, anything can break, not only disk. When this happens, we should always be able to report the failure to the farmer. To do so we should have some extra method on the explorer directory actor where a node can report any hardware failure. @muhamadazmy @LeeSmet what do you think ? |
Since the entire system runs in memory from tmpfs, it's fairly easy to introduce a new daemon (from tf-zos-bins) that starts on machine boot, and doesn't depend on any of the other services. This daemon will keep running as long as memory is not corrupted. The difficult part is actually detecting hardware failure without being intrusive. I think lots of there information is available via the kernel, I would be happy to do some research on that. |
A separate error reporting daemon is probably a good idea. There should probably be some research to find the best way how to detect errors though, since the daemons indeed will likely not be able to detect all errors. E.g once a filesystem is mounted, storaged will not know if a consumer of said FS experiences read/write errors or similar. |
Rather than returning an error while initializing the storage module, which ultimately crashes the module, maintain separate lists of (presumably) faulty devices and storagepools. This allows storaged to finish intializing with all the known working devices and storagpools in the system. Also expose these lists over the zbus interface, so other modules have the ability to check if there is faulty hardware and if so, take action (e.g. notify farmer).
Rather than returning an error while initializing the storage module, which ultimately crashes the module, maintain separate lists of (presumably) faulty devices and storagepools. This allows storaged to finish intializing with all the known working devices and storagpools in the system. Also expose these lists over the zbus interface, so other modules have the ability to check if there is faulty hardware and if so, take action (e.g. notify farmer).
Fix #232: Ensure storaged does not crash on boot
If there are problems with the SSD, nothing works, so there is not even a possibility to try to fix the node without going into some form of rescue mode
The text was updated successfully, but these errors were encountered: