Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storaged : if impossible to mount crashed (ctree failed) btrfs storaged tries indefinitely #232

Closed
delandtj opened this issue Sep 2, 2019 · 10 comments
Assignees
Labels
type_bug Something isn't working
Milestone

Comments

@delandtj
Copy link
Contributor

delandtj commented Sep 2, 2019

If there are problems with the SSD, nothing works, so there is not even a possibility to try to fix the node without going into some form of rescue mode

@zaibon zaibon added this to the later milestone Sep 2, 2019
@zaibon zaibon added the type_bug Something isn't working label Sep 2, 2019
@muhamadazmy
Copy link
Member

May be as a "rescue" solution the storage can set a "rescue" flag some where (which should be respected by other daemons) and continue. then this rescue flag will make some daemon to run in some sort of rescue mode (for example, no provision is accepted, minimal networking, etc...) to give a chance to an admin to login (not sure how) to the node and check the problem.

@zaibon zaibon modified the milestones: later, 0.1.1 Oct 2, 2019
@delandtj
Copy link
Contributor Author

Andreas had the case with an old disk -> boom, no boot, as it waited indefinitely for the disk to come up.
maybe put all mounts in a goroutine and continue booting or smt like that

@zaibon
Copy link
Contributor

zaibon commented Nov 20, 2019

What we could do is at least the reach the point were we get a network connection and somehow be able to register itself and mark the node as faulty. So we farmer could see there is something wrong with is particular node.

The issue is that without disk, all the binaries will be downloaded in memory. I guess this is fine for just sending a warning message the stop...

@LeeSmet LeeSmet self-assigned this Dec 11, 2019
@LeeSmet
Copy link
Contributor

LeeSmet commented Dec 12, 2019

So there are 2 main ways we can have issues:

  • The (btrfs) filesystem on the node is broken/corrupted/...
  • The disk itself is dead or malfunctioning

Right now storaged returns immediately when it encounters an error. To fix this, we can separate the corrupt storage pools and broken disks from the functioning once in the initialization method (rather than returning when we encounter an error), and exposing a function over zbus for other modules (i.e. capacityd) to then check which disks are not working (if any).

@zaibon
Copy link
Contributor

zaibon commented Dec 13, 2019

Right so indeed I think they are multiple things to put in place to go around this kind of problem.

  • Allow storaged to be able to operate even when one or more disk are dead/disabled
  • Capacityd already use storaged to know how much HRU and SRU are available on the node. We Should update the Total method to take in account dead disk.
  • Put something in place so the node can report dead disk to the farmer. To do so, we should add an extra method to the directory actor of the explorer for the node to use when it needs to report disks status. I'm not sure yet which daemon would be responsible to send this information thought.

@muhamadazmy
Copy link
Member

We should also take into account the following situations:

  • The cache disk is broken, hence the node does not have a persisted ID. May be identityd should generate a unique ID that is associated with the node identity/hardware rather than a random generated ID.
  • Storaged is only considered running once the /var/cache is mounted, hence lots of essential services will not start unless this happens. Should storaged mount a tmpfs in place to allow other services (including capacity) to start so the node can report itself as faulty ?

LeeSmet added a commit that referenced this issue Dec 13, 2019
Rather than returning an error while initializing the storage module,
which ultimately crashes the module, maintain separate lists of
(presumably) faulty devices and storagepools. This allows storaged to
finish intializing with all the known working devices and storagpools in
the system. Also expose these lists over the zbus interface, so other
modules have the ability to check if there is faulty hardware and if so,
take action (e.g. notify farmer).
@zaibon
Copy link
Contributor

zaibon commented Dec 13, 2019

The cache disk is broken, hence the node does not have a persisted ID. May be identityd should generate a unique ID that is associated with the node identity/hardware rather than a random generated ID.

We have already tried that. We never found anything that be used to do this. So this is not an option. Instead this solution has been chosen.

Storaged is only considered running once the /var/cache is mounted, hence lots of essential services will not start unless this happens. Should storaged mount a tmpfs in place to allow other services (including capacity) to start so the node can report itself as faulty ?

So here you are talking about the possibility that the disk used for cache dies. This is a special case that has not been taken in account yet. I think for case, we could simple prepare another working disk to take over. As long as we have the seed for the identity still available we should be good.

@zaibon
Copy link
Contributor

zaibon commented Dec 13, 2019

Coming back to this: Put something in place so the node can report dead disk to the farmer.

I think we should consider a more generic approach to hardware failure reporting. In a node, anything can break, not only disk. When this happens, we should always be able to report the failure to the farmer.

To do so we should have some extra method on the explorer directory actor where a node can report any hardware failure.
The daemon that would implement such a report system should be able to work with the minimum vital so as long as it has a network connection it can send report to the explorer.

@muhamadazmy @LeeSmet what do you think ?
I can create a new issue to start designing such a system

@muhamadazmy
Copy link
Member

Since the entire system runs in memory from tmpfs, it's fairly easy to introduce a new daemon (from tf-zos-bins) that starts on machine boot, and doesn't depend on any of the other services. This daemon will keep running as long as memory is not corrupted.

The difficult part is actually detecting hardware failure without being intrusive. I think lots of there information is available via the kernel, I would be happy to do some research on that.

@zaibon zaibon modified the milestones: 0.1.3, 0.1.4 Dec 13, 2019
@LeeSmet
Copy link
Contributor

LeeSmet commented Dec 13, 2019

A separate error reporting daemon is probably a good idea. There should probably be some research to find the best way how to detect errors though, since the daemons indeed will likely not be able to detect all errors. E.g once a filesystem is mounted, storaged will not know if a consumer of said FS experiences read/write errors or similar.

LeeSmet added a commit that referenced this issue Dec 13, 2019
Rather than returning an error while initializing the storage module,
which ultimately crashes the module, maintain separate lists of
(presumably) faulty devices and storagepools. This allows storaged to
finish intializing with all the known working devices and storagpools in
the system. Also expose these lists over the zbus interface, so other
modules have the ability to check if there is faulty hardware and if so,
take action (e.g. notify farmer).
LeeSmet added a commit that referenced this issue Dec 17, 2019
Rather than returning an error while initializing the storage module,
which ultimately crashes the module, maintain separate lists of
(presumably) faulty devices and storagepools. This allows storaged to
finish intializing with all the known working devices and storagpools in
the system. Also expose these lists over the zbus interface, so other
modules have the ability to check if there is faulty hardware and if so,
take action (e.g. notify farmer).
@zaibon zaibon closed this as completed in 93465e0 Dec 17, 2019
zaibon added a commit that referenced this issue Dec 17, 2019
Fix #232: Ensure storaged does not crash on boot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants