DAOS-16276 doc: Address engine unavailability

Add a section on handling unavailable engines. Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true
daos-stack · Nov 6, 2024 · 9f6dd50 · 9f6dd50
1 parent 7059aa5
commit 9f6dd50
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 8 deletions.
diff --git a/docs/admin/pool_operations.md b/docs/admin/pool_operations.md
@@ -1120,15 +1120,10 @@ Administrator can set the default pool redundancy factor by environment variable
 dead and the number of failed fault domain exceeds or is going to exceed the pool
 redundancy factor, it will not change pool map immediately. Instead, it will give
 critical log message:
+```
 intolerable unavailability: engine rank x
-In this case, the system administrator should check and try to recover those
-failed engines and bring them back with:
-dmg system start --ranks=x
-one by one. A reintegrate call is not needed.
-
-For true unrecoverable failures, the administrator can still exclude engines.
-However, data loss is expected as the number of unrecoverable failures exceeds
-the pool redundancy factor.
+```
+To recover, see [Servers or engines become unavailable](troubleshooting.md#engines-become-unavailable).
 
 ## Recovering Container Ownership
 

diff --git a/docs/admin/troubleshooting.md b/docs/admin/troubleshooting.md
@@ -554,6 +554,21 @@ To resolve the issue:
 
 Alternately, the administrator may erase and re-format the DAOS system to start over fresh using the new addresses.
 
+### Engines become unavailable
+
+Engines may become unavailable due to server power losses and reboots, network switch failures, etc. After staying unavailable for a certain period of time, these engines may become "excluded" or "errored" in `dmg system query` output. Once the states of all engines stablize (see [`CRT_EVENT_DELAY`](env_variables.md)), each pool will check whether there is enough redundancy (see [Pool RF](pool_operations.md#pool-redundancy-factor)) to tolerate the unavailability of the "excluded" or "errored" engines. If there is enough redundancy, these engines will be excluded from the pool ("disabled ranks" in `dmg pool query --health-only` output); otherwise, the pool will perform no exclusion and may become temporarily unavailable. Similarly, when engines become available, whenever the states of all engines stablize, each pool will perform the aforementioned check for any unavailable engines.
+
+To restore availability as well as capacity and performance, try to start all "excluded" or "errored" engines. Starting all of them at the same time minimizes the chance of triggering rebuild jobs. In many cases, the following command suffices:
+```
+$ dmg system start
+```
+If some pools remain unavailable (e.g., `dmg pool list` keeps timing out) after the previous step, restart the whole system:
+```
+$ dmg system stop --force
+$ dmg system start
+```
+If some engines have been excluded from certain pools, and they are available again, reintegrate them to the pools.
+
 ## Diagnostic and Recovery Tools
 
 !!! WARNING : Please be careful and use this tool under supervision of DAOS support team.