Skip to content

Commit

Permalink
DAOS-16276 doc: Address engine unavailability
Browse files Browse the repository at this point in the history
Add a section on handling unavailable engines.

Signed-off-by: Li Wei <wei.g.li@intel.com>
Required-githooks: true
  • Loading branch information
liw committed Nov 6, 2024
1 parent 7059aa5 commit 9f6dd50
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 8 deletions.
11 changes: 3 additions & 8 deletions docs/admin/pool_operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -1120,15 +1120,10 @@ Administrator can set the default pool redundancy factor by environment variable
dead and the number of failed fault domain exceeds or is going to exceed the pool
redundancy factor, it will not change pool map immediately. Instead, it will give
critical log message:
```
intolerable unavailability: engine rank x
In this case, the system administrator should check and try to recover those
failed engines and bring them back with:
dmg system start --ranks=x
one by one. A reintegrate call is not needed.

For true unrecoverable failures, the administrator can still exclude engines.
However, data loss is expected as the number of unrecoverable failures exceeds
the pool redundancy factor.
```
To recover, see [Servers or engines become unavailable](troubleshooting.md#engines-become-unavailable).

## Recovering Container Ownership

Expand Down
15 changes: 15 additions & 0 deletions docs/admin/troubleshooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -554,6 +554,21 @@ To resolve the issue:

Alternately, the administrator may erase and re-format the DAOS system to start over fresh using the new addresses.

### Engines become unavailable

Engines may become unavailable due to server power losses and reboots, network switch failures, etc. After staying unavailable for a certain period of time, these engines may become "excluded" or "errored" in `dmg system query` output. Once the states of all engines stablize (see [`CRT_EVENT_DELAY`](env_variables.md)), each pool will check whether there is enough redundancy (see [Pool RF](pool_operations.md#pool-redundancy-factor)) to tolerate the unavailability of the "excluded" or "errored" engines. If there is enough redundancy, these engines will be excluded from the pool ("disabled ranks" in `dmg pool query --health-only` output); otherwise, the pool will perform no exclusion and may become temporarily unavailable. Similarly, when engines become available, whenever the states of all engines stablize, each pool will perform the aforementioned check for any unavailable engines.

Check failure on line 559 in docs/admin/troubleshooting.md

View workflow job for this annotation

GitHub Actions / Codespell

stablize ==> stabilize

Check failure on line 559 in docs/admin/troubleshooting.md

View workflow job for this annotation

GitHub Actions / Codespell

stablize ==> stabilize

To restore availability as well as capacity and performance, try to start all "excluded" or "errored" engines. Starting all of them at the same time minimizes the chance of triggering rebuild jobs. In many cases, the following command suffices:
```
$ dmg system start
```
If some pools remain unavailable (e.g., `dmg pool list` keeps timing out) after the previous step, restart the whole system:
```
$ dmg system stop --force
$ dmg system start
```
If some engines have been excluded from certain pools, and they are available again, reintegrate them to the pools.

## Diagnostic and Recovery Tools

!!! WARNING : Please be careful and use this tool under supervision of DAOS support team.
Expand Down

0 comments on commit 9f6dd50

Please sign in to comment.