-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16276 doc: Address engine unavailability #15456
Conversation
Ticket title is 'Document how to bring DAOS cluster online after many servers failed' |
Add a section on handling unavailable engines. Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true
Sorry, a minor change: Added "that remain" at the end of a paragraph. |
docs/admin/troubleshooting.md
Outdated
@@ -554,6 +554,21 @@ To resolve the issue: | |||
|
|||
Alternately, the administrator may erase and re-format the DAOS system to start over fresh using the new addresses. | |||
|
|||
### Engines become unavailable | |||
|
|||
Engines may become unavailable due to server power losses and reboots, network switch failures, etc. After staying unavailable for a certain period of time, these engines may become "excluded" or "errored" in `dmg system query` output. Once the states of all engines stablize (see [`CRT_EVENT_DELAY`](env_variables.md)), each pool will check whether there is enough redundancy (see [Pool RF](pool_operations.md#pool-redundancy-factor)) to tolerate the unavailability of the "excluded" or "errored" engines. If there is enough redundancy, these engines will be excluded from the pool ("disabled ranks" in `dmg pool query --health-only` output); otherwise, the pool will perform no exclusion and may become temporarily unavailable. Similarly, when engines become available, whenever the states of all engines stablize, each pool will perform the aforementioned check for any unavailable engines that remain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest adding to the end of the sentence
... "and may become temporarily unavailable (as seen by timeouts of certain commands, for example, dmg pool query
and dmg pool list
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked myself this too when writing these words. Added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fixed "stablize" -> "stabilize".
Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true
Add a section on handling unavailable engines. Signed-off-by: Li Wei <wei.g.li@intel.com> Required-githooks: true
Add a section on handling unavailable engines.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: