From f1c27a7ba5d0da3f10b50369584390e2a3c772b1 Mon Sep 17 00:00:00 2001 From: Ilia Shakhov Date: Fri, 3 May 2024 10:24:13 +0000 Subject: [PATCH 01/10] Add english version of CMS docs --- .../manual/maintenance-without-downtime.md | 98 +++++++++++++++++++ ydb/docs/en/core/maintenance/toc_i.yaml | 2 + .../maintenance-without-downtime.md} | 14 +-- .../maintenance/manual/node_restarting.md | 2 + ydb/docs/ru/core/maintenance/toc_i.yaml | 2 +- 5 files changed, 110 insertions(+), 8 deletions(-) create mode 100644 ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md rename ydb/docs/ru/core/maintenance/{maintenance-without-outages.md => manual/maintenance-without-downtime.md} (84%) diff --git a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md new file mode 100644 index 000000000000..620890883009 --- /dev/null +++ b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md @@ -0,0 +1,98 @@ +# Maintenance without downtime + +Periodically, the {{ ydb-short-name }} cluster needs to be maintained, such as upgrading its version or replacing broken disks. Maintenance can cause the cluster or its databases to become unavailable due to: +- Exceeding the failure model of the affected [storage groups](../../concepts/databases.md#storage-groups). +- Exceeding the [State Storage](../../deploy/configuration/config.md#domains-state) failure model. +- Lack of computational resources due to stopping too many [dynamic nodes](../../concepts/cluster/common_scheme_ydb.md#nodes). + +To avoid such situations, {{ ydb-short-name }} has a system [tablet](./concepts/cluster/common_scheme_ydb.md#tablets) that monitors the state of the cluster - the *Cluster Management System (CMS)*. The CMS allows you to answer the question of whether a {{ ydb-short-name }} node or host running {{ ydb-short-name }} nodes can be safely taken out for maintenance. To do this, create a [maintenance task](#maintenance-task) in the CMS and specify in it to take exclusive locks on the nodes or hosts that will be involved in the maintenance. The cluster components on which the locks are taken are considered unavailable from the CMS perspective and can be safely maintained. The CMS will [check](#checking-algorithm) the current state of the cluster and take locks only if the maintenance comply with the [availability mode](#availability-mode) and [unavailable node limits](#unavailable-node-limits). + +{% note warning "Faults during maintenance" %} + +During maintenance activities whose safety is guaranteed by the CMS, faults unrelated to those activities may occur in the cluster. If the faults threaten the availability of the cluster, urgent completion of the maintenance can help mitigate the risk of loss of availability. + +{% endnote %} + +## Maintenance task {#maintenance-task} + +A *maintenance task* is a set of *actions* that the user asks the CMS to perform for safe maintenance. + +Supported actions: +- Taking an exclusive lock on a cluster component - node or host. + +In a task, actions are divided into groups. Actions from the same group are performed atomically. Currently, groups can consist of only one action. + +If it's not possible to perform an action at the time of the request, the CMS informs you of the reason and the time when it is worth *refreshing* the task, and sets the action status to *pending*. When the task is refreshed, the CMS attempts to perform the pending actions again. + +*Performed* actions have a deadline after which they are considered *completed* and stop having an effect on the cluster. For example, an exclusive lock is removed. An action can be completed early. + +{% note info "Protracted maintenance" %} + +If maintenance continues after the actions that were performed to make it safe have been completed, this is considered a fault in the cluster. + +{% endnote %} + +Completed actions are automatically removed from the task. + +### Availability mode {#availability-mode} + +In a maintenance task, you need to specify the availability mode of the cluster to be complied when checking whether actions can be performed. The following modes are supported: +- **Strong** - a mode that minimizes the risk of availability loss. + - No more than one unavailable [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) is allowed in each affected storage group. + - No more than one unavailable State Storage ring is allowed. +- **Weak** - a mode that does not allow exceeding the failure model. + - No more than two unavailable VDisks are allowed for affected storage groups with the [block-4-2](../../administration/production-storage-config.md#reliability) scheme. + - No more than four unavailable VDisks, three of which must be in the same data center, are allowed for affected storage groups with the [mirror-3-dc](../../administration/production-storage-config.md#reliability) scheme. + - No more than `(nto_select - 1) / 2` unavailable State Storage rings are allowed. +- **Force** - forced mode, the failure model is ignored. Not recommended for use. + +### Priority {#priority} + +You can specify the priority of a maintenance task. A lower value means a higher priority. + +The actions of the task cannot be performed until all conflicting actions from tasks with a higher priority are completed. Tasks with the same priority have no advantage over each other. + +## Unavailable node limits {#unavailable-node-limits} + +In the CMS configuration, you can configure limits on the number of unavailable nodes for a database (tenant) or for the cluster as a whole. Relative and absolute limits are supported. + +By default, no more than 10% of unavailable nodes are allowed for each database and the cluster as a whole. + +## Checking algorithm {#checking-algorithm} + +To check if the actions of a maintenance task can be performed, the CMS sequentially goes through each action group in the task and checks the action from the group: +- If the object of the action is a host, the CMS checks whether the action can be performed with all nodes running on the host. +- If the object of the action is a node, the CMS checks: + - Whether there is a lock on the node. + - Whether it's possible to lock the node according to the limits of unavailable nodes. + - Whether it's possible to lock all VDisks of the node according to the availability mode. + - Whether it's possible to lock the State Storage ring of the node according to the availability mode. + - Whether it's possible to lock the node according to the limit of unavailable nodes on which cluster system tablets can run. + +If the checks are successful, the action can be performed and a temporary locks are taken on the checked nodes. The CMS then considers the next group of actions. Temporary locks help to understand whether the actions requested in different groups conflict with each other. Once the check is complete, the temporary locks are removed. + +## Examples {#examples} + +The [ydbops](https://github.com/ydb-platform/ydbops) utility tool uses CMS for cluster maintenance without downtime. You can also use the CMS directly through the [gRPC API](https://github.com/ydb-platform/ydb/blob/main/ydb/public/api/grpc/draft/ydb_maintenance_v1.proto). + +#### Take out a node for maintenance {#node-maintenance} + +{% note info "Functionality in development" %} + +Functionality is expected in upcoming versions of ydbops. + +{% endnote %} + +To take out a node for maintenance, you can use the command: +``` +$ ydbops node maintenance --host +``` +When executing this command, ydbops will take an exclusive lock on the node in CMS. + +### Rolling restart {##rolling-restart} + +To perform a rolling restart of the entire cluster you can use the command: +``` +$ ydbops restart --endpoint grpc:// --availability-mode strong +``` +The ydbops utility will automatically create a maintenance task to restart the entire cluster using the given availability mode. As it progresses, ydbops will refresh the maintenance task and take exclusive locks on the nodes in the CMS until all nodes are restarted. diff --git a/ydb/docs/en/core/maintenance/toc_i.yaml b/ydb/docs/en/core/maintenance/toc_i.yaml index 4ce3544597ea..2f19919177bc 100644 --- a/ydb/docs/en/core/maintenance/toc_i.yaml +++ b/ydb/docs/en/core/maintenance/toc_i.yaml @@ -11,5 +11,7 @@ items: include: { mode: link, path: manual/toc_p.yaml } - name: Changing an actor system's configuration href: manual/change_actorsystem_configs.md + - name: Maintenance without downtine + href: manual/maintenance-without-downtime.md - name: Updating configurations via CMS href: manual/cms.md diff --git a/ydb/docs/ru/core/maintenance/maintenance-without-outages.md b/ydb/docs/ru/core/maintenance/manual/maintenance-without-downtime.md similarity index 84% rename from ydb/docs/ru/core/maintenance/maintenance-without-outages.md rename to ydb/docs/ru/core/maintenance/manual/maintenance-without-downtime.md index cbc8a99d7d61..6fb45954fcfe 100644 --- a/ydb/docs/ru/core/maintenance/maintenance-without-outages.md +++ b/ydb/docs/ru/core/maintenance/manual/maintenance-without-downtime.md @@ -1,11 +1,11 @@ # Обслуживание кластера без потери доступности Периодически кластер {{ ydb-short-name }} необходимо обслуживать, например, обновлять его версию или заменять сломавшиеся диски. Работы по обслуживанию могут привести к недоступности кластера или имеющихся баз данных из-за: -- Превышения модели отказа затронутых [групп хранения](../concepts/databases.md#storage-groups). -- Превышения модели отказа [State Storage](../deploy/configuration/config.md#domains-state). -- Недостатка вычислительных ресурсов вследствие остановки слишком большого количества [динамических узлов](../concepts/cluster/common_scheme_ydb.md#nodes). +- Превышения модели отказа затронутых [групп хранения](../../concepts/databases.md#storage-groups). +- Превышения модели отказа [State Storage](../../deploy/configuration/config.md#domains-state). +- Недостатка вычислительных ресурсов вследствие остановки слишком большого количества [динамических узлов](../../concepts/cluster/common_scheme_ydb.md#nodes). -Для избежания таких ситуаций в {{ ydb-short-name }} есть системная [таблетка](../concepts/cluster/common_scheme_ydb.md#tablets), которая следит за состоянием кластера — *Cluster Management System (CMS)*. CMS позволяет ответить на вопрос можно ли безопасно вывести в обслуживание узел {{ ydb-short-name }} или хост, на котором работают узлы {{ ydb-short-name }}. Для этого необходимо создать [задачу обслуживания](#maintenance-task) в CMS и указать в ней взятие эксклюзивных блокировок на узлы или хосты, которые будут задействованы в обслуживании. Компоненты кластера, на которые взяты блокировки, считаются недоступными с точки зрения CMS, и их можно безопасно обслуживать. CMS [проверит](#check-task-actions-algorithm) текущее состояние кластера и возьмет блокировки, только если работы по обслуживанию соответствуют ограничениям [режима доступности](#availability-mode) и [лимитам недоступных узлов](#unavailable-node-limits). +Для избежания таких ситуаций в {{ ydb-short-name }} есть системная [таблетка](../../concepts/cluster/common_scheme_ydb.md#tablets), которая следит за состоянием кластера — *Cluster Management System (CMS)*. CMS позволяет ответить на вопрос можно ли безопасно вывести в обслуживание узел {{ ydb-short-name }} или хост, на котором работают узлы {{ ydb-short-name }}. Для этого необходимо создать [задачу обслуживания](#maintenance-task) в CMS и указать в ней взятие эксклюзивных блокировок на узлы или хосты, которые будут задействованы в обслуживании. Компоненты кластера, на которые взяты блокировки, считаются недоступными с точки зрения CMS, и их можно безопасно обслуживать. CMS [проверит](#check-task-actions-algorithm) текущее состояние кластера и возьмет блокировки, только если работы по обслуживанию соответствуют ограничениям [режима доступности](#availability-mode) и [лимитам недоступных узлов](#unavailable-node-limits). {% note warning "Поломки во время проведения работ" %} @@ -38,11 +38,11 @@ В задаче обслуживания необходимо указать режим доступности кластера, который должен соблюдаться при проверке возможности выполнения действий. Поддерживаются следующие режимы: - **Strong** — режим, минимизирующий риск потери доступности. - - Допускается не более одного недоступного [VDisk](../concepts/cluster/distributed_storage.md#storage-groups) в каждой из затрагиваемых групп хранения. + - Допускается не более одного недоступного [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) в каждой из затрагиваемых групп хранения. - Допускается не более одного недоступного кольца State Storage. - **Weak** — режим, не позволяющий превысить модель отказа. - - Допускается не более двух недоступных VDisk-ов для затрагиваемых групп хранения со схемой [block-4-2](../administration/production-storage-config.md#reliability). - - Допускается не более четырех недоступных VDisk-ов, три из которых должны находиться в одном датацентре, для затрагиваемых групп хранения со схемой [mirror-3-dc](../administration/production-storage-config.md#reliability). + - Допускается не более двух недоступных VDisk-ов для затрагиваемых групп хранения со схемой [block-4-2](../../administration/production-storage-config.md#reliability). + - Допускается не более четырех недоступных VDisk-ов, три из которых должны находиться в одном датацентре, для затрагиваемых групп хранения со схемой [mirror-3-dc](../../administration/production-storage-config.md#reliability). - Допускается не более `(nto_select - 1) / 2` недоступных колец State Storage. - **Force** — принудительный режим, модель отказа игнорируется. Не рекомендуется к использованию. diff --git a/ydb/docs/ru/core/maintenance/manual/node_restarting.md b/ydb/docs/ru/core/maintenance/manual/node_restarting.md index f81b118bc977..b88382e93a51 100644 --- a/ydb/docs/ru/core/maintenance/manual/node_restarting.md +++ b/ydb/docs/ru/core/maintenance/manual/node_restarting.md @@ -1,3 +1,5 @@ +# Безопасный рестарт и выключение узлов + ## Остановка/рестарт процесса ydb на узле {#restart_process} Чтобы убедиться, что процесс можно остановить, надо выполнить следующие шаги. diff --git a/ydb/docs/ru/core/maintenance/toc_i.yaml b/ydb/docs/ru/core/maintenance/toc_i.yaml index c187b0077f92..3849318c3747 100644 --- a/ydb/docs/ru/core/maintenance/toc_i.yaml +++ b/ydb/docs/ru/core/maintenance/toc_i.yaml @@ -12,7 +12,7 @@ items: - name: Изменение конфигурации актор-системы href: manual/change_actorsystem_configs.md - name: Обслуживание кластера без потери доступности - href: maintenance-without-outages.md + href: manual/maintenance-without-downtime.md - name: Управление конфигурацией кластера items: - name: Обзор конфигурации From 4a7c4cc80a7f6e6e135f03a9e587d45d27efcb32 Mon Sep 17 00:00:00 2001 From: Ilia Shakhov Date: Fri, 3 May 2024 10:35:28 +0000 Subject: [PATCH 02/10] Fix toc --- ydb/docs/en/core/devops/manual/toc_p.yaml | 2 ++ ydb/docs/en/core/maintenance/toc_i.yaml | 2 +- ydb/docs/ru/core/devops/manual/toc_p.yaml | 2 ++ 3 files changed, 5 insertions(+), 1 deletion(-) diff --git a/ydb/docs/en/core/devops/manual/toc_p.yaml b/ydb/docs/en/core/devops/manual/toc_p.yaml index 2ca9b34a3d6b..ce00df7455fb 100644 --- a/ydb/docs/en/core/devops/manual/toc_p.yaml +++ b/ydb/docs/en/core/devops/manual/toc_p.yaml @@ -23,4 +23,6 @@ items: href: ../../maintenance/manual/cms.md - name: System views href: system-views.md +- name: Maintenance without downtime + href: ../../manual/maintenance-without-downtime.md diff --git a/ydb/docs/en/core/maintenance/toc_i.yaml b/ydb/docs/en/core/maintenance/toc_i.yaml index 2f19919177bc..87976af7afd1 100644 --- a/ydb/docs/en/core/maintenance/toc_i.yaml +++ b/ydb/docs/en/core/maintenance/toc_i.yaml @@ -11,7 +11,7 @@ items: include: { mode: link, path: manual/toc_p.yaml } - name: Changing an actor system's configuration href: manual/change_actorsystem_configs.md - - name: Maintenance without downtine + - name: Maintenance without downtime href: manual/maintenance-without-downtime.md - name: Updating configurations via CMS href: manual/cms.md diff --git a/ydb/docs/ru/core/devops/manual/toc_p.yaml b/ydb/docs/ru/core/devops/manual/toc_p.yaml index d8d598087882..24d4b2e0df10 100644 --- a/ydb/docs/ru/core/devops/manual/toc_p.yaml +++ b/ydb/docs/ru/core/devops/manual/toc_p.yaml @@ -35,3 +35,5 @@ items: href: ../../maintenance/manual/cms.md - name: Системные таблицы href: system-views.md +- name: Обслуживание без потери доступности + href: ../..manual/maintenance-without-downtime.md From 3bd0198eb35b76086873e3498aa79ffb4fd3fb1b Mon Sep 17 00:00:00 2001 From: Ilia Shakhov Date: Fri, 3 May 2024 10:43:16 +0000 Subject: [PATCH 03/10] Fix toc --- ydb/docs/en/core/devops/manual/toc_p.yaml | 2 +- ydb/docs/ru/core/devops/manual/toc_p.yaml | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/ydb/docs/en/core/devops/manual/toc_p.yaml b/ydb/docs/en/core/devops/manual/toc_p.yaml index ce00df7455fb..dcd79b7672bd 100644 --- a/ydb/docs/en/core/devops/manual/toc_p.yaml +++ b/ydb/docs/en/core/devops/manual/toc_p.yaml @@ -24,5 +24,5 @@ items: - name: System views href: system-views.md - name: Maintenance without downtime - href: ../../manual/maintenance-without-downtime.md + href: ../../maintenance/manual/maintenance-without-downtime.md diff --git a/ydb/docs/ru/core/devops/manual/toc_p.yaml b/ydb/docs/ru/core/devops/manual/toc_p.yaml index 24d4b2e0df10..bfe689591fbd 100644 --- a/ydb/docs/ru/core/devops/manual/toc_p.yaml +++ b/ydb/docs/ru/core/devops/manual/toc_p.yaml @@ -36,4 +36,4 @@ items: - name: Системные таблицы href: system-views.md - name: Обслуживание без потери доступности - href: ../..manual/maintenance-without-downtime.md + href: ../../maintenance/manual/maintenance-without-downtime.md From a918e971ef585ea25d58644d4196ac8c0807b2a8 Mon Sep 17 00:00:00 2001 From: Ilia Shakhov Date: Fri, 3 May 2024 10:53:56 +0000 Subject: [PATCH 04/10] Fix toc --- .../maintenance/manual/maintenance-without-downtime.md | 6 +++--- .../maintenance/manual/maintenance-without-downtime.md | 8 ++++---- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md index 620890883009..ea46629d2741 100644 --- a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md +++ b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md @@ -5,7 +5,7 @@ Periodically, the {{ ydb-short-name }} cluster needs to be maintained, such as u - Exceeding the [State Storage](../../deploy/configuration/config.md#domains-state) failure model. - Lack of computational resources due to stopping too many [dynamic nodes](../../concepts/cluster/common_scheme_ydb.md#nodes). -To avoid such situations, {{ ydb-short-name }} has a system [tablet](./concepts/cluster/common_scheme_ydb.md#tablets) that monitors the state of the cluster - the *Cluster Management System (CMS)*. The CMS allows you to answer the question of whether a {{ ydb-short-name }} node or host running {{ ydb-short-name }} nodes can be safely taken out for maintenance. To do this, create a [maintenance task](#maintenance-task) in the CMS and specify in it to take exclusive locks on the nodes or hosts that will be involved in the maintenance. The cluster components on which the locks are taken are considered unavailable from the CMS perspective and can be safely maintained. The CMS will [check](#checking-algorithm) the current state of the cluster and take locks only if the maintenance comply with the [availability mode](#availability-mode) and [unavailable node limits](#unavailable-node-limits). +To avoid such situations, {{ ydb-short-name }} has a system [tablet](../../concepts/cluster/common_scheme_ydb.md#tablets) that monitors the state of the cluster - the *Cluster Management System (CMS)*. The CMS allows you to answer the question of whether a {{ ydb-short-name }} node or host running {{ ydb-short-name }} nodes can be safely taken out for maintenance. To do this, create a [maintenance task](#maintenance-task) in the CMS and specify in it to take exclusive locks on the nodes or hosts that will be involved in the maintenance. The cluster components on which the locks are taken are considered unavailable from the CMS perspective and can be safely maintained. The CMS will [check](#checking-algorithm) the current state of the cluster and take locks only if the maintenance comply with the [availability mode](#availability-mode) and [unavailable node limits](#unavailable-node-limits). {% note warning "Faults during maintenance" %} @@ -41,8 +41,8 @@ In a maintenance task, you need to specify the availability mode of the cluster - No more than one unavailable [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) is allowed in each affected storage group. - No more than one unavailable State Storage ring is allowed. - **Weak** - a mode that does not allow exceeding the failure model. - - No more than two unavailable VDisks are allowed for affected storage groups with the [block-4-2](../../administration/production-storage-config.md#reliability) scheme. - - No more than four unavailable VDisks, three of which must be in the same data center, are allowed for affected storage groups with the [mirror-3-dc](../../administration/production-storage-config.md#reliability) scheme. + - No more than two unavailable VDisks are allowed for affected storage groups with the [block-4-2](../../deploy/configuration/config.md#reliability) scheme. + - No more than four unavailable VDisks, three of which must be in the same data center, are allowed for affected storage groups with the [mirror-3-dc](../../deploy/configuration/config.md#reliability) scheme. - No more than `(nto_select - 1) / 2` unavailable State Storage rings are allowed. - **Force** - forced mode, the failure model is ignored. Not recommended for use. diff --git a/ydb/docs/ru/core/maintenance/manual/maintenance-without-downtime.md b/ydb/docs/ru/core/maintenance/manual/maintenance-without-downtime.md index 6fb45954fcfe..7c9b3ffa5730 100644 --- a/ydb/docs/ru/core/maintenance/manual/maintenance-without-downtime.md +++ b/ydb/docs/ru/core/maintenance/manual/maintenance-without-downtime.md @@ -5,7 +5,7 @@ - Превышения модели отказа [State Storage](../../deploy/configuration/config.md#domains-state). - Недостатка вычислительных ресурсов вследствие остановки слишком большого количества [динамических узлов](../../concepts/cluster/common_scheme_ydb.md#nodes). -Для избежания таких ситуаций в {{ ydb-short-name }} есть системная [таблетка](../../concepts/cluster/common_scheme_ydb.md#tablets), которая следит за состоянием кластера — *Cluster Management System (CMS)*. CMS позволяет ответить на вопрос можно ли безопасно вывести в обслуживание узел {{ ydb-short-name }} или хост, на котором работают узлы {{ ydb-short-name }}. Для этого необходимо создать [задачу обслуживания](#maintenance-task) в CMS и указать в ней взятие эксклюзивных блокировок на узлы или хосты, которые будут задействованы в обслуживании. Компоненты кластера, на которые взяты блокировки, считаются недоступными с точки зрения CMS, и их можно безопасно обслуживать. CMS [проверит](#check-task-actions-algorithm) текущее состояние кластера и возьмет блокировки, только если работы по обслуживанию соответствуют ограничениям [режима доступности](#availability-mode) и [лимитам недоступных узлов](#unavailable-node-limits). +Для избежания таких ситуаций в {{ ydb-short-name }} есть системная [таблетка](../../concepts/cluster/common_scheme_ydb.md#tablets), которая следит за состоянием кластера — *Cluster Management System (CMS)*. CMS позволяет ответить на вопрос можно ли безопасно вывести в обслуживание узел {{ ydb-short-name }} или хост, на котором работают узлы {{ ydb-short-name }}. Для этого необходимо создать [задачу обслуживания](#maintenance-task) в CMS и указать в ней взятие эксклюзивных блокировок на узлы или хосты, которые будут задействованы в обслуживании. Компоненты кластера, на которые взяты блокировки, считаются недоступными с точки зрения CMS, и их можно безопасно обслуживать. CMS [проверит](#checking-algorithm) текущее состояние кластера и возьмет блокировки, только если работы по обслуживанию соответствуют ограничениям [режима доступности](#availability-mode) и [лимитам недоступных узлов](#unavailable-node-limits). {% note warning "Поломки во время проведения работ" %} @@ -41,8 +41,8 @@ - Допускается не более одного недоступного [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) в каждой из затрагиваемых групп хранения. - Допускается не более одного недоступного кольца State Storage. - **Weak** — режим, не позволяющий превысить модель отказа. - - Допускается не более двух недоступных VDisk-ов для затрагиваемых групп хранения со схемой [block-4-2](../../administration/production-storage-config.md#reliability). - - Допускается не более четырех недоступных VDisk-ов, три из которых должны находиться в одном датацентре, для затрагиваемых групп хранения со схемой [mirror-3-dc](../../administration/production-storage-config.md#reliability). + - Допускается не более двух недоступных VDisk-ов для затрагиваемых групп хранения со схемой [block-4-2](../../deploy/configuration/config.md#reliability). + - Допускается не более четырех недоступных VDisk-ов, три из которых должны находиться в одном датацентре, для затрагиваемых групп хранения со схемой [mirror-3-dc](../../deploy/configuration/config.md#reliability). - Допускается не более `(nto_select - 1) / 2` недоступных колец State Storage. - **Force** — принудительный режим, модель отказа игнорируется. Не рекомендуется к использованию. @@ -58,7 +58,7 @@ По умолчанию допускается не более 10% недоступных узлов для каждой базы данных и кластера в целом. -## Алгоритм проверки действий задачи {#check-task-actions-algorithm} +## Алгоритм проверки {#checking-algorithm} Для того, чтобы проверить можно ли выполнить действия задачи обслуживания, CMS последовательно идет по каждой группе действий в задаче и проверяет действие из группы: - Если объектом действия является хост, то CMS проверяет можно ли выполнить действие со всеми узлами, запущенными на хосте. From 1fafadf3a058268876fc778cbe9c22d27578cfb0 Mon Sep 17 00:00:00 2001 From: Ilia Shakhov Date: Fri, 3 May 2024 11:03:20 +0000 Subject: [PATCH 05/10] Fix header --- .../en/core/maintenance/manual/maintenance-without-downtime.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md index ea46629d2741..bbf7d1a6b579 100644 --- a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md +++ b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md @@ -75,7 +75,7 @@ If the checks are successful, the action can be performed and a temporary locks The [ydbops](https://github.com/ydb-platform/ydbops) utility tool uses CMS for cluster maintenance without downtime. You can also use the CMS directly through the [gRPC API](https://github.com/ydb-platform/ydb/blob/main/ydb/public/api/grpc/draft/ydb_maintenance_v1.proto). -#### Take out a node for maintenance {#node-maintenance} +### Take out a node for maintenance {#node-maintenance} {% note info "Functionality in development" %} From 683d08222f8f7e64f44728fd36731abc5cf37e5d Mon Sep 17 00:00:00 2001 From: Ilia Shakhov Date: Fri, 3 May 2024 13:01:57 +0000 Subject: [PATCH 06/10] Micro fix --- .../en/core/maintenance/manual/maintenance-without-downtime.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md index bbf7d1a6b579..3439607dcda6 100644 --- a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md +++ b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md @@ -18,7 +18,7 @@ During maintenance activities whose safety is guaranteed by the CMS, faults unre A *maintenance task* is a set of *actions* that the user asks the CMS to perform for safe maintenance. Supported actions: -- Taking an exclusive lock on a cluster component - node or host. +- Taking an exclusive lock on a cluster component — node or host. In a task, actions are divided into groups. Actions from the same group are performed atomically. Currently, groups can consist of only one action. From 8983c4789c9cc1857fc60bd0617799aa78e04391 Mon Sep 17 00:00:00 2001 From: Ilia Shakhov Date: Mon, 6 May 2024 10:39:52 +0000 Subject: [PATCH 07/10] Replace 'take' to 'acquire' --- .../manual/maintenance-without-downtime.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md index 3439607dcda6..dcb99dfb853d 100644 --- a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md +++ b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md @@ -5,7 +5,7 @@ Periodically, the {{ ydb-short-name }} cluster needs to be maintained, such as u - Exceeding the [State Storage](../../deploy/configuration/config.md#domains-state) failure model. - Lack of computational resources due to stopping too many [dynamic nodes](../../concepts/cluster/common_scheme_ydb.md#nodes). -To avoid such situations, {{ ydb-short-name }} has a system [tablet](../../concepts/cluster/common_scheme_ydb.md#tablets) that monitors the state of the cluster - the *Cluster Management System (CMS)*. The CMS allows you to answer the question of whether a {{ ydb-short-name }} node or host running {{ ydb-short-name }} nodes can be safely taken out for maintenance. To do this, create a [maintenance task](#maintenance-task) in the CMS and specify in it to take exclusive locks on the nodes or hosts that will be involved in the maintenance. The cluster components on which the locks are taken are considered unavailable from the CMS perspective and can be safely maintained. The CMS will [check](#checking-algorithm) the current state of the cluster and take locks only if the maintenance comply with the [availability mode](#availability-mode) and [unavailable node limits](#unavailable-node-limits). +To avoid such situations, {{ ydb-short-name }} has a system [tablet](../../concepts/cluster/common_scheme_ydb.md#tablets) that monitors the state of the cluster - the *Cluster Management System (CMS)*. The CMS allows you to answer the question of whether a {{ ydb-short-name }} node or host running {{ ydb-short-name }} nodes can be safely taken out for maintenance. To do this, create a [maintenance task](#maintenance-task) in the CMS and specify in it to acquire exclusive locks on the nodes or hosts that will be involved in the maintenance. The cluster components on which the locks are acquired are considered unavailable from the CMS perspective and can be safely maintained. The CMS will [check](#checking-algorithm) the current state of the cluster and acquire locks only if the maintenance comply with the [availability mode](#availability-mode) and [unavailable node limits](#unavailable-node-limits). {% note warning "Faults during maintenance" %} @@ -18,13 +18,13 @@ During maintenance activities whose safety is guaranteed by the CMS, faults unre A *maintenance task* is a set of *actions* that the user asks the CMS to perform for safe maintenance. Supported actions: -- Taking an exclusive lock on a cluster component — node or host. +- Acquiring an exclusive lock on a cluster component — node or host. In a task, actions are divided into groups. Actions from the same group are performed atomically. Currently, groups can consist of only one action. If it's not possible to perform an action at the time of the request, the CMS informs you of the reason and the time when it is worth *refreshing* the task, and sets the action status to *pending*. When the task is refreshed, the CMS attempts to perform the pending actions again. -*Performed* actions have a deadline after which they are considered *completed* and stop having an effect on the cluster. For example, an exclusive lock is removed. An action can be completed early. +*Performed* actions have a deadline after which they are considered *completed* and stop having an effect on the cluster. For example, an exclusive lock is released. An action can be completed early. {% note info "Protracted maintenance" %} @@ -69,7 +69,7 @@ To check if the actions of a maintenance task can be performed, the CMS sequenti - Whether it's possible to lock the State Storage ring of the node according to the availability mode. - Whether it's possible to lock the node according to the limit of unavailable nodes on which cluster system tablets can run. -If the checks are successful, the action can be performed and a temporary locks are taken on the checked nodes. The CMS then considers the next group of actions. Temporary locks help to understand whether the actions requested in different groups conflict with each other. Once the check is complete, the temporary locks are removed. +If the checks are successful, the action can be performed and a temporary locks are acquired on the checked nodes. The CMS then considers the next group of actions. Temporary locks help to understand whether the actions requested in different groups conflict with each other. Once the check is complete, the temporary locks are released. ## Examples {#examples} @@ -87,7 +87,7 @@ To take out a node for maintenance, you can use the command: ``` $ ydbops node maintenance --host ``` -When executing this command, ydbops will take an exclusive lock on the node in CMS. +When executing this command, ydbops will acquire an exclusive lock on the node in CMS. ### Rolling restart {##rolling-restart} @@ -95,4 +95,4 @@ To perform a rolling restart of the entire cluster you can use the command: ``` $ ydbops restart --endpoint grpc:// --availability-mode strong ``` -The ydbops utility will automatically create a maintenance task to restart the entire cluster using the given availability mode. As it progresses, ydbops will refresh the maintenance task and take exclusive locks on the nodes in the CMS until all nodes are restarted. +The ydbops utility will automatically create a maintenance task to restart the entire cluster using the given availability mode. As it progresses, ydbops will refresh the maintenance task and acquire exclusive locks on the nodes in the CMS until all nodes are restarted. From e1dd75adf42bbb39195979148d20eb13fb6ced54 Mon Sep 17 00:00:00 2001 From: Ilia Shakhov Date: Mon, 13 May 2024 13:45:06 +0000 Subject: [PATCH 08/10] Fix --- .../manual/maintenance-without-downtime.md | 100 ++++++++++++++++++ ydb/docs/en/core/devops/manual/toc_p.yaml | 2 +- .../manual/maintenance-without-downtime.md | 98 ----------------- .../manual/maintenance-without-downtime.md | 18 ++-- ydb/docs/ru/core/devops/manual/toc_p.yaml | 2 +- ydb/docs/ru/core/maintenance/toc_i.yaml | 2 - 6 files changed, 112 insertions(+), 110 deletions(-) create mode 100644 ydb/docs/en/core/devops/manual/maintenance-without-downtime.md delete mode 100644 ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md rename ydb/docs/ru/core/{maintenance => devops}/manual/maintenance-without-downtime.md (89%) diff --git a/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md b/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md new file mode 100644 index 000000000000..d1e6554e5fc8 --- /dev/null +++ b/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md @@ -0,0 +1,100 @@ +# Maintenance without downtime + +A {{ ydb-short-name }} cluster periodically needs maintenance, such as upgrading its version or replacing broken disks. Maintenance can cause a cluster or its databases to become unavailable due to: +- Going beyond the expectations of the affected [storage groups](../../concepts/databases.md#storage-groups) failure model. +- Going beyond the expectations of the [State Storage](../../deploy/configuration/config.md#domains-state) failure model. +- Lack of computational resources due to stopping too many [dynamic nodes](../../concepts/cluster/common_scheme_ydb.md#nodes). + +To avoid such situations, {{ ydb-short-name }} has a system [tablet](../../concepts/cluster/common_scheme_ydb.md#tablets) that monitors the state of the cluster - the *Cluster Management System (CMS)*. The CMS allows you to answer the question of whether a {{ ydb-short-name }} node or host running {{ ydb-short-name }} nodes can be safely taken out for maintenance. To do this, create a [maintenance task](#maintenance-task) in the CMS and specify in it to acquire exclusive locks on the nodes or hosts that will be involved in the maintenance. The cluster components on which the locks are acquired are considered unavailable from the CMS perspective and can be safely engaged in maintenance. The CMS will [check](#checking-algorithm) the current state of the cluster and acquire locks only if the maintenance complies with the [availability mode](#availability-mode) and [unavailable node limits](#unavailable-node-limits). + +{% note warning "Failures during maintenance" %} + +During maintenance activities whose safety is guaranteed by the CMS, failures unrelated to those activities may occur in the cluster. If the failures threaten the cluster's availability, urgently aborting the maintenance can help mitigate the risk of cluster downtime. + +{% endnote %} + +## Maintenance task {#maintenance-task} + +A *maintenance task* is a set of *actions* that the user asks the CMS to perform for safe maintenance. + +Supported actions: +- Acquiring an exclusive lock on a cluster component (node or host). + +Actions in a task are divided into groups. Actions from the same group are performed atomically. Currently, groups can consist of only one action. + +If an action cannot be performed at the time of the request, the CMS informs you of the reason and time it is worth *refreshing* the task and sets the action status to *pending*. When the task is refreshed, the CMS attempts to perform the pending actions again. + +*Performed* actions have a deadline after which they are considered *completed* and stop affecting the cluster. For example, an exclusive lock is released. An action can be completed early. + +{% note info "Protracted maintenance" %} + +If maintenance continues after the actions performed to make it safe have been completed, this is considered a failure in the cluster. + +{% endnote %} + +Completed actions are automatically removed from the task. + +### Availability mode {#availability-mode} + +In a maintenance task, you need to specify the cluster's availability mode to comply with when checking whether actions can be performed. The following modes are supported: +- **Strong**: a mode that minimizes the risk of availability loss. + - No more than one unavailable [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) is allowed in each affected storage group. + - No more than one unavailable State Storage ring is allowed. +- **Weak**: a mode that does not allow exceeding the failure model. + - For affected storage groups with the [block-4-2](../../deploy/configuration/config.md#reliability) scheme, no more than two unavailable VDisks are allowed. + - For affected storage groups with the [mirror-3-dc](../../deploy/configuration/config.md#reliability) scheme, up to four unavailable VDisks are allowed, three of which must be in the same data center. + - No more than `(nto_select - 1) / 2` unavailable State Storage rings are allowed. +- **Force**: a forced mode, the failure model is ignored. *Not recommended for use.* + +### Priority {#priority} + +You can specify the priority of a maintenance task. A lower value means a higher priority. + +The task's actions cannot be performed until all conflicting actions from tasks with a higher priority are completed. Tasks with the same priority have no advantage over each other. + +## Unavailable node limits {#unavailable-node-limits} + +In the CMS configuration, you can configure limits on the number of unavailable nodes for a database (tenant) or the cluster as a whole. Relative and absolute limits are supported. + +By default, each database and the cluster as a whole are allowed to have no more than 10% unavailable nodes. + +## Checking algorithm {#checking-algorithm} + +To check if the actions of a maintenance task can be performed, the CMS sequentially goes through each action group in the task and checks the action from the group: +- If the action's object is a host, the CMS checks whether the action can be performed with all nodes running on the host. +- If the action's object is a node, the CMS checks: + - Whether there is a lock on the node. + - Whether it's possible to lock the node according to the limits of unavailable nodes. + - Whether it's possible to lock all VDisks of the node according to the availability mode. + - Whether it's possible to lock the State Storage ring of the node according to the availability mode. + - Whether it's possible to lock the node according to the limit of unavailable nodes on which cluster system tablets can run. + +The action can be performed if the checks are successful, and temporary locks are acquired on the checked nodes. The CMS then considers the next group of actions. Temporary locks help to understand whether the actions requested in different groups conflict with each other. Once the check is complete, the temporary locks are released. + +## Examples {#examples} + +The [ydbops](https://github.com/ydb-platform/ydbops) utility tool uses CMS for cluster maintenance without downtime. You can also use the CMS directly through the [gRPC API](https://github.com/ydb-platform/ydb/blob/main/ydb/public/api/grpc/draft/ydb_maintenance_v1.proto). + +### Take out a node for maintenance {#node-maintenance} + +{% note info "Functionality in development" %} + +Functionality is expected in upcoming versions of `ydbops`. + +{% endnote %} + +To take out a node for maintenance, you can use the command: +``` +$ ydbops node maintenance --host +``` +When executing this command, `ydbops` will acquire an exclusive lock on the node in CMS. + +### Rolling restart {##rolling-restart} + +To perform a rolling restart of the entire cluster, you can use the command: +``` +$ ydbops restart --endpoint grpc:// --availability-mode strong +``` +If your systemd unit name is different from the default one, you may need to override it with `--systemd-unit` flag. + +The `ydbops` utility will automatically create a maintenance task to restart the entire cluster using the given availability mode. As it progresses, `ydbops` will refresh the maintenance task and acquire exclusive locks on the nodes in the CMS until all nodes are restarted. diff --git a/ydb/docs/en/core/devops/manual/toc_p.yaml b/ydb/docs/en/core/devops/manual/toc_p.yaml index dcd79b7672bd..efec05c8ce6f 100644 --- a/ydb/docs/en/core/devops/manual/toc_p.yaml +++ b/ydb/docs/en/core/devops/manual/toc_p.yaml @@ -24,5 +24,5 @@ items: - name: System views href: system-views.md - name: Maintenance without downtime - href: ../../maintenance/manual/maintenance-without-downtime.md + href: maintenance-without-downtime.md diff --git a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md b/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md deleted file mode 100644 index dcb99dfb853d..000000000000 --- a/ydb/docs/en/core/maintenance/manual/maintenance-without-downtime.md +++ /dev/null @@ -1,98 +0,0 @@ -# Maintenance without downtime - -Periodically, the {{ ydb-short-name }} cluster needs to be maintained, such as upgrading its version or replacing broken disks. Maintenance can cause the cluster or its databases to become unavailable due to: -- Exceeding the failure model of the affected [storage groups](../../concepts/databases.md#storage-groups). -- Exceeding the [State Storage](../../deploy/configuration/config.md#domains-state) failure model. -- Lack of computational resources due to stopping too many [dynamic nodes](../../concepts/cluster/common_scheme_ydb.md#nodes). - -To avoid such situations, {{ ydb-short-name }} has a system [tablet](../../concepts/cluster/common_scheme_ydb.md#tablets) that monitors the state of the cluster - the *Cluster Management System (CMS)*. The CMS allows you to answer the question of whether a {{ ydb-short-name }} node or host running {{ ydb-short-name }} nodes can be safely taken out for maintenance. To do this, create a [maintenance task](#maintenance-task) in the CMS and specify in it to acquire exclusive locks on the nodes or hosts that will be involved in the maintenance. The cluster components on which the locks are acquired are considered unavailable from the CMS perspective and can be safely maintained. The CMS will [check](#checking-algorithm) the current state of the cluster and acquire locks only if the maintenance comply with the [availability mode](#availability-mode) and [unavailable node limits](#unavailable-node-limits). - -{% note warning "Faults during maintenance" %} - -During maintenance activities whose safety is guaranteed by the CMS, faults unrelated to those activities may occur in the cluster. If the faults threaten the availability of the cluster, urgent completion of the maintenance can help mitigate the risk of loss of availability. - -{% endnote %} - -## Maintenance task {#maintenance-task} - -A *maintenance task* is a set of *actions* that the user asks the CMS to perform for safe maintenance. - -Supported actions: -- Acquiring an exclusive lock on a cluster component — node or host. - -In a task, actions are divided into groups. Actions from the same group are performed atomically. Currently, groups can consist of only one action. - -If it's not possible to perform an action at the time of the request, the CMS informs you of the reason and the time when it is worth *refreshing* the task, and sets the action status to *pending*. When the task is refreshed, the CMS attempts to perform the pending actions again. - -*Performed* actions have a deadline after which they are considered *completed* and stop having an effect on the cluster. For example, an exclusive lock is released. An action can be completed early. - -{% note info "Protracted maintenance" %} - -If maintenance continues after the actions that were performed to make it safe have been completed, this is considered a fault in the cluster. - -{% endnote %} - -Completed actions are automatically removed from the task. - -### Availability mode {#availability-mode} - -In a maintenance task, you need to specify the availability mode of the cluster to be complied when checking whether actions can be performed. The following modes are supported: -- **Strong** - a mode that minimizes the risk of availability loss. - - No more than one unavailable [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) is allowed in each affected storage group. - - No more than one unavailable State Storage ring is allowed. -- **Weak** - a mode that does not allow exceeding the failure model. - - No more than two unavailable VDisks are allowed for affected storage groups with the [block-4-2](../../deploy/configuration/config.md#reliability) scheme. - - No more than four unavailable VDisks, three of which must be in the same data center, are allowed for affected storage groups with the [mirror-3-dc](../../deploy/configuration/config.md#reliability) scheme. - - No more than `(nto_select - 1) / 2` unavailable State Storage rings are allowed. -- **Force** - forced mode, the failure model is ignored. Not recommended for use. - -### Priority {#priority} - -You can specify the priority of a maintenance task. A lower value means a higher priority. - -The actions of the task cannot be performed until all conflicting actions from tasks with a higher priority are completed. Tasks with the same priority have no advantage over each other. - -## Unavailable node limits {#unavailable-node-limits} - -In the CMS configuration, you can configure limits on the number of unavailable nodes for a database (tenant) or for the cluster as a whole. Relative and absolute limits are supported. - -By default, no more than 10% of unavailable nodes are allowed for each database and the cluster as a whole. - -## Checking algorithm {#checking-algorithm} - -To check if the actions of a maintenance task can be performed, the CMS sequentially goes through each action group in the task and checks the action from the group: -- If the object of the action is a host, the CMS checks whether the action can be performed with all nodes running on the host. -- If the object of the action is a node, the CMS checks: - - Whether there is a lock on the node. - - Whether it's possible to lock the node according to the limits of unavailable nodes. - - Whether it's possible to lock all VDisks of the node according to the availability mode. - - Whether it's possible to lock the State Storage ring of the node according to the availability mode. - - Whether it's possible to lock the node according to the limit of unavailable nodes on which cluster system tablets can run. - -If the checks are successful, the action can be performed and a temporary locks are acquired on the checked nodes. The CMS then considers the next group of actions. Temporary locks help to understand whether the actions requested in different groups conflict with each other. Once the check is complete, the temporary locks are released. - -## Examples {#examples} - -The [ydbops](https://github.com/ydb-platform/ydbops) utility tool uses CMS for cluster maintenance without downtime. You can also use the CMS directly through the [gRPC API](https://github.com/ydb-platform/ydb/blob/main/ydb/public/api/grpc/draft/ydb_maintenance_v1.proto). - -### Take out a node for maintenance {#node-maintenance} - -{% note info "Functionality in development" %} - -Functionality is expected in upcoming versions of ydbops. - -{% endnote %} - -To take out a node for maintenance, you can use the command: -``` -$ ydbops node maintenance --host -``` -When executing this command, ydbops will acquire an exclusive lock on the node in CMS. - -### Rolling restart {##rolling-restart} - -To perform a rolling restart of the entire cluster you can use the command: -``` -$ ydbops restart --endpoint grpc:// --availability-mode strong -``` -The ydbops utility will automatically create a maintenance task to restart the entire cluster using the given availability mode. As it progresses, ydbops will refresh the maintenance task and acquire exclusive locks on the nodes in the CMS until all nodes are restarted. diff --git a/ydb/docs/ru/core/maintenance/manual/maintenance-without-downtime.md b/ydb/docs/ru/core/devops/manual/maintenance-without-downtime.md similarity index 89% rename from ydb/docs/ru/core/maintenance/manual/maintenance-without-downtime.md rename to ydb/docs/ru/core/devops/manual/maintenance-without-downtime.md index 7c9b3ffa5730..f373b7aba449 100644 --- a/ydb/docs/ru/core/maintenance/manual/maintenance-without-downtime.md +++ b/ydb/docs/ru/core/devops/manual/maintenance-without-downtime.md @@ -18,7 +18,7 @@ *Задача обслуживания* представляет собой набор *действий*, которые пользователь просит выполнить CMS для возможности проведения безопасного обслуживания. Поддерживаемые действия: -- Взятие эксклюзивной блокировки на компонент кластера — узел или хост. +- Взятие эксклюзивной блокировки на компонент кластера (узел или хост). В задаче действия делятся на группы. Действия из одной группы выполняются атомарно. На данный момент группы могут состоять только из одного действия. @@ -37,14 +37,14 @@ ### Режим доступности {#availability-mode} В задаче обслуживания необходимо указать режим доступности кластера, который должен соблюдаться при проверке возможности выполнения действий. Поддерживаются следующие режимы: -- **Strong** — режим, минимизирующий риск потери доступности. +- **Strong**: режим, минимизирующий риск потери доступности. - Допускается не более одного недоступного [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) в каждой из затрагиваемых групп хранения. - Допускается не более одного недоступного кольца State Storage. -- **Weak** — режим, не позволяющий превысить модель отказа. +- **Weak**: режим, не позволяющий превысить модель отказа. - Допускается не более двух недоступных VDisk-ов для затрагиваемых групп хранения со схемой [block-4-2](../../deploy/configuration/config.md#reliability). - Допускается не более четырех недоступных VDisk-ов, три из которых должны находиться в одном датацентре, для затрагиваемых групп хранения со схемой [mirror-3-dc](../../deploy/configuration/config.md#reliability). - Допускается не более `(nto_select - 1) / 2` недоступных колец State Storage. -- **Force** — принудительный режим, модель отказа игнорируется. Не рекомендуется к использованию. +- **Force**: принудительный режим, модель отказа игнорируется. *Не рекомендуется к использованию*. ### Приоритет {#priority} @@ -79,15 +79,15 @@ {% note info "Функциональность в разработке" %} -Функциональность ожидается в ближайших версиях ydbops. +Функциональность ожидается в ближайших версиях `ydbops`. {% endnote %} Для выведения узла для обслуживания можно воспользоваться командой: ``` -$ ydbops node maintenance --host +$ ydbops node maintenance --host ``` -При выполнении этой команды ydbops возьмет эксклюзивную блокировку на узел в CMS. +При выполнении этой команды `ydbops` возьмет эксклюзивную блокировку на узел в CMS. ### Rolling restart {#rolling-restart} @@ -95,4 +95,6 @@ $ ydbops node maintenance --host ``` $ ydbops restart --endpoint grpc:// --availability-mode strong ``` -Утилита ydbops автоматически создаст задачу обслуживания на рестарт всего кластера, используя указанный режим доступности. По ходу продвижения ydbops будет обновлять задачу обслуживания и получать эксклюзивные блокировки на узлы в CMS, пока все узлы не будут перезапущены. +Если используемое имя systemd unit отличается от стандартного, его можно переопределить с помощью флага `--systemd-unit`. + +Утилита `ydbops` автоматически создаст задачу обслуживания на рестарт всего кластера, используя указанный режим доступности. По ходу продвижения `ydbops` будет обновлять задачу обслуживания и получать эксклюзивные блокировки на узлы в CMS, пока все узлы не будут перезапущены. diff --git a/ydb/docs/ru/core/devops/manual/toc_p.yaml b/ydb/docs/ru/core/devops/manual/toc_p.yaml index bfe689591fbd..a547b991c7b3 100644 --- a/ydb/docs/ru/core/devops/manual/toc_p.yaml +++ b/ydb/docs/ru/core/devops/manual/toc_p.yaml @@ -36,4 +36,4 @@ items: - name: Системные таблицы href: system-views.md - name: Обслуживание без потери доступности - href: ../../maintenance/manual/maintenance-without-downtime.md + href: maintenance-without-downtime.md diff --git a/ydb/docs/ru/core/maintenance/toc_i.yaml b/ydb/docs/ru/core/maintenance/toc_i.yaml index 3849318c3747..aaa90502b1c8 100644 --- a/ydb/docs/ru/core/maintenance/toc_i.yaml +++ b/ydb/docs/ru/core/maintenance/toc_i.yaml @@ -11,8 +11,6 @@ items: include: { mode: link, path: manual/toc_p.yaml } - name: Изменение конфигурации актор-системы href: manual/change_actorsystem_configs.md - - name: Обслуживание кластера без потери доступности - href: manual/maintenance-without-downtime.md - name: Управление конфигурацией кластера items: - name: Обзор конфигурации From 1ff8d19adabca93e37266d6bb2afc828b3cb96de Mon Sep 17 00:00:00 2001 From: Ilia Shakhov Date: Wed, 15 May 2024 08:46:18 +0000 Subject: [PATCH 09/10] Fix --- .../manual/maintenance-without-downtime.md | 26 ++++++++----------- .../manual/maintenance-without-downtime.md | 24 +++++++---------- 2 files changed, 21 insertions(+), 29 deletions(-) diff --git a/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md b/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md index d1e6554e5fc8..8dadaaf4eb14 100644 --- a/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md +++ b/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md @@ -75,20 +75,6 @@ The action can be performed if the checks are successful, and temporary locks ar The [ydbops](https://github.com/ydb-platform/ydbops) utility tool uses CMS for cluster maintenance without downtime. You can also use the CMS directly through the [gRPC API](https://github.com/ydb-platform/ydb/blob/main/ydb/public/api/grpc/draft/ydb_maintenance_v1.proto). -### Take out a node for maintenance {#node-maintenance} - -{% note info "Functionality in development" %} - -Functionality is expected in upcoming versions of `ydbops`. - -{% endnote %} - -To take out a node for maintenance, you can use the command: -``` -$ ydbops node maintenance --host -``` -When executing this command, `ydbops` will acquire an exclusive lock on the node in CMS. - ### Rolling restart {##rolling-restart} To perform a rolling restart of the entire cluster, you can use the command: @@ -97,4 +83,14 @@ $ ydbops restart --endpoint grpc:// --availability-mode strong ``` If your systemd unit name is different from the default one, you may need to override it with `--systemd-unit` flag. -The `ydbops` utility will automatically create a maintenance task to restart the entire cluster using the given availability mode. As it progresses, `ydbops` will refresh the maintenance task and acquire exclusive locks on the nodes in the CMS until all nodes are restarted. +The `ydbops` utility will automatically create a maintenance task to restart the entire cluster using the given availability mode. As it progresses, the `ydbops` will refresh the maintenance task and acquire exclusive locks on the nodes in the CMS until all nodes are restarted. + +### Take out a node for maintenance {#node-maintenance} + +{% note info "Functionality in development" %} + +Functionality is expected in upcoming versions of the `ydbops`. + +{% endnote %} + +To take out a node for maintenance, you can use the `ydbops` utility. When taking a node out, `ydbops` will acquire an exclusive lock on this node in CMS. diff --git a/ydb/docs/ru/core/devops/manual/maintenance-without-downtime.md b/ydb/docs/ru/core/devops/manual/maintenance-without-downtime.md index f373b7aba449..adbd757f9970 100644 --- a/ydb/docs/ru/core/devops/manual/maintenance-without-downtime.md +++ b/ydb/docs/ru/core/devops/manual/maintenance-without-downtime.md @@ -75,20 +75,6 @@ Утилита [ydbops](https://github.com/ydb-platform/ydbops) использует CMS для проведения обслуживания кластера без потери доступности. Также CMS можно использовать напрямую через [gRPC API](https://github.com/ydb-platform/ydb/blob/main/ydb/public/api/grpc/draft/ydb_maintenance_v1.proto). -### Вывести узел для обслуживания {#node-maintenance} - -{% note info "Функциональность в разработке" %} - -Функциональность ожидается в ближайших версиях `ydbops`. - -{% endnote %} - -Для выведения узла для обслуживания можно воспользоваться командой: -``` -$ ydbops node maintenance --host -``` -При выполнении этой команды `ydbops` возьмет эксклюзивную блокировку на узел в CMS. - ### Rolling restart {#rolling-restart} Чтобы выполнить rolling restart всего кластера можно воспользоваться командой: @@ -98,3 +84,13 @@ $ ydbops restart --endpoint grpc:// --availability-mode strong Если используемое имя systemd unit отличается от стандартного, его можно переопределить с помощью флага `--systemd-unit`. Утилита `ydbops` автоматически создаст задачу обслуживания на рестарт всего кластера, используя указанный режим доступности. По ходу продвижения `ydbops` будет обновлять задачу обслуживания и получать эксклюзивные блокировки на узлы в CMS, пока все узлы не будут перезапущены. + +### Вывести узел для обслуживания {#node-maintenance} + +{% note info "Функциональность в разработке" %} + +Функциональность ожидается в ближайших версиях `ydbops`. + +{% endnote %} + +Чтобы вывести узел для обслуживания можно воспользоваться утилитой `ydbops`. При выведении узла `ydbops` возьмет эксклюзивную блокировку на этот узел в CMS. From 4caed88de55aa4cb8089493c59aa58b6eb26ab4e Mon Sep 17 00:00:00 2001 From: Ilia Shakhov Date: Wed, 15 May 2024 08:46:33 +0000 Subject: [PATCH 10/10] Fix --- ydb/docs/en/core/devops/manual/maintenance-without-downtime.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md b/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md index 8dadaaf4eb14..893074978975 100644 --- a/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md +++ b/ydb/docs/en/core/devops/manual/maintenance-without-downtime.md @@ -93,4 +93,4 @@ Functionality is expected in upcoming versions of the `ydbops`. {% endnote %} -To take out a node for maintenance, you can use the `ydbops` utility. When taking a node out, `ydbops` will acquire an exclusive lock on this node in CMS. +To take out a node for maintenance, you can use the `ydbops` utility. When taking a node out, the `ydbops` will acquire an exclusive lock on this node in CMS.