Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add english version of CMS docs #4272

Merged
merged 10 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions ydb/docs/en/core/devops/manual/toc_p.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,6 @@ items:
href: ../../maintenance/manual/cms.md
- name: System views
href: system-views.md
- name: Maintenance without downtime
href: ../../maintenance/manual/maintenance-without-downtime.md
pixcc marked this conversation as resolved.
Show resolved Hide resolved

Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Maintenance without downtime

Periodically, the {{ ydb-short-name }} cluster needs to be maintained, such as upgrading its version or replacing broken disks. Maintenance can cause the cluster or its databases to become unavailable due to:
- Exceeding the failure model of the affected [storage groups](../../concepts/databases.md#storage-groups).
- Exceeding the [State Storage](../../deploy/configuration/config.md#domains-state) failure model.
- Lack of computational resources due to stopping too many [dynamic nodes](../../concepts/cluster/common_scheme_ydb.md#nodes).
pixcc marked this conversation as resolved.
Show resolved Hide resolved

To avoid such situations, {{ ydb-short-name }} has a system [tablet](../../concepts/cluster/common_scheme_ydb.md#tablets) that monitors the state of the cluster - the *Cluster Management System (CMS)*. The CMS allows you to answer the question of whether a {{ ydb-short-name }} node or host running {{ ydb-short-name }} nodes can be safely taken out for maintenance. To do this, create a [maintenance task](#maintenance-task) in the CMS and specify in it to acquire exclusive locks on the nodes or hosts that will be involved in the maintenance. The cluster components on which the locks are acquired are considered unavailable from the CMS perspective and can be safely maintained. The CMS will [check](#checking-algorithm) the current state of the cluster and acquire locks only if the maintenance comply with the [availability mode](#availability-mode) and [unavailable node limits](#unavailable-node-limits).
pixcc marked this conversation as resolved.
Show resolved Hide resolved

{% note warning "Faults during maintenance" %}
pixcc marked this conversation as resolved.
Show resolved Hide resolved

During maintenance activities whose safety is guaranteed by the CMS, faults unrelated to those activities may occur in the cluster. If the faults threaten the availability of the cluster, urgent completion of the maintenance can help mitigate the risk of loss of availability.
pixcc marked this conversation as resolved.
Show resolved Hide resolved

{% endnote %}

## Maintenance task {#maintenance-task}

A *maintenance task* is a set of *actions* that the user asks the CMS to perform for safe maintenance.

Supported actions:
- Acquiring an exclusive lock on a cluster component — node or host.
pixcc marked this conversation as resolved.
Show resolved Hide resolved

In a task, actions are divided into groups. Actions from the same group are performed atomically. Currently, groups can consist of only one action.

If it's not possible to perform an action at the time of the request, the CMS informs you of the reason and the time when it is worth *refreshing* the task, and sets the action status to *pending*. When the task is refreshed, the CMS attempts to perform the pending actions again.
pixcc marked this conversation as resolved.
Show resolved Hide resolved

*Performed* actions have a deadline after which they are considered *completed* and stop having an effect on the cluster. For example, an exclusive lock is released. An action can be completed early.
pixcc marked this conversation as resolved.
Show resolved Hide resolved

{% note info "Protracted maintenance" %}

If maintenance continues after the actions that were performed to make it safe have been completed, this is considered a fault in the cluster.

{% endnote %}
pixcc marked this conversation as resolved.
Show resolved Hide resolved

Completed actions are automatically removed from the task.

### Availability mode {#availability-mode}

In a maintenance task, you need to specify the availability mode of the cluster to be complied when checking whether actions can be performed. The following modes are supported:
- **Strong** - a mode that minimizes the risk of availability loss.
- No more than one unavailable [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) is allowed in each affected storage group.
- No more than one unavailable State Storage ring is allowed.
- **Weak** - a mode that does not allow exceeding the failure model.
- No more than two unavailable VDisks are allowed for affected storage groups with the [block-4-2](../../deploy/configuration/config.md#reliability) scheme.
- No more than four unavailable VDisks, three of which must be in the same data center, are allowed for affected storage groups with the [mirror-3-dc](../../deploy/configuration/config.md#reliability) scheme.
- No more than `(nto_select - 1) / 2` unavailable State Storage rings are allowed.
- **Force** - forced mode, the failure model is ignored. Not recommended for use.
pixcc marked this conversation as resolved.
Show resolved Hide resolved

### Priority {#priority}

You can specify the priority of a maintenance task. A lower value means a higher priority.

The actions of the task cannot be performed until all conflicting actions from tasks with a higher priority are completed. Tasks with the same priority have no advantage over each other.
pixcc marked this conversation as resolved.
Show resolved Hide resolved

## Unavailable node limits {#unavailable-node-limits}

In the CMS configuration, you can configure limits on the number of unavailable nodes for a database (tenant) or for the cluster as a whole. Relative and absolute limits are supported.

By default, no more than 10% of unavailable nodes are allowed for each database and the cluster as a whole.
pixcc marked this conversation as resolved.
Show resolved Hide resolved

## Checking algorithm {#checking-algorithm}

To check if the actions of a maintenance task can be performed, the CMS sequentially goes through each action group in the task and checks the action from the group:
- If the object of the action is a host, the CMS checks whether the action can be performed with all nodes running on the host.
- If the object of the action is a node, the CMS checks:
- Whether there is a lock on the node.
- Whether it's possible to lock the node according to the limits of unavailable nodes.
- Whether it's possible to lock all VDisks of the node according to the availability mode.
- Whether it's possible to lock the State Storage ring of the node according to the availability mode.
- Whether it's possible to lock the node according to the limit of unavailable nodes on which cluster system tablets can run.

If the checks are successful, the action can be performed and a temporary locks are acquired on the checked nodes. The CMS then considers the next group of actions. Temporary locks help to understand whether the actions requested in different groups conflict with each other. Once the check is complete, the temporary locks are released.
pixcc marked this conversation as resolved.
Show resolved Hide resolved

## Examples {#examples}

The [ydbops](https://github.com/ydb-platform/ydbops) utility tool uses CMS for cluster maintenance without downtime. You can also use the CMS directly through the [gRPC API](https://github.com/ydb-platform/ydb/blob/main/ydb/public/api/grpc/draft/ydb_maintenance_v1.proto).

### Take out a node for maintenance {#node-maintenance}

{% note info "Functionality in development" %}

Functionality is expected in upcoming versions of ydbops.
pixcc marked this conversation as resolved.
Show resolved Hide resolved

{% endnote %}

To take out a node for maintenance, you can use the command:
```
$ ydbops node maintenance --host <node_fqdn>
```
When executing this command, ydbops will acquire an exclusive lock on the node in CMS.
pixcc marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph sounds weird: typically, multiple YDB nodes are on a single host, and such a command probably needs to acquire locks for all of them. Or maybe this implies only static/storage nodes, but this would need to be explicitly clarified, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed parameter to <node_id> to make it more clear. I just took this command from ydb-platform/ydbops#2
It is under active development. When the final variant of the command is chosen, the text in this article will be changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pixcc I don't think this change fixes the problem: in this article, we shouldn't use "node" and "host" interchangeably. This command sounds more like a maintenance of all (or some) nodes on a given host but not of a single node. Meanwhile, the surrounding text sounds like it is about a single node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed command to eliminate ambiguity.

It's not 100% clear whether the final version of the command will take out a host for maintenance (with all nodes) or individual nodes, so I've settled on a simple node case for now.


### Rolling restart {##rolling-restart}

To perform a rolling restart of the entire cluster you can use the command:
pixcc marked this conversation as resolved.
Show resolved Hide resolved
```
$ ydbops restart --endpoint grpc://<cluster-fqdn> --availability-mode strong
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this have some additional requirements? For instance, the last time I tried (a while ago), it couldn't restart a cluster deployed with Ansible because it relied on some different systemd unit naming.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things have changed a bit. By default the ydb-server-storage.service systemd name will be used, but it is possible to specify a different systemd unit name with --systemd-unit flag (here is a relevant --help output which contains this flag: https://pastebin.com/Jqcx31Jn).

Just for context: the fact that we have two different default unit names for deploying with Ansible and for everything else (cloud environments etc.) - is horrible and we just have to live with it for a while.

It is probably a good idea to include the mention of --system-unit flag into the docs, but not in too much detail, maybe one sentence only. E.g. "If your systemd unit name is different from the default one, you may need to override it with --systemd-unit flag"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added aditional requirements

```
The ydbops utility will automatically create a maintenance task to restart the entire cluster using the given availability mode. As it progresses, ydbops will refresh the maintenance task and acquire exclusive locks on the nodes in the CMS until all nodes are restarted.
pixcc marked this conversation as resolved.
Show resolved Hide resolved
2 changes: 2 additions & 0 deletions ydb/docs/en/core/maintenance/toc_i.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,7 @@ items:
include: { mode: link, path: manual/toc_p.yaml }
- name: Changing an actor system's configuration
href: manual/change_actorsystem_configs.md
- name: Maintenance without downtime
href: manual/maintenance-without-downtime.md
- name: Updating configurations via CMS
href: manual/cms.md
2 changes: 2 additions & 0 deletions ydb/docs/ru/core/devops/manual/toc_p.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,5 @@ items:
href: ../../maintenance/manual/cms.md
- name: Системные таблицы
href: system-views.md
- name: Обслуживание без потери доступности
href: ../../maintenance/manual/maintenance-without-downtime.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Обслуживание кластера без потери доступности

Периодически кластер {{ ydb-short-name }} необходимо обслуживать, например, обновлять его версию или заменять сломавшиеся диски. Работы по обслуживанию могут привести к недоступности кластера или имеющихся баз данных из-за:
- Превышения модели отказа затронутых [групп хранения](../concepts/databases.md#storage-groups).
- Превышения модели отказа [State Storage](../deploy/configuration/config.md#domains-state).
- Недостатка вычислительных ресурсов вследствие остановки слишком большого количества [динамических узлов](../concepts/cluster/common_scheme_ydb.md#nodes).
- Превышения модели отказа затронутых [групп хранения](../../concepts/databases.md#storage-groups).
- Превышения модели отказа [State Storage](../../deploy/configuration/config.md#domains-state).
- Недостатка вычислительных ресурсов вследствие остановки слишком большого количества [динамических узлов](../../concepts/cluster/common_scheme_ydb.md#nodes).

Для избежания таких ситуаций в {{ ydb-short-name }} есть системная [таблетка](../concepts/cluster/common_scheme_ydb.md#tablets), которая следит за состоянием кластера — *Cluster Management System (CMS)*. CMS позволяет ответить на вопрос можно ли безопасно вывести в обслуживание узел {{ ydb-short-name }} или хост, на котором работают узлы {{ ydb-short-name }}. Для этого необходимо создать [задачу обслуживания](#maintenance-task) в CMS и указать в ней взятие эксклюзивных блокировок на узлы или хосты, которые будут задействованы в обслуживании. Компоненты кластера, на которые взяты блокировки, считаются недоступными с точки зрения CMS, и их можно безопасно обслуживать. CMS [проверит](#check-task-actions-algorithm) текущее состояние кластера и возьмет блокировки, только если работы по обслуживанию соответствуют ограничениям [режима доступности](#availability-mode) и [лимитам недоступных узлов](#unavailable-node-limits).
Для избежания таких ситуаций в {{ ydb-short-name }} есть системная [таблетка](../../concepts/cluster/common_scheme_ydb.md#tablets), которая следит за состоянием кластера — *Cluster Management System (CMS)*. CMS позволяет ответить на вопрос можно ли безопасно вывести в обслуживание узел {{ ydb-short-name }} или хост, на котором работают узлы {{ ydb-short-name }}. Для этого необходимо создать [задачу обслуживания](#maintenance-task) в CMS и указать в ней взятие эксклюзивных блокировок на узлы или хосты, которые будут задействованы в обслуживании. Компоненты кластера, на которые взяты блокировки, считаются недоступными с точки зрения CMS, и их можно безопасно обслуживать. CMS [проверит](#checking-algorithm) текущее состояние кластера и возьмет блокировки, только если работы по обслуживанию соответствуют ограничениям [режима доступности](#availability-mode) и [лимитам недоступных узлов](#unavailable-node-limits).

{% note warning "Поломки во время проведения работ" %}

Expand Down Expand Up @@ -38,11 +38,11 @@

В задаче обслуживания необходимо указать режим доступности кластера, который должен соблюдаться при проверке возможности выполнения действий. Поддерживаются следующие режимы:
- **Strong** — режим, минимизирующий риск потери доступности.
- Допускается не более одного недоступного [VDisk](../concepts/cluster/distributed_storage.md#storage-groups) в каждой из затрагиваемых групп хранения.
- Допускается не более одного недоступного [VDisk](../../concepts/cluster/distributed_storage.md#storage-groups) в каждой из затрагиваемых групп хранения.
- Допускается не более одного недоступного кольца State Storage.
- **Weak** — режим, не позволяющий превысить модель отказа.
- Допускается не более двух недоступных VDisk-ов для затрагиваемых групп хранения со схемой [block-4-2](../administration/production-storage-config.md#reliability).
- Допускается не более четырех недоступных VDisk-ов, три из которых должны находиться в одном датацентре, для затрагиваемых групп хранения со схемой [mirror-3-dc](../administration/production-storage-config.md#reliability).
- Допускается не более двух недоступных VDisk-ов для затрагиваемых групп хранения со схемой [block-4-2](../../deploy/configuration/config.md#reliability).
- Допускается не более четырех недоступных VDisk-ов, три из которых должны находиться в одном датацентре, для затрагиваемых групп хранения со схемой [mirror-3-dc](../../deploy/configuration/config.md#reliability).
- Допускается не более `(nto_select - 1) / 2` недоступных колец State Storage.
- **Force** — принудительный режим, модель отказа игнорируется. Не рекомендуется к использованию.

Expand All @@ -58,7 +58,7 @@

По умолчанию допускается не более 10% недоступных узлов для каждой базы данных и кластера в целом.

## Алгоритм проверки действий задачи {#check-task-actions-algorithm}
## Алгоритм проверки {#checking-algorithm}

Для того, чтобы проверить можно ли выполнить действия задачи обслуживания, CMS последовательно идет по каждой группе действий в задаче и проверяет действие из группы:
- Если объектом действия является хост, то CMS проверяет можно ли выполнить действие со всеми узлами, запущенными на хосте.
Expand Down
2 changes: 2 additions & 0 deletions ydb/docs/ru/core/maintenance/manual/node_restarting.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Безопасный рестарт и выключение узлов

## Остановка/рестарт процесса ydb на узле {#restart_process}

Чтобы убедиться, что процесс можно остановить, надо выполнить следующие шаги.
Expand Down
2 changes: 1 addition & 1 deletion ydb/docs/ru/core/maintenance/toc_i.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ items:
- name: Изменение конфигурации актор-системы
href: manual/change_actorsystem_configs.md
- name: Обслуживание кластера без потери доступности
href: maintenance-without-outages.md
href: manual/maintenance-without-downtime.md
- name: Управление конфигурацией кластера
items:
- name: Обзор конфигурации
Expand Down
Loading