Skip to content

Commit

Permalink
creating client snippet
Browse files Browse the repository at this point in the history
  • Loading branch information
StekPerepolnen committed Jun 25, 2024
1 parent 783826f commit 0a64c0e
Show file tree
Hide file tree
Showing 2 changed files with 63 additions and 30 deletions.
55 changes: 36 additions & 19 deletions ydb/docs/en/core/reference/ydb-sdk/health-check-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,26 @@ description: "The article will tell you how to initiate the check using the Heal

{{ ydb-short-name }} has a built-in self-diagnostic system, which can be used to get a brief report on the database status and information about existing problems.

To initiate the check, call the `SelfCheck` method from `Ydb.Monitoring`. You must also pass the name of the checked DB as usual.
To initiate the check, call the `SelfCheck` method from SDK `Ydb.Monitoring`. You must also pass the name of the checked DB as usual.

{% list tabs %}
- C++
App code snippet for creating a client:
```cpp
auto client = NYdb::NMonitoring::TMonitoringClient(driver);
```
Calling `SelfCheck` method:
```
auto settings = TSelfCheckSettings();
settings.ReturnVerboseStatus(true);
auto result = client.SelfCheck(settings).GetValueSync();
```
{% endlist %}

## Response Structure {#response-structure}
Calling the method will return the following structure:
For the full response structure, see the [ydb_monitoring.proto](https://github.com/ydb-platform/ydb/public/api/protos/ydb_monitoring.proto) file in the {{ ydb-short-name }} Git repository.
Calling the `SelfCheck` method will return the following message:

```protobuf
message SelfCheckResult {
Expand Down Expand Up @@ -44,6 +60,8 @@ Each issue has a nesting `level` - the higher the `level`, the deeper the ish is
| `self_check_result` | enum field which contains the DB check result:<ul><li>`GOOD`: No problems were detected.</li><li>`DEGRADED`: Degradation of one of the database systems was detected, but the database is still functioning (for example, allowable disk loss).</li><li>`MAINTENANCE_REQUIRED`: Significant degradation was detected, there is a risk of availability loss, and human maintenance is required.</li><li>`EMERGENCY`: A serious problem was detected in the database, with complete or partial loss of availability.</li></ul> |
| `issue_log` | This is a set of elements, each of which describes a problem in the system at a certain level. |
| `issue_log.id` | A unique problem ID within this response. |
| `issue_log.id` | A unique problem ID within this response. |
| `issue_log.id` | A unique problem ID within this response. |
| `issue_log.status` | Status (severity) of the current problem. <br/>It can take one of the following values:</li><li>`RED`: A component is faulty or unavailable.</li><li>`ORANGE`: A serious problem, we are one step away from losing availability. Maintenance may be required.</li><li>`YELLOW`: A minor problem, no risks to availability. We recommend you continue monitoring the problem.</li><li>`BLUE`: Temporary minor degradation that does not affect database availability. The system is expected to switch to `GREEN`.</li><li>`GREEN`: No problems were detected.</li><li>`GREY`: Failed to determine the status (a problem with the self-diagnostic mechanism).</li></ul> |
| `issue_log.message` | Text that describes the problem. |
| `issue_log.location` | Location of the problem. This can be a physical location or an execution context. |
Expand All @@ -59,13 +77,13 @@ The whole list of extra parameters presented below:

{% list tabs %}
- C++
```c++
struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>{
FLUENT_SETTING_OPTIONAL(bool, ReturnVerboseStatus);
FLUENT_SETTING_OPTIONAL(EStatusFlag, MinimumStatus);
FLUENT_SETTING_OPTIONAL(ui32, MaximumLevel);
};
```
```c++
struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>{
FLUENT_SETTING_OPTIONAL(bool, ReturnVerboseStatus);
FLUENT_SETTING_OPTIONAL(EStatusFlag, MinimumStatus);
FLUENT_SETTING_OPTIONAL(ui32, MaximumLevel);
};
```
{% endlist %}
| Parameter | Type | Description |
Expand All @@ -79,20 +97,20 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
| Message | Description |
|:----|:----|
| **DATABASE** ||
| `Database has multiple issues`</br>`Database has compute issues`</br>`Database has storage issues` | These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of the database. |
| `Database has multiple issues`</br>`Database has compute issues`</br>`Database has storage issues` | These issues depend solely on the underlying `COMPUTE` and `STORAGE` layers. This is the most general status of a database. |
| **STORAGE** ||
| `There are no storage pools` | Storage pools aren't configured. |
| `Storage degraded`</br>`Storage has no redundancy`</br>`Storage failed` | These issues depend solely on the underlying `STORAGE_POOLS` layer. |
| `System tablet BSC didn't provide information` | Storage diagnostics will be generated with alternative way. |
| `Storage usage over 75%` <br>`Storage usage over 85%` <br>`Storage usage over 90%` | Need to increase disk space. |
| `System tablet BSC didn't provide information` | Storage diagnostics will be generated alternatively. |
| `Storage usage over 75%` <br>`Storage usage over 85%` <br>`Storage usage over 90%` | Some data needs to be removed, or the database needs to be reconfigured with additional disk space. |
| **STORAGE_POOL** ||
| `Pool degraded` <br>`Pool has no redundancy` <br>`Pool failed` | These issues depend solely on the underlying `STORAGE_GROUP` layer. |
| **STORAGE_GROUP** ||
| `Group has no vslots` | This case is not expected, it inner problem. |
| `Group degraded` | The number of disks allowed in the group is not available. |
| `Group degraded` | A number of disks allowed in the group are not available. |
| `Group has no redundancy` | A storage group lost its redundancy. Аnother failure of vdisk may lead to the loss of the group. |
| `Group failed` | A storage group lost its integrity. Data is not available |
||`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on this, sets the appropriate status and displays a message. |
||`HealthCheck` checks various parameters (fault tolerance mode, number of failed disks, disk status, etc.) and, depending on them, sets the appropriate status and displays a message. |
| **VDISK** ||
| `System tablet BSC didn't provide known status` | This case is not expected, it inner problem. |
| `VDisk is not available` | the disk is not operational at all. |
Expand All @@ -101,7 +119,7 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
| `VDisk have space issue` | These issues depend solely on the underlying `PDISK` layer. |
| **PDISK** ||
| `Unknown PDisk state` | `HealthCheck` the system can't parse pdisk state. |
| `PDisk is inactive` <br>`PDisk state is FAULTY` <br>`PDisk state is BROKEN` <br>`PDisk state is TO_BE_REMOVED` | Indicates problems with a physical disk. |
| `PDisk state is ...` | Indicates state of physical disk. |
| `Available size is less than 12%` <br>`Available size is less than 9%` <br>`Available size is less than 6%` | Free space on the physical disk is running out. |
| `PDisk is not available` | A physical disk is not available. |
| **STORAGE_NODE** ||
Expand All @@ -114,22 +132,21 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
| `Compute quota usage` | These issues depend solely on the underlying `COMPUTE_QUOTA` layer. |
| `Compute has issues with tablets`| These issues depend solely on the underlying `TABLET` layer. |
| **COMPUTE_QUOTA** ||
| `Paths quota usage is over than 90%` <br>`Paths quota usage is over than 99%` <br>`Paths quota exhausted` </br>`Shards quota usage is over than 90%` </br>`Shards quota usage is over than 99%` </br>`Shards quota exhausted` |Quotas exhausted|
| **COMPUTE_NODE** | *There is no specific issues on this layer.* |
| `Paths quota usage is over than 90%` <br>`Paths quota usage is over than 99%` <br>`Paths quota exhausted` </br>`Shards quota usage is over than 90%` </br>`Shards quota usage is over than 99%` </br>`Shards quota exhausted` | Quotas exhausted |
| **SYSTEM_TABLET** ||
| `System tablet is unresponsive ` <br>`System tablet response time over 1000ms` <br>`System tablet response time over 5000ms`| The system tablet is not responding or it takes too long to respond. |
| **TABLET** ||
| `Tablets are restarting too often` | Tablets are restarting too often. |
| `Tablets/Followers are dead` | Tablets are not running (probably cannot be started). |
| **LOAD_AVERAGE** ||
| `LoadAverage above 100%` | A physical host is overloaded. </br> The `Healthcheck` tool monitors system load by evaluating the current workload in terms of running and waiting processes (load) and comparing it to the total number of logical cores on the host (cores). For example, if a system has 8 logical cores and the current load value is 16, the load is considered to be 200%. </br> `Healthcheck` only checks if the load exceeds the number of cores (load > cores) and reports based on this condition. This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations. </br></br> Load Information: </br> Source: </br>`/proc/loadavg` </br> Logical Cores Information </br></br>The number of logical cores: </br>Primary Source: </br>`/sys/fs/cgroup/cpu.max` </br></br>Fallback Source: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br> `/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>The number of cores is calculated by dividing the quota by the period (quota / period)
| `LoadAverage above 100%` | ([Load](https://en.wikipedia.org/wiki/Load_(computing))) A physical host is overloaded . </br> This indicates that the system is working at or beyond its capacity, potentially due to a high number of processes waiting for I/O operations. </br></br> Load Information: </br> Source: </br>`/proc/loadavg` </br> Logical Cores Information </br></br>The number of logical cores: </br>Primary Source: </br>`/sys/fs/cgroup/cpu.max` </br></br>Fallback Source: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br> `/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>The number of cores is calculated by dividing the quota by the period (quota / period)
| **COMPUTE_POOL** ||
| `Pool usage is over than 90%` <br>`Pool usage is over than 95%` <br>`Pool usage is over than 99%` | One of the pools' CPUs is overloaded. |
| **NODE_UPTIME** ||
| `The number of node restarts has increased` | The number of node restarts has exceeded the threshold. By default, 10 restarts per hour |
| `Node is restarting too often` | The number of node restarts has exceeded the threshold. By default, 30 restarts per hour |
| **NODES_TIME_DIFFERENCE** ||
| `The nodes have a time difference of ... ms` | Time drift on nodes might lead to potential issues with coordinating distributed transactions. This message starts to appear from 5 ms |
| `The nodes have a time difference of ... ms` | Time drift on nodes might lead to potential issues with coordinating distributed transactions. This issus starts to appear from 5 ms |
## Example {#examples}
Expand Down
38 changes: 27 additions & 11 deletions ydb/docs/ru/core/reference/ydb-sdk/health-check-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,25 @@ description: "Из статьи вы узнаете, как инициирова

{{ ydb-short-name }} имеет встроенную систему самодиагностики, с помощью которой можно получить краткий отчет о состоянии базы данных и информацию об имеющихся проблемах.

Для инициации проверки необходимо сделать вызов метода `SelfCheck` из сервиса `Ydb.Monitoring`. Также необходимо передать имя проверяемой БД стандартным способом.
Для инициации проверки необходимо сделать вызов метода `SelfCheck` из сервиса YDB `Ydb.Monitoring`. Также необходимо передать имя проверяемой БД стандартным способом.

{% list tabs %}
- C++
Пример кода приложения для создания клиента:
```cpp
auto client = NYdb::NMonitoring::TMonitoringClient(driver);
```

Вызов метода `SelfCheck`:
```
auto settings = TSelfCheckSettings();
settings.ReturnVerboseStatus(true);
auto result = client.SelfCheck(settings).GetValueSync();
```
{% endlist %}

## Response Structure {#response-structure}
Полную структуру ответа можно посмотреть в файле [ydb_monitoring.proto](https://github.com/ydb-platform/ydb/public/api/protos/ydb_monitoring.proto) в {{ ydb-short-name }} Git репозитории.
В результате вызова этого метода будет возвращена следующая структура:

```protobuf
Expand Down Expand Up @@ -59,12 +76,12 @@ message IssueLog {
Полный список дополнительных параметров представлен ниже:
{% list tabs %}
- C++
```c++
struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>{
FLUENT_SETTING_OPTIONAL(bool, ReturnVerboseStatus);
FLUENT_SETTING_OPTIONAL(EStatusFlag, MinimumStatus);
FLUENT_SETTING_OPTIONAL(ui32, MaximumLevel);
};
```c++
struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>{
FLUENT_SETTING_OPTIONAL(bool, ReturnVerboseStatus);
FLUENT_SETTING_OPTIONAL(EStatusFlag, MinimumStatus);
FLUENT_SETTING_OPTIONAL(ui32, MaximumLevel);
};
```
{% endlist %}
Expand Down Expand Up @@ -101,7 +118,7 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
| `VDisk have space issue` | Зависит от нижележащего слоя `PDISK`. |
| **PDISK** ||
| `Unknown PDisk state` | `HealthCheck` не может разобрать состояние PDisk. Внутренняя ошибка. |
| `PDisk is inactive` <br>`PDisk state is FAULTY` <br>`PDisk state is BROKEN` <br>`PDisk state is TO_BE_REMOVED` | Cообщает о проблемах с физическим диском. |
| `PDisk state is ...` | Cообщает состояние физического диска. |
| `Available size is less than 12%` <br>`Available size is less than 9%` <br>`Available size is less than 6%` | Заканчивается свободное место на физическом диске. |
| `PDisk is not available` | Отсутствует физический диск. |
| **STORAGE_NODE** ||
Expand All @@ -115,21 +132,20 @@ struct TSelfCheckSettings : public TOperationRequestSettings<TSelfCheckSettings>
| `Compute has issues with tablets` | Зависит от нижележащего слоя `TABLET`. |
| **COMPUTE_QUOTA** ||
| `Paths quota usage is over than 90%` <br>`Paths quota usage is over than 99%` <br>`Paths quota exhausted` </br>`Shards quota usage is over than 90%` <br>`Shards quota usage is over than 99%` <br>`Shards quota exhausted` | Квоты исчерпаны. |
| **COMPUTE_NODE** | *Нет сообщений на этом уровне.* |
| **SYSTEM_TABLET** ||
| `System tablet is unresponsive ` <br>`System tablet response time over 1000ms` <br>`System tablet response time over 5000ms` | Системная таблетка не отвечает или отвечает долго |
| **TABLET** ||
| `Tablets are restarting too often` | Таблетки слишком часто перезапускаются. |
| `Tablets are dead` <br>`Followers are dead` | Таблетки не запущены (или не могут быть запущены). |
| **LOAD_AVERAGE** ||
| `LoadAverage above 100%` | Физический хост перегружен. </br>Сервис `Healthcheck` мониторит системную нагрузку, оценивая ее в терминах выполняющихся, ожидающих процессов (load) и сравнивая её с общим числом логических ядер на хосте (cores). Например, если у системы 8 логических ядер и текущая нагрузка составляет 16, нагрузка считается равной 200%. </br>`Healthcheck` проверяет только превышение нагрузки над количеством ядер (load > cores) и сообщает на основе этого предупреждение. Это указывает на то, что система работает на пределе, скорее всего из-за большого количества процессов, ожидающих операций ввода-вывода. </br></br>Информация о нагрузке: </br>Источник: </br>`/proc/loadavg` </br>Информация о логических ядрах </br></br>Количество логических ядер: </br>Основной источник: </br>`/sys/fs/cgroup/cpu.max` </br></br>Дополнительный источник: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br>`/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>Количество ядер вычисляется путем деления квоты на период (quota / period) |
| `LoadAverage above 100%` | ([Load](https://en.wikipedia.org/wiki/Load_(computing))) Физический хост перегружен. </br>Это указывает на то, что система работает на пределе, скорее всего из-за большого количества процессов, ожидающих операций ввода-вывода. </br></br>Информация о нагрузке: </br>Источник: </br>`/proc/loadavg` </br>Информация о логических ядрах </br></br>Количество логических ядер: </br>Основной источник: </br>`/sys/fs/cgroup/cpu.max` </br></br>Дополнительный источник: </br>`/sys/fs/cgroup/cpu/cpu.cfs_quota_us` </br>`/sys/fs/cgroup/cpu/cpu.cfs_period_us` </br>Количество ядер вычисляется путем деления квоты на период (quota / period) |
| **COMPUTE_POOL** ||
| `Pool usage is over than 90%` <br>`Pool usage is over than 95%` <br>`Pool usage is over than 99%` | один из CPU пулов перегружен. |
| **NODE_UPTIME** ||
| `The number of node restarts has increased` | Количество рестартов ноды превысило порог. По-умолчанию, это 10 рестартов в час. |
| `Node is restarting too often` | Узлы слишком часто перезапускаются. По-умолчанию, это 30 рестартов в час. |
| **NODES_TIME_DIFFERENCE** ||
| `The nodes have a time difference of ... ms` | Расхождение времени на узлах, что может приводить к возможным проблемам с координацией распределённых транзакций. |
| `The nodes have a time difference of ... ms` | Расхождение времени на узлах, что может приводить к возможным проблемам с координацией распределённых транзакций. Начинает появляться с расхождения в 5ms|
## Пример ответа {#examples}
Expand Down

0 comments on commit 0a64c0e

Please sign in to comment.