Skip to content

Commit

Permalink
Translated the Troubleshooting section into Russian (#12759)
Browse files Browse the repository at this point in the history
Co-authored-by: Ivan Blinkov <ivan@ydb.tech>
  • Loading branch information
anton-bobkov and blinkov authored Jan 29, 2025
1 parent 755e82b commit 7102021
Show file tree
Hide file tree
Showing 80 changed files with 976 additions and 22 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

![](../_assets/disk-time-available--disk-cost.png)

This chart shows the estimated total bandwidth capacity of the storage system in conventional units (green) and the total usage cost (blue). When the total usage cost exceeds the total bandwidth capacity, the {{ ydb-short-name }} storage system becomes overloaded, leading to increased latencies.
This chart shows the estimated total bandwidth capacity of the storage system in conventional units (green) and the total usage cost in conventional units (blue). When the total usage cost exceeds the total bandwidth capacity, the {{ ydb-short-name }} storage system becomes overloaded, leading to increased latencies.

1. On the **Total burst duration** chart, check for any load spikes on the storage system. This chart displays microbursts of load on the storage system, measured in microseconds.

Expand All @@ -15,4 +15,3 @@
This chart might show microbursts of the load that are not detected by the average usage cost in the **Cost and DiskTimeAvailable relation** chart.

{% endnote %}

Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,10 @@ Additionally, which components within the {{ ydb-short-name }} process consume

Look for the lines like this:

[ 2203.393223] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=ydb,pid=1332,uid=1000
[ 2203.393263] Out of memory: Killed process 1332 (ydb) total-vm:14219904kB, anon-rss:1771156kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:4736kB oom_score_adj:0
```plaintext
[ 2203.393223] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-1.scope,task=ydb,pid=1332,uid=1000
[ 2203.393263] Out of memory: Killed process 1332 (ydb) total-vm:14219904kB, anon-rss:1771156kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:4736kB oom_score_adj:0
```
Additionally, review the `ydbd` logs for relevant details.
Expand All @@ -53,6 +55,3 @@ Consider the following solutions for addressing insufficient memory:
- If the load on {{ ydb-short-name }} has increased due to new usage patterns or increased query rate, try optimizing the application to reduce the load on {{ ydb-short-name }} or add more {{ ydb-short-name }} nodes.
- If the load on {{ ydb-short-name }} has not changed but nodes are still restarting, consider adding more {{ ydb-short-name }} nodes or raising the hard memory limit for the nodes. For more information about memory management in {{ ydb-short-name }}, see [{#T}](../../../../reference/configuration/index.md#memory-controller).



8 changes: 6 additions & 2 deletions ydb/docs/en/core/dev/troubleshooting/performance/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,9 +57,13 @@ These issues refer to situations when the workload demands more physical resourc

- Actor system pools misconfiguration.

### Client application-related issues
### Schema design issues

- **[{#T}](./schemas/overloaded-shards.md)**. Data shards serving row-oriented tables may become overloaded for several reasons. Such overload leads to increased latencies for the transactions processed by the affected data shards.

- **Schema design issues**. Inefficient table and index design decisions can significantly impact query performance.
- **[{#T}](./schemas/splits-merges.md)**. {{ ydb-short-name }} supports automatic splitting and merging of data shards, which allows it to seamlessly adapt to changes in workloads. However, these operations are not free and might have a short-term negative impact on query latencies.

### Client application-related issues

- **Query design issues**. Inefficiently designed database queries may execute slower than expected.

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

* Overloaded table partitions with over 15000 queries in their queue.

* The outbound CDC queue exceeds the limit of 10000 elements or 125 MB.
* The outbound [CDC](../../../../concepts/glossary.md#cdc) queue exceeds the limit of 10000 elements or 125 MB.

* Table partitions in states other than normal, for example partitions in the process of splitting or merging.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Consider the following recommendations:

If possible, avoid [interactive transactions](../../../../concepts/glossary.md#interactive-transaction). A better approach is to use a single YQL query with `begin;` and `commit;` to select data, update data, and commit the transaction.

If you do need interactive transactions, append `commit;` to the last query in the transaction.
If you do need interactive transactions, perform `commit` in the last query in the transaction.

- Analyze the range of primary keys where conflicting modifications occur, and try to change the application logic to reduce the number of conflicts.

Expand Down

This file was deleted.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ items:
include:
mode: link
path: hardware/toc_p.yaml
- name: OS
- name: Operating system
include:
mode: link
path: system/toc_p.yaml
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Autobalancing occurs in the following cases:

- **Uneven distribution of database objects**

{{ ydb-short-name }} uses the **ObjectImbalance** metric to monitor the distribution of tablets utilizing the **[counter](*counter)** resource across {{ ydb-short-name }} nodes. When {{ ydb-short-name }} nodes restart, these tablets may not distribute evenly, prompting Hive to initiate the autobalancing procedure.
{{ ydb-short-name }} uses the **ObjectImbalance** metric to monitor the distribution of tablets utilizing the **[count](*count)** resource across {{ ydb-short-name }} nodes. When {{ ydb-short-name }} nodes restart, these tablets may not distribute evenly, prompting Hive to initiate the autobalancing procedure.


## Diagnostics
Expand Down Expand Up @@ -84,4 +84,3 @@ Adjust Hive balancer settings:


[*count]: Count is a virtual resource for distributing tablets of the same type evenly between nodes.

Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Rolling restart

{{ ydb-short-name }} clusters can be updated without downtime, which is possible because {{ ydb-short-name }} normally has redundant components and supports rolling restart procedure. To ensure continuous data availability, {{ ydb-short-name }} includes Cluster Management System (CMS) that tracks all outages and nodes taken offline for maintenance, such as restarts. CMS halts new maintenance requests if they might risk data availability.
{{ ydb-short-name }} clusters can be updated without downtime, which is possible because {{ ydb-short-name }} normally has redundant components and supports rolling restart procedure. To ensure continuous data availability, {{ ydb-short-name }} includes [Cluster Management System (CMS)](../../../../concepts/glossary.md#cms) that tracks all outages and nodes taken offline for maintenance, such as restarts. CMS halts new maintenance requests if they might risk data availability.

However, even if data is always available, the restart of all nodes in a relatively short period of time might have a noticeable impact on overall performance. Each [tablet](../../../../concepts/glossary.md#tablet) running on a restarted node is relaunched on a different node. Moving a tablet between nodes takes time and may affect latencies of queries involving it. See recommendations [for rolling restart](#rolling-restart).

Expand Down
5 changes: 5 additions & 0 deletions ydb/docs/ru/core/dev/toc_p.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,11 @@ items:
path: primary-key/toc_p.yaml
- name: Вторичные индексы
href: secondary-indexes.md
- name: Диагностика проблем
href: troubleshooting/index.md
include:
mode: link
path: troubleshooting/toc_p.yaml
- name: Оптимизация планов запросов
href: query-plans-optimization.md
- name: Пакетная загрузка
Expand Down
5 changes: 5 additions & 0 deletions ydb/docs/ru/core/dev/troubleshooting/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Диагностика проблем

В данном разделе описана диагностика проблем, которые могут возникнуть с базами данных {{ ydb-short-name }} и приложениями, работающими с этими базами данных.

- [{#T}](performance/index.md)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
1. Используйте вкладку **Diagnostics** во [встроенном UI](../../../../../reference/embedded-ui/index.md) для анализа загрузки процессора во всех пулах ресурсов:

1. Откройте [встроенный UI](../../../../../reference/embedded-ui/index.md), перейдите на вкладку **Databases** и нажмите на требуемую базу данных.

1. На вкладке **Navigation** убедитесь, что требуемая база данных выбрана.

1. Откройте вкладку **Diagnostics**.

1. На вкладке **Info** нажмите на кнопку **CPU** и проверьте уровни загрузки процессора во всех пулах ресурсов.

![](../_assets/embedded-ui-cpu-system-pool.png)
1. Проанализируйте загрузку процессора во всех пулах ресурсов на графиках Grafana:

1. Откройте панель мониторинга **[CPU](../../../../../reference/observability/metrics/grafana-dashboards.md#cpu)** в Grafana.

1. Проверьте наличие скачков на следующих графиках:

- **CPU by execution pool**

![](../_assets/cpu-by-pool.png)

- **User pool - CPU by host**

![](../_assets/cpu-user-pool.png)

- **System pool - CPU by host**

![](../_assets/cpu-system-pool.png)

- **Batch pool - CPU by host**

![](../_assets/cpu-batch-pool.png)

- **IC pool - CPU by host**

![](../_assets/cpu-ic-pool.png)

- **IO pool - CPU by host**

![](../_assets/cpu-io-pool.png)

1. Если скачки потребления ресурсов процессора обнаружены в пользовательском пуле ресурсов (user pool), проанализируйте изменения пользовательской нагрузки, которые могли бы вызвать недостаток ресурсов процессора. Проверьте следующие графики на панели мониторинга **DB overview** в Grafana:

- **Requests**

![](../_assets/requests.png)

- **Request size**

![](../_assets/request-size.png)

- **Response size**

![](../_assets/response-size.png)

Также проверьте все графики в секции **Operations** на панели мониторинга **DataShard**.

1. Если скачки потребления ресурсов процессора обнаружены в пакетном пуле ресурсов (batch pool), проверьте, не запущены ли процессы резервного копирования (backups).
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
1. Откройте панель мониторинга **[Distributed Storage Overview](../../../../../reference/observability/metrics/grafana-dashboards.md)** в Grafana.

1. На графике **DiskTimeAvailable and total Cost relation** проверьте, пересекают ли всплески **Total Cost** уровень **DiskTimeAvailable**.

![](../_assets/disk-time-available--disk-cost.png)

Этот график показывает ориентировочную суммарную пропускную способность системы хранения в условных единицах (зелёный) и суммарную стоимость использования в условных единицах (синий). Когда суммарная стоимость использования системы хранения превышает суммарную пропускную способность, система хранения {{ ydb-short-name }} перегружается, и задержки выполнения запросов растут.

1. На графике **Total burst duration** проверьте наличие всплесков в системе хранения. Этот график показывает микровсплески нагрузки на систему хранения, в микросекундах.

![](../_assets/microbursts.png)

{% note info %}

Этот график может выявить микровсплески нагрузки, которые не проявляются на графике со средней стоимостью использования **Cost and DiskTimeAvailable relation**.

{% endnote %}
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Недостаточное быстродействие процессора

Высокая нагрузка на процессор может привести к медленному выполнению запросов и увеличению задержек. В условиях ограниченного ресурса процессора база данных может с трудом справляться со сложными запросами или высоконагруженными транзакционными сценариями использования.

Узлы {{ ydb-short-name }} в основном используют ресурсы процессора на выполнение [акторов](../../../../concepts/glossary.md#actor). На каждом узле акторы выполняются с использованием ресурсов одного из [пулов акторной системы](../../../../concepts/glossary.md#actor-system-pools). Потребление ресурсов каждого пула измеряется отдельно, что позволяет точнее отслеживать изменения в потреблении ресурсов.

## Диагностика

<!-- The include is added to allow partial overrides in overlays -->
{% include notitle [#](_includes/cpu-bottleneck.md) %}

## Рекомендации

Добавьте дополнительные [узлы базы данных](../../../../concepts/glossary.md#database-node) в кластер или выделите больше процессорных ядер существующим узлам. Если это невозможно, рассмотрите возможность перераспределения ядер процессора между пулами ресурсов.
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Недостаточное дисковое пространство

Нехватка места на диске может привести к невозможности сохранения новых данных, когда база переходит в режим только для чтения. Эта проблема может также приводить к замедлению работы, когда система пытается освободить дисковое пространство, активнее приводя данные к более компактному виду в фоне.

## Диагностика

1. Проверьте наличие скачков на графиках панели мониторинга **[DB overview > Storage](../../../../reference/observability/metrics/grafana-dashboards.md#dboverview)** в Grafana.

1. Во [встроенном UI](../../../../reference/embedded-ui/index.md) на вкладке **Storage** проанализируйте список доступных групп хранения и их потребление места на диске.

{% note tip %}

Используйте фильтр **Out of Space**, чтобы отображать только группы хранения с заполненными дисками.

{% endnote %}

![](_assets/storage-groups-disk-space.png)

{% note info %}

Чтобы получить эту информацию, можно также использовать [Healthcheck API](../../../../reference/ydb-sdk/health-check-api.md).

{% endnote %}

## Рекомендации

Добавьте больше [групп хранения](../../../../concepts/glossary.md#storage-group) в базу данных.

Если у кластера нет свободных групп хранения, необходимо их предварительно сконфигурировать. При необходимости добавьте дополнительные [узлы хранения](../../../../concepts/glossary.md#storage-node).
Loading

0 comments on commit 7102021

Please sign in to comment.