Skip to content

Commit

Permalink
Merge pull request #238 from grafana/add-bucket-index-observability
Browse files Browse the repository at this point in the history
Add bucket index observability
  • Loading branch information
pracucci authored Jan 5, 2021
2 parents 6bb6fe1 + 05094da commit fda458b
Show file tree
Hide file tree
Showing 4 changed files with 78 additions and 2 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
## master / unreleased

* [ENHANCEMENT] Added `unregister_ingesters_on_shutdown` config option to disable unregistering ingesters on shutdown (default is enabled). #213
* [ENHANCEMENT] Improved blocks storage observability: #237
- Cortex / Queries: added bucket index load operations and latency (available only when bucket index is enabled)
- Alerts: added "CortexBucketIndexNotUpdated" (bucket index only) and "CortexTenantHasPartialBlocks"

## 1.6.0 / 2021-01-05

Expand Down
27 changes: 27 additions & 0 deletions cortex-mixin/alerts/blocks.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,33 @@
message: 'Cortex Store Gateway {{ $labels.namespace }}/{{ $labels.instance }} has not successfully synched the bucket since {{ $value | humanizeDuration }}.',
},
},
{
// Alert if the bucket index has not been updated for a given user.
alert: 'CortexBucketIndexNotUpdated',
expr: |||
min by(namespace, user) (time() - cortex_bucket_index_last_successful_update_timestamp_seconds) > 7200
|||,
labels: {
severity: 'critical',
},
annotations: {
message: 'Cortex bucket index for tenant {{ $labels.user }} in {{ $labels.namespace }} has not been updated since {{ $value | humanizeDuration }}.',
},
},
{
// Alert if a we consistently find partial blocks for a given tenant over a relatively large time range.
alert: 'CortexTenantHasPartialBlocks',
'for': '6h',
expr: |||
max by(namespace, user) (cortex_bucket_blocks_partials_count) > 0
|||,
labels: {
severity: 'warning',
},
annotations: {
message: 'Cortex tenant {{ $labels.user }} in {{ $labels.namespace }} has {{ $value }} partial blocks.',
},
},
],
},
],
Expand Down
28 changes: 26 additions & 2 deletions cortex-mixin/dashboards/queries.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
)
)
.addRowIf(
std.member($._config.storage_engine, 'chunks'),
std.member($._config.storage_engine, 'blocks'),
$.row('Querier - Blocks storage')
.addPanel(
$.panel('Number of store-gateways hit per Query') +
Expand All @@ -156,7 +156,31 @@ local utils = import 'mixin-utils/utils.libsonnet';
)
.addRowIf(
std.member($._config.storage_engine, 'blocks'),
$.row('Store-gateway - Blocks')
$.row('')
.addPanel(
$.panel('Bucket indexes loaded (per querier)') +
$.queryPanel([
'max(cortex_bucket_index_loaded{%s})' % $.jobMatcher($._config.job_names.querier),
'min(cortex_bucket_index_loaded{%s})' % $.jobMatcher($._config.job_names.querier),
'avg(cortex_bucket_index_loaded{%s})' % $.jobMatcher($._config.job_names.querier),
], ['Max', 'Min', 'Average']) +
{ yaxes: $.yaxes('short') },
)
.addPanel(
$.successFailurePanel(
'Bucket indexes load / sec',
'sum(rate(cortex_bucket_index_loads_total{%s}[$__rate_interval])) - sum(rate(cortex_bucket_index_load_failures_total{%s}[$__rate_interval]))' % [$.jobMatcher($._config.job_names.querier), $.jobMatcher($._config.job_names.querier)],
'sum(rate(cortex_bucket_index_load_failures_total{%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.querier),
)
)
.addPanel(
$.panel('Bucket indexes load latency') +
$.latencyPanel('cortex_bucket_index_load_duration_seconds', '{%s}' % $.jobMatcher($._config.job_names.querier)),
)
)
.addRowIf(
std.member($._config.storage_engine, 'blocks'),
$.row('Store-gateway - Blocks storage')
.addPanel(
$.panel('Blocks queried / sec') +
$.queryPanel('sum(rate(cortex_bucket_store_series_blocks_queried_sum{component="store-gateway",%s}[$__rate_interval]))' % $.jobMatcher($._config.job_names.store_gateway), 'blocks') +
Expand Down
22 changes: 22 additions & 0 deletions cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,6 +226,28 @@ gsutil mv gs://BUCKET/TENANT/BLOCK gs://BUCKET/TENANT/corrupted-BLOCK

Same as [`CortexCompactorHasNotUploadedBlocks`](#CortexCompactorHasNotUploadedBlocks).

### CortexBucketIndexNotUpdated

This alert fires when the bucket index, for a given tenant, is not updated since a long time. The bucket index is expected to be periodically updated by the compactor and is used by queriers and store-gateways to get an almost-updated view over the bucket store.

How to **investigate**:
- Ensure the compactor is successfully running
- Look for any error in the compactor logs

### CortexTenantHasPartialBlocks

This alert fires when Cortex finds partial blocks for a given tenant. A partial block is a block missing the `meta.json` and this may usually happen in two circumstances:

1. A block upload has been interrupted and not cleaned up or retried
2. A block deletion has been interrupted and `deletion-mark.json` has been deleted before `meta.json`

How to **investigate**:
- Look for the block ID in the logs
- Find out which Cortex component operated on the block at last (eg. uploaded by ingester/compactor, or deleted by compactor)
- Investigate if was a partial upload or partial delete
- Safely manually delete the block from the bucket if was a partial delete or an upload failed by a compactor
- Further investigate if was an upload failed by an ingester but not later retried (ingesters are expected to retry uploads until succeed)

### CortexWALCorruption

This alert is only related to the chunks storage. This can happen because of 2 reasons: (1) Non graceful shutdown of ingesters. (2) Faulty storage or NFS.
Expand Down

0 comments on commit fda458b

Please sign in to comment.