-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck #341
Conversation
…iesStuck Signed-off-by: Marco Pracucci <marco@pracucci.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good AFAICT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job on improving playbooks!
cortex-mixin/docs/playbooks.md
Outdated
- An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue | ||
- Temporarily scale up queriers to try to stop the bleed | ||
- Check the `Cortex / Slow Queries` dashboard to see if a specific tenant is running heavy queries | ||
- If it's a multi-tenant Cortex cluster and shuffle-sharing is disabled for queriers, you may consider to enable it only for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If it's a multi-tenant Cortex cluster and shuffle-sharing is disabled for queriers, you may consider to enable it only for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). | |
- On multi-tenant Cortex cluster with shuffle-sharing for queriers disabled, you may consider to enable it for that specific tenant to reduce its blast radius. To enable queriers shuffle-sharding for a single tenant you need to set the `max_queriers_per_tenant` limit override for the specific tenant (the value should be set to the number of queriers assigned to the tenant). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user is already configured with querier sharding, and is sending too many slow queries, then nothing from this will help. Arguably, operator could increase the shard size, but that may affect other users too much. Depending on the situation, operator may choose to do nothing, and let Cortex return errors for that given user. (This assumes that only single user is affected).
cortex-mixin/docs/playbooks.md
Outdated
- Is query latency increased? | ||
- An increased latency reduces the number of queries we can run / sec: once all workers are busy, new queries will pile up in the queue | ||
- Temporarily scale up queriers to try to stop the bleed | ||
- Check the `Cortex / Slow Queries` dashboard to see if a specific tenant is running heavy queries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can perhaps be better checked by looking at cortex_query_scheduler_queue_length
metric, which is already per-user: sum by (user) (cortex_query_scheduler_queue_length)
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Very good suggestions @pstibrany. Could you take another look, please? |
Brings in the following changes: - Use default as a picker value for datasource variable #204 - allow table link in new tab #238 - allow setting a default datasource #301 - Add textPanel #341 - make status code label name overrideable in qpsPanel #397 - use $__rate_interval over $__interval #401 - Set shared tooltip to false by default #458 - Use custom 'all' value to avoid massive regexes in queries. #469 https://github.com/grafana/jsonnet-libs/commits/master/grafana-builder/
…ok-for-stuck-queries-alerts Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck
Brings in the following changes: - Use default as a picker value for datasource variable grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/204 - allow table link in new tab grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/238 - allow setting a default datasource grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/301 - Add textPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/341 - make status code label name overrideable in qpsPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/397 - use $__rate_interval over $__interval grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/401 - Set shared tooltip to false by default grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/458 - Use custom 'all' value to avoid massive regexes in queries. grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/469 https://github.com/grafana/jsonnet-libs/commits/master/grafana-builder/
* Increased CortexAllocatingTooMuchMemory alert threshold Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add alert for etcd memory limits close Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * the distributor now supports push via GRPC (grafana/cortex-jsonnet#266) Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * Fixed CortexQuerierHighRefetchRate alert Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed label matcher Signed-off-by: Marco Pracucci <marco@pracucci.com> * Sort legend descending in the CPU/memory panels Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add slow queries dashboard Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added tenant ID field to the table Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add recording rules to calculate Cortex scaling - Update dashboard so it only shows under provisioned services and why - Add sizing rules based on limits. - Add some docs to the dashboard. Signed-off-by: Tom Wilkie <tom@grafana.com> * Increased CortexRequestErrors alert severity Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed "Disk Writes" and "Disk Reads" panels Signed-off-by: Marco Pracucci <marco@pracucci.com> * Pre-compute aggregations to optimize scaling recording rules Signed-off-by: Marco Pracucci <marco@pracucci.com> * Removed 5m step from subquery Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add function to customize compactor statefulset Signed-off-by: Marco Pracucci <marco@pracucci.com> * Use the job name in compactor alerts too Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed CortexCompactorRunFailed threshold Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added Cortex Rollout progress dashboard Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fix 'Unhealthy pods' in Cortex Rollout dashboard Signed-off-by: Marco Pracucci <marco@pracucci.com> * Simplify compactor alerts We should simply alert on things not having run since X. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * Use the right metric Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * Apply suggestions from code review Co-authored-by: Marco Pracucci <marco@pracucci.com> Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * Fix CortexCompactorHasNotSuccessfullyRunCompaction to avoid false positives Signed-off-by: Marco Pracucci <marco@pracucci.com> * Introduce ingester instance limits to configuration, and add alerts. (grafana/cortex-jsonnet#296) * Introduce ingester instance limits to configuration, and add alerts. * CHANGELOG.md * Address (internal) review feedback. * Improve CortexRulerFailedRingCheck alert Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added example Loki query to CortexTenantHasPartialBlocks playbook Signed-off-by: Marco Pracucci <marco@pracucci.com> * Default dashboards to Cortex blocks storage only Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add missing memberlist components to alerts This adds the admin-api, compactor and store-gateway components to the memberlist alert. Signed-off-by: Christian Simon <simon@swine.de> * mixin: Add gateway to valid job names (for GEM) * Only show namespaces from selected cluster. "All" works thanks to using regex matcher. (grafana/cortex-jsonnet#311) * Only show namespaces from selected cluster. "All" works thanks to using regex matcher. * CHANGELOG.md * Fixed CortexIngesterHasNotShippedBlocks alert false positive Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed mixin linter Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add placeholders to make the linter pass Signed-off-by: Marco Pracucci <marco@pracucci.com> * cortex-mixin: Use kube_pod_container_resource_{requests,limits} metrics This updates the recording rules to make them compatible with kube-state-metrics v2.0.0 which introduces some breaking changes in some metric names. With kube-state-metrics v2.0.0: - `kube_pod_container_resource_requests_cpu_cores` becomes `kube_pod_container_resource_requests{resource="cpu"}` - `kube_pod_container_resource_requests_memory_bytes` becomes `kube_pod_container_resource_requests{resource="memory"}` * cortex-mixin: Make the recording rules backwards compatible * refactor: functions to reduce code duplication - improve overrideability - making more use of `per_instance_label` from _config - added containerNetworkPanel functions for dashboards to use * fix: lint * refactor: config for job aggregation strings - to make it easier to override, define "cluster_namespace_job" in $._config as `job_aggregation_prefix`. - added some `job_aggregation_labels_*` as well The resulting output does not change (unless config is overridden). * lint * Update cortex-mixin/dashboards/writes.libsonnet simplify mapping by extending $._config Co-authored-by: Marco Pracucci <marco@pracucci.com> * fix: syntax * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * Lower CortexIngesterRestarts severity Signed-off-by: Marco Pracucci <marco@pracucci.com> * feature: add some text boxes and descriptions Focussing on the reads and writes dashboards, added some info panels and hover-over descriptions for some of the panels. Some common code used by the compactor also received additional text content. New functions: - addRows - addRowsIf ...to add a list of rows to a dashboard. The `thanosMemcachedCache` function has had some of its query text sprawled out for easier reading and comparison with similar dashboard queries. * fix: text replacements, repair addRows * Changing copy to add 'latency' as well. * Cut down on text from initial PR. Tucked existing text from the compactor dashboard under tooltips, rather than making them text boxes. * Getting rid of a few space/comma errors. * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * fix: formatting - limit to 4 panels per row * fmt * fix: remove accidental line * Update cortex-mixin/dashboards/dashboard-utils.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * fix: Requests per second * fix: text * Apply suggestions from code review as per @osg-grafana Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * fix: clarity * Apply suggestions from code review as per @osg-grafana Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> * Add a simple playbook for ingester series limit alert. Signed-off-by: Callum Styan <callumstyan@gmail.com> * Add cortex-gw-internal to watched gateway metrics (grafana/cortex-jsonnet#328) * Add cortex-gw-internal to watched gateway metrics * Update CHANGELOG.md Co-authored-by: Marco Pracucci <marco@pracucci.com> * fix: query formatting to aid in merge * fix: query formatting to aid in merge * fix: consistent labelling * fix: ensure panel titles are consistent - Most existing "per second" panel titles in `main` are written "/ sec", corrected recent commits to match. * Improved CortexIngesterReachingSeriesLimit playbook and added CortexIngesterReachingTenantsLimit playbook Signed-off-by: Marco Pracucci <marco@pracucci.com> * Better formatting for ingester_instance_limits+ example Signed-off-by: Marco Pracucci <marco@pracucci.com> * Clarify which alerts apply to chunks storage only Signed-off-by: Marco Pracucci <marco@pracucci.com> * Improve compactor alerts and playbooks Signed-off-by: Marco Pracucci <marco@pracucci.com> * Addressed review comments Signed-off-by: Marco Pracucci <marco@pracucci.com> * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com> * Fixed and improved runtime config alerts and playbooks Signed-off-by: Marco Pracucci <marco@pracucci.com> * fix: resolve review feedback * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com> * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com> * MarkCortexTableSyncFailure and CortexOldChunkInMemory alerts as chunks storage only Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed whitespace noise Signed-off-by: Marco Pracucci <marco@pracucci.com> * refactor: resources dashboard comtainer functions added: - containerDiskWritesPanel - containerDiskReadsPanel - containerDiskSpaceUtilization * revert: matching spacing format of main * lint: white noise * Add playbook for CortexRequestErrors and config option to exclude specific routes Signed-off-by: Marco Pracucci <marco@pracucci.com> * Change min-step to 15s to show better detail. $__rate_interval will be floored at 4x this quantity, so 15s lets us see faster transients than the previous value of 1m. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck Signed-off-by: Marco Pracucci <marco@pracucci.com> * Remove CortexQuerierCapacityFull alert Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added playbook for CortexProvisioningTooManyWrites Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added playbook for CortexAllocatingTooMuchMemory Signed-off-by: Marco Pracucci <marco@pracucci.com> * Address review feedback Signed-off-by: Marco Pracucci <marco@pracucci.com> * Replaced CortexCacheRequestErrors with CortexMemcachedRequestErrors Signed-off-by: Marco Pracucci <marco@pracucci.com> * Replace ruler alerts, and add playbooks. * Addressed review comments Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fix white space. * Better alert messages. * Improve CortexIngesterReachingSeriesLimit playbook Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add playbook for CortexProvisioningTooManyActiveSeries Signed-off-by: Marco Pracucci <marco@pracucci.com> * Improve messaging. * Fixed formatting Signed-off-by: Marco Pracucci <marco@pracucci.com> * Improved alert messages with Cortex cluster Signed-off-by: Marco Pracucci <marco@pracucci.com> * Improved CortexRequestLatency playbook Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added 'Per route p99 latency' to ruler configuration API Signed-off-by: Marco Pracucci <marco@pracucci.com> * Addressed review comments Signed-off-by: Marco Pracucci <marco@pracucci.com> * Aded object storage metrics for Ruler and Alertmanager Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add playbook entry for CortexGossipMembersMismatch. * Clarify data loss related to 'not healthy index found' issue Signed-off-by: Marco Pracucci <marco@pracucci.com> * Review comments. * Improve CortexIngesterReachingSeriesLimit playbook Signed-off-by: Marco Pracucci <marco@pracucci.com> * Increased CortexIngesterReachingSeriesLimit critical alert threshold from 80% to 85% Signed-off-by: Marco Pracucci <marco@pracucci.com> * Increase CortexIngesterReachingSeriesLimit warning `for` duration As it turns out, during normal shuffle-sharding operation, the 70% mark is often exceeded, but not by much. Rather than increasing the threshold to 75%, this commit increases the `for` duration to 3h, following the thought that we want this alert to fire if ingesters are constantly above the threshold even after stale series are flushed (which occurs every 2h, when the TSDB head is compacted). We flush series with a timestamp between [-3h, -1h] after the last compaction, so the worst case scenario is that it takes 3h to flush a stale series. Signed-off-by: beorn7 <beorn@grafana.com> * Fix scaling dashboard to work on multi-zone ingesters Signed-off-by: Marco Pracucci <marco@pracucci.com> * Simplified cluster_namespace_deployment:actual_replicas:count recording rule Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added a comment to explain '.*?' Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fix rollout dashboard to work with multi-zone deployments Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed legends Signed-off-by: Marco Pracucci <marco@pracucci.com> * Extend Alertmanager dashboard with currently unused metrics. Metrics for general operation: - Added "Tenants" stat panel using: `cortex_alertmanager_tenants_discovered` - Added "Tenant Configuration Sync" row using: `cortex_alertmanager_sync_configs_failed_total` `cortex_alertmanager_sync_configs_total` `cortex_alertmanager_ring_check_errors_total` Metrics specific to sharding operation: - Added "Sharding Initial State Sync" row using: `cortex_alertmanager_state_initial_sync_completed_total` `cortex_alertmanager_state_initial_sync_completed_total` `cortex_alertmanager_state_initial_sync_duration_seconds` - Added "Sharding State Operations" row using: `cortex_alertmanager_state_fetch_replica_state_total` `cortex_alertmanager_state_fetch_replica_state_failed_total` `cortex_alertmanager_state_replication_total` `cortex_alertmanager_state_replication_failed_total` `cortex_alertmanager_partial_state_merges_total` `cortex_alertmanager_partial_state_merges_failed_total` `cortex_alertmanager_state_persist_total` `cortex_alertmanager_state_persist_failed_total` * Review comments + fix latency panel. * Review comments. * Clarify the gsutil mv command for moving corrupted blocks Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Modify log message to fit example command Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Update grafana-builder from Mar 2019 to Feb 2021 Brings in the following changes: - Use default as a picker value for datasource variable grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/204 - allow table link in new tab grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/238 - allow setting a default datasource grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/301 - Add textPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/341 - make status code label name overrideable in qpsPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/397 - use $__rate_interval over $__interval grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/401 - Set shared tooltip to false by default grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/458 - Use custom 'all' value to avoid massive regexes in queries. grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/469 https://github.com/grafana/jsonnet-libs/commits/master/grafana-builder/ * Match query-frontend/query-scheduler/querier custom deployments by default Signed-off-by: Marco Pracucci <marco@pracucci.com> * Create playbooks for sharded alertmanager * Add new alerts for alertmanager sharding mode of operation. * fix(rules): upstream recording rule switched to sum_irate ref: kubernetes-monitoring/kubernetes-mixin#619 * Fix CortexIngesterReachingSeriesLimit playbook Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> * feat: Allow configuration of ring members in gossip alerts Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * fix: Add store-gateway and compactor ring_members Also re-order names for readability. Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * fix: Match all ingester workloads and avoid matching the cortex-gateway Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * feat: Optionally allow use of array or string to configure ring members Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * address review feedback Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * fix: Correct ingester and querier regexps Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Fixes to initial state sync panels on alertmanager dashboard. 1) Change minimal interval to 1m for sync duration and fetch state panels. This is in order to show infrequent events at smaller time windows. 2) Change syncs/sec panel to reflect absolute value of metric not rate. The initial sync only occurs once per-tenant so the counter value is essentially 0 or 1. Due to how per-tenant metrics are aggregated, the external facing metric really acts more like a gauge reflecting the number of tenants which achieved each outcome. Also, stack this panel as it becomes easier to visually see when the initial syncs have completed for all tenants (e.g. during a rollout). * Add rate back to Alertmanager dashboard initial syncs panel. The metric in fact does act like a counter due to soft deletion of the per-user registry when the user is unconfigured (e.g. moved to another instance or configuration deleted). * Make the overrides metric name configurable. We (Grafana Labs) are about to put in a new system to control and export data about limits and we'll need to use a different name. This shouldn't affect our OSS users. Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> * Improve Cortex / Queries dashboard Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add recording rules for speeding up Alertmanager dashboard. With large numbers of tenants the queries for some panels on thos dashboard can become quite slow as the metrics exposed are per-tenant. * Fixes from testing. * Move rules to their own group. * Split `cortex_api` recording rule group into three groups. This is a workaround for large clusters where this group can become slow to evaluate. * Update gsutil installation playbook Signed-off-by: Marco Pracucci <marco@pracucci.com> * Use `$._config.job_names.gateway` in resources dashboards. This fixes panels where `cortex-gw` was hardcoded. * Fine tune CortexIngesterReachingSeriesLimit alert Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add CortexRolloutStuck alert Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed playbook Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added CortexFailingToTalkToConsul alert Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed alert message Signed-off-by: Marco Pracucci <marco@pracucci.com> * Update alert to be generic to KV stores Signed-off-by: Marco Pracucci <marco@pracucci.com> * Add README * Add mimir-mixin CI checks * Update build image * Move to operations folder * Add missing zip to build-image * Run prettifier on playbooks.md * Update build-image Co-authored-by: Marco Pracucci <marco@pracucci.com> Co-authored-by: Goutham Veeramachaneni <gouthamve@gmail.com> Co-authored-by: Mauro Stettler <mauro.stettler@gmail.com> Co-authored-by: Tom Wilkie <tom@grafana.com> Co-authored-by: Tom Wilkie <tomwilkie@users.noreply.github.com> Co-authored-by: Goutham Veeramachaneni <gouthamve+github@gmail.com> Co-authored-by: Peter Štibraný <peter.stibrany@grafana.com> Co-authored-by: Alex Martin <alex@suitupalex.com> Co-authored-by: Javier Palomo <javier.palomo@grafana.com> Co-authored-by: Darren Janeczek <darren.janeczek@grafana.com> Co-authored-by: Darren Janeczek <38694490+darrenjaneczek@users.noreply.github.com> Co-authored-by: Jennifer Villa <jennifervilla@jennifers-mbp.lan> Co-authored-by: Ursula Kallio <73951760+osg-grafana@users.noreply.github.com> Co-authored-by: Callum Styan <callumstyan@gmail.com> Co-authored-by: Johanna Ratliff <johanna.ratliff@grafana.com> Co-authored-by: Bryan Boreham <bjboreham@gmail.com> Co-authored-by: Steve Simpson <steve.simpson@grafana.com> Co-authored-by: beorn7 <beorn@grafana.com> Co-authored-by: Tyler Reid <tyler.reid@grafana.com> Co-authored-by: George Robinson <george.robinson@grafana.com> Co-authored-by: Duologic <jeroen@simplistic.be> Co-authored-by: Arve Knudsen <arve.knudsen@gmail.com> Co-authored-by: Jack Baldry <jack.baldry@grafana.com>
What this PR does:
Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck. I tried to write down the most common root causes: anything else to add?
Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]