-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LOG-1815] Enhancement proposal: Add alerts and rules for operator-managed LokiStack #940
[LOG-1815] Enhancement proposal: Add alerts and rules for operator-managed LokiStack #940
Conversation
the Loki Deployment. The alerts are based on metrics consumed from Prometheus into alert manager, evaluated and when | ||
fired, pushed to notify human operators. | ||
|
||
Loki Alerts can be categorized as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
See also [Defining recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/) | ||
|
||
##### LokiRequestErrors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same request ... if we can move to tables it will be easier to read
(I propose a table for each alert, with known organized fields)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
fired, pushed to notify human operators. | ||
|
||
Loki Alerts can be categorized as follows: | ||
* **Loki System**: One or more of the Loki components is misbehaving; this can be due to any of Loki or OCP system that is not working as expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ronensc Any chance that we can label those with the relevant components that reported panic ... in the end-of-the-day each Liki component is separate container so it might be good to have in the alert the name of the component that broke?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* https://monitoring.mixins.dev/loki/ | ||
* https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#lokirequestpanics | ||
|
||
##### LokiRequestLatency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that for all tenants or maybe affecting only one or more tenants ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tenants I guess. I couldn't find a label that distinguishes between labels
|
||
References: | ||
* https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#lokitenantratelimitwarning | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of missing alerts here,,,, storage, etc ... can we create a tracing table from the name of the alert to the category and severity ... this will help us to see that we have minimal coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a summary table at the beginning and added a few more alerts
|
||
##### LokiRequestLatency | ||
|
||
The 99th percentile is experiencing high latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please extend to explain also bigger than 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
namespace_job_route:loki_request_duration_seconds:99quantile{route!~"(?i).*tail.*"} > 1 | ||
``` | ||
|
||
underline recording rule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explain what is the purpose of the metric created by this recording rule
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
##### LokiTenantRateLimitWarning | ||
|
||
A tenant is experiencing rate limiting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please enhance to explain the limit (>100)
@ronensc and @periklis I found this page https://pages.github.ibm.com/CloudEngineering/system_architecture/guardrails_standards/monitoring_maturity.md that provides generic but very useful information on needed alerts and dashboards for services such as Loki ... this is used by all IBM service teams so I think we can learn from that a lot ... @ronensc please add those to the list ... I am not saying that we need to implement all of those for day zero .. but we can add them and discuss what we implement and what not (and when). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested some minor name changes, otherwise LGTM.
@alanconway @periklis @eranra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to squash commits before merge, otherwise LGTM.
083547e
to
ca7bed0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alanconway, periklis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
https://issues.redhat.com/browse/LOG-1815