Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LOG-1815] Enhancement proposal: Add alerts and rules for operator-managed LokiStack #940

Merged
merged 1 commit into from
Dec 6, 2021

Conversation

ronensc
Copy link
Contributor

@ronensc ronensc commented Oct 21, 2021

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 21, 2021
the Loki Deployment. The alerts are based on metrics consumed from Prometheus into alert manager, evaluated and when
fired, pushed to notify human operators.

Loki Alerts can be categorized as follows:
Copy link

@eranra eranra Oct 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


See also [Defining recording rules](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/)

##### LokiRequestErrors
Copy link

@eranra eranra Oct 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same request ... if we can move to tables it will be easier to read

(I propose a table for each alert, with known organized fields)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

fired, pushed to notify human operators.

Loki Alerts can be categorized as follows:
* **Loki System**: One or more of the Loki components is misbehaving; this can be due to any of Loki or OCP system that is not working as expected
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronensc Any chance that we can label those with the relevant components that reported panic ... in the end-of-the-day each Liki component is separate container so it might be good to have in the alert the name of the component that broke?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I installed loki-operator and checked the labels of loki_panic_total metric. It seems that we can distinguish between loki components using the job label
image

* https://monitoring.mixins.dev/loki/
* https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#lokirequestpanics

##### LokiRequestLatency
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that for all tenants or maybe affecting only one or more tenants ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All tenants I guess. I couldn't find a label that distinguishes between labels


References:
* https://github.com/rhobs/configuration/blob/main/docs/sop/observatorium.md#lokitenantratelimitwarning

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of missing alerts here,,,, storage, etc ... can we create a tracing table from the name of the alert to the category and severity ... this will help us to see that we have minimal coverage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a summary table at the beginning and added a few more alerts


##### LokiRequestLatency

The 99th percentile is experiencing high latency.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extend to explain also bigger than 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

namespace_job_route:loki_request_duration_seconds:99quantile{route!~"(?i).*tail.*"} > 1
```

underline recording rule:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what is the purpose of the metric created by this recording rule

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


##### LokiTenantRateLimitWarning

A tenant is experiencing rate limiting.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please enhance to explain the limit (>100)

@eranra
Copy link

eranra commented Oct 27, 2021

@ronensc and @periklis I found this page https://pages.github.ibm.com/CloudEngineering/system_architecture/guardrails_standards/monitoring_maturity.md that provides generic but very useful information on needed alerts and dashboards for services such as Loki ... this is used by all IBM service teams so I think we can learn from that a lot ... @ronensc please add those to the list ... I am not saying that we need to implement all of those for day zero .. but we can add them and discuss what we implement and what not (and when).

Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested some minor name changes, otherwise LGTM.

enhancements/cluster-logging/loki-observability.md Outdated Show resolved Hide resolved
enhancements/cluster-logging/loki-observability.md Outdated Show resolved Hide resolved
@ronensc
Copy link
Contributor Author

ronensc commented Nov 23, 2021

@alanconway @periklis @eranra
Since Loki is just a logging system, is there an alert that is justified a critical severity? (i.e. warrants waking someone in the middle of the night)

@ronensc ronensc changed the title [WIP] [LOG-1815] Enhancement proposal: Add alerts and rules for operator-managed LokiStack [LOG-1815] Enhancement proposal: Add alerts and rules for operator-managed LokiStack Nov 24, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 24, 2021
Copy link
Contributor

@alanconway alanconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to squash commits before merge, otherwise LGTM.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2021
Copy link
Contributor

@periklis periklis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 6, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 6, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alanconway, periklis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit afea709 into openshift:master Dec 6, 2021
@ronensc ronensc deleted the loki-observability branch January 9, 2022 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants