Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apicast] prometheus metrics policy #5

Closed
wants to merge 16 commits into from
Closed

Conversation

mikz
Copy link
Contributor

@mikz mikz commented Feb 27, 2018

depends on 3scale/APIcast#629

Example

# HELP apicast_status HTTP status generated by APIcast
# TYPE apicast_status counter
apicast_status{status="200"} 2
apicast_status{status="503"} 14116
# HELP cloud_hosted_balancer Cloud hosted balancer
# TYPE cloud_hosted_balancer counter
cloud_hosted_balancer{status="success"} 239
# HELP cloud_hosted_rate_limit Cloud hosted rate limits
# TYPE cloud_hosted_rate_limit counter
cloud_hosted_rate_limit{state="delayed "} 283
cloud_hosted_rate_limit{state="rejected"} 14116
# HELP nginx_error_log Items in nginx error log
# TYPE nginx_error_log counter
nginx_error_log{level="alert"} 4
nginx_error_log{level="warn"} 113
# HELP nginx_http_connections Number of HTTP connections
# TYPE nginx_http_connections gauge
nginx_http_connections{state="accepted"} 274
nginx_http_connections{state="active"} 114
nginx_http_connections{state="handled"} 274
nginx_http_connections{state="reading"} 0
nginx_http_connections{state="total"} 14422
nginx_http_connections{state="waiting"} 62
nginx_http_connections{state="writing"} 52
# HELP nginx_metric_errors_total Number of nginx-lua-prometheus errors
# TYPE nginx_metric_errors_total counter
nginx_metric_errors_total 0
# HELP upstream_status HTTP status from upstream servers
# TYPE upstream_status counter
upstream_status{status="200"} 233
upstream_status{status="404"} 3
# HELP openresty_shdict_capacity OpenResty shared dictionary capacity
# TYPE openresty_shdict_capacity gauge
openresty_shdict_capacity{dict="api_keys"} 10485760
openresty_shdict_capacity{dict="configuration"} 10485760
openresty_shdict_capacity{dict="init"} 16384
openresty_shdict_capacity{dict="locks"} 1048576
openresty_shdict_capacity{dict="prometheus_metrics"} 16777216
openresty_shdict_capacity{dict="rate_limit_req_store"} 10485760
# HELP openresty_shdict_free_space OpenResty shared dictionary free space
# TYPE openresty_shdict_free_space gauge
openresty_shdict_free_space{dict="api_keys"} 10412032
openresty_shdict_free_space{dict="configuration"} 10412032
openresty_shdict_free_space{dict="init"} 4096
openresty_shdict_free_space{dict="locks"} 1032192
openresty_shdict_free_space{dict="prometheus_metrics"} 16662528
openresty_shdict_free_space{dict="rate_limit_req_store"} 10412032

@@ -2,12 +2,14 @@ local PolicyChain = require('apicast.policy_chain')
local policy_chain = context.policy_chain

if not arg then -- {arg} is defined only when executing the CLI
policy_chain:insert(PolicyChain.load_policy('cloud_hosted.metrics', '0.1', { log_level = 'warn' }))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now it records warnings too but for production we probably want it to be error and above. Maybe an env var?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, log_level should be manage by a ENV var.

This would be a policy specific ENV var?

ex: PROMETHEUS_LOG_LEVEL or system wide?

@@ -0,0 +1 @@
lua_capture_error_log 4k;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enough 10 log entries.
So it will collect last 10 log entries with desired log level.

For example when you have between prometheus pulls 15 log entries like: 5 error, 10 warning then the error will not appear.
It is always good to set the desired log level to something we actually need.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe this conf could be an env var? Because we certainly will have to play with the Prometheus scrape time and the log capture size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be when it would be part of the liquid template in the main repo.
But here it could not be templated.

And disregard my comment about 10 entries. It fits more.

Well I think 4k is fine when properly configured. We should not capture log levels we don't care about. So lets say there are top ones: emerg, alert, crit, error.
And we configure to capture error and up. We would not really care if there are some missing higher levels, because the error itself is enough to trigger a warning.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is controlled by the "log_map" var? can we expose this one as ENV ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see so set_filter_level will capture everything >=METRICS_LOG_LEVEL right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmprusi done in f45985c as METRICS_LOG_LEVEL. better name suggestions much welcome :)

@maneta yes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! :)

@mikz mikz force-pushed the apicast-prometheus branch from 28eca89 to f45985c Compare February 28, 2018 09:13
policy_chain:insert(PolicyChain.load_policy('cloud_hosted.rate_limit', '0.1', {
limit = os.getenv('RATE_LIMIT') or 5,
burst = os.getenv('RATE_LIMIT_BURST') or 50 }), 1)
policy_chain:insert(PolicyChain.load_policy('cloud_hosted.balancer_blacklist', '0.1'), 1)
end

return {
policy_chain = policy_chain
policy_chain = policy_chain,
ports = { metrics = 9100 },
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9421 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Thx!

@mikz mikz force-pushed the apicast-prometheus branch from 147c051 to 09f65c9 Compare February 28, 2018 09:50
@davidor
Copy link
Contributor

davidor commented Feb 28, 2018

👍

@mikz mikz force-pushed the apicast-prometheus branch 3 times, most recently from 9c392cc to 7472e92 Compare February 28, 2018 15:11
@mikz mikz force-pushed the apicast-prometheus branch from 7472e92 to d2f09fc Compare February 28, 2018 15:30
mikz added 2 commits February 28, 2018 17:26
and it is customizable by RATE_LIMIT_STATUS env variable
@mikz mikz force-pushed the apicast-prometheus branch from 7f5e47e to 908ea05 Compare February 28, 2018 16:58
@mikz mikz force-pushed the apicast-prometheus branch from c5f1e8e to 7fff097 Compare March 1, 2018 08:48
[apicast] prometheus metrics policy
@maneta
Copy link

maneta commented Mar 1, 2018

@mikz @jmprusi @davidor

Here is the draft I have for alerts for APIcast based apps, the thresholds are not defined yet, so inputs are welcome. The time range of the queries is 5 minutes but totally flexible.

Nginx Error Logs

sum(increase(nginx_error_log{kubernetes_namespace="apicast-staging",level=~"(error|crit|alert|emerg)"}[5m]))

Detect spikes in nginx error logs (error|crit|alert|emerg) for the last five minutes.
Severity: Critical

Apicast HTTP status

sum(increase(apicast_status{kubernetes_namespace="apicast-staging",status=~"5\\d{2}"}[5m]))

Detect spikes in 5XX
Severity: Critical

increase(apicast_status{kubernetes_namespace="apicast-staging",status=~"4\\d{2}"}[5m])

Detect spikes in 4XX
Severity Warning
Obs: Not sure if we should alert over 4XX at all, so opinions are very welcome here.

Dropped connections

sum(increase(nginx_http_connections{kubernetes_namespace="apicast-staging",state="accepted"}[5m])) - sum(increase(nginx_http_connections{kubernetes_namespace="apicast-staging",state="handled"}[5

Calculated dropped connections dropped = (accepted - handled)
Severity: Critical

Request Processing Time (duration)

I don't have a query for it yet but the idea is to extract the $request_time from APIcast logs and alert over it. I'm still trying to find a good way to do this, opinions are welcome.

What do you think guys?

@mikz
Copy link
Contributor Author

mikz commented Mar 4, 2018

Looks good to me 👍

@mikz mikz closed this Mar 5, 2018
@mikz mikz deleted the apicast-prometheus branch March 5, 2018 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants