[apicast] prometheus metrics policy #5

mikz · 2018-02-27T21:19:04Z

export log stats (extracted from: [prometheus] exporter prototype APIcast#629)
export stub_status (extracted from Openshift template cors-proxy#2)
upstream status (extracted from [prometheus] metric for upstream status / proxy status cors-proxy#30)
balancer status (to measure blacklisted upstreams)

Example

# HELP apicast_status HTTP status generated by APIcast
# TYPE apicast_status counter
apicast_status{status="200"} 2
apicast_status{status="503"} 14116
# HELP cloud_hosted_balancer Cloud hosted balancer
# TYPE cloud_hosted_balancer counter
cloud_hosted_balancer{status="success"} 239
# HELP cloud_hosted_rate_limit Cloud hosted rate limits
# TYPE cloud_hosted_rate_limit counter
cloud_hosted_rate_limit{state="delayed "} 283
cloud_hosted_rate_limit{state="rejected"} 14116
# HELP nginx_error_log Items in nginx error log
# TYPE nginx_error_log counter
nginx_error_log{level="alert"} 4
nginx_error_log{level="warn"} 113
# HELP nginx_http_connections Number of HTTP connections
# TYPE nginx_http_connections gauge
nginx_http_connections{state="accepted"} 274
nginx_http_connections{state="active"} 114
nginx_http_connections{state="handled"} 274
nginx_http_connections{state="reading"} 0
nginx_http_connections{state="total"} 14422
nginx_http_connections{state="waiting"} 62
nginx_http_connections{state="writing"} 52
# HELP nginx_metric_errors_total Number of nginx-lua-prometheus errors
# TYPE nginx_metric_errors_total counter
nginx_metric_errors_total 0
# HELP upstream_status HTTP status from upstream servers
# TYPE upstream_status counter
upstream_status{status="200"} 233
upstream_status{status="404"} 3
# HELP openresty_shdict_capacity OpenResty shared dictionary capacity
# TYPE openresty_shdict_capacity gauge
openresty_shdict_capacity{dict="api_keys"} 10485760
openresty_shdict_capacity{dict="configuration"} 10485760
openresty_shdict_capacity{dict="init"} 16384
openresty_shdict_capacity{dict="locks"} 1048576
openresty_shdict_capacity{dict="prometheus_metrics"} 16777216
openresty_shdict_capacity{dict="rate_limit_req_store"} 10485760
# HELP openresty_shdict_free_space OpenResty shared dictionary free space
# TYPE openresty_shdict_free_space gauge
openresty_shdict_free_space{dict="api_keys"} 10412032
openresty_shdict_free_space{dict="configuration"} 10412032
openresty_shdict_free_space{dict="init"} 4096
openresty_shdict_free_space{dict="locks"} 1032192
openresty_shdict_free_space{dict="prometheus_metrics"} 16662528
openresty_shdict_free_space{dict="rate_limit_req_store"} 10412032

mikz · 2018-02-28T08:47:13Z

apicast/config/cloud_hosted.lua

@@ -2,12 +2,14 @@ local PolicyChain = require('apicast.policy_chain')
 local policy_chain = context.policy_chain

 if not arg then -- {arg} is defined only when executing the CLI
+  policy_chain:insert(PolicyChain.load_policy('cloud_hosted.metrics', '0.1', { log_level = 'warn' }))


For now it records warnings too but for production we probably want it to be error and above. Maybe an env var?

yes, log_level should be manage by a ENV var.

This would be a policy specific ENV var?

ex: PROMETHEUS_LOG_LEVEL or system wide?

mikz · 2018-02-28T08:48:36Z

apicast/http.d/lua_capture_error_log.conf

@@ -0,0 +1 @@
+lua_capture_error_log 4k;


This is enough 10 log entries.
So it will collect last 10 log entries with desired log level.

For example when you have between prometheus pulls 15 log entries like: 5 error, 10 warning then the error will not appear.
It is always good to set the desired log level to something we actually need.

So maybe this conf could be an env var? Because we certainly will have to play with the Prometheus scrape time and the log capture size.

It could be when it would be part of the liquid template in the main repo.
But here it could not be templated.

And disregard my comment about 10 entries. It fits more.

Well I think 4k is fine when properly configured. We should not capture log levels we don't care about. So lets say there are top ones: emerg, alert, crit, error.
And we configure to capture error and up. We would not really care if there are some missing higher levels, because the error itself is enough to trigger a warning.

And this is controlled by the "log_map" var? can we expose this one as ENV ?

I see so set_filter_level will capture everything >=METRICS_LOG_LEVEL right?

@jmprusi done in f45985c as METRICS_LOG_LEVEL. better name suggestions much welcome :)

@maneta yes.

jmprusi · 2018-02-28T09:45:10Z

apicast/config/cloud_hosted.lua

  policy_chain:insert(PolicyChain.load_policy('cloud_hosted.rate_limit', '0.1', {
    limit = os.getenv('RATE_LIMIT') or 5,
    burst = os.getenv('RATE_LIMIT_BURST') or 50 }), 1)
  policy_chain:insert(PolicyChain.load_policy('cloud_hosted.balancer_blacklist', '0.1'), 1)
 end

 return {
-  policy_chain = policy_chain
+  policy_chain = policy_chain,
+  ports = { metrics = 9100 },


davidor · 2018-02-28T10:27:14Z

👍

* export log stats

and it is customizable by RATE_LIMIT_STATUS env variable

* so it is evaluated before loading configurations

[apicast] prometheus metrics policy

maneta · 2018-03-01T21:03:11Z

@mikz @jmprusi @davidor

Here is the draft I have for alerts for APIcast based apps, the thresholds are not defined yet, so inputs are welcome. The time range of the queries is 5 minutes but totally flexible.

Nginx Error Logs

sum(increase(nginx_error_log{kubernetes_namespace="apicast-staging",level=~"(error|crit|alert|emerg)"}[5m]))

Detect spikes in nginx error logs (error|crit|alert|emerg) for the last five minutes.
Severity: Critical

Apicast HTTP status

sum(increase(apicast_status{kubernetes_namespace="apicast-staging",status=~"5\\d{2}"}[5m]))

Detect spikes in 5XX
Severity: Critical

increase(apicast_status{kubernetes_namespace="apicast-staging",status=~"4\\d{2}"}[5m])

Detect spikes in 4XX
Severity Warning
Obs: Not sure if we should alert over 4XX at all, so opinions are very welcome here.

Dropped connections

sum(increase(nginx_http_connections{kubernetes_namespace="apicast-staging",state="accepted"}[5m])) - sum(increase(nginx_http_connections{kubernetes_namespace="apicast-staging",state="handled"}[5

Calculated dropped connections dropped = (accepted - handled)
Severity: Critical

Request Processing Time (duration)

I don't have a query for it yet but the idea is to extract the $request_time from APIcast logs and alert over it. I'm still trying to find a good way to do this, opinions are welcome.

What do you think guys?

mikz · 2018-03-04T08:56:59Z

Looks good to me 👍

mikz requested review from davidor and maneta February 27, 2018 22:27

mikz mentioned this pull request Feb 27, 2018

[prometheus] exporter prototype 3scale/APIcast#629

Merged

mikz force-pushed the apicast-prometheus branch from 3f66e85 to c100af5 Compare February 28, 2018 07:52

mikz requested review from jmprusi and orimarti February 28, 2018 08:43

mikz commented Feb 28, 2018

View reviewed changes

mikz force-pushed the apicast-prometheus branch from 28eca89 to f45985c Compare February 28, 2018 09:13

jmprusi reviewed Feb 28, 2018

View reviewed changes

mikz force-pushed the apicast-prometheus branch from 147c051 to 09f65c9 Compare February 28, 2018 09:50

jmprusi approved these changes Feb 28, 2018

View reviewed changes

mikz added 7 commits February 28, 2018 15:18

[apicast] update roverfile.lock

aceb192

[apicast] prometheus metrics policy

6ad3c2d

* export log stats

[apicast] add http stats to prometheus metrics

21b2dfa

[apicast] export metrics about rate limits

93c84cd

[apicast] prometheus metrics for balancer

4eb00b5

[apicast] add upstream metrics

71ce807

[apicast] expose metrics on port 9100

ca8ccbf

mikz force-pushed the apicast-prometheus branch 3 times, most recently from 9c392cc to 7472e92 Compare February 28, 2018 15:11

mikz added 3 commits February 28, 2018 16:30

[apicast] test metrics

eba9e37

[apicast] metrics about shared memory use

df43d6b

[apicast] make test to verify more cases of issues

d2f09fc

mikz force-pushed the apicast-prometheus branch from 7472e92 to d2f09fc Compare February 28, 2018 15:30

mikz added 2 commits February 28, 2018 17:26

[apicast] rate limit should default to 429 status

aba27c7

and it is customizable by RATE_LIMIT_STATUS env variable

[apicast] fix module test

cdbd29e

mikz force-pushed the apicast-prometheus branch from 7f5e47e to 908ea05 Compare February 28, 2018 16:58

mikz added 3 commits March 1, 2018 09:48

[apicast] improve balancer metrics

5df84e4

[apicast] rate-limit policy should happen in the rewrite phase

3362ebb

* so it is evaluated before loading configurations

[apicast] rate limit should not print errors

7fff097

mikz force-pushed the apicast-prometheus branch from c5f1e8e to 7fff097 Compare March 1, 2018 08:48

Merge pull request #5 from 3scale/apicast-prometheus

0b6c979

[apicast] prometheus metrics policy

mikz closed this Mar 5, 2018

mikz deleted the apicast-prometheus branch March 5, 2018 13:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[apicast] prometheus metrics policy #5

[apicast] prometheus metrics policy #5

mikz commented Feb 27, 2018 •

edited

Loading

mikz Feb 28, 2018

jmprusi Feb 28, 2018

mikz Feb 28, 2018

maneta Feb 28, 2018

mikz Feb 28, 2018

jmprusi Feb 28, 2018

maneta Feb 28, 2018

mikz Feb 28, 2018

jmprusi Feb 28, 2018

jmprusi Feb 28, 2018

mikz Feb 28, 2018

davidor commented Feb 28, 2018

maneta commented Mar 1, 2018 •

edited

Loading

mikz commented Mar 4, 2018

[apicast] prometheus metrics policy #5

[apicast] prometheus metrics policy #5

Conversation

mikz commented Feb 27, 2018 • edited Loading

Example

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidor commented Feb 28, 2018

maneta commented Mar 1, 2018 • edited Loading

Nginx Error Logs

Apicast HTTP status

Dropped connections

Request Processing Time (duration)

mikz commented Mar 4, 2018

mikz commented Feb 27, 2018 •

edited

Loading

maneta commented Mar 1, 2018 •

edited

Loading