-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add service label on metrics #1024
Conversation
d1af596
to
5027c0b
Compare
service = "all" | ||
end | ||
upstream_metrics.report(ngx.var.upstream_status, ngx.var.upstream_response_time, service) | ||
report_req_response_time(service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is unrelated to this PR. But we probably should also introduce a counter of how many requests the gateway processed (and have a label for each service). So prometheus can detect requests per minute for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave the approval on David, but this is excellent work! 👍
spec/metrics/upstream_spec.lua
Outdated
@@ -36,8 +37,8 @@ describe('upstream metrics', function() | |||
end) | |||
|
|||
it('adds the latency to the histogram', function() | |||
upstream_metrics.report(200, 0.1) | |||
assert.stub(test_histogram.observe).was_called_with(test_histogram, 0.1) | |||
upstream_metrics.report(200, 0.1, service_metric_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably test with and without service. To make sure that both test cases work correctly.
Also, we should do the same for the status codes counter.
@eloycoto the new metrics should be added to https://github.com/3scale/APIcast/blob/master/doc/prometheus-metrics.md |
I think that I found the issue with the test[0]. On the executor, where the phase APIcast/gateway/src/apicast/executor.lua Lines 40 to 51 in 66168b7
The main problem on log phase, if the batch policy or cache policy is in place I debug all the code, and the issue is fixed if we add in this the following APIcast/gateway/conf.d/apicast.conf Lines 68 to 71 in 66168b7
I think that it's safe enough, but I want to make sure that I do not introduce [0] https://github.com/3scale/APIcast/blob/master/t/apicast-policy-3scale-batcher.t#L403-L483 |
5027c0b
to
190962e
Compare
Added the patch on log phase in apicast.conf, now a different test is failing: Lines 621 to 631 in 66168b7
Should a request to invalid service reach |
@eloycoto The log phase will run even when the We'll need to take that case into account when updating the Prometheus metrics. Regarding the 3scale batching problem, in that case, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of suggestions.
@eloycoto and I have been investigating this. Adding APIcast/t/apicast-policy-3scale-batcher.t Line 453 in 66168b7
Calling |
b2fbd03
to
9b2deea
Compare
@eloycoto Sorry, I thought about this after approving the PR. Maybe we should use the service name (maybe system_name to guarantee uniqueness) or instead of the ID as the label of the metric. I'm thinking about how this can be used to show data in something like Grafana. The ID will not give much information to the people operating APIcast. It's likely that they'll need to go to the admin portal to check which service that ID corresponds to. Maybe we can save them that step. |
For the record, we discussed offline with @davidor: If we use the ID, the number will never change. We are going to keep the ID that it's the most reasonable one. |
I think we discussed this earlier and kind of agreed using system name. The system name is immutable and should be used instead of an ID because it gives more context. It is immediately obvious in alerts and dashboards what service it is. Ids have zero context and it would be impossible for humans to decipher them without some lookup through 3scale API. |
Changes to use system_name added in a new commit. Please review :-) |
Hi, Made some perfomance testing around the multiple services in the labels, it's Config: https://gist.githubusercontent.com/eloycoto/37b23b7c74a21645e6a4c37f41197c53/raw/447eaf62434e50cd1e0253b622443f0d6de99d64/3scale-100-nodes.json
Result with extended metrics:
Upstream status metrics:
Without extended metrics: wrk output:
Metrics:
|
@eloycoto I realised one shortcoming when using just system_name. It is no longer unique in multi-tenant environments. Would it be a big issue to have both system name and id? The granularity should be the same, so it shouldn't affect prometheus. |
@mikz your comment makes total sense! I do not liked the system_name; a user can't change that and can be out of date quite quickly. I would format as ${id}-${name}, with that users can use a custom legend based on ID or names, for legend format, they can use the following promQL query, and we allow users to have the correct names on that: https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace What do you think about this? Are you ok with that? Regards |
@eloycoto I actually thought about using two labels. One for id and one for system_name. So people do not have to bother using the label replace and just see the system_name as a hint for what the service is, but still be unique because of the ID. It is just additional metadata. |
@mikz the problem if you add two labels, is that Prometheus matrix will be bigger, and the Prometheus performance will be affected, I'm happy with two labels if have different data, but having two labels for the same data, I'm a bit against |
@eloycoto how it will be bigger? the number of combinations is the same if you make an identifier that contains both or if you have two identifiers and one is unique. Prometheus will not store anything in empty timelines. |
This commits adds support to have additional information on some metrics[0], this commit enables users to know the service in the metric. Due to this can be a problematic for Prometheus[1], to enable this, a new ENV variable(`APICAST_EXTENDED_METRICS`) is defined. Users that have a few services can run the extended metrics without affecting Prometheus stability, if the APICast instance has thousands of services the recommendation is to have this option disabled. In case of the extended_metrics is disabled, all the service labels will use "all" value and Prometheus performance will not be affected. The new metrics list looks like this: ``` bash-4.2$ curl http://localhost:9421/metrics -s | grep service total_response_time_seconds_bucket{service="all",le="00.100"} 91 total_response_time_seconds_bucket{service="all",le="00.200"} 189 total_response_time_seconds_bucket{service="all",le="00.300"} 220 total_response_time_seconds_bucket{service="all",le="00.400"} 220 total_response_time_seconds_bucket{service="all",le="00.500"} 222 total_response_time_seconds_bucket{service="all",le="00.750"} 223 total_response_time_seconds_bucket{service="all",le="01.000"} 223 total_response_time_seconds_bucket{service="all",le="01.500"} 224 total_response_time_seconds_bucket{service="all",le="02.000"} 224 total_response_time_seconds_bucket{service="all",le="03.000"} 224 total_response_time_seconds_bucket{service="all",le="04.000"} 224 total_response_time_seconds_bucket{service="all",le="05.000"} 224 total_response_time_seconds_bucket{service="all",le="10.000"} 224 total_response_time_seconds_bucket{service="all",le="+Inf"} 224 total_response_time_seconds_count{service="all"} 224 total_response_time_seconds_sum{service="all"} 33.616 upstream_response_time_seconds_bucket{service="all",le="00.100"} 93 upstream_response_time_seconds_bucket{service="all",le="00.200"} 190 upstream_response_time_seconds_bucket{service="all",le="00.300"} 220 upstream_response_time_seconds_bucket{service="all",le="00.400"} 220 upstream_response_time_seconds_bucket{service="all",le="00.500"} 223 upstream_response_time_seconds_bucket{service="all",le="00.750"} 224 upstream_response_time_seconds_bucket{service="all",le="01.000"} 224 upstream_response_time_seconds_bucket{service="all",le="01.500"} 224 upstream_response_time_seconds_bucket{service="all",le="02.000"} 224 upstream_response_time_seconds_bucket{service="all",le="03.000"} 224 upstream_response_time_seconds_bucket{service="all",le="04.000"} 224 upstream_response_time_seconds_bucket{service="all",le="05.000"} 224 upstream_response_time_seconds_bucket{service="all",le="10.000"} 224 upstream_response_time_seconds_bucket{service="all",le="+Inf"} 224 upstream_response_time_seconds_count{service="all"} 224 upstream_response_time_seconds_sum{service="all"} 32.226 upstream_status{status="200",service="all"} 224 ``` [0] List of affected metrics: total_response_time_seconds upstream_response_time_seconds upstream_status [1] https://prometheus.io/docs/practices/naming/#labels Signed-off-by: Eloy Coto <eloy.coto@gmail.com>
This commit add the EXTENDED_Metrics integration test. Signed-off-by: Eloy Coto <eloy.coto@gmail.com>
Co-Authored-By: eloycoto <eloy.coto@gmail.com>
This commit change the behaviour of the metrics, now the metrics contains the service_id and the service_system_name labels. ``` bash-4.2$ curl http://localhost:9421/metrics -s | grep service total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="00.200"} 1 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="00.300"} 1 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="00.400"} 1 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="00.500"} 1 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="00.750"} 2 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="01.000"} 2 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="01.500"} 3 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="02.000"} 3 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="03.000"} 3 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="04.000"} 3 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="05.000"} 3 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="10.000"} 3 total_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="+Inf"} 3 total_response_time_seconds_count{service_id="2555417794444",service_system_name="api"} 3 total_response_time_seconds_sum{service_id="2555417794444",service_system_name="api"} 1.802 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="00.200"} 1 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="00.300"} 1 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="00.400"} 1 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="00.500"} 2 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="00.750"} 3 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="01.000"} 3 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="01.500"} 3 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="02.000"} 3 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="03.000"} 3 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="04.000"} 3 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="05.000"} 3 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="10.000"} 3 upstream_response_time_seconds_bucket{service_id="2555417794444",service_system_name="api",le="+Inf"} 3 upstream_response_time_seconds_count{service_id="2555417794444",service_system_name="api"} 3 upstream_response_time_seconds_sum{service_id="2555417794444",service_system_name="api"} 1.102 upstream_status{status="200",service_id="2555417794444",service_system_name="api"} 3 ``` Signed-off-by: Eloy Coto <eloy.coto@gmail.com>
@mikz latest changes now contains the service_id and the service_system_name labels per metric :-) Regards. |
@eloycoto amazing! 👍 |
@eloycoto Good job! 👍 |
…ween phases In this particular test, the context was not being shared properly between phases. The problem did not show up in this test, but it can cause problems in other tests that rely on having always the correct context in the log phases for example. Like it happened here: #1024 (comment)
This commits adds support to have additional information on some metrics[0],
this commit enables users to know the service affected by the metric.
Due to this can be a problematic for Prometheus[1], to enable this a new ENV
variable(
APICAST_EXTENDED_METRICS
) is defined, so if users that have a few servicesthey can run the extended metrics without affecting Prometheus stability, if
they thousands of services the recommendation is to have this option disabled.
In case of the extended_metrics is disabled, all the service labels will use
"all" value and Prometheus performance will not be affected.
The new metrics list looks like this:
[0] List of affected metrics:
total_response_time_seconds
upstream_response_time_seconds
upstream_status
[1] https://prometheus.io/docs/practices/naming/#labels