Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metricbeat]independent events based on le for prometheus histograms #12446

Closed
odacremolbap opened this issue Jun 5, 2019 · 16 comments
Closed
Assignees
Labels
discuss Issue needs further discussion. Metricbeat Metricbeat Team:Integrations Label for the Integrations team

Comments

@odacremolbap
Copy link
Contributor

Describe the enhancement:

metricbeat prometheus helper is gathering bucket information in a single event, using a structure similar to:

              "my_metric": {
                "sum": 4560.874000000001,
                "count": 1,
                "bucket": {
                  "2000": 0,
                  "512000": 1,
                  "+Inf": 1,
                  "32000": 1,
                  "4000": 0,
                  "64000": 1,
                  "128000": 1,
                  "256000": 1,
                  "1000": 0,
                  "8000": 1,
                  "16000": 1
                }

The values under bucket are hard to work with at visualizations. We mostly rely on count and sum to calculate averages, dismissing all other data.

Describe a specific use case for the enhancement or feature:

Expanding the data above into multiple events, each one containing the le key and value as provided by the prometheus metric, would make it way more flexible visualize, at the cost of storage

              "my_metric": {
                "sum": 4560.874000000001,
                "count": 1
                }
                }
              "my_metric": {
                "le": {
                  "value": "2000",
                  "count": 0,
                }
                }

...

              "my_metric": {
                "le": {
                  "value": "+inf",
                  "count": 1,
                }
                }
@odacremolbap odacremolbap added discuss Issue needs further discussion. Metricbeat Metricbeat labels Jun 5, 2019
@odacremolbap odacremolbap self-assigned this Jun 5, 2019
@exekias exekias added the Team:Integrations Label for the Integrations team label Jun 5, 2019
@ruflin
Copy link
Collaborator

ruflin commented Jun 6, 2019

Wouldn't this explode the number of events we have to store? Basically meaning we have 1 event per entry?

What does le stand for?

@odacremolbap Could you share some of the queries you want to run on the data?

@exekias
Copy link
Contributor

exekias commented Jun 6, 2019

It would definitely create more events (for histogram metric type only). We did this change in Prometheus collector already, this change would align the helper with that.

le stands for less or equal, and it represents a bucket of the histogram, more info can be found here: https://prometheus.io/docs/concepts/metric_types/#histogram, I think the discussion on naming here can still happen, as the helper hides away prometheus logic under our own format.

This change should allow to perform terms (group by) aggregations on buckets, to get a different line per bucket without previous knowledge of the bucket sizes.

@odacremolbap
Copy link
Contributor Author

getting each bucket expanded increases storage requirements and also CPU/mem/time when processing.

before measuring it for posting here for consideration, I am trying to come up with an alternative, no luck so far:

    "coredns": {
      "stats": {
        "dns": {
          "request": {
            "size": {
              "bytes": [
                {
                  "value": 5559,
                  "le": "2047"
                },
                {
                  "le": "16000",
                  "value": 5559
                },
                {
                  "value": 5559,
                  "le": "200"
                },
                {
                  "le": "400",
                  "value": 5559
                },
...

that structure above is not searchable as is because of how (non nested) arrays work, but I'm wondering if a visualization internally retrieves the doc by timestamp, and then parses the content, in which case a solution would be close at hand just expanding inside one event vs sending one event per expanded value at the histogram/summary

I'm ingesting at elasticsearch using this template:

                "dns" : {
                  "properties" : {
                    "request" : {
                      "properties" : {
                        "duration" : {
                          "properties" : {
                            "ns" : {
                              "properties" : {
                                "count" : {
                                  "type" : "long"
                                },
                                "le" : {
                                  "ignore_above" : 1024,
                                  "type" : "keyword"
                                },
                                "sum" : {
                                  "type" : "long"
                                },
                                "value" : {
                                  "type" : "long"
                                }
                              }
                            }
                          }

At the visualization the terms agregation by le is showing the right data at the legend, but values are using whatever aggregation I choose among all le values.

image

there are also some glitches with a 0 le value and an empty one that might probably come from the +inf label at prometheus.

I'll try changing types for all le keys and values to array.
I think this solution is a lot nicer than the many events, but not sure how feasible.

@odacremolbap
Copy link
Contributor Author

at the kubernetes apiserver metricset
using the standard layout, the resulting document is 514K
using the expanded layout, the resulting document is 1,6M

@ruflin

@exekias
Copy link
Contributor

exekias commented Jun 13, 2019

ey @odacremolbap could you add some more detail? what do you mean by document? I guess this is the index size?

@odacremolbap
Copy link
Contributor Author

@ruflin
Copy link
Collaborator

ruflin commented Jun 13, 2019

We should definitively compare the size in Elasticsearch (index size after refresh etc.).

@odacremolbap
Copy link
Contributor Author

if this was merged, is there a way in kibana to manage histograms?

we would need to:

  • group by le
  • get the max value per time bucket
  • add the derivative of value between buckets
    (at this point the result is probably a mess, a bunch of stacked lines that don't provide an intuitive meaning)
  • now the tricky part: we need to get those resulting buckets, and de-group the ´lefields, using that value + the number derivative of thevalue` resulting from above to obtain the percentile.

Is that possible at all?

@odacremolbap
Copy link
Contributor Author

Size wise, these are the results of 5 minutes of monitoring apiserver, 10s freq:

Format Doc Count Index Size
Standard 14080 2mb
Expanded 50840 5.1mb

3.6 x number of documents indexed
2.5 x storage size

The metricset contains 2 metrics

  • a counter, which is left unchanged when using the expanded format
  • an histogram, which uses 8 bucketed values, and that will be expanded to a number of events at the standard format per unique set of labels. When using the expanded format these events will be sub-expanded into 8 events each (one per bucket).

@ruflin @exekias

@exekias
Copy link
Contributor

exekias commented Jun 19, 2019

Thank you for doing the numbers. Something that brought my attention is the number of documents that a single fetch creates, even with the standard layout, it sounds like it creates ~450 docs. I'm wondering what's the cause for this (I think I remember api server is quite verbose, as it provides detailed info per client & path). From the other metricsets you are working on, is this amount of data that common?

@odacremolbap
Copy link
Contributor Author

i think the reason is the number of labels and the cardinality of those labels.
In the case of apiserver_request_duration_seconds_bucket with almost no usage, you get more than 6000 different values at prometheus.

As the usage increases at a production environment, the number of metrics might go up.

@exekias
Copy link
Contributor

exekias commented Jun 19, 2019

Yeah, I'm guessing this is not the general case, what about the other histograms you saw?

@odacremolbap
Copy link
Contributor Author

kubeproxy and kubescheduler are kept in the low side
kubecontroller has some 400 lines histograms on no-usage test cluster

@exekias
Copy link
Contributor

exekias commented Jun 19, 2019

I understand with lines you mean documents.

Yeah, I can see how apiserver/kubecontroller can become a problem, even wit the standard layout, 2MB per 5 mins sounds like a lot of data, we should decide if it's worth it, if so, probably make them optional (disabled by default?)

@odacremolbap
Copy link
Contributor Author

prometheus metrics lines

for histograms we will generate an event for each 8 lines when all labels are considered keylabels. (I think for the apiserver we where missing some labels into keylabels, that's being fixed at #12610)

@odacremolbap
Copy link
Contributor Author

closing, we are selecting a reduced set of histogram buckets at visualizations as a work around

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issue needs further discussion. Metricbeat Metricbeat Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

4 participants