-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Histogram scrape performance with multiple labels/label values #216
Comments
If we have performance issues with serialising the metrics data, I think it makes sense to take a look at either our storage format or our algorithm 🙂 A test case would be awesome, so we have a baseline. Maybe look into adding benchmarks as well to the repo? /cc @siimon @zbjornson |
Okay -- let me look at modifying our benchmark program for histograms to provide a baseline (and allow others to check I'm not missing something obvious!) |
Apologies for the delay. I think I've got a relatively tidy benchmark program (warning -- my coding style is awful) -- I just want to add a few options to make it easier to run on single test cases (currently iterates over everything). I'll fork the archive and add a new histogramBenchmark.js into tests (probably tomorrow). Here is a snippet of the current output (not double checked yet!) showing the times taken to complete the generation of the text format data for return (apologies for length). Hopefully you'll see what I'm talking about with the increase of scrape times with number of labels.
|
Hmm. I think there is a bug in the multipliers above so the numbers will be wrong.... |
Please give 11.1.2 a try 🙂 We can reopen if it's still an issue |
Thanks. I'm seeing about a double in performance. Very Nice! thanks. The performance is much more linear, which is great. The benchmark test program is sitting in my fork in the benchmark directory. Shall I PR it into the main branch? Running as Specifically: Previously (I think 11.1.1), and snipped from different log (hence slight differentces): I'll try and run some more exhaustive tests on our detailed use cases tonight. |
Also see #220 I'd love to add a benchmark to this repo, feel free to PR it! |
@SimenB, @siimon : I've been tinkering on performance using the benchmarks tools in 11.1.3 (great stuff -- many thanks for them!). I'm pre-computing the Prometheus label string (mostly) in histogram, and I'm seeing about a 33% performance gain from this at the expense of a 10% performance drop against getMetricsAsJSON and obviously an increase in memory to hold the precomputed labels. I'd appreciate comments on these trade-offs. Code is currently in https://github.com/KevinAMurray/prom-client/tree/performance-metrics-scrape (The hash used is very close to the prometheus labels, but it doesn't do the escape string. It may be possible to use the hash instead at a lesser memory overhead. I've yet to see what the performance costs of that would be. Hope to create a branch to check that idea out sometime.) |
@nowells I've also expanded the benchmarks a bit more now. |
@KevinAMurray awesome! I can't wait to see how they evolve. Anything I can do to help? |
Hi,
We have a use case where we are using histograms together with between 2 and 6 labels (depending on the metric), and where those labels have between 3 and 40 values. What we are seeing is that the time to perform the scrape increases dramatically when we have more labels and more label values, to the point where the scrape operation could take a couple of seconds (for a worst case situation, which I would expect to happen after the service has been running for some time).
Whilst this clearly isn't a problem for Prometheus, in our particular case the node application is single threaded (e.g. we cannot use Cluster or similar). This means that a prometheus scrape will block that particular instance for a second or two whilst the scrape is happening.
I've explored the prom-client code, and I have made some improvements by effectively pre-computing information when the histogram's time series is created, and by performing some calculations during observe(). I.e. I've slightly changed the design trade-off between optimal observe() performance to improve the scrape performance, with an increase in memory requirements (though I think those memory requirements are probably the same when the scrape is happening).
Have other people encountered this sort of performance issue before, and are there other solutions? (As mentioned, effectively we can't use something like Cluster.)
I'm happy to provide code snippets (or a fork) as the basis of improvements or for more discussions.
Thanks,
Kevin
The text was updated successfully, but these errors were encountered: