Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RollingSummary to prevent summary saturation #306

Merged
merged 7 commits into from
Jun 28, 2022
Merged

Add RollingSummary to prevent summary saturation #306

merged 7 commits into from
Jun 28, 2022

Conversation

danielnelson
Copy link
Contributor

This PR attempts to address the issue described in #269 where Prometheus
Summaries become saturated. I added a RollingSummary type that manages a list
of Summary so that old values can be retired.

There is a breaking change in AtomicStorage:

- type Histogram = Arc<AtomicBucket<f64>>;
+ type Histogram = Arc<AtomicBucket<(f64, Instant)>>;

I tested using a program that produces a test pattern, every two minutes it
switches between uniform random values between 0-100 and 100-200.

I also graph the summary rate(sum[1m]) / rate(count[1m]) for comparison to
the 50th quantile.

Before this patch:

2022-06-09-163831_1347x637_scrot

With this patch:

2022-06-09-160026_1345x635_scrot

@danielnelson
Copy link
Contributor Author

In a staging environment I have ran into this error:

panicked at ‘index out of bounds: the len is 10939 but the index is 10939’, /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sketches-ddsketch-0.1.3/src/store.rs:155:22

I'll try to find a simple way to reproduce.

@tobz
Copy link
Member

tobz commented Jun 10, 2022

First off, thank you for submitting a solution for this problem. :)

I'll need to see what the benchmarks look like, but the tracking of a timestamp per recorded value in a histogram is a bit of a red flag to me, performance-wise.

There's some thoughts I have in mind for how to avoid it while still providing proper time bucketing/rolling, but I'll wait to see what the benchmarks look like first before getting into any refactoring.

@tobz
Copy link
Member

tobz commented Jun 19, 2022

Alright, apologies for the delay.

After looking at the change, and running metrics-benchmark this adds more overhead than I'm comfortable with. Throughput seems to drop by around ~15%, which means it's actually higher because the benchmark app basically does a loop of "counter, gauge, histogram", so if 1/3" of it is contributing an overall ~15% slowdown.... but anyways, this is also in conjunction with the fact it's forcing everyone else using Handle to have to upgrade in the future and pay that cost, even if they don't need to.

I think what I'd want to see is localizing the change to metrics-exporter-prometheus. This should be doable without out much additional effort by creating a new Storage implementation in metrics-exporter-prometheus that emulates what you did to AtomicStorage<K>. You'd similarly have to create something like a newtype wrapper around AtomicBucket<(f64, Instant)> and impl HistogramFn.. but it should be doable.

As I try and get closer and closer to what a 1.0 release might look like, I'm striving to avoid continually changing every crate in the workspace due to changes that primarily occur in only one of them.

@danielnelson
Copy link
Contributor Author

That makes sense, I will try to make the suggested changes this week.

@danielnelson
Copy link
Contributor Author

I have updated the PR as suggested, please let me know what you think. BTW, the issue linked above with ddsketch is not yet resolved and will need to be before this could be merged.

@tobz
Copy link
Member

tobz commented Jun 24, 2022

Taking a cursory look, this a lot more in line with what I was thinking. 👍🏻

There's probably some performance to be squeezed out if we lowered the timing accuracy by bucketing with a background task on an interval, rather than getting the timestamp at record time, but I don't think this matters a ton for users using the Prometheus exporter anyways.

I'll do a more in-depth path, including looking at the DDSketch issue you mentioned, in the next few days. Thank you for making the requested changes.

@tobz
Copy link
Member

tobz commented Jun 27, 2022

Bah, looks like indexmap is unhappy on Rust 1.55. I'll get the MSRV bumped to 1.56 which should fix that and then we can see what CI looks like for this.

@danielnelson
Copy link
Contributor Author

The ddsketch issue has been fixed in a new 0.2.0 release. With it we can simplify the Summary, we no longer need the negative sketch or zero handling since it is part of the DDSketch type. I'll update this PR to contain the changes.

@tobz
Copy link
Member

tobz commented Jun 28, 2022

I updated the CI workflow to bump the MSRV to 1.56.1 which seems to get the MSRV-specific unit tests passing again, so I think you should be good if you merge in main.

Copy link
Member

@tobz tobz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was finally able to take a look through, and this looks solid to me.

Most of my thoughts/concerns are nits or can be addressed in the future (i.e. configurable summary window/bucket width, enforcing monotonicity of samples to drop a lot of the logic in add, etc) and nothing here feels like a design pitfall.

@tobz tobz merged commit 1b95311 into metrics-rs:main Jun 28, 2022
@tobz
Copy link
Member

tobz commented Jun 28, 2022

Thanks for your contribution! <3

@tobz tobz added C-util Component: utility classes and helpers. E-intermediate Effort: intermediate. T-bug Type: bug. T-refactor Type: refactor. S-awaiting-release Status: awaiting a release to be considered fixed/implemented. labels Jul 20, 2022
@tobz
Copy link
Member

tobz commented Jul 21, 2022

Released as metrics-exporter-prometheus@v0.11.0.

@tobz tobz removed the S-awaiting-release Status: awaiting a release to be considered fixed/implemented. label Jul 21, 2022
readysetbot pushed a commit to readysettech/readyset that referenced this pull request May 11, 2023
- Add metrics::SharedString and update function signatures in
  describe_* methods to have desc be a SharedString.
- Update metrics and friends to their latest (0.21, 0.12.1, 0.15)
- Update broken test due to a fixed bug (i assume) in the summary
  calculation: see: metrics-rs/metrics#306
- Unify MYSQL_* env variables used in benchmarks and generator
- benchmarks: Add .into() for each of the metrics descriptions
- .with_push_gateway requires user/pass to be passed, passing in None
  for now but this should probably become configurable as an env var
- Update tikv-jemallocator to 0.5.0 to remove a future incompat
  message from Rust

Change-Id: If92d136dd76c29b90623c36e7ff5ceb892089f01
Reviewed-on: https://gerrit.readyset.name/c/readyset/+/4892
Tested-by: Buildkite CI
Reviewed-by: Griffin Smith <griffin@readyset.io>
jvimal-eg added a commit to edgeguard-dev/metrics that referenced this pull request Dec 24, 2023
* fix prometheus metric name and label key sanitizer (metrics-rs#296)

Co-authored-by: Toby Lawrence <tobz@users.noreply.github.com>

* metrics-util: add ability to collect metrics on a per-thread basis via DebuggingRecorder (metrics-rs#299)

Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>

* (cargo-release) version 0.12.1

* Improve handling of the global recorder instance (metrics-rs#302)

This gets rid of the dangerous `static mut`, adds more comments
about the code, relaxes the orderings and documents the unsoundness
of the `clear` function, in addition to marking it unsafe.

It implements a small once_cell-like abstraction.

Co-authored-by: Toby Lawrence <toby@nuclearfurnace.com>

* Fix `metrics::Cow` provenance issue (metrics-rs#303)

* Quantile Remapping Fix (metrics-rs#304)

* update changelogs prior to release

* (cargo-release) version 0.19.0

* (cargo-release) version 0.13.0

* (cargo-release) version 0.11.0

* (cargo-release) version 0.10.0

* Update CHANGELOG.md

* Update README.md

* Update ci.yml

* Update ci.yml

* Add RollingSummary to prevent summary saturation (metrics-rs#306)

* Update quanta requirement from 0.9.3 to 0.10.0 (metrics-rs#301)

Updates the requirements on [quanta](https://github.com/metrics-rs/quanta) to permit the latest version.
- [Release notes](https://github.com/metrics-rs/quanta/releases)
- [Changelog](https://github.com/metrics-rs/quanta/blob/main/CHANGELOG.md)
- [Commits](metrics-rs/quanta@v0.9.3...v0.10.0)

---
updated-dependencies:
- dependency-name: quanta
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Document CI enforced msrv in readme and rust-version fields (metrics-rs#311)

* Change description to be SharedString (metrics-rs#312)

* Update hashbrown requirement from 0.11 to 0.12 (metrics-rs#266)

Updates the requirements on [hashbrown](https://github.com/rust-lang/hashbrown) to permit the latest version.
- [Release notes](https://github.com/rust-lang/hashbrown/releases)
- [Changelog](https://github.com/rust-lang/hashbrown/blob/master/CHANGELOG.md)
- [Commits](rust-lang/hashbrown@v0.11.0...v0.12.0)

---
updated-dependencies:
- dependency-name: hashbrown
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Toby Lawrence <tobz@users.noreply.github.com>

* Shims remaining AtomicU64 usage (metrics-rs#313)

* Update parking_lot requirement from 0.11 to 0.12 (metrics-rs#268)

Updates the requirements on [parking_lot](https://github.com/Amanieu/parking_lot) to permit the latest version.
- [Release notes](https://github.com/Amanieu/parking_lot/releases)
- [Changelog](https://github.com/Amanieu/parking_lot/blob/master/CHANGELOG.md)
- [Commits](Amanieu/parking_lot@0.11.0...0.12.0)

---
updated-dependencies:
- dependency-name: parking_lot
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Toby Lawrence <tobz@users.noreply.github.com>

* update changelogs

* (cargo-release) version 0.13.1

* add std atomics handle support back + changelogs

* (cargo-release) version 0.20.0

* (cargo-release) version 0.14.0

* (cargo-release) version 0.11.0

* changelog

* (cargo-release) version 0.12.0

* (cargo-release) version 0.7.0

* update changelog

* (cargo-release) version 0.6.0

* update changelog

* (cargo-release) version 0.20.1

* Remove incorrect return info (metrics-rs#316)

* Add a `KeyName` argument to `LabelFilter::should_include_label` (metrics-rs#342)

* rewind

* (cargo-release) version 0.13.0

* Use std sync primitives instead of parking_lot. (metrics-rs#344)

* Use std::sync::Mutex instead of parking_lot::Mutex.
* Bump MSRV to 1.60.0 for CI.

Co-authored-by: Toby Lawrence <tobz@users.noreply.github.com>

* clean up github CI workflow + rust-toolchain.toml to match quanta

* update portable-atomic to 1.0

* bump to prost 0.11 + fix spelling issue in metrics-tracing-context

* bump MSRV to 1.61 + update hashbrown/ahash deps

* update termion/ordered-float and pin predicates-* to avoid stupid 1.64 MSRV

* update syn

* update quanta + clean up semver notation in cargo.toml

* cleanup README wording for MSRV policy

* Add white background to splash image (metrics-rs#348)

* Install protoc in CI.

* fix spelling error in CI workflow

* Update ci.yml

* no need to run CI against macOS/Windows specifically

* Bring the metrics-observer protobufs up to date (metrics-rs#345)

* Use global paths for macros (metrics-rs#358)

* fix changes to fully qualified metrics crate ref change in macros

* update changelog

* update tui to 0.19

* bump numpy dep

* impl std::error::Error for metrics_exporter_tcp::Error

* changelog

* tweak test to avoid unused code

* rework 32 vs 64-bit arch atomics support + a lot of import consolidation/cleanup

* const-ify some stuff + rewrite some comments for the global recorder init code

* clean up clippy lints

* bump MSRV to 1.61.0

* (cargo-release) version 0.7.0

* (cargo-release) version 0.21.0

* (cargo-release) version 0.15.0

* (cargo-release) version 0.14.0

* (cargo-release) version 0.8.0

* (cargo-release) version 0.12.0

* allow publishing

* (cargo-release) version 0.2.0

* make it publishable pt 2

* push-gateway support authentication (metrics-rs#366)

* update changelog + fix failing feature check test

* (cargo-release) version 0.12.1

* feat(util): new helper type for recovering recorder after installing it (metrics-rs#362)

* Update aho-corasick to 1.0.

* Impl From<std::borrow::Cow> for KeyName (metrics-rs#378)

* changelog

* Add `Borrow` impl to `KeyName` (metrics-rs#381)

* bump deps + clippy stuff

* changelog

* pin hashbrown to avoid MSRV bump

* (cargo-release) version 0.15.1

* (cargo-release) version 0.21.1

* Add metadata to metrics (metrics-rs#380)

* add https support in Prometheus gateway (metrics-rs#392)

* migrate from procedural to declarative macros (metrics-rs#386)

* Make `Unit` methods return `&'static str` where possible (metrics-rs#393)

* simplify macros (metrics-rs#394)

* Add support for `Arc<T>` to `metrics::Cow<'a, T>`. (metrics-rs#402)

* Add support for `tracing::Span::record`ed fields in `metrics-tracing-context` (metrics-rs#408)

* fix(prom): `RollingSummary` overflow panic (metrics-rs#423)

* update CHANGELOG

* (cargo-release) version 0.12.2

* Remove unneeded unsafe from test (metrics-rs#427)

* Fix feature check in CI (metrics-rs#428)

* update changelogs/release notes

* fix unsafe/incorrect crossbeam-epoch usage in Block<T>

* Add a clippy CI check (metrics-rs#416) (metrics-rs#417)

* Try other resolved addresses if the first one fails (metrics-rs#429)

* update changelog

* Update quanta requirement from 0.11 to 0.12 (metrics-rs#396)

Updates the requirements on [quanta](https://github.com/metrics-rs/quanta) to permit the latest version.
- [Changelog](https://github.com/metrics-rs/quanta/blob/v0.12.0/CHANGELOG.md)
- [Commits](metrics-rs/quanta@v0.11.0...v0.12.0)

---
updated-dependencies:
- dependency-name: quanta
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Toby Lawrence <tobz@users.noreply.github.com>

* Update ordered-float requirement from 3.4 to 4.2 (metrics-rs#421)

Updates the requirements on [ordered-float](https://github.com/reem/rust-ordered-float) to permit the latest version.
- [Release notes](https://github.com/reem/rust-ordered-float/releases)
- [Commits](reem/rust-ordered-float@v3.4.0...v4.2.0)

---
updated-dependencies:
- dependency-name: ordered-float
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* fix quantile api

* missing clear

* reintroduce old way to avoid big refactor

---------

Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Shaoyuan CHEN <chensy20@mails.tsinghua.edu.cn>
Co-authored-by: Toby Lawrence <tobz@users.noreply.github.com>
Co-authored-by: Toby Lawrence <toby@nuclearfurnace.com>
Co-authored-by: nils <48135649+Nilstrieb@users.noreply.github.com>
Co-authored-by: Dan Wilbanks <78116928+dfwilbanks395@users.noreply.github.com>
Co-authored-by: Daniel Nelson <daniel@wavesofdawn.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Lucas Kent <rubickent@gmail.com>
Co-authored-by: Fredrik Enestad <fredrik@enestad.com>
Co-authored-by: Christopher Hunt <huntchr@gmail.com>
Co-authored-by: zohnannor <35764628+zohnannor@users.noreply.github.com>
Co-authored-by: Sinotov Aleksandr <bratsinot@gmail.com>
Co-authored-by: Jacob Kiesel <kieseljake@live.com>
Co-authored-by: C J Silverio <ceejceej@gmail.com>
Co-authored-by: CinchBlue <aytasato@gmail.com>
Co-authored-by: JasonLi <lijingxuan92@126.com>
Co-authored-by: Mostafa Elhemali <mostafa.elhemaly@gmail.com>
Co-authored-by: Harry Barber <106155934+hlbarber@users.noreply.github.com>
Co-authored-by: Qingwen Zhao <864593343@qq.com>
Co-authored-by: david-perez <d@vidp.dev>
Co-authored-by: Lucio Franco <luciofranco14@gmail.com>
Co-authored-by: Nicolas Stinus <nicolas.stinus@gmail.com>
Co-authored-by: Valeriy V. Vorotyntsev <valery.vv@gmail.com>
mnpw pushed a commit to mnpw/metrics that referenced this pull request Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-util Component: utility classes and helpers. E-intermediate Effort: intermediate. T-bug Type: bug. T-refactor Type: refactor.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants