-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
monitor: Create realtime ratio histogram metric #1989
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just two minor suggestions.
Mind adding an entry to CHANGELOG_PENDING since this is a user facing update? See the docs for more info on the entry format. I think we should add an entry under "Breaking changes" as well since I think any Prometheus/Grafana tools that depend on the individual tier metrics won't work any more since the individual tier metrics are no longer recorded?
Also, related to the last point, we'll need to update the Grafana templates in docker-livepeer for the same reason right?
This will allow for simpler queries to be performed to evaluate our transcoding performance. Using Prometheus' histogram_quantile() [1] function, we can actually plot and alert on a concrete p99 value for the transcoding realtime ratio. [1] https://prometheus.io/docs/prometheus/latest/querying/functions/#histogram_quantile
Co-authored-by: Yondon Fu <yondon.fu@gmail.com>
Co-authored-by: Yondon Fu <yondon.fu@gmail.com>
71c35ea
to
e2d2975
Compare
@yondonfu Picking this back up!
Done!
Actually, we haven't stopped measuring the individual tier metrics, we've only started measuring the new histogram metric as well. Notice that here we send both of the I plan on updating our realtime alert query to use the histogram, but it will continue to work the same way as before until then, no need to worry about breaking them with this change. (although they are currently disabled due to the Grafana/VictoriaMetrics upgrade, we've got to look into that) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victorges Thanks for the explanation re: no breaking change!
LGTM! Looks like some of the commits can be squashed before merge.
Thanks! Merged with squash to avoid the low-level history in |
What does this pull request do? Explain your changes. (required)
This will allow for simpler queries to be performed to evaluate our transcoding
performance. Using Prometheus' histogram_quantile() function, we can actually
plot and alert on a concrete p99 value for the transcoding realtime ratio.
e.g.: "p99" (actually p01 since higher is better) of transcode real-time ratio as
seen by
ewr
orchs: [query]Ended up coding this when digging deeper on the real-time alert, to make it less noisy.
Specific updates (required)
http_client_segment_transcoded_realtime_ratio
metric tomonitor/census.go
How did you test each of these updates (required)
Not really sure how to test the Prometheus logic locally, but the change is pretty simple.
Also cross-checked the implementation with the
transcode_score
metric which is almostthe same idea (it's only measured on a different step in the code, on orchs and transcoders afaict)
Does this pull request close any open issues?
Closes https://github.com/livepeer/livepeer-infra/issues/627 (still need to update the actual alert definitions though)
Checklist:
make
runs successfully./test.sh
pass