-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a Counter Aggregator #119
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a thorough review - this is a review just to point our the existing concern I have so far.
@knyar I have a higher level question. What is the motivation to do this, when Prometheus itself can do the same aggregation and many more? We want to keep the sidecar nimble and predictable, and I'm concerned that this feature and the new metrics are above the minimally required. |
I don't think there is a way in Prometheus to export a cumulative metric that corresponds to the sum of multiple counters if those counters can be reset independently. Recording rules can only be used to export a sum of per-second rates (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anton, based on your explanation, I believe that this feature is useful, thanks for sending the change!
I see that you added a new mechanism for talking to the monitoring backend, and I'd like to understand better the reasons. The sidecar today uses the Stackdriver Monitoring API library and manages its own queues for batching, retries, etc. This PR introduces OpenCensus as a second active path with its own queuing and retry logic. It's not obvious how these two will interact. Can you think what it would take to merge the aggregator into the existing export path instead?
I think it's reasonable to use OpenCensus for exporting the collected data to Stackdriver, but I would want to see it as a replacement, not as a second active path. I also expect us to have to perform a more exhaustive test before we can migrate the export path to OpenCensus, because it's likely that the performance and resource usage will change.
Thanks for your comments, @jkohen.
Sure, the main reason to use OpenCensus here is that it already implements everything one needs to export cumulative metrics to Stackdriver: tracks start time, creates Stackdriver protos, flushes data according to a predefined interval. As you've mentioned, the sidecar is already instrumented to export OpenCensus metrics, and the two paths don't really overlap or interact with one another. After taking a quick look at the existing output path, I believe we will need to add the following additional functionality to Counter Aggregator to make it use QueueManager rather than OpenCensus for reporting aggregated counters to Stackdriver:
Overall, I think seriesCache + QueueManager are well optimized for high-bandwidth proxying of samples from Prometheus WAL to Stackdriver, but are too low level for a simple task of exporting a small number of counters to Stackdriver, for which OpenCensus provides a simpler and more intuitive API. It seems like integrating counter aggregator with QueueManager will bring additional complexity without significant benefits, but I am happy to take a stab at doing it if it's something you feel strongly about. |
@jkohen, @StevenYCChou, I've taken another look at this PR, and I believe I've resolved all comments that had a clear call for action. This is currently blocked on your decision on whether it's ok for Counter Aggregator to use OpenCensus (which would be my preference, since it's simpler), or if you'd like me to integrate it with QueueManager. |
Sorry for the delay. @StevenYCChou and I discussed your response offline, and I'm convinced to accept the new path you added. However, I still want to see some validation of performance and resource usage, because we are pulling in a large dependency with OpenCensus. We have a benchmark here: https://github.com/Stackdriver/stackdriver-prometheus-sidecar/tree/master/bench You'd have to trigger the codepaths you added. Let us know if you encounter issues, because we haven't used it in a while. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code generally looks good to me. I have a few minor comments.
Counter Aggregator allows exporting a sum of multiple Prometheus counters to Stackdriver as a single CUMULATIVE metric.
I've made a few changes necessary to run the benchmark with counter aggregator enabled and kept it running for 30 minutes. I am not sure what data you'd like to see, but in this Drive folder (shared with you, but not public) you can find the output along with metric dumps for Prometheus server, old sidecar (before this PR) and the new sidecar. Please let me know if you'd like me to add these changes to this PR as well. |
Thanks for running the benchmark. It'd be interesting to look at resource usage for both sidecar versions, maybe other metrics like RPC rate, just for sanity. Can you share it? It should be in Stackdriver :) |
I believe the benchmark is executed against a fake Stackdriver client in CPU and RAM usage is comparable. In
In
RPC stats are in the same files, but there's too much data for me to attach it here. |
Please let me know if you have any other comments, or if you would like me to squash the commits for this PR to be merged. I am also curious whether you'd like my changes to the benchmark to be included in this PR or not. Thanks! |
I or YCChou will squash before merge, don't worry.
Let's look at these separately.
Looks good. I also spot checked another pair of files 15 minutes into the test and I see the same. Not as cool as a time series, but it does the job for now. Thank you! |
@StevenYCChou can you confirm that this OK with you? |
Yes, LGTM. |
Counter Aggregator allows exporting a sum of multiple Prometheus counters to Stackdriver as a single CUMULATIVE metric.
This can be useful if you have so many counter metrics (or metrics with high label cardinality) in Prometheus that importing all of them to Stackdriver directly might be too expensive, however you would like to have a total cumulative counter that corresponds to the sum of those counters.