-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Metrics to the snapshot controller #142
Comments
/assign |
/kind feature |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Fixed in #227 |
@xing-yang @yuxiangqian You should open one a similar bug on the common snapshot controller |
Or I guess we can leave this open to track that |
Sure. |
Plan to add following metrics into common controller:
|
You can combine both (latency and count) in to a single metric. |
Count is strange for a total latency metric that you want to capture across retry loops in the controller. |
Maybe we can get some guidance from sig-instrumentation in this area. How do we map the K8s reconciliaton model to typical metrics patterns which are generally call-based? |
I share the some concern as Michelle, but @saad-ali what's the major benefits of combining them into one? one less TS? |
My recommendation is based on this conversation: kubernetes-csi/external-provisioner#386 (comment) |
Its certainly possible to save the error counter by introducing a new label into the latency histogram. Unlike the csi metrics, there is no well-defined return code ATM in snapshot-controller which could serve as another label. To achieve that, a more well-defined error reporting mechanism needs to be introduced in the snapshot-controller. I will do some more digging.
since the histogram metric tends to record end to end latency of an operation, the three types could also be used to determinate whether the operation has ended or not. |
A histogram metric is actually a series of metrics and it includes a count metric. So having another metric for counting is redundant; this is an artifact of the way prometheus exposition format expresses histograms. |
@logicalhan @saad-ali the challenge here is that this metric is a total latency metric that captures the time across multiple retry loops, so the "count" is really only the number of Snapshot objects that were created. How much value do you see in also having a per-retry loop metric? In reality, I found that per-loop metric at least for latency has not been very useful because it doesn't map to the user-perceived operations. |
this is slightly different than a pure count, it's an error count not total operation count. It also makes the situation more complex as the end to end latency metric tries to record the whole duration of an operation from the controller took it to either it succeeded or permanently failed. An operation might go through multiple reconcile loops before it's done, and same errors could happen. I probably need to add a cache to record starting timestamps for each operation, and mixing error count into the cache might bring another complexity as we do not have well defined error status as CSI does. |
/reopen |
@yuxiangqian: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
this is not done yet. #280 only supplies the metrics utility functions, implementation in controller is still needed. |
cc @AndiLi99 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
/assign @ggriffiths |
Add Metrics to snapshot controller. Need to be consistent with Metrics in PV/PVC controller.
@jingxu97 to check if Shawn (?) can work on this.
The text was updated successfully, but these errors were encountered: