add metric to common controller - basic metrics utilities #280

yuxiangqian · 2020-03-23T06:46:24Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR is to provide basic metrics utilities which later could be used in common controller for end to end operation metrics recording.
Which issue(s) this PR fixes:
Fixes #142

Special notes for your reviewer:
This PR only contains the basic metrics utility functions needed for end to end operational metrics recording in snapshot controller. Another PR will be based on this to add metrics recording in snapshot controller code. ATM, it should NOT have any user-facing changes.

Does this PR introduce a user-facing change?:

Adds the basic metrics utility functions needed for end to end operational metrics recording in snapshot controller.

k8s-ci-robot · 2020-03-23T06:46:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuxiangqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuxiangqian]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/metrics/metrics.go

xing-yang · 2020-03-24T02:30:57Z

In the release note, can you rename "common controller" to "snapshot controller".

pkg/metrics/metrics.go

pkg/metrics/metrics_test.go

yuxiangqian · 2020-03-24T06:01:36Z

In the release note, can you rename "common controller" to "snapshot controller".

done. thnx

yuxiangqian · 2020-03-25T21:23:12Z

/retest

pkg/metrics/metrics_test.go

yuxiangqian · 2020-03-27T20:49:46Z

/retest

pkg/metrics/metrics.go

pkg/metrics/metrics_test.go

xing-yang · 2020-04-09T04:18:24Z

In the release notes, can you just include the following as this will be included in the release notes in the future:

Adds the basic metrics utility functions needed for end to end operational metrics recording in snapshot controller.

And move the remaining message to the Special notes for the reviewer section:
Another PR will be based on this to add metrics recording in snapshot controller code. ATM, it should NOT have any user-facing changes.

yuxiangqian · 2020-04-09T22:55:21Z

@xing-yang comments addressed.

msau42 · 2020-04-09T23:01:11Z

pkg/metrics/metrics.go

+	// Failed is an operation state which means the controller has encountered
+	// an error which is not recoverable and the controller has marked a permanent
+	// failure of the operation.
+	Failure OperationState = "Failure"


In what situations does the controller not retry?

Do we also count user cancellation as a failure?

ATM when CSI failed permanently it would stop retrying to call CSI.
when you say user cancellation, are you referring to deletion of a VS?

What is failed permanently? I don't think we should stop retrying until user cancels by deleting the VS. It is a Kubernetes anti-pattern

This failure refers to the failure of one try. The controller should always re-try. We do have a bug to be fixed - If create snapshot fails very early, before the request is sent to the CSI driver, there will be no retries. But we do plan to fix it when fixing the requeue logic.

one concern is that the cache will grow linearly to the number of failed volumesnapshots over time, and when should we report an error metrics if there is no permanently failed VolumeSnapshot?

"volume operation was aborted" - create snapshot will only be aborted when user deletes the snapshot, but how is that a failure?

A couple scenarios I see are:

There was an invalid configuration so user canceled the operation, fixed the config, and then retry

There was a system error and the user gave up and deleted the object

Snapshot was taking too long and the user got impatient and gave up.

If reconcile will retry CSI calls regardless the CSI return code (which is not the case ATM), I feel having an error metrics for end to end operation is not necessary and maybe not that valuable. User cancellation/abort is a pretty random behavior, they could cancel anytime after a failure has been observed, which will in turn make the metrics latency value not that useful. On the other hand, the error count, could be potentially valuable, as it would give users an idea how many total cancellations have been observed.

The latency value could tell us information about how long users have waited before giving up. I think since count is included as part of the histogram anyway, it doesn't hurt to include the latency. Consumers can decide if they want the use the data or not.

(Also we should fix the controller to retry in all cases. This was one of the design changes we made for beta)

pkg/metrics/metrics.go

pkg/metrics/metrics_test.go

yuxiangqian · 2020-04-29T03:48:20Z

@msau42 gental ping

xing-yang · 2020-05-03T14:29:48Z

lgtm

mattcary

Hi Xiangqian,

Michelle asked me to help out with this review. I've added a few comments.

Thanks!

Matt

pkg/metrics/metrics.go

pkg/metrics/metrics_test.go

yuxiangqian · 2020-05-15T03:35:21Z

Hi Xiangqian,

Michelle asked me to help out with this review. I've added a few comments.

Thanks!

Matt

Thanks Matt for spending time on this.

pkg/metrics/metrics.go

relax latency check to be greater than instead of within threshold

mattcary

Thanks! LGTM.

Thanks for implementing the loose histogram comparison, your solution is nice and simple and should be more reliable.

Also thanks for the extra tests :)

msau42 · 2020-05-22T00:33:51Z

LGTM! just one minor question about the max histogram bucket

msau42 · 2020-05-22T00:31:30Z

pkg/metrics/metrics.go

+	String() string
+}
+
+var metricBuckets = []float64{0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 30, 60, 120, 300, 600}


Do we think 10 mins is enough? Are we capturing upload time too?

Good question. From my observation using gce PD snapshot as example, a 4G PV roughly takes ~5-10s to do a snapshot (including uploading), assuming linear interpolation, 600s should be able to upload any snapshot for volumes of size between 240G to 480G. do you think its sufficient? GCE PD can have up to 64TB disks, I am a little concerned to have a such big bucket which seems to be a bit overkill for most of the cases. and the exact latency time is anyway captured by the sum metric.
Snapshot latency, depending on storage vendor implementation, varies a lot. For storage vendors with inplace implementation, the latency should be really small like within 1 second, where as for those do uploading, it takes much longer and depends on a volume's size. WDYT?

@msau42 @yuxiangqian I believe we can't control this timeout from our end, considering the fact that, all the CSI drivers can have their own timeout configured with an arg to the sidecar, Isn't it? Either we have to use that timeout value or stick to a default value in general like 10 mins. Am I missing something here?

This isn't about a timeout value, this is about the maximum bucket range that we want to have for our histograms. This is probably fine for now, we can add more buckets if needed.

humblec · 2020-05-26T16:19:08Z

LGTM from me too. Thanks @yuxiangqian for this !

msau42 · 2020-05-26T20:41:57Z

/lgtm

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 23, 2020

k8s-ci-robot requested review from saad-ali and xing-yang March 23, 2020 06:46

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 23, 2020

xing-yang reviewed Mar 23, 2020

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

msau42 reviewed Mar 23, 2020

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

yuxiangqian force-pushed the metrics branch from 6f88664 to 1f3bef2 Compare March 24, 2020 01:21

xing-yang reviewed Mar 24, 2020

View reviewed changes

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

pkg/metrics/metrics.go Outdated Show resolved Hide resolved

xing-yang reviewed Mar 24, 2020

View reviewed changes

yuxiangqian force-pushed the metrics branch from 1f3bef2 to aa58f76 Compare March 24, 2020 06:45

yuxiangqian force-pushed the metrics branch 6 times, most recently from d1acbf8 to a3600be Compare March 26, 2020 07:04

xing-yang reviewed Mar 26, 2020

View reviewed changes

pkg/metrics/metrics_test.go Outdated Show resolved Hide resolved

yuxiangqian force-pushed the metrics branch 3 times, most recently from 15799ea to 8b0b81a Compare March 27, 2020 20:10

yuxiangqian force-pushed the metrics branch from 8b0b81a to b226350 Compare March 27, 2020 22:52

k8s-ci-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 27, 2020

yuxiangqian force-pushed the metrics branch from f60516a to 498cf9f Compare April 8, 2020 22:29

xing-yang reviewed Apr 9, 2020

View reviewed changes

yuxiangqian force-pushed the metrics branch 2 times, most recently from 58bdff6 to cbbdb75 Compare April 9, 2020 22:54

msau42 reviewed Apr 9, 2020

View reviewed changes

yuxiangqian force-pushed the metrics branch from cbbdb75 to 0bf4b06 Compare April 28, 2020 02:47

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 28, 2020

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 4, 2020

mattcary suggested changes May 15, 2020

View reviewed changes

yuxiangqian force-pushed the metrics branch from 0bf4b06 to 824e65d Compare May 15, 2020 04:06

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2020

humblec reviewed May 15, 2020

View reviewed changes

pkg/metrics/metrics.go Show resolved Hide resolved

humblec reviewed May 15, 2020

View reviewed changes

pkg/metrics/metrics.go Show resolved Hide resolved

add basic metrics utility functions

7bd7171

relax latency check to be greater than instead of within threshold

yuxiangqian force-pushed the metrics branch from e2fd5cc to 7bd7171 Compare May 21, 2020 20:46

mattcary approved these changes May 22, 2020

View reviewed changes

msau42 reviewed May 22, 2020

View reviewed changes

k8s-ci-robot assigned msau42 May 26, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 26, 2020

k8s-ci-robot merged commit 63029d9 into kubernetes-csi:master May 26, 2020

yuxiangqian mentioned this pull request May 26, 2020

Add Metrics to the snapshot controller #142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metric to common controller - basic metrics utilities #280

add metric to common controller - basic metrics utilities #280

yuxiangqian commented Mar 23, 2020 •

edited

Loading

k8s-ci-robot commented Mar 23, 2020

xing-yang commented Mar 24, 2020

yuxiangqian commented Mar 24, 2020

yuxiangqian commented Mar 25, 2020

yuxiangqian commented Mar 27, 2020

xing-yang commented Apr 9, 2020

yuxiangqian commented Apr 9, 2020

msau42 Apr 9, 2020

yuxiangqian Apr 9, 2020

msau42 Apr 9, 2020

xing-yang Apr 9, 2020

yuxiangqian Apr 9, 2020

xing-yang Apr 10, 2020

msau42 Apr 10, 2020

yuxiangqian Apr 13, 2020

msau42 Apr 13, 2020

msau42 Apr 13, 2020

yuxiangqian commented Apr 29, 2020

xing-yang commented May 3, 2020

mattcary left a comment

yuxiangqian commented May 15, 2020 •

edited

Loading

mattcary left a comment

msau42 commented May 22, 2020

msau42 May 22, 2020

yuxiangqian May 25, 2020

humblec May 26, 2020

msau42 May 26, 2020

humblec commented May 26, 2020

msau42 commented May 26, 2020

add metric to common controller - basic metrics utilities #280

add metric to common controller - basic metrics utilities #280

Conversation

yuxiangqian commented Mar 23, 2020 • edited Loading

k8s-ci-robot commented Mar 23, 2020

xing-yang commented Mar 24, 2020

yuxiangqian commented Mar 24, 2020

yuxiangqian commented Mar 25, 2020

yuxiangqian commented Mar 27, 2020

xing-yang commented Apr 9, 2020

yuxiangqian commented Apr 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuxiangqian commented Apr 29, 2020

xing-yang commented May 3, 2020

mattcary left a comment

Choose a reason for hiding this comment

yuxiangqian commented May 15, 2020 • edited Loading

mattcary left a comment

Choose a reason for hiding this comment

msau42 commented May 22, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humblec commented May 26, 2020

msau42 commented May 26, 2020

yuxiangqian commented Mar 23, 2020 •

edited

Loading

yuxiangqian commented May 15, 2020 •

edited

Loading