Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metric to common controller - basic metrics utilities #280

Merged
merged 1 commit into from
May 26, 2020

Conversation

yuxiangqian
Copy link
Contributor

@yuxiangqian yuxiangqian commented Mar 23, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR is to provide basic metrics utilities which later could be used in common controller for end to end operation metrics recording.
Which issue(s) this PR fixes:
Fixes #142

Special notes for your reviewer:
This PR only contains the basic metrics utility functions needed for end to end operational metrics recording in snapshot controller. Another PR will be based on this to add metrics recording in snapshot controller code. ATM, it should NOT have any user-facing changes.

Does this PR introduce a user-facing change?:

Adds the basic metrics utility functions needed for end to end operational metrics recording in snapshot controller.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 23, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuxiangqian

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 23, 2020
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
@xing-yang
Copy link
Collaborator

In the release note, can you rename "common controller" to "snapshot controller".

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Outdated Show resolved Hide resolved
@yuxiangqian
Copy link
Contributor Author

In the release note, can you rename "common controller" to "snapshot controller".

done. thnx

@yuxiangqian
Copy link
Contributor Author

/retest

@yuxiangqian yuxiangqian force-pushed the metrics branch 6 times, most recently from d1acbf8 to a3600be Compare March 26, 2020 07:04
@yuxiangqian yuxiangqian force-pushed the metrics branch 3 times, most recently from 15799ea to 8b0b81a Compare March 27, 2020 20:10
@yuxiangqian
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot removed the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 27, 2020
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Show resolved Hide resolved
pkg/metrics/metrics_test.go Outdated Show resolved Hide resolved
@xing-yang
Copy link
Collaborator

In the release notes, can you just include the following as this will be included in the release notes in the future:

Adds the basic metrics utility functions needed for end to end operational metrics recording in snapshot controller.

And move the remaining message to the Special notes for the reviewer section:
Another PR will be based on this to add metrics recording in snapshot controller code. ATM, it should NOT have any user-facing changes.

@yuxiangqian yuxiangqian force-pushed the metrics branch 2 times, most recently from 58bdff6 to cbbdb75 Compare April 9, 2020 22:54
@yuxiangqian
Copy link
Contributor Author

@xing-yang comments addressed.

// Failed is an operation state which means the controller has encountered
// an error which is not recoverable and the controller has marked a permanent
// failure of the operation.
Failure OperationState = "Failure"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what situations does the controller not retry?

Do we also count user cancellation as a failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM when CSI failed permanently it would stop retrying to call CSI.
when you say user cancellation, are you referring to deletion of a VS?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is failed permanently? I don't think we should stop retrying until user cancels by deleting the VS. It is a Kubernetes anti-pattern

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This failure refers to the failure of one try. The controller should always re-try. We do have a bug to be fixed - If create snapshot fails very early, before the request is sent to the CSI driver, there will be no retries. But we do plan to fix it when fixing the requeue logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one concern is that the cache will grow linearly to the number of failed volumesnapshots over time, and when should we report an error metrics if there is no permanently failed VolumeSnapshot?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"volume operation was aborted" - create snapshot will only be aborted when user deletes the snapshot, but how is that a failure?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple scenarios I see are:

  • There was an invalid configuration so user canceled the operation, fixed the config, and then retry
  • There was a system error and the user gave up and deleted the object
  • Snapshot was taking too long and the user got impatient and gave up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If reconcile will retry CSI calls regardless the CSI return code (which is not the case ATM), I feel having an error metrics for end to end operation is not necessary and maybe not that valuable. User cancellation/abort is a pretty random behavior, they could cancel anytime after a failure has been observed, which will in turn make the metrics latency value not that useful. On the other hand, the error count, could be potentially valuable, as it would give users an idea how many total cancellations have been observed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latency value could tell us information about how long users have waited before giving up. I think since count is included as part of the histogram anyway, it doesn't hurt to include the latency. Consumers can decide if they want the use the data or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also we should fix the controller to retry in all cases. This was one of the design changes we made for beta)

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 28, 2020
@yuxiangqian
Copy link
Contributor Author

@msau42 gental ping

@xing-yang
Copy link
Collaborator

lgtm

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 4, 2020
Copy link
Contributor

@mattcary mattcary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Xiangqian,

Michelle asked me to help out with this review. I've added a few comments.

Thanks!

Matt

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Show resolved Hide resolved
pkg/metrics/metrics_test.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Show resolved Hide resolved
@yuxiangqian
Copy link
Contributor Author

yuxiangqian commented May 15, 2020

Hi Xiangqian,

Michelle asked me to help out with this review. I've added a few comments.

Thanks!

Matt

Thanks Matt for spending time on this.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2020
relax latency check to be greater than instead of within threshold
Copy link
Contributor

@mattcary mattcary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM.

Thanks for implementing the loose histogram comparison, your solution is nice and simple and should be more reliable.

Also thanks for the extra tests :)

@msau42
Copy link
Collaborator

msau42 commented May 22, 2020

LGTM! just one minor question about the max histogram bucket

String() string
}

var metricBuckets = []float64{0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 30, 60, 120, 300, 600}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we think 10 mins is enough? Are we capturing upload time too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. From my observation using gce PD snapshot as example, a 4G PV roughly takes ~5-10s to do a snapshot (including uploading), assuming linear interpolation, 600s should be able to upload any snapshot for volumes of size between 240G to 480G. do you think its sufficient? GCE PD can have up to 64TB disks, I am a little concerned to have a such big bucket which seems to be a bit overkill for most of the cases. and the exact latency time is anyway captured by the sum metric.
Snapshot latency, depending on storage vendor implementation, varies a lot. For storage vendors with inplace implementation, the latency should be really small like within 1 second, where as for those do uploading, it takes much longer and depends on a volume's size. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42 @yuxiangqian I believe we can't control this timeout from our end, considering the fact that, all the CSI drivers can have their own timeout configured with an arg to the sidecar, Isn't it? Either we have to use that timeout value or stick to a default value in general like 10 mins. Am I missing something here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't about a timeout value, this is about the maximum bucket range that we want to have for our histograms. This is probably fine for now, we can add more buckets if needed.

@humblec
Copy link
Contributor

humblec commented May 26, 2020

LGTM from me too. Thanks @yuxiangqian for this !

@msau42
Copy link
Collaborator

msau42 commented May 26, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 26, 2020
@k8s-ci-robot k8s-ci-robot merged commit 63029d9 into kubernetes-csi:master May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Metrics to the snapshot controller
6 participants