-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metric to common controller - basic metrics utilities #280
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yuxiangqian The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
In the release note, can you rename "common controller" to "snapshot controller". |
done. thnx |
/retest |
d1acbf8
to
a3600be
Compare
15799ea
to
8b0b81a
Compare
/retest |
In the release notes, can you just include the following as this will be included in the release notes in the future:
And move the remaining message to the Special notes for the reviewer section: |
58bdff6
to
cbbdb75
Compare
@xing-yang comments addressed. |
pkg/metrics/metrics.go
Outdated
// Failed is an operation state which means the controller has encountered | ||
// an error which is not recoverable and the controller has marked a permanent | ||
// failure of the operation. | ||
Failure OperationState = "Failure" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what situations does the controller not retry?
Do we also count user cancellation as a failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ATM when CSI failed permanently it would stop retrying to call CSI.
when you say user cancellation, are you referring to deletion of a VS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is failed permanently? I don't think we should stop retrying until user cancels by deleting the VS. It is a Kubernetes anti-pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This failure refers to the failure of one try. The controller should always re-try. We do have a bug to be fixed - If create snapshot fails very early, before the request is sent to the CSI driver, there will be no retries. But we do plan to fix it when fixing the requeue logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one concern is that the cache will grow linearly to the number of failed volumesnapshots over time, and when should we report an error metrics if there is no permanently failed VolumeSnapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"volume operation was aborted" - create snapshot will only be aborted when user deletes the snapshot, but how is that a failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple scenarios I see are:
- There was an invalid configuration so user canceled the operation, fixed the config, and then retry
- There was a system error and the user gave up and deleted the object
- Snapshot was taking too long and the user got impatient and gave up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If reconcile will retry CSI calls regardless the CSI return code (which is not the case ATM), I feel having an error metrics for end to end operation is not necessary and maybe not that valuable. User cancellation/abort is a pretty random behavior, they could cancel anytime after a failure has been observed, which will in turn make the metrics latency value not that useful. On the other hand, the error count, could be potentially valuable, as it would give users an idea how many total cancellations have been observed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latency value could tell us information about how long users have waited before giving up. I think since count is included as part of the histogram anyway, it doesn't hurt to include the latency. Consumers can decide if they want the use the data or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also we should fix the controller to retry in all cases. This was one of the design changes we made for beta)
@msau42 gental ping |
lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Xiangqian,
Michelle asked me to help out with this review. I've added a few comments.
Thanks!
Matt
Thanks Matt for spending time on this. |
relax latency check to be greater than instead of within threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! LGTM.
Thanks for implementing the loose histogram comparison, your solution is nice and simple and should be more reliable.
Also thanks for the extra tests :)
LGTM! just one minor question about the max histogram bucket |
String() string | ||
} | ||
|
||
var metricBuckets = []float64{0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 30, 60, 120, 300, 600} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think 10 mins is enough? Are we capturing upload time too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. From my observation using gce PD snapshot as example, a 4G PV roughly takes ~5-10s to do a snapshot (including uploading), assuming linear interpolation, 600s should be able to upload any snapshot for volumes of size between 240G to 480G. do you think its sufficient? GCE PD can have up to 64TB disks, I am a little concerned to have a such big bucket which seems to be a bit overkill for most of the cases. and the exact latency time is anyway captured by the sum metric.
Snapshot latency, depending on storage vendor implementation, varies a lot. For storage vendors with inplace implementation, the latency should be really small like within 1 second, where as for those do uploading, it takes much longer and depends on a volume's size. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msau42 @yuxiangqian I believe we can't control this timeout from our end, considering the fact that, all the CSI drivers can have their own timeout
configured with an arg to the sidecar, Isn't it? Either we have to use that timeout value or stick to a default value in general like 10 mins. Am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't about a timeout value, this is about the maximum bucket range that we want to have for our histograms. This is probably fine for now, we can add more buckets if needed.
LGTM from me too. Thanks @yuxiangqian for this ! |
/lgtm |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR is to provide basic metrics utilities which later could be used in common controller for end to end operation metrics recording.
Which issue(s) this PR fixes:
Fixes #142
Special notes for your reviewer:
This PR only contains the basic metrics utility functions needed for end to end operational metrics recording in snapshot controller. Another PR will be based on this to add metrics recording in snapshot controller code. ATM, it should NOT have any user-facing changes.
Does this PR introduce a user-facing change?: