Emit AWS API operation duration/error/throttle metrics #842

wongma7 · 2021-04-20T23:27:59Z

Is this a bug fix or adding new feature? fix #806

What is this PR about? / Why do we need it? This publishes API metrics equivalent to those that the ebs plugin/aws cloud provider embedded in kube-controller-manager publishes today, like describeinstances/describevolume call durations, error counts, and throttle error counts.

I'm not well-versed in how people intend to consume the metrics so this PR is barebones, for now it just exposes them over port 80 using "k8s.io/component-base/metrics/legacyregistry"

DIFFERENCES between my implementation and the cloud provider one:

I implement a Complete handler and refer to request.Operation.Name when emitting metrics. As opposed to wrapping every call and referring to a custom request name.
- consequence: instead of emitting e.g. cloudprovider_aws_api_request_duration_seconds_bucket{request="describe_instance" , I emit cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances"
I use SDK's IsErrorThrottle function to decide whether to emit a throttle metric whereas cloudprovider only treats RequestLimitExceeded as throttles.
- consequence: I don't think it matters for DescribeVolumes/DescribeInstances.
I did not change our retry logic. Unlike cloudprovider one that has process-wide retry, ours is per-request. So since the quantity of requests will differ so will the metrics, meaning people will have to adjust their alarms?
- consequence: hard to say for certain without testing under conditions where lots of api calls fail and need to be retried. But generally I feel safer relying on SDK retry logic.

What testing is done?

kubectl port-forward deployment/ebs-csi-controller 8080:80 -n kube-system

excerpt:

# HELP cloudprovider_aws_api_request_duration_seconds [ALPHA] Latency of AWS API calls
# TYPE cloudprovider_aws_api_request_duration_seconds histogram
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.1"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.25"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="1"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="2.5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="10"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="+Inf"} 1
cloudprovider_aws_api_request_duration_seconds_sum{request="DescribeInstances"} 0.204679971
cloudprovider_aws_api_request_duration_seconds_count{request="DescribeInstances"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.1"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.25"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.5"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="1"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="2.5"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="5"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="10"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="+Inf"} 4
cloudprovider_aws_api_request_duration_seconds_sum{request="DescribeVolumes"} 0.35412039799999995

k8s-ci-robot · 2021-04-20T23:28:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wongma7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [wongma7]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wongma7 · 2021-04-20T23:29:29Z

/cc @gnufied

need your expertise or testimony from SREs relying on this metric : - )

wongma7 · 2021-04-20T23:32:02Z

/cc @AndyXiangLi

BTW, once we have these metrics, we may need to do some basic comparison / load testing with in-tree driver for API call volume. In addition to the pod startup type of testing we are doing. I probably can sign up for it, time permitting : D

coveralls · 2021-04-20T23:33:02Z

Pull Request Test Coverage Report for Build 1869

0 of 46 (0.0%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-1.6%) to 80.51%

Changes Missing Coverage	Changed/Added Lines	%
pkg/cloud/cloud.go	12	0.0%
pkg/cloud/aws_metrics.go	15	0.0%
pkg/cloud/handlers.go	19	0.0%

Totals
Change from base Build 1858:	-1.6%
Covered Lines:	1896
Relevant Lines:	2355

💛 - Coveralls

gnufied

Left some minor comments. It is okay to rename the metrics as long as - we cover it with release notes. I think whenever AWS migration is enabled by default, the release note should capture the renamed metrics.

cmd/main.go

pkg/cloud/handlers.go

gnufied · 2021-04-21T18:27:31Z

also cc @Jiawei0227 and @msau42 who are tracking CSI migration work and metrics migration is a prerequisite for CSI migration.

…dpoint flag

gnufied · 2021-04-22T02:57:27Z

/lgtm

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 20, 2021

k8s-ci-robot requested review from AndyXiangLi and bertinatto April 20, 2021 23:28

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 20, 2021

k8s-ci-robot requested a review from gnufied April 20, 2021 23:29

gnufied reviewed Apr 21, 2021

View reviewed changes

cmd/main.go Outdated Show resolved Hide resolved

pkg/cloud/handlers.go Outdated Show resolved Hide resolved

Emit AWS API operation duration/error/throttle metrics with --http-en…

3b0bc58

…dpoint flag

wongma7 force-pushed the cloudmetrics branch from d60bc5e to 3b0bc58 Compare April 21, 2021 21:07

k8s-ci-robot assigned gnufied Apr 22, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2021

wongma7 merged commit 57e57bb into kubernetes-sigs:master Apr 22, 2021

wongma7 mentioned this pull request Apr 23, 2021

Fix missing import #849

Merged

zetaab mentioned this pull request May 31, 2022

[cinder-csi-plugin] Add http endpoint of CSI container kubernetes/cloud-provider-openstack#1398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit AWS API operation duration/error/throttle metrics #842

Emit AWS API operation duration/error/throttle metrics #842

wongma7 commented Apr 20, 2021

k8s-ci-robot commented Apr 20, 2021

wongma7 commented Apr 20, 2021

wongma7 commented Apr 20, 2021

coveralls commented Apr 20, 2021 •

edited

Loading

gnufied left a comment

gnufied commented Apr 21, 2021

gnufied commented Apr 22, 2021

Emit AWS API operation duration/error/throttle metrics #842

Emit AWS API operation duration/error/throttle metrics #842

Conversation

wongma7 commented Apr 20, 2021

k8s-ci-robot commented Apr 20, 2021

wongma7 commented Apr 20, 2021

wongma7 commented Apr 20, 2021

coveralls commented Apr 20, 2021 • edited Loading

Pull Request Test Coverage Report for Build 1869

💛 - Coveralls

gnufied left a comment

Choose a reason for hiding this comment

gnufied commented Apr 21, 2021

gnufied commented Apr 22, 2021

coveralls commented Apr 20, 2021 •

edited

Loading