Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit AWS API operation duration/error/throttle metrics #842

Merged
merged 1 commit into from
Apr 22, 2021

Conversation

wongma7
Copy link
Contributor

@wongma7 wongma7 commented Apr 20, 2021

Is this a bug fix or adding new feature? fix #806

What is this PR about? / Why do we need it? This publishes API metrics equivalent to those that the ebs plugin/aws cloud provider embedded in kube-controller-manager publishes today, like describeinstances/describevolume call durations, error counts, and throttle error counts.

I'm not well-versed in how people intend to consume the metrics so this PR is barebones, for now it just exposes them over port 80 using "k8s.io/component-base/metrics/legacyregistry"

DIFFERENCES between my implementation and the cloud provider one:

  • I implement a Complete handler and refer to request.Operation.Name when emitting metrics. As opposed to wrapping every call and referring to a custom request name.
    • consequence: instead of emitting e.g. cloudprovider_aws_api_request_duration_seconds_bucket{request="describe_instance" , I emit cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances"
  • I use SDK's IsErrorThrottle function to decide whether to emit a throttle metric whereas cloudprovider only treats RequestLimitExceeded as throttles.
    • consequence: I don't think it matters for DescribeVolumes/DescribeInstances.
  • I did not change our retry logic. Unlike cloudprovider one that has process-wide retry, ours is per-request. So since the quantity of requests will differ so will the metrics, meaning people will have to adjust their alarms?
    • consequence: hard to say for certain without testing under conditions where lots of api calls fail and need to be retried. But generally I feel safer relying on SDK retry logic.

What testing is done?

kubectl port-forward deployment/ebs-csi-controller 8080:80 -n kube-system

excerpt:

# HELP cloudprovider_aws_api_request_duration_seconds [ALPHA] Latency of AWS API calls
# TYPE cloudprovider_aws_api_request_duration_seconds histogram
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.1"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.25"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="0.5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="1"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="2.5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="5"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="10"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeInstances",le="+Inf"} 1
cloudprovider_aws_api_request_duration_seconds_sum{request="DescribeInstances"} 0.204679971
cloudprovider_aws_api_request_duration_seconds_count{request="DescribeInstances"} 1
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.005"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.01"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.025"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.05"} 0
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.1"} 3
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.25"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="0.5"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="1"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="2.5"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="5"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="10"} 4
cloudprovider_aws_api_request_duration_seconds_bucket{request="DescribeVolumes",le="+Inf"} 4
cloudprovider_aws_api_request_duration_seconds_sum{request="DescribeVolumes"} 0.35412039799999995

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 20, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wongma7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 20, 2021
@wongma7
Copy link
Contributor Author

wongma7 commented Apr 20, 2021

/cc @gnufied

need your expertise or testimony from SREs relying on this metric : - )

@k8s-ci-robot k8s-ci-robot requested a review from gnufied April 20, 2021 23:29
@wongma7
Copy link
Contributor Author

wongma7 commented Apr 20, 2021

/cc @AndyXiangLi

BTW, once we have these metrics, we may need to do some basic comparison / load testing with in-tree driver for API call volume. In addition to the pod startup type of testing we are doing. I probably can sign up for it, time permitting : D

@coveralls
Copy link

coveralls commented Apr 20, 2021

Pull Request Test Coverage Report for Build 1869

  • 0 of 46 (0.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-1.6%) to 80.51%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/cloud/cloud.go 0 12 0.0%
pkg/cloud/aws_metrics.go 0 15 0.0%
pkg/cloud/handlers.go 0 19 0.0%
Totals Coverage Status
Change from base Build 1858: -1.6%
Covered Lines: 1896
Relevant Lines: 2355

💛 - Coveralls

Copy link
Contributor

@gnufied gnufied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. It is okay to rename the metrics as long as - we cover it with release notes. I think whenever AWS migration is enabled by default, the release note should capture the renamed metrics.

cmd/main.go Outdated Show resolved Hide resolved
pkg/cloud/handlers.go Outdated Show resolved Hide resolved
@gnufied
Copy link
Contributor

gnufied commented Apr 21, 2021

also cc @Jiawei0227 and @msau42 who are tracking CSI migration work and metrics migration is a prerequisite for CSI migration.

@gnufied
Copy link
Contributor

gnufied commented Apr 22, 2021

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2021
@wongma7 wongma7 merged commit 57e57bb into kubernetes-sigs:master Apr 22, 2021
@wongma7 wongma7 mentioned this pull request Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emit cloudprovider and throttling metrics
4 participants