Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing use of TooManyRequests error for eviction #106286

Open
theunrepentantgeek opened this issue Nov 10, 2021 · 26 comments
Open

Confusing use of TooManyRequests error for eviction #106286

theunrepentantgeek opened this issue Nov 10, 2021 · 26 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@theunrepentantgeek
Copy link

theunrepentantgeek commented Nov 10, 2021

What happened?

When the Kubernetes API server returns a 429 (throttled), the response may include misleading cause information.

We observed that a pod eviction API call received a response that included both of the following fragments:

"responseStatus": {
  "metadata": {},
  "status": "Failure",
  "reason": "TooManyRequests",
  "code": 429
},

and

"responseObject": {
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Cannot evict pod as it would violate the pod's disruption budget.",
  "reason": "TooManyRequests",
  "details": {
    "causes": [
      {
        "reason": "DisruptionBudget",
        "message": "The disruption budget [elided] needs 11 healthy pods and has 12 currently"
      }
    ]
  },
  "code": 429
},

Note that the message given above is confusing, as there are more currently healthy pods than the required number.

What did you expect to happen?

When a request is rejected due to throttling (429 HTTP response), either

  • no other error information should be included; or
  • the included error information should indicate throttling

How can we reproduce it (as minimally and precisely as possible)?

Provoke a 429 HTTP response and inspect the response.

Anything else we need to know?

Issue #88535 seems to be similar - it also describes a 429 response where inclusion of a cause resulted in confusion.

Kubernetes version

Version: 1.20.9

Cloud provider

Azure Kubernetes Service

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@theunrepentantgeek theunrepentantgeek added the kind/bug Categorizes issue or PR as related to a bug. label Nov 10, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 10, 2021
@theunrepentantgeek
Copy link
Author

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 10, 2021
@liggitt
Copy link
Member

liggitt commented Nov 12, 2021

/remove-sig api-machinery
/sig apps
for PDB

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Nov 12, 2021
@theunrepentantgeek
Copy link
Author

theunrepentantgeek commented Nov 12, 2021

@liggitt , are you sure?

As far as I've been able to work out, the 429 HTTP response is correct and the PDB error is a red herring.

The PDB error we observed doesn't even make sense, given the number of healthy pods (12) was more than the needed (11):

The disruption budget [elided] needs 11 healthy pods and has 12 currently

(My experience is that I spent a bunch of time digging into possible causes of the PDB error before realizing that I'd been ignoring the HTTP response code - and once I started paying attention to that, everything started to make sense. Of course, you understand this area far better, I just don't want you to fall into the same trap that I did!)

@liggitt
Copy link
Member

liggitt commented Nov 12, 2021

any message coming back with PDB language in it belongs to sig-apps

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2022
@theunrepentantgeek
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 12, 2022
@theunrepentantgeek
Copy link
Author

theunrepentantgeek commented May 12, 2022

/remove-lifecycle stale

1 similar comment
@theunrepentantgeek
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 12, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 9, 2022
@theunrepentantgeek
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 9, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 8, 2022
@theunrepentantgeek
Copy link
Author

/remove-lifecycle stale

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 12, 2023
@theunrepentantgeek
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 12, 2023
@tallclair tallclair changed the title Throttled response from API server should not include misleading cause information Confusing use of TooManyRequests error for eviction Mar 21, 2023
@tallclair
Copy link
Member

/triage accepted
/priority important-longterm

It looks like eviction has some creative uses of TooManyRequests errors. Why is this conflict error converted to TooManyRequests?

// Try the delete
err = addConditionAndDeletePod(r, ctx, eviction.Name, rest.ValidateAllObjectFunc, deleteOptions)
if err != nil {
if errors.IsConflict(err) && updateDeletionOptions &&
(originalDeleteOptions.Preconditions == nil || originalDeleteOptions.Preconditions.ResourceVersion == nil) {
// If we encounter a resource conflict error, we updated the deletion options to include them,
// and the original deletion options did not specify ResourceVersion, we send back
// TooManyRequests so clients will retry.
return nil, createTooManyRequestsError(pdbName)
}
return nil, err
}

Also this observed generation mismatch (stale cache?):

if pdb.Status.ObservedGeneration < pdb.Generation {
return createTooManyRequestsError(pdb.Name)
}

Violating the disruption budget is more debatable, but I still don't think a 429 error is right:

if pdb.Status.DisruptionsAllowed == 0 {
err := errors.NewTooManyRequests("Cannot evict pod as it would violate the pod's disruption budget.", 0)
err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type: policyv1.DisruptionBudgetCause, Message: fmt.Sprintf("The disruption budget %s needs %d healthy pods and has %d currently", pdb.Name, pdb.Status.DesiredHealthy, pdb.Status.CurrentHealthy)})
return err
}

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 21, 2023
@liggitt
Copy link
Member

liggitt commented Mar 21, 2023

It looks like eviction has some creative uses of TooManyRequests errors. Why is this conflict error converted to TooManyRequests?

that was added in https://github.com/kubernetes/kubernetes/pull/94381/files#r507982781 with some discussion there

I think the short answer is we wanted to drive an automatic retry (which the server returning TooManyRequests with a RetryAfter does), and returning a conflict when the user did not set a precondition on the request did not seem correct

@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Sep 29, 2023
@helayoty helayoty moved this from Needs Triage to Backlog in SIG Apps Sep 29, 2023
@k8s-triage-robot
Copy link

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 20, 2024
@flavianmissi
Copy link
Contributor

If we put aside the 429 for a moment and focus on the error's cause in question:

  "details": {
    "causes": [
      {
        "reason": "DisruptionBudget",
        "message": "The disruption budget [elided] needs 11 healthy pods and has 12 currently"
      }
    ]
  },

source:

if pdb.Status.DisruptionsAllowed == 0 {
err := errors.NewTooManyRequests("Cannot evict pod as it would violate the pod's disruption budget.", 0)
err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type: policyv1.DisruptionBudgetCause, Message: fmt.Sprintf("The disruption budget %s needs %d healthy pods and has %d currently", pdb.Name, pdb.Status.DesiredHealthy, pdb.Status.CurrentHealthy)})
return err
}

Isn't it weird that this state (needs 11 healthy pods and has 12 currently) is an error?
The code returns this error cause when pdb.Status.DisruptionsAllowed == 0. Is this value wrong in this case, should it have been 1? I couldn't find how this is calculated but I don't know the code so well.

@liggitt
Copy link
Member

liggitt commented May 2, 2024

pdb.Status.DisruptionsAllowed = 0 is set as a fail safe when the DesiredHealthy / CurrentHealthy pods can't be calculated so those numbers are not trustworthy:

// failSafe is an attempt to at least update the DisruptionsAllowed field to
// 0 if everything else has failed. This is one place we
// implement the "fail open" part of the design since if we manage to update
// this field correctly, we will prevent the /evict handler from approving an
// eviction when it may be unsafe to do so.
func (dc *DisruptionController) failSafe(ctx context.Context, pdb *policy.PodDisruptionBudget, err error) error {
newPdb := pdb.DeepCopy()
newPdb.Status.DisruptionsAllowed = 0
if newPdb.Status.Conditions == nil {
newPdb.Status.Conditions = make([]metav1.Condition, 0)
}
apimeta.SetStatusCondition(&newPdb.Status.Conditions, metav1.Condition{
Type: policy.DisruptionAllowedCondition,
Status: metav1.ConditionFalse,
Reason: policy.SyncFailedReason,
Message: err.Error(),
ObservedGeneration: newPdb.Status.ObservedGeneration,
})

If the type=DisruptionAllowedCondition, status=False condition is present, we should probably use the message from that condition rather than the pdb.Status.DesiredHealthy and pdb.Status.CurrentHealthy fields

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2024
@theunrepentantgeek
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024
@theunrepentantgeek
Copy link
Author

/lifecycle stale

@c-goes
Copy link

c-goes commented Nov 25, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: Backlog
Development

No branches or pull requests

7 participants