Confusing use of TooManyRequests error for eviction #106286

theunrepentantgeek · 2021-11-10T01:18:49Z

What happened?

When the Kubernetes API server returns a 429 (throttled), the response may include misleading cause information.

We observed that a pod eviction API call received a response that included both of the following fragments:

"responseStatus": {
  "metadata": {},
  "status": "Failure",
  "reason": "TooManyRequests",
  "code": 429
},

and

"responseObject": {
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "Cannot evict pod as it would violate the pod's disruption budget.",
  "reason": "TooManyRequests",
  "details": {
    "causes": [
      {
        "reason": "DisruptionBudget",
        "message": "The disruption budget [elided] needs 11 healthy pods and has 12 currently"
      }
    ]
  },
  "code": 429
},

Note that the message given above is confusing, as there are more currently healthy pods than the required number.

What did you expect to happen?

When a request is rejected due to throttling (429 HTTP response), either

no other error information should be included; or
the included error information should indicate throttling

How can we reproduce it (as minimally and precisely as possible)?

Provoke a 429 HTTP response and inspect the response.

Anything else we need to know?

Issue #88535 seems to be similar - it also describes a 429 response where inclusion of a cause resulted in confusion.

Kubernetes version

Version: 1.20.9

Cloud provider

Azure Kubernetes Service

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

The text was updated successfully, but these errors were encountered:

theunrepentantgeek · 2021-11-10T01:21:06Z

/sig api-machinery

liggitt · 2021-11-12T21:52:29Z

/remove-sig api-machinery
/sig apps
for PDB

theunrepentantgeek · 2021-11-12T22:49:30Z

@liggitt , are you sure?

As far as I've been able to work out, the 429 HTTP response is correct and the PDB error is a red herring.

The PDB error we observed doesn't even make sense, given the number of healthy pods (12) was more than the needed (11):

The disruption budget [elided] needs 11 healthy pods and has 12 currently

(My experience is that I spent a bunch of time digging into possible causes of the PDB error before realizing that I'd been ignoring the HTTP response code - and once I started paying attention to that, everything started to make sense. Of course, you understand this area far better, I just don't want you to fall into the same trap that I did!)

liggitt · 2021-11-12T22:54:31Z

any message coming back with PDB language in it belongs to sig-apps

k8s-triage-robot · 2022-02-10T23:53:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

theunrepentantgeek · 2022-02-11T06:10:25Z

/remove-lifecycle stale

k8s-triage-robot · 2022-05-12T07:00:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

theunrepentantgeek · 2022-05-12T09:27:38Z

/remove-lifecycle stale

theunrepentantgeek · 2022-05-12T21:46:09Z

/remove-lifecycle stale

k8s-triage-robot · 2022-08-10T22:09:59Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-09-09T22:39:10Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

theunrepentantgeek · 2022-09-09T23:42:19Z

/remove-lifecycle rotten

k8s-triage-robot · 2022-12-08T23:53:04Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

theunrepentantgeek · 2022-12-12T01:56:35Z

/remove-lifecycle stale

k8s-triage-robot · 2023-03-12T02:33:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

theunrepentantgeek · 2023-03-12T09:08:55Z

/remove-lifecycle stale

tallclair · 2023-03-21T21:03:27Z

/triage accepted
/priority important-longterm

It looks like eviction has some creative uses of TooManyRequests errors. Why is this conflict error converted to TooManyRequests?

kubernetes/pkg/registry/core/pod/storage/eviction.go

Lines 294 to 305 in 786316f

    
           // Try the delete 
        
           err = addConditionAndDeletePod(r, ctx, eviction.Name, rest.ValidateAllObjectFunc, deleteOptions) 
        
           if err != nil { 
        
           	if errors.IsConflict(err) && updateDeletionOptions && 
        
           		(originalDeleteOptions.Preconditions == nil || originalDeleteOptions.Preconditions.ResourceVersion == nil) { 
        
           		// If we encounter a resource conflict error, we updated the deletion options to include them, 
        
           		// and the original deletion options did not specify ResourceVersion, we send back 
        
           		// TooManyRequests so clients will retry. 
        
           		return nil, createTooManyRequestsError(pdbName) 
        
           	} 
        
           	return nil, err 
        
           }

Also this observed generation mismatch (stale cache?):

kubernetes/pkg/registry/core/pod/storage/eviction.go

Lines 419 to 422 in 786316f

    
           if pdb.Status.ObservedGeneration < pdb.Generation { 
        
           	return createTooManyRequestsError(pdb.Name) 
        
           }

Violating the disruption budget is more debatable, but I still don't think a 429 error is right:

kubernetes/pkg/registry/core/pod/storage/eviction.go

Lines 429 to 433 in 786316f

    
           if pdb.Status.DisruptionsAllowed == 0 { 
        
           	err := errors.NewTooManyRequests("Cannot evict pod as it would violate the pod's disruption budget.", 0) 
        
           	err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type: policyv1.DisruptionBudgetCause, Message: fmt.Sprintf("The disruption budget %s needs %d healthy pods and has %d currently", pdb.Name, pdb.Status.DesiredHealthy, pdb.Status.CurrentHealthy)}) 
        
           	return err 
        
           }

liggitt · 2023-03-21T21:13:06Z

It looks like eviction has some creative uses of TooManyRequests errors. Why is this conflict error converted to TooManyRequests?

that was added in https://github.com/kubernetes/kubernetes/pull/94381/files#r507982781 with some discussion there

I think the short answer is we wanted to drive an automatic retry (which the server returning TooManyRequests with a RetryAfter does), and returning a conflict when the user did not set a precondition on the request did not seem correct

k8s-triage-robot · 2024-03-20T22:12:00Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

flavianmissi · 2024-05-02T11:52:57Z

If we put aside the 429 for a moment and focus on the error's cause in question:

  "details": {
    "causes": [
      {
        "reason": "DisruptionBudget",
        "message": "The disruption budget [elided] needs 11 healthy pods and has 12 currently"
      }
    ]
  },

source:

kubernetes/pkg/registry/core/pod/storage/eviction.go

Lines 429 to 433 in 786316f

    
           if pdb.Status.DisruptionsAllowed == 0 { 
        
           	err := errors.NewTooManyRequests("Cannot evict pod as it would violate the pod's disruption budget.", 0) 
        
           	err.ErrStatus.Details.Causes = append(err.ErrStatus.Details.Causes, metav1.StatusCause{Type: policyv1.DisruptionBudgetCause, Message: fmt.Sprintf("The disruption budget %s needs %d healthy pods and has %d currently", pdb.Name, pdb.Status.DesiredHealthy, pdb.Status.CurrentHealthy)}) 
        
           	return err 
        
           }

Isn't it weird that this state (needs 11 healthy pods and has 12 currently) is an error?
The code returns this error cause when pdb.Status.DisruptionsAllowed == 0. Is this value wrong in this case, should it have been 1? I couldn't find how this is calculated but I don't know the code so well.

liggitt · 2024-05-02T17:02:00Z

pdb.Status.DisruptionsAllowed = 0 is set as a fail safe when the DesiredHealthy / CurrentHealthy pods can't be calculated so those numbers are not trustworthy:

kubernetes/pkg/controller/disruption/disruption.go

Lines 944 to 962 in bb838fd

    
           // failSafe is an attempt to at least update the DisruptionsAllowed field to 
        
           // 0 if everything else has failed.  This is one place we 
        
           // implement the  "fail open" part of the design since if we manage to update 
        
           // this field correctly, we will prevent the /evict handler from approving an 
        
           // eviction when it may be unsafe to do so. 
        
           func (dc *DisruptionController) failSafe(ctx context.Context, pdb *policy.PodDisruptionBudget, err error) error { 
        
           	newPdb := pdb.DeepCopy() 
        
           	newPdb.Status.DisruptionsAllowed = 0 
        
           	if newPdb.Status.Conditions == nil { 
        
           		newPdb.Status.Conditions = make([]metav1.Condition, 0) 
        
           	} 
        
           	apimeta.SetStatusCondition(&newPdb.Status.Conditions, metav1.Condition{ 
        
           		Type:               policy.DisruptionAllowedCondition, 
        
           		Status:             metav1.ConditionFalse, 
        
           		Reason:             policy.SyncFailedReason, 
        
           		Message:            err.Error(), 
        
           		ObservedGeneration: newPdb.Status.ObservedGeneration, 
        
           	})

If the type=DisruptionAllowedCondition, status=False condition is present, we should probably use the message from that condition rather than the pdb.Status.DesiredHealthy and pdb.Status.CurrentHealthy fields

k8s-triage-robot · 2024-07-31T18:02:32Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

theunrepentantgeek · 2024-08-01T04:27:37Z

/remove-lifecycle stale

k8s-triage-robot · 2024-10-30T04:48:52Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

theunrepentantgeek · 2024-10-30T04:50:24Z

/lifecycle stale

c-goes · 2024-11-25T13:35:08Z

/remove-lifecycle stale

theunrepentantgeek added the kind/bug Categorizes issue or PR as related to a bug. label Nov 10, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 10, 2021

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 10, 2021

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Nov 12, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 11, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 12, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 12, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 9, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 9, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 8, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 12, 2023

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 12, 2023

tallclair changed the title ~~Throttled response from API server should not include misleading cause information~~ Confusing use of TooManyRequests error for eviction Mar 21, 2023

helayoty added this to SIG Apps Sep 29, 2023

github-project-automation bot moved this to Needs Triage in SIG Apps Sep 29, 2023

helayoty moved this from Needs Triage to Backlog in SIG Apps Sep 29, 2023

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Mar 20, 2024

This was referenced Jun 12, 2024

KEP-1397: descheduler integration with evacuation API as an alternative to eviction API kubernetes-sigs/descheduler#1354

Merged

Distinguish PDB error separately in eviction API #125500

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 1, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024

OttaviaB mentioned this issue Nov 25, 2024

Change error when draining a node with Pod Disruption Budget ansible-collections/kubernetes.core#797

Closed

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing use of TooManyRequests error for eviction #106286

Confusing use of TooManyRequests error for eviction #106286

theunrepentantgeek commented Nov 10, 2021 •

edited

Loading

theunrepentantgeek commented Nov 10, 2021

liggitt commented Nov 12, 2021

theunrepentantgeek commented Nov 12, 2021 •

edited

Loading

liggitt commented Nov 12, 2021

k8s-triage-robot commented Feb 10, 2022

theunrepentantgeek commented Feb 11, 2022

k8s-triage-robot commented May 12, 2022

theunrepentantgeek commented May 12, 2022 •

edited

Loading

theunrepentantgeek commented May 12, 2022

k8s-triage-robot commented Aug 10, 2022

k8s-triage-robot commented Sep 9, 2022

theunrepentantgeek commented Sep 9, 2022

k8s-triage-robot commented Dec 8, 2022

theunrepentantgeek commented Dec 12, 2022

k8s-triage-robot commented Mar 12, 2023

theunrepentantgeek commented Mar 12, 2023

tallclair commented Mar 21, 2023

liggitt commented Mar 21, 2023

k8s-triage-robot commented Mar 20, 2024

flavianmissi commented May 2, 2024

liggitt commented May 2, 2024 •

edited

Loading

k8s-triage-robot commented Jul 31, 2024

theunrepentantgeek commented Aug 1, 2024

k8s-triage-robot commented Oct 30, 2024

theunrepentantgeek commented Oct 30, 2024

c-goes commented Nov 25, 2024

Confusing use of TooManyRequests error for eviction #106286

Confusing use of TooManyRequests error for eviction #106286

Comments

theunrepentantgeek commented Nov 10, 2021 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

theunrepentantgeek commented Nov 10, 2021

liggitt commented Nov 12, 2021

theunrepentantgeek commented Nov 12, 2021 • edited Loading

liggitt commented Nov 12, 2021

k8s-triage-robot commented Feb 10, 2022

theunrepentantgeek commented Feb 11, 2022

k8s-triage-robot commented May 12, 2022

theunrepentantgeek commented May 12, 2022 • edited Loading

theunrepentantgeek commented May 12, 2022

k8s-triage-robot commented Aug 10, 2022

k8s-triage-robot commented Sep 9, 2022

theunrepentantgeek commented Sep 9, 2022

k8s-triage-robot commented Dec 8, 2022

theunrepentantgeek commented Dec 12, 2022

k8s-triage-robot commented Mar 12, 2023

theunrepentantgeek commented Mar 12, 2023

tallclair commented Mar 21, 2023

liggitt commented Mar 21, 2023

k8s-triage-robot commented Mar 20, 2024

flavianmissi commented May 2, 2024

liggitt commented May 2, 2024 • edited Loading

k8s-triage-robot commented Jul 31, 2024

theunrepentantgeek commented Aug 1, 2024

k8s-triage-robot commented Oct 30, 2024

theunrepentantgeek commented Oct 30, 2024

c-goes commented Nov 25, 2024

theunrepentantgeek commented Nov 10, 2021 •

edited

Loading

theunrepentantgeek commented Nov 12, 2021 •

edited

Loading

theunrepentantgeek commented May 12, 2022 •

edited

Loading

liggitt commented May 2, 2024 •

edited

Loading