Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exceeded Quota Causes Failed Workflows #3645

Closed
4 tasks done
KPostOffice opened this issue Jul 30, 2020 · 17 comments
Closed
4 tasks done

Exceeded Quota Causes Failed Workflows #3645

KPostOffice opened this issue Jul 30, 2020 · 17 comments
Assignees
Labels

Comments

@KPostOffice
Copy link

KPostOffice commented Jul 30, 2020

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:
Workflow failed due to exceeding CPU quota and also due to exceeding memory quota

What you expected to happen:
Pod should stay in pending state until it is able to get the necessary resources.

How to reproduce it (as minimally and precisely as possible):

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: cpu-limit-
spec:
  serviceAccountName: argo
  entrypoint: wait

  templates:
  - name: wait
    resubmitPendingPods: True
    script:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 30s"]
      resources:
        requests:
          cpu: 200m
        limits:
          cpu: 200m
for i in {1..20}
do
    argo submit test-workflow.yaml
done

*replace cpu limit and request with memory to test memory

Anything else we need to know?:

Environment:

  • Argo version:
$ argo version
argo: 2.8.2+8a151ae.dirty
  BuildDate: 2020-06-18T23:50:58Z
  GitCommit: 8a151aec6538c9442cf2380c2544ba3efb60ff60
  GitTreeState: dirty
  GitTag: 2.8.2
  GoVersion: go1.13
  Compiler: gc
  Platform: linux/amd64
  • Kubernetes version :
$ kubectl version -o yaml
clientVersion:
  buildDate: 2020-01-29T21:26:39Z
  compiler: gc
  gitCommit: d4cacc0
  gitTreeState: clean
  gitVersion: v1.10.0+d4cacc0
  goVersion: go1.14beta1
  major: "1"
  minor: 10+
  platform: linux/amd64
serverVersion:
  buildDate: 2020-05-04T12:54:43Z
  compiler: gc
  gitCommit: a3ec9df
  gitTreeState: clean
  gitVersion: v1.16.2
  goVersion: go1.12.12
  major: "1"
  minor: 16+
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
$ argo --loglevel DEBUG get <workflowname>
DEBU[0000] CLI version                                   version="{2.8.2+8a151ae.dirty 2020-06-18T23:50:58Z 8a151aec6538c9442cf2380c2544ba3efb60ff60 2.8.2 dirty go1.13 gc linux/amd64}"
DEBU[0000] Client options                                opts="{{ false false} 0x1574670 0xc000117900}"
Name:                cpu-limit-r4jsz
Namespace:           thoth-test-core
ServiceAccount:      argo
Status:              Error
Message:             pods "cpu-limit-r4jsz" is forbidden: exceeded quota: thoth-test-core-quota, requested: limits.memory=3048Mi, used: limits.memory=30096Mi, limited: limits.memory=32Gi
Conditions:          
 Completed           True
Created:             Thu Jul 30 14:11:30 -0400 (11 minutes ago)
Started:             Thu Jul 30 14:11:30 -0400 (11 minutes ago)
Finished:            Thu Jul 30 14:11:31 -0400 (11 minutes ago)
Duration:            1 second

STEP                TEMPLATE  PODNAME          DURATION  MESSAGE
 ⚠ cpu-limit-r4jsz  wait      cpu-limit-r4jsz  0s        pods "cpu-limit-r4jsz" is forbidden: exceeded quota: thoth-test-core-quota, requested: limits.memory=3048Mi, used: limits.memory=30096Mi, limited: limits.memory=32Gi 

Related
#3419
#3490

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@simster7
Copy link
Member

simster7 commented Aug 3, 2020

Please change

- resubmitPendingPods: True
+ resubmitPendingPods: true

Boolean values are case sensitive

@simster7 simster7 closed this as completed Aug 3, 2020
@simster7
Copy link
Member

simster7 commented Aug 3, 2020

Feel free to repoen if this doesn't fix your issue

@KPostOffice
Copy link
Author

This didn't fix my issue

@KPostOffice
Copy link
Author

@simster7 I can't seem to reopen this myself

@simster7 simster7 reopened this Aug 4, 2020
@simster7
Copy link
Member

simster7 commented Aug 4, 2020

Will take another look

@KPostOffice
Copy link
Author

It may already be fixed by #3490 I haven't been able to test the prerelease

@simster7
Copy link
Member

simster7 commented Aug 4, 2020

Please test in v2.10.0-rc5 and let me know

@KPostOffice
Copy link
Author

Does this version need to be present on the cluster or is it enough to just have the pre-release CLI?

@simster7
Copy link
Member

simster7 commented Aug 4, 2020

It needs to be present on the cluster

@KPostOffice
Copy link
Author

I don't have the permission to do this

@simster7
Copy link
Member

simster7 commented Aug 4, 2020

I'll take a look

@simster7
Copy link
Member

Hi @KPostOffice, sorry for the delay. I tested this in 2.10.0-rc6 and 2.9.5 and seems to work as expected on both.

@simster7
Copy link
Member

Will close this again unless you confirm that you've tested this in 2.10.0-rc6 and 2.9.5 and still have the issue 🙂

@KPostOffice
Copy link
Author

I still seem to be having issues. Does the executor that I'm using make any difference?

@alexec
Copy link
Contributor

alexec commented Aug 19, 2020

I've created a new image for testing if you would like to try it: argoproj/workflow-controller:fix-3791 .

@KPostOffice
Copy link
Author

👍

@alexec
Copy link
Contributor

alexec commented Aug 22, 2020

I've created another test image: argoproj/workflow-controller:fix-3791.

Can you please try it out to confirm it fixes your problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants