-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflows can get stuck in Running with no backing pod #3006
Comments
Something that is invaluable when debugging similar errors is the full Workflow object that causes this error after it finishes running (or in this case stops running any further). You can get it by running |
@simster7 This is the workflow @wreed4 was referring to. I've changed a few lines to REDACTED but the majority is present
|
We have experienced this as well. We think it's related to the cluster not having enough resources to run the pod/pods getting OOM killed. The workflow controller seems to think the workflows are still running, but of course they aren't, and they'll never complete. |
Thanks, I'll take a look at this |
We will try updating to 2.8 as well. |
I haven't seen the issue with Running workflows not having a pod: they do have a pod in all my "stuck in Running" workflows now. This is anecdotal data though, we see this in a test cluster that is very starved of resources, so workflow pods get OOM-killed quite often. However, pods are there now as far as I can see. |
@antoniomo is this still an issue please? |
Hi, as reported, we haven't seen this again as described. However I say "we" referring to my team, we aren't the original poster of the issue, that would be @wreed4, sorry if I caused confusion. For my team, the version update solved an issue that seemed to be the same. |
@wreed4 's team hasn't seen it since upgrading as well. I think we can close this out. |
Checklist:
What happened:
We have seen several times now a workflow being stuck in the Running state waiting for pods to come up that are nowhere to be found.
What you expected to happen:
The workflow should be marked as Failed probably. Or the pods should be restarted.
How to reproduce it (as minimally and precisely as possible):
I'm unsure exactly how to achieve this state. But it may have something to do with our cluster being not the most stable at the moment, so pods are not always able to launch. They sometimes don't have enough IP addresses so they fail to start. Additionally, we're using spot instances, so the node may disappear out from under the workflow, taking the pod with it. In either case, I would expect the workflow to be resilient to this, either by Failing when the pod was killed (and no retry behavior specified) or by restarting the pod if there was retry behavior.
Anything else we need to know?:
Environment:
Other debugging information (if applicable):
These are old workflows that have been around for days. Unfortunately, I don't think the logs here will help. also, some of the logs container sensitive information.
Message from the maintainers:
If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: