-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Argo cronworkflow showing a finished workflow under Active Workflows #4181
Comments
I did a restart of the argo deployments to rule out a memory leak issue and just after a few minutes the memory of the workflow-controlller was again at 13GB. I also managed to get the end of the logs for the job in question:
So in the example above the Active Workflows reference in the cronworkflow disappeared 13:00. What is the argo workflow doing between the second last and last action/event? |
@simster7 I don't see 'failed to update CronWorkflow' string in logs. Could you please elaborate on why there should be sudden CPU spikes ? As I mentioned in my case , CPU spiked from 30 millicores to 1 core in just couple of minutes and jobs started behaving weird after that event and average cpu usage of workflowcontroller pods is not more than 50 millicores in normal cases |
Hi @simster7, Yes I do find those strings in our logs for the affected jobs but also for others:
So these would be fixed by upgrading our argo version? Are these errors leading to retries and thus delaying the completion of the workflow? Or how would they explain what we see in our environment? |
@dertodestod @Sushant20 I would recommend waiting until
These errors are an indication that the
I can't think of a reason as to why the persistence issue would cause CPU spikes, but if the spikes happen around a time when you have many |
@simster7 @alexec We are now observing similar behaviour in another environment.
As per logs, workflow update to 'Successful' is done however workflow yaml shows status as Active Also, even after deploying sample cron yaml, it wasn't working as per schedule however manual trigger of job works. In summary, entire Argo cron scheduler fails, seems like unknown bug and we don't even know on how to reproduce it I had to specifically remove the status section from workflow YAML and restart the workflow controller pod to stabilize the argo scheduler. CPU usage comes to normal after issue is resolved Could you please let us know the ETA for 2.11.2 release? |
But at some point this persist operation is successful, right? If not then I would expect the incorrect active list to stay that way forever. So what decides if the action fails or is successful (which seems to happen in our case after 20-30 minutes)? |
I will work to release @Sushant20 Let's first check if
@dertodestod Yes, what I believe happens is that we then kick off our "conflict resolution" code that attempts to resolve the differences needed to allow the CRD to be persisted. My hypothesis is that this conflict resolution doesn't deal with removals (i.e. the removal of the workflow from the |
@simster7 |
@simster7 It is okay to wait for few more days, would prefer tested version(2.11.2) from your end because we also need to run new version initially in non production environment. By any chance, will upgrading from 2.9.5 to 2.11.2 be breaking change as per your view? |
Simon is away at the moment. We were concerned we might see breakage in v2.10, but we did not. However, we did in v2.11.0. Can I suggest you try v2.11.3 which has as significant performance improvement as well as mitigation for breakage. I would not expect to see the controller more than 1-2g of memory. |
@dertodestod @Sushant20 can you guys confirm if upgrading fixed the issues? |
Will close this unless the issue persists. If so, please let me know and I'll reopen the issue |
Thank you for the update and potential fix(es) @simster7. We did not yet manage to upgrade due to time constraints and as the situation improved somewhat on its own (the cron jobs start now every 5-15 minutes). We will let you know in case the upgrade is not fixing the issue. Thanks again. |
@simster7 We did upgrade to 2.11.4 and haven't seen issues till now. It's hard to replicate the issue because we don't know what caused it in the first place. |
Summary
We have observed that cron jobs which should run every 5 minutes only start every 10, or 20 or sometimes even 30 minutes. The workflows itself finish in less than 5 minutes (please see logs below).
We also see a high CPU and memory usage of the workflow-controller pod and its log shows completed messages for many jobs at the wrong time.
We are using a relatively old argo version but we have not touched our configuration lately (but regularly added more and more workflows). The issue only happens in our Production environment where we are running 500+ jobs regularly. Some hundred each hour, some hundred every 3 hours and a few every 5 minutes.
From all the things I can see, I assume the workflow controller is not able to catch up with the amount of cronworkflows/workflows in our cluster and thus it detects too late that the workflow of the cronworkflow is not active any longer.
Maybe similar to #4174 ?
I'm open to upgrade our argo deployment to a recent argo version but I saw there are breaking changes in 2.10 and upgrading all environments including Production (where we see the issue) would take some effort on our side. So if someone is aware of any certain fix for our problem before we do the upgrade I'd be really glad.
Please let me know if you need any additional information.
Thank you.
Diagnostics
What Kubernetes provider are you using?
EKS 1.14
What version of Argo Workflows are you running?
v2.8.1
Logs of workflow-controller:
Cronworkflow shows Active Workflow but it already finished 12 minutes ago:
High CPU and memory usage of workflow-controller pod observable:
workflow-controller has many events like these which should not appear at that point in time because the job ran 3 hours ago:
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: