Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo cronworkflow showing a finished workflow under Active Workflows #4181

Closed
dertodestod opened this issue Oct 1, 2020 · 15 comments
Closed
Assignees
Labels

Comments

@dertodestod
Copy link

Summary

We have observed that cron jobs which should run every 5 minutes only start every 10, or 20 or sometimes even 30 minutes. The workflows itself finish in less than 5 minutes (please see logs below).

We also see a high CPU and memory usage of the workflow-controller pod and its log shows completed messages for many jobs at the wrong time.

We are using a relatively old argo version but we have not touched our configuration lately (but regularly added more and more workflows). The issue only happens in our Production environment where we are running 500+ jobs regularly. Some hundred each hour, some hundred every 3 hours and a few every 5 minutes.

From all the things I can see, I assume the workflow controller is not able to catch up with the amount of cronworkflows/workflows in our cluster and thus it detects too late that the workflow of the cronworkflow is not active any longer.

Maybe similar to #4174 ?

I'm open to upgrade our argo deployment to a recent argo version but I saw there are breaking changes in 2.10 and upgrading all environments including Production (where we see the issue) would take some effort on our side. So if someone is aware of any certain fix for our problem before we do the upgrade I'd be really glad.

Please let me know if you need any additional information.
Thank you.

Diagnostics

What Kubernetes provider are you using?
EKS 1.14

What version of Argo Workflows are you running?
v2.8.1

Logs of workflow-controller:

time="2020-10-01T09:25:02Z" level=info msg="extractor has 'ConcurrencyPolicy: Forbid' and has an active Workflow so it was not run"
time="2020-10-01T09:25:02Z" level=info msg="extractor2 has 'ConcurrencyPolicy: Forbid' and has an active Workflow so it was not run"
time="2020-10-01T09:25:02Z" level=info msg="extractor3 has 'ConcurrencyPolicy: Forbid' and has an active Workflow so it was not run"
time="2020-10-01T09:25:02Z" level=info msg="extractor4 has 'ConcurrencyPolicy: Forbid' and has an active Workflow so it was not run"
time="2020-10-01T09:25:02Z" level=info msg="extractor5 has 'ConcurrencyPolicy: Forbid' and has an active Workflow so it was not run"

Cronworkflow shows Active Workflow but it already finished 12 minutes ago:

user@pc:~ $ argo cron get extractor -n landing
Name:                          extractor
Namespace:                     landing
Created:                       Fri Sep 25 09:15:38 +0200 (6 days ago)
Schedule:                      */5 * * * *
Suspended:                     false
StartingDeadlineSeconds:       0
ConcurrencyPolicy:             Forbid
LastScheduledTime:             Thu Oct 01 10:31:29 +0200 (15 minutes ago)
Active Workflows:              extractor-8hcfx

user@pc:~ $ argo get extractor-8hcfx -n landing
Name:                extractor-8hcfx
Namespace:           landing
ServiceAccount:      argo
Status:              Succeeded
Conditions:
 Completed           True
Created:             Thu Oct 01 10:31:29 +0200 (15 minutes ago)
Started:             Thu Oct 01 10:31:51 +0200 (15 minutes ago)
Finished:            Thu Oct 01 10:34:11 +0200 (12 minutes ago)
Duration:            2 minutes 20 seconds

STEP                                    TEMPLATE                              PODNAME                                  DURATION  MESSAGE
 ✔ extractor-8hcfx        	extractor-dag-name
 └-✔ extractor-s-task     	extractor-s-template     		extractor-8hcfx-1765851555  			2m

 ✔ extractor-8hcfx.onExit  	extractor-exit-template
 └---○ extractor-e-step    	extractor-e-template                                                        			when 'Succeeded != Succeeded' evaluated false

High CPU and memory usage of workflow-controller pod observable:

user@pc:~ $ kubectl top pod -A
NAMESPACE         NAME                                                     CPU(cores)   MEMORY(bytes)
argo              argo-server-ff6f7f779-5wk2b                              1m           37Mi
argo              workflow-controller-5d689bd947-b8qmx                     3705m        10031Mi

workflow-controller has many events like these which should not appear at that point in time because the job ran 3 hours ago:

time="2020-10-01T09:38:07Z" level=warning msg="Workflow 's-log-8ctpp' from CronWorkflow 's-log' completed"
time="2020-10-01T09:38:07Z" level=info msg="Enforcing history limit for 's-log'"
time="2020-10-01T09:38:07Z" level=warning msg="Workflow 's-dm-dr5mj' from CronWorkflow 's-dm' completed"
time="2020-10-01T09:38:07Z" level=info msg="Enforcing history limit for 's-dm'"
time="2020-10-01T09:38:08Z" level=warning msg="Workflow 's-sup-mcsgs' from CronWorkflow 's-sup' completed"
time="2020-10-01T09:38:08Z" level=info msg="Enforcing history limit for 's-sup'"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@dertodestod
Copy link
Author

I did a restart of the argo deployments to rule out a memory leak issue and just after a few minutes the memory of the workflow-controlller was again at 13GB.

I also managed to get the end of the logs for the job in question:

time="2020-10-01T12:41:10Z" level=info msg="Checking daemoned children of extractor-vvfd7-2002148658" namespace=landing workflow=extractor-vvfd7
time="2020-10-01T12:41:10Z" level=info msg="Updated phase Running -> Succeeded" namespace=landing workflow=extractor-vvfd7
time="2020-10-01T12:41:10Z" level=info msg="Marking workflow completed" namespace=landing workflow=extractor-vvfd7
time="2020-10-01T12:41:10Z" level=info msg="Checking daemoned children of " namespace=landing workflow=extractor-vvfd7
time="2020-10-01T12:41:10Z" level=info msg="Workflow update successful" namespace=landing phase=Succeeded resourceVersion=249079700 workflow=extractor-vvfd7
time="2020-10-01T12:41:11Z" level=info msg="Labeled pod landing/zextractor-vvfd7-804352055 completed"
time="2020-10-01T13:00:13Z" level=warning msg="Workflow 'extractor-vvfd7' from CronWorkflow 'extractor' completed"
client@pc:~ $ argo get extractor-vvfd7 -n landing
Name:                wal-extractor-vvfd7
Namespace:           landing
ServiceAccount:      argo
Status:              Succeeded
Conditions:
 Completed           True
Created:             Thu Oct 01 14:40:02 +0200 (1 minute ago)
Started:             Thu Oct 01 14:40:02 +0200 (1 minute ago)
Finished:            Thu Oct 01 14:41:10 +0200 (13 seconds ago)
Duration:            1 minute 8 seconds

So in the example above the Active Workflows reference in the cronworkflow disappeared 13:00. What is the argo workflow doing between the second last and last action/event?

@simster7 simster7 self-assigned this Oct 1, 2020
@simster7
Copy link
Member

simster7 commented Oct 1, 2020

What version of Argo Workflows are you running?
v2.8.1

I suspect behavior like this is fixed by #3272, #3883, and #4172.

Can you find the string failed to update CronWorkflow anywhere in your logs?

@Sushant20
Copy link
Contributor

Sushant20 commented Oct 1, 2020

@simster7 I don't see 'failed to update CronWorkflow' string in logs. Could you please elaborate on why there should be sudden CPU spikes ? As I mentioned in my case , CPU spiked from 30 millicores to 1 core in just couple of minutes and jobs started behaving weird after that event and average cpu usage of workflowcontroller pods is not more than 50 millicores in normal cases

@dertodestod
Copy link
Author

dertodestod commented Oct 2, 2020

Hi @simster7,
thank you for the reply.

Yes I do find those strings in our logs for the affected jobs but also for others:

time="2020-10-02T07:13:46Z" level=warning msg="Workflow 'extractor-fjmvs' from CronWorkflow 'extractor' completed"
time="2020-10-02T07:13:46Z" level=info msg="Enforcing history limit for 'extractor'"
time="2020-10-02T07:13:47Z" level=info msg="Deleted Workflow 'extractor-fjmvs' due to CronWorkflow 'extractor' history limit"

time="2020-10-02T07:17:54Z" level=warning msg="Workflow 'extractor-bb4mm' from CronWorkflow 'extractor' completed"
time="2020-10-02T07:17:54Z" level=info msg="Enforcing history limit for 'extractor'"
time="2020-10-02T07:17:55Z" level=warning msg="Workflow 'landing/extractor-fjmvs' no longer exists"

time="2020-10-02T07:19:12Z" level=warning msg="Workflow 'extractor-8pz6m' from CronWorkflow 'extractor' completed"
time="2020-10-02T07:19:12Z" level=error msg="Unable to update CronWorkflow 'extractor': failed to update CronWorkflow: Operation cannot be fulfilled on cronworkflows.argoproj.io "extractor": the object has been modified; please apply your changes to the latest version and try again""
time="2020-10-02T07:19:12Z" level=info msg="Enforcing history limit for 'extractor'"

time="2020-10-02T07:19:35Z" level=warning msg="Workflow 'extractor-cd85w' from CronWorkflow 'extractor' completed"
time="2020-10-02T07:19:35Z" level=error msg="Unable to update CronWorkflow 'extractor': failed to update CronWorkflow: Operation cannot be fulfilled on cronworkflows.argoproj.io "extractor": the object has been modified; please apply your changes to the latest version and try again"
time="2020-10-02T07:19:35Z" level=info msg="Enforcing history limit for 'extractor'"

time="2020-10-02T07:20:00Z" level=info msg="Running extractor"
time="2020-10-02T07:20:01Z" level=info msg="Processing workflow" namespace=landing workflow=extractor-5qg5f"
time="2020-10-02T07:20:01Z" level=info msg="Updated phase  -> Running" namespace=landing workflow=extractor-5qg5f"
time="2020-10-02T07:20:01Z" level=error msg="failed to update CronWorkflow: Operation cannot be fulfilled on cronworkflows.argoproj.io "extractor": the object has been modified; please apply your changes to the latest version and try again"

So these would be fixed by upgrading our argo version?

Are these errors leading to retries and thus delaying the completion of the workflow? Or how would they explain what we see in our environment?

@simster7
Copy link
Member

simster7 commented Oct 2, 2020

@dertodestod @Sushant20 I would recommend waiting until 2.11.2 is released (which will include #4172, a crucial part of this fix) and upgrading to it to fix this issue.

Are these errors leading to retries and thus delaying the completion of the workflow? Or how would they explain what we see in our environment?

These errors are an indication that the CronWorkflow CRD is not being persisted correctly. For example, when the controller tries to remove a workflow from the active list, and then tries to persist its modification, the operation fails and the CRD with the incorrect active list remains.

Could you please elaborate on why there should be sudden CPU spikes ?

I can't think of a reason as to why the persistence issue would cause CPU spikes, but if the spikes happen around a time when you have many CronWorkflows scheduled to run (such as on a new minute), then it would make sense that there is a CPU spike around that time.

@Sushant20
Copy link
Contributor

Sushant20 commented Oct 3, 2020

@simster7 @alexec We are now observing similar behaviour in another environment.
Here are the series of events:

  1. CPU spikes from 30 millicores to 1 cores in just 5 mins and it continues till issue is resolved. Our max CPU usage is not more than 50 millicores in last 3 days so I don't think that large number of workflows is causing CPU spikes. There is something else happening in workflow controller pod , which is causing sudden increase in CPU..

  2. 'ConcurrencyPolicy: Forbid' and has an active Workflow so it was not run' error in logs

As per logs, workflow update to 'Successful' is done however workflow yaml shows status as Active

Also, even after deploying sample cron yaml, it wasn't working as per schedule however manual trigger of job works.

In summary, entire Argo cron scheduler fails, seems like unknown bug and we don't even know on how to reproduce it

I had to specifically remove the status section from workflow YAML and restart the workflow controller pod to stabilize the argo scheduler. CPU usage comes to normal after issue is resolved

Could you please let us know the ETA for 2.11.2 release?

@dertodestod
Copy link
Author

dertodestod commented Oct 5, 2020

@simster7

These errors are an indication that the CronWorkflow CRD is not being persisted correctly. For example, when the controller tries to remove a workflow from the active list, and then tries to persist its modification, the operation fails and the CRD with the incorrect active list remains.

But at some point this persist operation is successful, right? If not then I would expect the incorrect active list to stay that way forever. So what decides if the action fails or is successful (which seems to happen in our case after 20-30 minutes)?

@simster7
Copy link
Member

simster7 commented Oct 5, 2020

I will work to release 2.11.2 either today or tomorrow. However, I can get you guys an experimental dev build as soon as this morning if you guys would like to test what the release tomorrow would be like. Let me know if you'd like this.

@Sushant20 Let's first check if 2.11.2 fixes the CPU/consistency issues, if not we can treat that separately and start an investigation. However, if we cannot reproduce the issue reliably, it will be difficult to diagnose.

But at some point this persist operation is successful, right? If not then I would expect the incorrect active list to stay that way forever. So what decides if the action fails or is successful (which seems to happen in our case after 20-30 minutes)?

@dertodestod Yes, what I believe happens is that we then kick off our "conflict resolution" code that attempts to resolve the differences needed to allow the CRD to be persisted. My hypothesis is that this conflict resolution doesn't deal with removals (i.e. the removal of the workflow from the Active list) correctly, and then successfully persists the object with the removal in place.

@dertodestod
Copy link
Author

I will work to release 2.11.2 either today or tomorrow. However, I can get you guys an experimental dev build as soon as this morning if you guys would like to test what the release tomorrow would be like. Let me know if you'd like this.

@simster7
From my side it's ok to wait a few more days for the fix as we would anyways need some time to plan and rollout the change on our environments.

@Sushant20
Copy link
Contributor

@simster7 It is okay to wait for few more days, would prefer tested version(2.11.2) from your end because we also need to run new version initially in non production environment. By any chance, will upgrading from 2.9.5 to 2.11.2 be breaking change as per your view?

@alexec
Copy link
Contributor

alexec commented Oct 9, 2020

Simon is away at the moment. We were concerned we might see breakage in v2.10, but we did not. However, we did in v2.11.0. Can I suggest you try v2.11.3 which has as significant performance improvement as well as mitigation for breakage.

I would not expect to see the controller more than 1-2g of memory.

@simster7
Copy link
Member

@dertodestod @Sushant20 can you guys confirm if upgrading fixed the issues?

@simster7
Copy link
Member

Will close this unless the issue persists. If so, please let me know and I'll reopen the issue

@dertodestod
Copy link
Author

Thank you for the update and potential fix(es) @simster7. We did not yet manage to upgrade due to time constraints and as the situation improved somewhat on its own (the cron jobs start now every 5-15 minutes). We will let you know in case the upgrade is not fixing the issue. Thanks again.

@rr-sushant-choudhary
Copy link

@simster7 We did upgrade to 2.11.4 and haven't seen issues till now. It's hard to replicate the issue because we don't know what caused it in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants