Offload TaskResultsCompletionStatus from etcd to db or use compression to allow large worfklows (~100k) #13783

Hnatekmar · 2024-10-18T07:40:33Z

Summary

I am currently evaluating argo-workflows a goto solution for scheduling tasks for my company. So far we really like it featurewise and we thing it is really good fit 👍
Problem is that it number of tasks is expected to be around 100k per workflow and so far I haven't manage to persuade argo to do that.

From what I've observed there is limitation imposed by maximum size of entity inside etcd db which is around 1.5 MB. From my testing this can be observed with following workflow

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i
spec:
  podGC:
    strategy: OnPodSuccess
    deleteDelayDuration: 0s
  entrypoint: e
  templates:
  - name: c
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: e1
    steps:
      #@ for i in range(100):
      - - name: #@ "message" + str(i)
          template: c
          arguments:
            parameters:
              - name: message
                value: #@ "istep-" + str(i)
      #@ end
  - name: e
    dag:
      tasks:
    #@ for i in range(1000):
        - name: #@ "Step" + str(i)
          template: e1
    #@ end

You can use it with ytt -f <manifest_name> | kubectl create -f - -n <argo_namespace>. This manifest will get stuck at around 19177/20177 mark.

When I look at content of Workflow manifest it has states of each job inside it has jobs listed like this

      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4292625616: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4293120823: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4293149953: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4293305504: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4294093307: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4294368260: true
      this-is-extremly-long-prefix-so-i-will-spam-etcd-with-this-i-4294498843: true

Size of workflow manifest also roughly correlates to etcd limit:

$ kubectl get workflow -n argo-workflows -o yaml  |  wc -c
 1694314

Also when I decrease size of prefix I am able to schedule more jobs (around 80k with single character prefix)

What I am proposing is:

change of format

argo-workflows/pkg/apis/workflow/v1alpha1/workflow_types.go

Line 1955 in c9b1477

TaskResultsCompletionStatus map[string]bool `json:"taskResultsCompletionStatus,omitempty" protobuf:"bytes,20,opt,name=taskResultsCompletionStatus"`

to single key with base64 compressed string
Or we can offload this to db (when enabled I don't thing anyone will try this without ALWAYS_OFFLOAD_NODE_STATUS)

Here is my current configuration for argo-workflows https://github.com/Hnatekmar/kubernetes/blob/a09391109103d5ff9036eed85fd05577fff1c654/manifests/applications/argo-workflows.yaml

Use Cases

When scheduling 100k or more jobs

Message from the maintainers:

Love this feature request? Give it a 👍. We prioritise the proposals with the most 👍.

The text was updated successfully, but these errors were encountered:

Hnatekmar · 2024-10-18T07:42:01Z

Also it disappeared but I am willing to work on this :) Just need to discuss how it should be done

Hnatekmar · 2024-10-18T09:48:57Z

#7121 seems to be relevant to this issue but it seems discussion gravited towards db optimisations. Which won't solve this issue

shuangkun · 2024-10-18T11:13:45Z

Duplicate with #13213

shuangkun · 2024-10-18T11:18:27Z

I have also encountered this problem before. I implemented ALWAYS_OFFLOAD_TASK—RESULT_STATUS. If the maintainers think this requirement is reasonable, I can contribute here.

Joibel · 2024-10-18T13:58:42Z

Offloading task result status to the database along with node status seems like a reasonable thing to do.

I haven't looked into it, but it felt like it might be possible to:

Delete workflowTaskResults using a workerpool as they were marked as completed and copied into the node status.
Then delete the taskResultStatus entries for those workflowTaskResults once we knew they were all completed as we're only tracking them to ensure we collect them all, once collected they're done with.

This might be a better solution than offloading, WDYT @shuangkun?

These might be two separate things.

shuangkun · 2024-10-18T14:21:12Z

At that time, the task result status field was introduced to have a key function to help determine some status of the workflow, such as whether all tasks are completed (convenient for GC, whether the workflow completion status can be set), and whether the task output parsing in the previous step is completed (starting the next pod) , I am not sure whether some task result status can achieve this effect, may need to think about it.

Hnatekmar · 2025-02-04T09:36:43Z

Hi,

how is it going @shuangkun ? Do you still want to contribute to this issue?

If not I can try and create PR that fixes this

Thanks in advance

shuangkun · 2025-02-08T12:04:00Z

@Hnatekmar Feel free to submit a PR, and we will review it.

Joibel · 2025-02-10T08:40:10Z

I'm unconvinced we need to preserve taskResultCompletionStatuses that are in state true, once they are true they are done for all purposes and could just be removed. That's a simpler change to make than attempting to offload. What do you think @shuangkun?

Hnatekmar · 2025-02-10T10:15:29Z

@Joibel if I may. This could still cause issues depending on current workload.

I would be in favor of offloading it completely if it is feasible (I have to orient myself in the codebase but it doesn't seem impossible)

shuangkun · 2025-02-10T11:26:39Z

I'm unconvinced we need to preserve taskResultCompletionStatuses that are in state true, once they are true they are done for all purposes and could just be removed. That's a simpler change to make than attempting to offload. What do you think @shuangkun?

Agree. I sorted out the information of taskresult and found that it is mainly used in two places:

artifacts gc. For stopped workflows, gc needs to be performed after all artifacts are successfully processed.
In each round of workflow processing, it is necessary to ensure that the artifacts of the succeeded pod are processed before creating a new pod (because any new pod may rely on the previous artifact reference).
If we can handle these two, we don’t need to maintain such a large data structure.

shuangkun · 2025-02-10T11:47:29Z

The succeeded information and taskresult completion information are now important for each node.
The workflow-controller needs to determine whether a new pod can be created based on the above information.
Is it possible to add this information to the node status so that no additional offload is required?

Hnatekmar added the type/feature Feature request label Oct 18, 2024

shuangkun self-assigned this Oct 18, 2024

agilgur5 added area/offloading Node status offloading area/controller Controller issues, panics labels Oct 18, 2024

shuangkun assigned Hnatekmar Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offload TaskResultsCompletionStatus from etcd to db or use compression to allow large worfklows (~100k) #13783

Offload TaskResultsCompletionStatus from etcd to db or use compression to allow large worfklows (~100k) #13783

Hnatekmar commented Oct 18, 2024

Hnatekmar commented Oct 18, 2024

Hnatekmar commented Oct 18, 2024

shuangkun commented Oct 18, 2024

shuangkun commented Oct 18, 2024

Joibel commented Oct 18, 2024

shuangkun commented Oct 18, 2024

Hnatekmar commented Feb 4, 2025 •

edited

Loading

shuangkun commented Feb 8, 2025

Joibel commented Feb 10, 2025

Hnatekmar commented Feb 10, 2025

shuangkun commented Feb 10, 2025

shuangkun commented Feb 10, 2025

Offload TaskResultsCompletionStatus from etcd to db or use compression to allow large worfklows (~100k) #13783

Offload TaskResultsCompletionStatus from etcd to db or use compression to allow large worfklows (~100k) #13783

Comments

Hnatekmar commented Oct 18, 2024

Summary

Use Cases

Hnatekmar commented Oct 18, 2024

Hnatekmar commented Oct 18, 2024

shuangkun commented Oct 18, 2024

shuangkun commented Oct 18, 2024

Joibel commented Oct 18, 2024

shuangkun commented Oct 18, 2024

Hnatekmar commented Feb 4, 2025 • edited Loading

shuangkun commented Feb 8, 2025

Joibel commented Feb 10, 2025

Hnatekmar commented Feb 10, 2025

shuangkun commented Feb 10, 2025

shuangkun commented Feb 10, 2025

Hnatekmar commented Feb 4, 2025 •

edited

Loading