Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskRun with extremely large params fails to run #7513

Open
rouke-broersma opened this issue Dec 20, 2023 · 12 comments
Open

TaskRun with extremely large params fails to run #7513

rouke-broersma opened this issue Dec 20, 2023 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@rouke-broersma
Copy link

Expected Behavior

Either TaskRun is denied/invalid if param size is above certain limits or TaskRun starts successfully.

Actual Behavior

TaskRun fails with error exec /bin/sh: argument list too long

completionTime: '2023-12-20T13:49:51Z'
conditions:
  - lastTransitionTime: '2023-12-20T13:49:51Z'
    message: |
      init container failed, "place-scripts" exited with code 1 (image: "cgr.dev/chainguard/busybox@sha256:19f02276bf8dbdd62f069b922f10c65262cc34b710eea26ff928129a736be791"); for logs run: kubectl -n xxx logs yyy -c place-scripts
      
    reason: Failed
    status: 'False'
    type: Succeeded
podName: yyy
provenance:
  featureFlags:
    AwaitSidecarReadiness: true
    Coschedule: workspaces
    DisableAffinityAssistant: false
    DisableCredsInit: false
    EnableAPIFields: stable
    EnableCELInWhenExpression: false
    EnableKeepPodOnCancel: false
    EnableParamEnum: false
    EnableProvenanceInStatus: true
    EnableStepActions: false
    EnableTektonOCIBundles: false
    EnforceNonfalsifiability: none
    MaxResultSize: 4096
    RequireGitSSHSecretKnownHosts: false
    ResultExtractionMethod: termination-message
    RunningInEnvWithInjectedSidecars: true
    ScopeWhenExpressionsToTask: false
    SendCloudEventsForRuns: false
    SetSecurityContext: true
    VerificationNoMatchPolicy: ignore
startTime: '2023-12-20T13:49:47Z'
kubectl -n xxx logs yyy -c place-scripts
exec /bin/sh: argument list too long

Steps to Reproduce the Problem

  1. Create TaskRun with param containing more than 25K characters (it's a base64 encoded file)
  2. TaskRun fails

Additional Info

  • Kubernetes version:

    Output of kubectl version:

clientVersion:
  buildDate: "2023-06-14T09:47:38Z"
  compiler: gc
  gitCommit: 25b4e43193bcda6c7328a6d147b1fb73a33f1598
  gitTreeState: clean
  gitVersion: v1.27.3
  goVersion: go1.20.5
  major: "1"
  minor: "27"
  platform: linux/amd64
kustomizeVersion: v5.0.1
serverVersion:
  buildDate: "2023-10-23T08:13:55Z"
  compiler: gc
  gitCommit: a8a1abc25cad87333840cd7d54be2efaf31a3177
  gitTreeState: clean
  gitVersion: v1.28.3
  goVersion: go1.20.10
  major: "1"
  minor: "28"
  platform: linux/amd64
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.54.0
@rouke-broersma rouke-broersma added the kind/bug Categorizes issue or PR as related to a bug. label Dec 20, 2023
@chitrangpatel
Copy link
Contributor

@rouke-broersma Are you calling these params in command/args? An example that triggers error this would be helpful.

@rouke-broersma
Copy link
Author

@rouke-broersma Are you calling these params in command/args? An example that triggers error this would be helpful.

Afaik you don't need to use the params, simply setting the params on the taskrun will cause the built in init containers to crash when dealing with the param. Any param with more than 25000 characters would cause this. I can provide an example later.

@chitrangpatel
Copy link
Contributor

chitrangpatel commented Dec 20, 2023

I was able to reproduce it with a much larger param value.

I don't think this is a bug in Tekton. You are running into the max limit you can pass via a CLI argument list which varies depending on the operating system.

If the param value is huge or if you have too many params that cumulatively get this large, you will run into the same error. That's because they end up becoming part of a huge bash command (which has a limit) that is executed by K8s.

My suggestion if you have such extreme use case is to see if you can send that information via a file and mount the file via a volume. Your Step can probably access that file and extract the information. It's also important to note that if the params are too large then the CRD will be very large also and K8s imposes a Max limit of 1.5 MBs on the size of the CRD as well.

I encourage other @tektoncd/core-collaborators to chime in here as well.

@rouke-broersma
Copy link
Author

rouke-broersma commented Dec 20, 2023

@chitrangpatel I don't disagree except the place-scripts container is provided by tekton, tekton implements the method to pass the run params to containers and tekton allows this large param in the crd. So either the crd or an admission controller should place limits on the param size or tekton should modify the way params are passed so the large params are supported in tekton's own containers. Obviously it can only be limited to what tekton has control over.

We can work around this with a config map or something but as it stands tekton allows this in one place and doesn't support it in another.

@vdemeester
Copy link
Member

vdemeester commented Dec 21, 2023

@rouke-broersma Is it happening when using the script feature (with a very very big script content) ?

Initially, the script feature is/was meant for relatively small "scripts" (few lines) and not full blown scripts (python, …) that would be hundreds of kilobytes or more — mainly because all this is "stored" in the object (CR), and there is a limit to the size of any object in Kubernetes.

This is indeed probably the case (looking at the code). I think the "current" workaround is indeed to use a configmap or an image with most of the script content in.

The tektoncd/pipeline controller could, in theory, handle this a bit differently, by creating a configmap per TaskRun to put the script content in and make place-script mounting it and copy things in the right place (as it does today). But it seems a bit too complicated for a relatively edge-case :

  • the script data would be duplicated (in the object definition and in the configmap)
  • it doesn't change anything to the limitation on the object (aka you couldn't have a super huge script anyway)
  • it would generate a lot more object (by default) that it is today, and would add a whole lot more pressure on etcd that it already is under.
  • there is relatively decent workaround, that are probably more future-proof, for example
    • Can the main part of the script be encapsulated in the image that is executed ?
    • Can the script be replaced by its own tool (and thus image) ?
    • It's always possible to use a configmap on demand

I think it should be documented as a "known limitation" with some alternatives / workarounds to apply in different context though.

@vdemeester
Copy link
Member

On validation, we may be able to do something as well but it's not really clear what "limit" we would set. As written here, it's not super clear what is the limit per os, and it can depend, for example on linux, on kernel options, … I tried the getconf MAX_ARGS on two different machines (my NixOS laptop and a Fedora server) and the values are widely different 😓.

So we would have to decide on an arbitrary limit (or the lowest one we could find) to make it work in most cases.

@chitrangpatel
Copy link
Contributor

chitrangpatel commented Dec 21, 2023

On validation, we may be able to do something as well but it's not really clear what "limit" we would set. As written here, it's not super clear what is the limit per os, and it can depend, for example on linux, on kernel options, … I tried the getconf MAX_ARGS on two different machines (my NixOS laptop and a Fedora server) and the values are widely different 😓.

So we would have to decide on an arbitrary limit (or the lowest one we could find) to make it work in most cases.

And this limit would have to be on the size of all the scripts combined. Even with small param sizes, it can be easily triggered with many such params (or many steps with large scripts) that cumulatively reach this limit. At that point, I wonder if that's any better than the container failing (it's still a runtime error).

@chitrangpatel
Copy link
Contributor

chitrangpatel commented Dec 21, 2023

@vdemeester I wonder the place-init container can read the script from a file and do it's thing instead of getting it's information from arguments.
That will remove the cumulative effect because we won't be passing in entire scripts as arguments. Surely, this can also happen with user's scripts but at least that will be because the argument is too large for a single step but at least place-init will never fail. WDYT?

On the other hand, how would the controller write the file which the init-container can access 🤔 because the container could be spawned on a completely different node. So this might not be cleanly possible. ConfigMap? Downward API? Or keep the behaviour the same and add one place-init container per script(?)

@vdemeester
Copy link
Member

vdemeester commented Dec 21, 2023

@vdemeester I wonder the place-init container can read the script from a file and do it's thing instead of getting it's information from arguments. That will remove the cumulative effect because we won't be passing in entire scripts as arguments. Surely, this can also happen with user's scripts but at least that will be because the argument is too large for a single step but at least place-init will never fail. WDYT?

Yes, we have to "pass" the "content" of the script into a file somehow. Right now, we pass it as part of the Pod definiton, in base64. We could also compress it a bit, but it would still run into limits at some point.

On the other hand, how would the controller write the file which the init-container can access 🤔 because the container could be spawned on a completely different node. So this might not be cleanly possible. ConfigMap? Downward API?

Anything "external" to the Pod definition has to be through another Kubernets object (ConfigMap, …). We could use volume but in all cases, we'll have to create more objects, putting more pressure on etcd anyway.

Or keep the behaviour the same and add one place-init container per script(?)

This has other implications, see here ; essentially more containers means "smaller" possible results and it already bite us in the past.

@rouke-broersma
Copy link
Author

@rouke-broersma Is it happening when using the script feature (with a very very big script content) ?

Initially, the script feature is/was meant for relatively small "scripts" (few lines) and not full blown scripts (python, …) that would be hundreds of kilobytes or more — mainly because all this is "stored" in the object (CR), and there is a limit to the size of any object in Kubernetes.

We are using TaskRuns with the script step. We have about 10 steps in the TaskRun with between 20-120 LOC bash per step so I wouldn't consider them large. The large content is 1 TaskRun param (TaskRun.spec.params) which contains a base64 encoded json file which is used in the TaskRun steps.

@chitrangpatel
Copy link
Contributor

@vdemeester I wonder the place-init container can read the script from a file and do it's thing instead of getting it's information from arguments. That will remove the cumulative effect because we won't be passing in entire scripts as arguments. Surely, this can also happen with user's scripts but at least that will be because the argument is too large for a single step but at least place-init will never fail. WDYT?

Yes, we have to "pass" the "content" of the script into a file somehow. Right now, we pass it as part of the Pod definiton, in base64. We could also compress it a bit, but it would still run into limits at some point.

On the other hand, how would the controller write the file which the init-container can access 🤔 because the container could be spawned on a completely different node. So this might not be cleanly possible. ConfigMap? Downward API?

Anything "external" to the Pod definition has to be through another Kubernets object (ConfigMap, …). We could use volume but in all cases, we'll have to create more objects, putting more pressure on etcd anyway.

Or keep the behaviour the same and add one place-init container per script(?)

This has other implications, see here ; essentially more containers means "smaller" possible results and it already bite us in the past.

Agreed. However, we now also offer larger results via sidecar logs which does not run into these limits. So may be worth revisiting that issue.

@vdemeester
Copy link
Member

We are using TaskRuns with the script step. We have about 10 steps in the TaskRun with between 20-120 LOC bash per step so I wouldn't consider them large. The large content is 1 TaskRun param (TaskRun.spec.params) which contains a base64 encoded json file which is used in the TaskRun steps.

Ah I see. That's "kind-of" a problem as well. Params are really meant for "relatively" small size "parameters". The content of a file is usually not consider as such. We could "try" to limit with an arbitrary length (for params) but it would be a breaking change (so.. we would have to do in a v2). And even, what would be that limit, how would we choose it ?

For now, the only thing we can do is to document this as a "kind-of" known limitation, and discuss best practices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants