Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0040 - ignore step error #302

Merged
merged 1 commit into from
Feb 25, 2021

Conversation

pritidesai
Copy link
Member

@pritidesai pritidesai commented Jan 8, 2021

Proposing a TEP to ignore step error and provide an option to continue after capturing the non zero exit code.

Related issues:

@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 8, 2021
@pritidesai
Copy link
Member Author

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Jan 8, 2021
@pritidesai pritidesai force-pushed the capture-step-exit-code branch from 7141d27 to 91eafc4 Compare January 8, 2021 22:22
@bobcatfish
Copy link
Contributor

I'd really like to take a look at this before we merge it - I think there's a lot of overlap with the use case listed here and the goals we have for the PipelineResources revamp (specifically around task specialization)

/assign bobcatfish
/hold

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 11, 2021
@afrittoli
Copy link
Member

The code of the PoC is at https://github.com/afrittoli/pipeline/commits/tep_0040 (the last two commits)

Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about the problem(s) we are trying to address here. There are a few related problems and it feels like this is kinda starting to address but it's not totally clear:

  1. Allow steps to fail (Add a field to Step that allows it to ignore failed prior Steps *within the same Task* pipeline#1559) - progress on this one has been stalled since @imjasonh pointed out it's not clear that a single task is the best solution and a pipeline with multiple Tasks might make more sense: Add a field to Step that allows it to ignore failed prior Steps *within the same Task* pipeline#1559 (comment)
  2. "off the shelf" images <-- this seems like an interesting problem on it's own and is the motivation behind TEP 11. id really like to understand some examples of this a bit better
  3. It's interesting to me that we are talking about providing the exit code itself - up until this point, and in proposals like TEP 28 we've discussed failure vs success but never at the level of the error code before. Why is it important to this proposal, or is this just a case of wanting to understand if a step has failed?
  4. I'm a bit confused about what level we want this at: is it important that steps within a Task be allowed to fail and other steps can see their status, or is it important that Tasks within a Pipeline be able to fail and provide their exit code as part of their failure? (some use cases might help clear this up) <-- if the main motivation here is to allow tasks in a pipeline to fail and for something to act on that failure, id like to understand more about why finally Tasks dont do the trick
  5. The use case listed in this TEP (reporting the failure of unit tests) doesn't seem well suited to this solution: a pipeline with a finally task seems like it could do the same job? baking everything into one task interferes with the task's resusability (and is one of the motivations of pipelineresources and task specialization

I think the key thing is I'm trying to understand what this provides that we don't get with finally tasks + TEP 28.

an option to capture non-zero exit code. The pipeline author can choose to continue execution after capturing
the non-zero exit code and make it available to access it after the execution.

Issue: [tektoncd/pipeline#2800](https://github.com/tektoncd/pipeline/issues/2800)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might also be covering tektoncd/pipeline#1559 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, even though the approach there is slightly different. The focus on that one is to let a step run even though there was a previous failure. Here is on letting the potentially failing step declare that the failure should be captured, so other steps don't need to be aware about the failure at all.

As a pipeline author, I would like to design a task with multiple steps. One of the steps is running an
enterprise image to run unit tests, and the next step needs to report test results even after a previous
step results in failure due to tests failure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you go into more detail with the use cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One example could be the nightly release pipeline. As part of the release pipeline we might want to run a security scan, unit tests and other checks. None of this checks should stop the release pipeline from executing, however we want to run the tests so that we keep a track of which failures we saw against which version.

This can be sometime achieved by wrapping the step that runs the tests with a script, but in other cases it would require to create a new container image that contains a shell. In cases where wrapping with a script is possible, it may still be undesirable.

This use case is strengthened by the case of users migrating to Tekton. They may have scripts / steps from their previous CI system where it was possible to ignore a failure, and they would have to modify them and/or wrap them to trap the exit code.

Copy link
Contributor

@bobcatfish bobcatfish Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our discussion the other day: the nightly release pipeline example is kind of different b/c that's about allowing an individual Task to fail; this proposal is about allowing steps within Tasks to fail. If we wanted to apply this proposal to our nightly release pipeline for example, we'd have to modify the golang-test task

Copy link
Member

@afrittoli afrittoli Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to modify a Task is the main downside of this approach, we should capture it in the TEP.
The use case I'm facing is to migrate users to Tekton, they already have tasks written, and they need to be moved to Tekton, so it is not an issue in this case.

A possible workaround for existing Tasks, they could expose the capture flag through a task parameter, to allow the Task to use in both modes, with and without capture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"we'd have to modify the golang-test task" ... wow I somehow hadn't considered that. All of our legacy cases are with teams providing an "image" that when run might "fail", but in Tekton where we're promoting a Task Catalog teams might provide a Task that we should equivalently let a Pipeline decide if it should fail. I think maybe we should consider support ignoring Task failure in this TEP or at least immediately create another TEP to look at adding support.

## Motivation

It should be possible to easily use off-the-shelves (OTS) images as steps in Tekton tasks. A task author has no
control on the image but may desire to capture an error and continue executing the pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide some examples of images that cannot be used today? or what it would look like to try to use them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the main issue with these off-the-shelf images that they sometimes don't have a shell? this was the motivation for TEP-0011 as well

(if the images have a shell, you can write your own script to do whatever you want with the exit code)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, the most difficult case is that of images that do not include a shell.
The wrapping script feels to me like a workaround though, I think it would be a nicer user experience if we allowed natively to trap the exit code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another detail we could include in the TEP: many projects have to create "debug" versions of their images to support users that want shells; in CD automation especially there are so many scenarios where you do need shells that often folks end up just having to always use the debug version

The wrapping script feels to me like a workaround though

could you explain a bit more? It feels to me like being able to run a script that runs the command that needs to be run gives the user the most flexibility, e.g. they can grab the exit code, they can grab stdout + stderr, they can do simple processing like grep and awk, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container images are not necessarily written with a wrapping script in mind.
If an image sets an entrypoint, it is usually the recommended way to use the image. To write a script that wraps the container, we need to inspect the container image, find out what the entry point is, and run the command in our script the same way the entry point would have, which is not something we should ask Task author to be doing. If the entrypoint changes, the task will have to be updated too.

We could have a mechanism that fetches the entrypoint from the docker image automatically and runs it... oh, wait, we already do, it's Tekton's entrypoint. This TEP proposes to add a flag in the entrypoint to capture the exit code - I don't think we should do this instead in a shell script that will be executed by entrypoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk - maybe we can discuss this more when we start looking at possible designs

Steps are executed in order in which they are specified. One single step failure results in a task failure
in turn results in a pipeline failure. Once a step results in failure, rest of the steps are not executed.

Many common pipelines have requirement where a step failure must not halt the entire execution. In order to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im a bit confused about whether this problem exists at the pipeline level or at the step level, is it possible to dig into it a bit more?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failure originates in a step and propagates to the entire pipeline.
In terms of granularity, the proposal of this TEP is to trap the exit code for specific steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we have two separate scenarios, maybe we want to handle both, but I'm not convinced both apply to the use cases we're discussing:

  1. I want to stop the failure of a Task in a Pipeline from stopping the execution of a Pipeline (and maybe stop it from failing as well)
  2. I want to stop the failure of a step in a Task from failing the Task

The failure originates in a step and propagates to the entire pipeline.

Right, there is a relationship between the two - you could potentially use (2) to give you (1) as well, but (2) doesn't let you have any control over this at the pipeline level, it has to be baked into the Task. I'd argue that to truly satisfy (1) we'd want that control at the pipeline level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might eventually want support for ignoring both "step" and "task" failure.

@pritidesai
Copy link
Member Author

Thanks @bobcatfish for thorough review 😻

It's interesting to me that we are talking about providing the exit code itself - up until this point, and in proposals like TEP 28 we've discussed failure vs success but never at the level of the error code before. Why is it important to this proposal, or is this just a case of wanting to understand if a step has failed?

My understanding is taskrun and pipelinerun execution status are of type conditions (implementing duck type) vs step execution states are limited to container states with exitCode and reason.

TaskRun status:

"conditions": [
     {
         "lastTransitionTime": "2021-01-08T22:45:22Z", 
         "message": "All Steps have completed executing",
         "reason": "Succeeded",                                               
         "status": "True",
         "type": "Succeeded"
     }

TaskRun status with a list of steps:

"steps": [
      {
          "container": "execute-this-step",                                                        
          "imageID": "....",
          "terminated": {
              "containerID": "...",
              "exitCode": 0,
              "finishedAt": "2021-01-08T22:45:22Z",
              "message": "[{\"key\":\"sum\",\"value\":\"12\",\"type\":\"TaskRunResult\"}]",
              "reason": "Completed",
              "startedAt": "2021-01-08T22:45:22Z"
          }
      }
  ],

The duck typed conditions are not available at the step level (we could implement them if needed), and therefore exit codes are so important to this proposal. A non zero exit code of a container means a step failed. Failed step with capture exit code set to true allows to ignore that step failure and continue executing rest of the steps and tasks.

@pritidesai
Copy link
Member Author

pritidesai commented Jan 12, 2021

I'm a bit confused about what level we want this at: is it important that steps within a Task be allowed to fail and other steps can see their status, or is it important that Tasks within a Pipeline be able to fail and provide their exit code as part of their failure? (some use cases might help clear this up) <-- if the main motivation here is to allow tasks in a pipeline to fail and for something to act on that failure, id like to understand more about why finally Tasks dont do the trick

It is important that steps within a task be allowed to fail and other steps in the same task can access the exit code. This proposal is limited to step failure and does not propose allowing a task to fail. A step capturing exit code ignores step failure and declares that step as a success (exit with zero) which results in successful task and continues executing rest of the tasks in a pipeline. With this proposal, a task failure cannot be ignored without capturing a step exit code and a single task failure still results in pipeline failure.

@bobcatfish
Copy link
Contributor

It is important that steps within a task be allowed to fail and other steps in the same task can access the exit code.

Maybe we can dig into this a bit more. This feels to me like a few related but different features:

  1. Allow a step to fail (Add a field to Step that allows it to ignore failed prior Steps *within the same Task* pipeline#1559) without stopping execution of the next step
  2. In the case where a step is allowed to fail, allow the next step to know if the previous step failed
  3. In addition to letting the next step know if the previous step failed, provide it with the exit code of the previous step

( @afrittoli 's demo seemed to go a bit further than 3 by also exposing the exit code as a result of the Task but it sounds like that might be beyond the scope of this TEP? in the issue tektoncd/pipeline#2800 @afrittoli mentions "It may be desirable for the pipeline to continue even if there was a failure." so I feel like there might be a bit of a mismatch between this TEP and that issue 🤔 )

Before we get into (2) and (3) I'd like to understand a bit better why we need (1). At a glance it makes sense, however once we start to dig into it (e.g. @imjasonh 's comment tektoncd/pipeline#1559 (comment)) it seems like the use cases motivating this are much better suited to a Pipeline with multiple Tasks. If we can identify some use cases that need (1) and why a Pipeline with multiple Tasks doesn't work for them I think that will really help.

@afrittoli afrittoli changed the title TEP - capture step exit code TEP0040 - capture step exit code Jan 13, 2021
@afrittoli afrittoli changed the title TEP0040 - capture step exit code TEP-0040 - capture step exit code Jan 13, 2021
@afrittoli
Copy link
Member

It is important that steps within a task be allowed to fail and other steps in the same task can access the exit code.

Maybe we can dig into this a bit more. This feels to me like a few related but different features:

  1. Allow a step to fail (tektoncd/pipeline#1559) without stopping execution of the next step

...and ultimately, of the whole pipeline, since today a step failure implies failure of the task and of the overall pipeline

  1. In the case where a step is allowed to fail, allow the next step to know if the previous step failed

In fact what this TEP provides is a way to trap or capture the failure, so the step does not actually fail from a Tekton POV.
Let the next step access the previous step success status (or trapped exit code) is not something that I thought would be in scope for this TEP. It is something that we can add later.

  1. In addition to letting the next step know if the previous step failed, provide it with the exit code of the previous step

( @afrittoli 's demo seemed to go a bit further than 3 by also exposing the exit code as a result of the Task but it sounds like that might be beyond the scope of this TEP? in the issue tektoncd/pipeline#2800 @afrittoli mentions "It may be desirable for the pipeline to continue even if there was a failure." so I feel like there might be a bit of a mismatch between this TEP and that issue 🤔 )

The reason I added the original exit code as a result in the PoC is because normally the exit code would be available in the step status. I wanted to provide a way to still make it available even when it's captured. In fact the bit that is strictly in scope for this TEP is to expose whether the step would have failed or not, to allow for external systems to identify this cases and bring them to the attention of users.

Before we get into (2) and (3) I'd like to understand a bit better why we need (1). At a glance it makes sense, however once we start to dig into it (e.g. @imjasonh 's comment tektoncd/pipeline#1559 (comment)) it seems like the use cases motivating this are much better suited to a Pipeline with multiple Tasks. If we can identify some use cases that need (1) and why a Pipeline with multiple Tasks doesn't work for them I think that will really help.

A pipeline with multiple tasks would solve the issue because a task failure would still cause the pipeline to fail.
Capturing the error at step level provides some advantages compared to doing so at task level.

If I want to:

  • run tests (might fail)
  • push results and logs

and I can only capture errors at Task level, I need to run the tests in a task and push the logs in a different ones, which means I need to copy logs and results to some pipeline storage, download them from there and push them to the final storage, which means three times as much I/O.

@bobcatfish
Copy link
Contributor

Thanks for the extra info @afrittoli !

I need to run the tests in a task and push the logs in a different ones, which means I need to copy logs and results to some pipeline storage, download them from there and push them to the final storage, which means three times as much I/O.

Ah okay interesting! So my questions about this use case are:

  1. I guess we're talking about a scenario where the image that we are using to run the tests doesn't have a shell? If it did have a shell, we'd have control over whether to fail the step or not
  2. Sounds like an important requirement (maybe we can add this to the list) is to be able do all of this in one pod, e.g. something like "run a command that fails / run an image that does not contain a shell and fails, then act on that failure, all within the same pod (without requiring sharing data between pods)"?
  3. How important is it that the overall Pipeline doesn't fail in this case and why is it important?

Another way we could tackle this is via task specialization, i.e. if we provide a way to specify "wrapping" tasks (like in this example where there is a "finally" clause for an individual Task) - this way you could write one Task that runs the test, and another Task that pushes the results and logs, and (theoretically) execute all of that within the same pod. From the requirements you described, the only missing bit is whether the failure of the Task fails the entire Pipeline or not.

(Totally understand if you don't agree with that route to tackle this, but I want to make sure we really understand what problem we're solving exactly - it kinda sounds like the main problem is: once any step in a Task fails, if you want to act on that failure, you have to do it in a separate pod?)

Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to discuss this with me yesterday @afrittoli and @pritidesai ! I've added some comments with updates we could make the TEP to flesh it out based on the discussion.

My biggest overall request is that we try to frame this problem clearly without referring to the solution (yet!)

- '@afrittoli'
---

# TEP-0040: Capture Step Exit Code
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This title feels to me like it jumps to an implementation but doesn't capture what's really motivating this.

Some ideas for alternative names: (sorry they are so wordy XD)

  • Support mulitstep Tasks when some steps have no shell
  • Allow Tasks to continue to act after the failure of a step
  • Allow steps to fail without failing or halting Tasks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup I agree, I think "Allow steps to fail without failing Tasks" fits well as the biggest motivation here is to continue executing subsequent steps and preventing the task from failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"treat a step as passing regardless of whether it fails or succeeds" - @skaegi
"ignore the failure of a step" - @imjasonh

As a pipeline author, I would like to design a task with multiple steps. One of the steps is running an
enterprise image to run unit tests, and the next step needs to report test results even after a previous
step results in failure due to tests failure.

Copy link
Contributor

@bobcatfish bobcatfish Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our discussion the other day: the nightly release pipeline example is kind of different b/c that's about allowing an individual Task to fail; this proposal is about allowing steps within Tasks to fail. If we wanted to apply this proposal to our nightly release pipeline for example, we'd have to modify the golang-test task

## Motivation

It should be possible to easily use off-the-shelves (OTS) images as steps in Tekton tasks. A task author has no
control on the image but may desire to capture an error and continue executing the pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another detail we could include in the TEP: many projects have to create "debug" versions of their images to support users that want shells; in CD automation especially there are so many scenarios where you do need shells that often folks end up just having to always use the debug version

The wrapping script feels to me like a workaround though

could you explain a bit more? It feels to me like being able to run a script that runs the command that needs to be run gives the user the most flexibility, e.g. they can grab the exit code, they can grab stdout + stderr, they can do simple processing like grep and awk, etc.

enterprise image to run unit tests, and the next step needs to report test results even after a previous
step results in failure due to tests failure.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a requirements section as well, e.g.:

Requirements

  • Users should be able to use prebuilt images as-is without having to do one or more of the following (see also TEP-0011:
    • Investigating how they are built to understand if they contain a shell and possibly overriding the entrypoint
    • Build and maintain their own images (i.e. add in required shell or other binaries) from those images
  • It should be possible to know that a step was allowed to fail by observing the status of the TaskRun (and PipelineRun if applicable) (e.g. to show a "warning" / display as "yellow" status in a UI)
  • When a step is allowed to fail, the exit code of the process that failed should not be lost and should at a minimum be available in the status of the TaskRun (and PipelineRun if applicable)
  • When a step is allowed to fail, the Task should not be marked as failed. <-- im a bit unsure about this one @afrittoli - if we want to do this, I think we need to make it possible for other steps to know that this happened and determine if the overall status should be failed, what do you think? (e.g. in the case of uploading test results, you want to keep running steps so you can do the upload, but still the overall Task has failed)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bumping this

teps/0040-capture-step-exit-code.md Outdated Show resolved Hide resolved
@afrittoli
Copy link
Member

It is important that steps within a task be allowed to fail and
other steps in the same task can access the exit code.

@pritidesai I slightly disagree on this last point specifically. I agree it could be interesting for steps to have access to other step outcomes, but that would be a different TEP with different use cases. For this TEP specifically, we should not plan on exposing outcome of steps within a running pipeline.

@bobcatfish
Copy link
Contributor

/assign vdemeester

@skaegi
Copy link
Contributor

skaegi commented Jan 25, 2021

/assign

@bobcatfish
Copy link
Contributor

@pritidesai I slightly disagree on this last point specifically. I agree it could be interesting for steps to have access to other step outcomes, but that would be a different TEP with different use cases. For this TEP specifically, we should not plan on exposing outcome of steps within a running pipeline.

My sticking point @afrittoli is that if we don't allow other steps to see the status of the step that was allowed to fail, the Task will never be able to fail (or it would require significant workarounds to make it happen).

For example in the use case where I want to run tests, then upload results, I think there is a good chance that a user will still want the overall Task to fail, they just want to be able to upload results before that happens.

@afrittoli
Copy link
Member

@pritidesai I slightly disagree on this last point specifically. I agree it could be interesting for steps to have access to other step outcomes, but that would be a different TEP with different use cases. For this TEP specifically, we should not plan on exposing outcome of steps within a running pipeline.

My sticking point @afrittoli is that if we don't allow other steps to see the status of the step that was allowed to fail, the Task will never be able to fail (or it would require significant workarounds to make it happen).

For example in the use case where I want to run tests, then upload results, I think there is a good chance that a user will still want the overall Task to fail, they just want to be able to upload results before that happens.

That's a good point.
However I think the main use case for this TEP is not delay the failure to another step but to prevent it.
It is true that we have use cases where we do some failure analysis in the pipeline itself, but that would be based on information extracted from the output of the previous step rather than the exit code.
For instance we could have a list of tests that must pass, and if one of them is in the list of failures we fail.

This TEP will work well in combination with that about capturing stdout, since with both it will be possible for the next step to always run and have access to the output, even when the original docker image did not support sending the output to a file via args.

@pritidesai pritidesai force-pushed the capture-step-exit-code branch from 91eafc4 to bc9ad7b Compare January 25, 2021 22:57
@pritidesai
Copy link
Member Author

pritidesai commented Jan 25, 2021

thanks @bobcatfish and @afrittoli for all the reviews.

I have addressed most of the comments except two:

  1. Alternative title
  2. Exposing exit code to subsequent steps in a task

Also, I will add drawback section after adding design proposal 🙃

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 2, 2021
@pritidesai
Copy link
Member Author

Right, there is a relationship between the two - you could potentially use (2) to give you (1) as well, but (2) doesn't let you have any control over this at the pipeline level, it has to be baked into the Task. I'd argue that to truly satisfy (1) we'd want that control at the pipeline level

I have expanded the scope of this TEP to include both (1 - task level) and (2 - pipeline level) given that not everything has to be part of the same PR. I can certainly drop (2) from this proposal if you would like. My rationale behind adding (2) is to make the API changes for (1) compatible with the changes we propose for (2).

I have added some examples and renamed the proposal to ignore step error. If we decide to drop support for (2) from this proposal, it can be further renamed to ignore step error in a task.

@pritidesai pritidesai force-pushed the capture-step-exit-code branch from 025386b to 2729f73 Compare February 2, 2021 22:06
@pritidesai pritidesai changed the title TEP-0040 - capture step exit code TEP-0040 - ignore step error Feb 2, 2021
Base automatically changed from master to main February 3, 2021 16:34
@vdemeester
Copy link
Member

I do see value in the proposal and those are valid use case we need to take care of. So I have a probably naive and general question, but given the following proposal (#318 and #316), and the potential of the pipeline-in-pipeline approach (with CustomTask support), what would it mean to support those use case at the Task + Pipeline level only ?

Having an option on both might be a bit confusing to users. When to fail, when to not ? I also wonder if this is a concern at the Task authoring time, at the Pipeline authoring time, at the PipelineRun execution ?

Similar to my comment on #316, I think we need to find where to put the cursor in our composability story and where each option fit.

On this proposal, I feel the use case can be both "at authoring time" and "at runtime". Those might be two different problems to tackle but on the authoring part, the question remain, if we were to have a way to say "don't fail the pipeline if this task fail", would we need the same at the task level (with steps) — and my initial thought is we would not.

  • Having it at the task/step level, we force the task author to take that decision for any user of the pipeline (related to @bobcatfish comment on modifying the catalog tasks).
  • Having it at the pipeline/task level, it's on the pipeline author to take that decision, which seems a bit more reusable (and would make the catalog more useful because more reusable).

I wonder thus if we should mention step in this proposal or not — as it is, from my perspective, already a beginning of a solution. And as linked above, we may need to consider our option taking into acount few other TEPs that aim to remove the overhead of running a pipeline.

@pritidesai pritidesai force-pushed the capture-step-exit-code branch from 2729f73 to 9c3ae5f Compare February 3, 2021 21:26
@bobcatfish
Copy link
Contributor

I have expanded the scope of this TEP to include both (1 - task level) and (2 - pipeline level) given that not everything has to be part of the same PR. I can certainly drop (2) from this proposal if you would like. My rationale behind adding (2) is to make the API changes for (1) compatible with the changes we propose for (2).

I think that expanding the scope of this TEP to include (2) is going to make it harder to merge - b/c of the potential overlap with the other TEPs that @vdemeester pointed out - i have been convinced that there is a legitimate gap that would address in isolation that specifically applies to:

Allowing a step to fail without stopping execution of a Task at Task authoring time

e.g. the use case about a platform team which wants to create a reusable Task that runs tests, and transforms their output, whether they fail or not

I suggest we tackle (2) separately

@pritidesai pritidesai force-pushed the capture-step-exit-code branch from 9c3ae5f to 86f9490 Compare February 5, 2021 05:24
@pritidesai
Copy link
Member Author

pritidesai commented Feb 5, 2021

Sounds good, thanks @bobcatfish @vdemeester 🙏

I have reduced its scope to only address (1):

Allowing a step to fail without stopping execution of a Task at Task authoring time

@pritidesai pritidesai force-pushed the capture-step-exit-code branch from 86f9490 to dbeda27 Compare February 5, 2021 05:27
Copy link
Contributor

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty much ready to approve this - I do want to add one more requirement, not sure if @afrittoli has strong feelings about it.

Also not sure where you stand @vdemeester , with the reduced scope are you happy to go ahead with this, or you want to see the other TEPs through first?

fi
```

This workaround does not apply to off-the-shelf container images.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the detailed overview!

in an image and then altering the entrypoint to allow capturing errors.

* It should be possible to know that a step failed and subsequent steps allowed to continue by observing the status of
the `TaskRun`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also like to include a requirement that other steps within the Task can access this information, I believe @afrittoli disagrees - I'm hoping we can include this as a requirement but also could deliver the feature incrementally such that the initial implementation doesn't need to support this.

I feel strongly that we haven't sufficiently addressed the use cases and problem statement without it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added that (other steps within the task can access the process exit code) in the requirements. @afrittoli please let me know if you still disagree or have any other ideas.

Proposing a tep to ignore step errors and provide an option to
continue after capturing the non zero exit code. Also document the
container termination state to access it after the pipeline exectution finishes.
@pritidesai pritidesai force-pushed the capture-step-exit-code branch from dbeda27 to 0b75788 Compare February 5, 2021 19:24
@afrittoli
Copy link
Member

@bobcatfish
Copy link
Contributor

Thanks for all the updates! I'm happy to go ahead with the problem statement!! Will be great to take a look at our options here.

/approve

We'll need to get @vdemeester to review before we merge tho - he had some thoughts as well

/hold

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 18, 2021
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I have one question : if we would go ahead with Injecting a shell into steps, then a user could wrap on any image, would this be necessary ?

I feel it could still be useful but on the other hand, the "shell" injection could also bring some "tool" to wrap the exit code and make it available later on. I am mainly asking this to discuss if the use case needs to be handle by an API field or by some tooling.

@afrittoli
Copy link
Member

I think this would still be useful.

Having a shell still requires writing a script which points to the correct entrypoint for the container image, which is more work, more error prone, and it's more brittle as the entrypoint might change in the image.

Using a script on the other hand gives more flexibility, as it allows doing output redirection, reading the error code and more.

So I think they're both useful features to have.

@vdemeester
Copy link
Member

Having a shell still requires writing a script which points to the correct entrypoint for the container image, which is more work, more error prone, and it's more brittle as the entrypoint might change in the image.

Using a script on the other hand gives more flexibility, as it allows doing output redirection, reading the error code and more.

So I think they're both useful features to have.

Agreed 😉

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 24, 2021
@pritidesai
Copy link
Member Author

Thanks @vdemeester @afrittoli

@bobcatfish we have review from @vdemeester, can I remove the hold now? Thanks 🙏

@vdemeester
Copy link
Member

/hold cancel
@pritidesai next step is to add more use case and design detail to pass this as implementable then 😉

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 25, 2021
@tekton-robot tekton-robot merged commit 48e3015 into tektoncd:main Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants