Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal] Capturing step exit codes in Tasks #2800

Closed
afrittoli opened this issue Jun 10, 2020 · 30 comments
Closed

[proposal] Capturing step exit codes in Tasks #2800

afrittoli opened this issue Jun 10, 2020 · 30 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@afrittoli
Copy link
Member

Expected Behavior

It should be possible to use easily use off-the-shelves (OTS) images as steps in Tekton tasks.

Actual Behavior

When using OTS images, a step / task author does not control the error case behaviour of the image. It may be desirable for the pipeline to continue even if there was a failure.

One way to achieve this could be to wrap the original image into a custom image, but that does not scale well if the number of images grows.

An alternative way would be to allow a Task to fail without having the whole pipeline failing. We discussed this a lot in the past. Finally tasks will help a bit, cloud events also provide a way around this, however neither solution provide a solution that feels natural for the problem.

Proposed Solution

We could optionally capture the exit code from steps and expose it as test results.
This could be achieved using entrypoint image, which already wraps around all steps images.
This feature could be enabled by setting a flag in the task spec (I would not consider this decision a runtime one).

The benefits would be:

  • the exit code becomes available as a result to following tasks and conditions, allowing for branching based on the result, using conditions as they exists today
  • limited change to the task execution logic:
    • once a step fails, no more steps are executed (like before)
    • if the exit code is captured, the step failure does not cause a task failure
  • no change to the pipeline logic: once a task fails, the pipeline fails
  • the solution could co-exist with any future development in terms of conditionals and workflow capabilities

/kind feature

@tekton-robot tekton-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 10, 2020
@afrittoli
Copy link
Member Author

@bitsofinfo
Copy link

bitsofinfo commented Jun 10, 2020

Thank you! (for reference) https://tektoncd.slack.com/archives/CK3HBG7CM/p1591741813404000

@jlpettersson
Copy link
Member

Indeed an interesting use case!

I feel that finally: in combination with a SlackMessage Task or a CloudEvent Task is very related here.

@afrittoli
Copy link
Member Author

@jlpettersson you can use finally and/or cloud events to achieve that, but you cannot achieve branching within the pipeline that way. The only way today to react to a failed task today is outside the pipeline. A task in finally could trigger another pipeline or a cloud event could also trigger another pipeline.

@jlpettersson
Copy link
Member

@jlpettersson you can use finally and/or cloud events to achieve that, but you cannot achieve branching within the pipeline that way. The only way today to react to a failed task today is outside the pipeline. A task in finally could trigger another pipeline or a cloud event could also trigger another pipeline.

That is a good point.

A solution here, could be allowedToFail as I proposed in #2635

@pritidesai
Copy link
Member

@afrittoli I can not think of any use case at this moment, but would be beneficial to pipeline level exception or onFailure

@afrittoli
Copy link
Member Author

The use case that triggered this is being able to use off-the-shelves images and control the exit code, so that the pipeline can be branched based on the outcome of the task.
If would allow implementing something like onFailure without forcing the task do define the logic to be taken on failure, leaving it to the next tasks to decide.

@bitsofinfo
Copy link

bitsofinfo commented Jun 12, 2020

Yes I'm not interested in having to create/maintain another thing outside of tekton for capturing/monitoring for events and then making additional calls against tekton to react to those events.

I just need a simple way to get access to a task's exit code. Right now I'm hacking this with something like the below (only for off the shelf images that I can wrap). I obviously can't do this with an off the shelf image that I am not wrapping in my own script task etc.

        <something that can fail, but should not halt the entire execution>
        LAST_EXIT_CODE=$?
        echo -n $LAST_EXIT_CODE > /tekton/results/last-exit-code

Then I can reference the task's exit code in the results from a subsequent pipeline condition.

This of course makes the failing task still look like it's succeeding because the echo exit code is zero... hacky but it works.

See example here for usage of this pattern: https://github.com/bitsofinfo/cicdstatemgr/tree/master/examples/tekton/pipelines

@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 14, 2020
@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vdemeester
Copy link
Member

/remove-lifecycle rotten
/remove-lifecycle stale
/reopen

@tekton-robot tekton-robot reopened this Aug 17, 2020
@tekton-robot
Copy link
Collaborator

@vdemeester: Reopened this issue.

In response to this:

/remove-lifecycle rotten
/remove-lifecycle stale
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 17, 2020
@afrittoli
Copy link
Member Author

@skaegi I believe this would partly help with your use case of failing tests that should not fail a pipeline :)

@vdemeester vdemeester mentioned this issue Nov 20, 2020
17 tasks
@pritidesai
Copy link
Member

/assign

I will start writing TEP for @skaegi's use case.

@bobcatfish
Copy link
Collaborator

  1. I'm wondering how this relates to Add a field to Step that allows it to ignore failed prior Steps *within the same Task* #1559 ? (if it does? they feel like similar features to me?)
  2. @afrittoli is it possible to add more detailed use cases? (maybe you are already doing this in the TEP) - as usual I feel skeptical about new features, particularly when they are changing the meaning of failure - but a few use cases can definitely bring me around pretty easily!

@afrittoli
Copy link
Member Author

  1. I'm wondering how this relates to Add a field to Step that allows it to ignore failed prior Steps *within the same Task* #1559 ? (if it does? they feel like similar features to me?)

It seems related, even though #1559 puts control on the step after the one that might fail.
I filed this issue to reply to a specific request, and I missed #1559 when doing so, sorry about that.

  1. @afrittoli is it possible to add more detailed use cases? (maybe you are already doing this in the TEP) - as usual I feel skeptical about new features, particularly when they are changing the meaning of failure - but a few use cases can definitely bring me around pretty easily!

The use case is being able to have a task in your pipeline that generates useful information (often by running some kind of test / information) which should not prevent a successful execution of the pipeline. Exposing the original exit code allows tools to visually highlight the fact that an error code was captured.

The only way to achieve something similar today is to use a script that wraps the execution of the tool and captures the original exit code in a result. The limitations of these approach are:

  • it requires wrapping off-the-shelve images with a script
  • there is no standard way to expose the original exit code, which makes it impossible for the tools (cli, dashboard) to build any functionality on top

This change has a very limited impact on the code base (see PoC) and rest of Tekton functionality; personally I feel it would be an easy win which would make a few users happy. It would be released in alpha state, so nothing would prevent us from taking a different approach in the future if we see this can be achieved better as part of a different feature.

@bobcatfish
Copy link
Collaborator

Exposing the original exit code allows tools to visually highlight the fact that an error code was captured.

I'm guessing these tools would probably want the logs too, right? I'm not sure the exit code alone is enough information.

@afrittoli
Copy link
Member Author

This change could be used also to achieve a series of steps like: running tests, post-process and upload logs and test results, all as part of the same task. In cases where a test failure should fail the pipeline, it would require the last step to exit with the exit code that was captured from the test-running step.

This specific use case could probably be implemented by introducing pre/post steps, or finally at task level, or by allowing a pipeline to be executed as a task (i.e. on a single node, with no need for extra I/O) thus leveraging the pipeline workflow capabilities in a task. But this will require more time and effort to design and implement.

@afrittoli
Copy link
Member Author

Exposing the original exit code allows tools to visually highlight the fact that an error code was captured.

I'm guessing these tools would probably want the logs too, right? I'm not sure the exit code alone is enough information.

Yeah, sure, but logs would still be available the same way they are today? The exit code should be enough info to make a step or task as "yellow".

@bobcatfish
Copy link
Collaborator

The exit code should be enough info to make a step or task as "yellow".

Would it be a situation like certain exit codes are yellow and some are red?

I'm also wondering: isn't this information available in the container statuses in a taskrun already, if a system really wanted to use it?

(I might be able to understand better if you could explain the use case in more detail)

@skaegi
Copy link
Contributor

skaegi commented Jan 12, 2021

For exit codes I think we should stick with just...

  1. exit code == 0 (success)
  2. exit code != 0 (failure)

A warn or yellow state is something that can get layered later by somehow tagging the step or even by searching the logs for [WARN] or similar. It might be nice to have smarts to mark a step as being "yellow" but there still are lots of cases where a step's success is not critical to the success of the task.

Some real world examples where you don't "necessarily" want to fail a build...

  1. Linting
  2. Localization checks
  3. Unit tests for experimental features typically behind a feature flag
  4. "clean" operations where it's possible there's nothing to clean-up
  5. "create" operations where it's possible the entity is already created

--
(Slightly off topic) Gnu Make has this support and used "ignore-errors", that might be a good choice for syntax ... https://www.gnu.org/software/make/manual/html_node/Errors.html

@vdemeester
Copy link
Member

So I am a bit torn about this feature (wasn't sure to comment here or on the TEP).

I tend to think, as it can definitely be done using the feature set (aka wrapping the binary/command/process run in a step and write in a result), I would advocate towards not adding it. That said, as @afrittoli, this would mean, to display things like a "yellow" task (similar to a yellow build/job in Jenkins), external tooling (dashboard, cli, …) would need to rely on some conventions rather than a "contract" from the API.

@bobcatfish
Copy link
Collaborator

I agree with @vdemeester - im gonna try to concentrate my feedback on the TEP PR (tektoncd/community#302) - I think there are a few features we're discussing simultaneously here (maybe? e.g. allowing a step to fail, exposing exit codes), I'm hoping we can narrow it down a bit in the TEP.

@skaegi
Copy link
Contributor

skaegi commented Jan 25, 2021

/assign

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2021
@pritidesai
Copy link
Member

/remove-lifecycle stale
Problem statement (TEP-0040) is merged, working on the proposal.

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2021
@vdemeester
Copy link
Member

/lifecycle frozen

@tekton-robot tekton-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 26, 2021
@bobcatfish
Copy link
Collaborator

oooo I think maybe this can be closed now that @pritidesai has implemented onError?

@pritidesai
Copy link
Member

yes please 🙏 , this can be closed now since we have onError in a step now.
Implemented in #4106

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
Status: Done
Development

No branches or pull requests

8 participants