Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEP-0013 for adding a limit to pipeline concurrency #228

Closed
wants to merge 1 commit into from

Conversation

NikeNano
Copy link

@NikeNano NikeNano commented Oct 8, 2020

After suggestion from @jerop, I have made a rough draft of a TEP for how to limit work related to Pipeline concurrency. This has been discussed in the following two issues:tektoncd/pipeline#2591, tektoncd/pipeline#1305.

Related PR: tektoncd/pipeline#3112

Would be great if you could help out with the TEP @jerop, thank you!

As a starting point I understand it as the two issues discuss similar but different things, tektoncd/pipeline#2591 relates to concurrency of tasks within a pipeline while tektoncd/pipeline#1305 the concurrency of several pipelines, but maybe I am misunderstanding the discussions?

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 8, 2020

CLA Signed

The committers are authorized under a signed CLA.

@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 8, 2020
@pritidesai
Copy link
Member

/area tep

@pritidesai
Copy link
Member

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Oct 9, 2020
@jerop
Copy link
Member

jerop commented Oct 12, 2020

Would be great if you could help out with the TEP @jerop, thank you!

@NikeNano thank you for creating this TEP, will contribute to it this week during the week of Oct 19th

Copy link

@ibotty ibotty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some typos I noticed.

teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
Copy link
Member

@jerop jerop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together @NikeNano -- suggested some additions and changes to the TEP :)

nit: the tep filename has a typo

teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
teps/0013-limit-pipeline-conecurrency.md Outdated Show resolved Hide resolved
@bobcatfish
Copy link
Contributor

Thanks for getting started on this @NikeNano !

I want to also draw your attention to #203

It's not exactly solving the same problem, but in solving the problem you are describing here, we have 2 choices:

  1. Add a feature to Tekton itself (i.e. directly in the pipelines controller)
  2. Create a new/separate component

The proposal in #203 unlocks the ability to do (2) by making it possible for some external service to decide when a PipelineRun or TaskRun is no longer "pending".

I am personally more inclined toward a solution like (2) because that would prevent the Pipelines controller from taking on more responsibilities and also would make it possible for folks to implement their own custom concurrency limitations however they wanted.

@NikeNano
Copy link
Author

Thanks for all the feedback, I will try to look at it later today or tomorrow and comeback with the updates.

@tekton-robot tekton-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 1, 2020
@NikeNano
Copy link
Author

NikeNano commented Nov 2, 2020

Thanks for all the comments @jerop! I will look over it a bit more and fixe some last things before done.

@NikeNano
Copy link
Author

NikeNano commented Nov 3, 2020

  1. Add a feature to Tekton itself (i.e. directly in the pipelines controller)
  2. Create a new/separate component

Yeah, I was inclined to favour 1, but after reading #203 is might give more possibilities if we go with 2.

The proposal in #203 unlocks the ability to do (2) by making it possible for some external service to decide when a
PipelineRun or TaskRun is no longer "pending".

Both TEPS seems to share some things in common, thanks for the pointer.

I am personally more inclined toward a solution like (2) because that would prevent the Pipelines controller from taking on more responsibilities and also would make it possible for folks to implement their own custom concurrency limitations however they wanted.

Would we in that case let the controller reach an external service to see if TaskRuns should be scheduled? As I understand it this new/seperate component has to interact with the controller to effect the way things are scheduled, is it wise to add this dependence to the controller of a separate service @bobcatfish?

@bobcatfish
Copy link
Contributor

@NikeNano > Would we in that case let the controller reach an external service to see if TaskRuns should be scheduled? As I understand it this new/seperate component has to interact with the controller to effect the way things are scheduled, is it wise to add this dependence to the controller of a separate service @bobcatfish?

I'm thinking of it kinda the other way around!

Looking at tektoncd/pipeline#3112 you were proposing adding MaxParallel to a PipelineRun. If we went the route where something external (let's call it "Limit Service") was in charge of these limits, it could look something like this:

  1. PipelineRun is created
  2. Pipelines controller sees PipelineRun, starts creating TaskRuns <-- each TaskRun needs to be created with spec.status.Pending (as per TEP-0015 - Add a pending setting to Tekton PipelineRun and TaskRuns  #203) (TBD how we'd express that in the API)
  3. Pipelines controller sees the new TaskRuns, but they all have spec.status.Pending; it doesn't do anything with them
  4. Limit Service also sees the TaskRusn with spec.status.Pending; it can now apply whatever logic it wants to determine if the TaskRuns are ready to run or not (e.g. maybe it checks a rule that says "only run a max of 4 Tasks at once per PipelineRun"), or check back later to see if it's ready to run
  5. When Service X decided the TaskRun can run, it removes spec.status.Pending from the TaskRuns(s)
  6. Pipelines controller now sees the TaskRuns are not longer pending, and it starts executing them

This is definitely more complicated but it also means the logic to determine the limits can be entirely pluggable <-- and how important this is probably depends on the use cases, e.g. if "MaxParallel" at a PipelineRun level meets 99% of ppl's use cases, maybe that's enough. Looking at tektoncd/pipeline#1305 as an example tho @jstrachan mentioned some cases where maybe you want limits on the number of Runs against a specific repository even

@NikeNano
Copy link
Author

NikeNano commented Nov 3, 2020

Thanks for the clarification! I like the idea with the flexibility and possibility for users to extend it further. I will update the suggested solution.

@NikeNano
Copy link
Author

FYI: have not forgot, just have had to much to do. Will try to get this done before the weekend.

@bobcatfish
Copy link
Contributor

Thanks for the update @NikeNano , no rush!

@NikeNano
Copy link
Author

Update the TEP now @bobcatfish, @jerop, will have more time to respond faster now as well!

@afrittoli afrittoli changed the title TEP for adding a limit to pipeline concurrency TEP-0013 for adding a limit to pipeline concurrency Nov 16, 2020
@afrittoli
Copy link
Member

@NikeNano could you squash your 31 commits into one please? Before we can proceed on this you'd need to sign the CLA too.
Thank you!

@jerop
Copy link
Member

jerop commented Nov 23, 2020

Thank you for the work on this @NikeNano -- I like the new approach much more!

@tektoncd/core-maintainers please take a look :)

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign ncskier
You can assign the PR to them by writing /assign @ncskier in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@NikeNano NikeNano force-pushed the master branch 2 times, most recently from d06bc5e to a73fc49 Compare November 23, 2020 21:25
@tekton-robot tekton-robot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 23, 2020
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 23, 2020
@NikeNano
Copy link
Author

@afrittoli I have signed it now, but seems to be some latency before it goes through. Will check tomorrow again.

This TEP describes how to add pipeline concurrency limits.

Author: Niklas Hansson <niklas.sven.hansson@gmail.com>
Co-authored-by: Jerop Kipruto <jerop@google.com>
@NikeNano
Copy link
Author

@afrittoli I have squashed and signed it. How do I run the linting locally to debug? Thank you

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a lot of questions throughout. I'm still a little hazy on how exactly this would all work, particularly wrt RBAC.


As suggested [here](https://github.com/tektoncd/pipeline/issues/2591#issuecomment-647754800), we can add a field - `MaxParallelTasks` - to `PipelineRunSpec` which is an integer that represents the maximum number of `Tasks` that can run concurrently in the `Pipeline`.

type PipelineRunSpec struct {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth wrapping this in backticks to make it into a code block.


Enable users to limit the number of tasks that can run simultaneously in a pipeline, which could help with:

- Tracking and limiting how much resources a Pipeline is consuming, and thus how much it costs.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to get more specific here. Reading tektoncd/pipeline#2591 it doesn't sound like this feature is supposed to solve resource or cost problems in the cluster - I think ResourceQuotas would be the right tool for that? I think this is supposed to solve the problem of flooding a service that the Pipeline relies on. In tektoncd/pipeline#2591 (comment) it's a Database server that is hit with 100 parallel reloads.

Maybe something more along the lines of:

Allow a user to limit the number of time-consuming or heavy operations being sent to a service at once in order to avoid accidentally causing a denial-of-service on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, will update.

-->

#### Story 1
User has a Pipeline with 100 independent Tasks but they don't want all 100 tasks to run at once.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? It would be great to get a more concrete example, like the database server example from issue tektoncd/pipeline#2591

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree on the need of being more specific.

#### Story 1
User has a Pipeline with 100 independent Tasks but they don't want all 100 tasks to run at once.
#### Story 2
User wants to limit amount of resources used by a Pipeline at a given time.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be solved by a ResourceQuota.


Separating the logic if a `TaskRun` is allowed to run from the `Task` controller allows for extensibility for adding custom logic to the `Limit Service`.

As suggested [here](https://github.com/tektoncd/pipeline/issues/2591#issuecomment-647754800), we can add a field - `MaxParallelTasks` - to `PipelineRunSpec` which is an integer that represents the maximum number of `Tasks` that can run concurrently in the `Pipeline`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. Up until here the document describes limiting concurrent Tasks per-Pipeline. Are we intending to limit per Pipeline or per PipelineRun?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I guess PipelineRun would allow for a larger flexibility. What do you think would be best @sbwsg ?

Copy link

@ghost ghost Jan 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone's trying to limit the number of concurrent uses of an external resource like a DB then Pipeline-level maximum makes sense (or even Task-level tbh). If it's PipelineRun-level then a user can just spawn more PipelineRuns and ultimately overwhelm the external thing anyway I think, right? What info does the LimitService have access to in order to decide what can / cannot run? Does it even make sense to include a configurable MaxParallelTasks as part of Tekton's CRDs? Why not make this part of the LimitService's configuration?

So I think this depends on the use-cases we're trying to solve. If we're going to pursue this we really have to nail these down and clearly decide what we're trying to achieve.


### Goals

- Limit how many tasks can run concurrently in a Pipeline.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, are we limiting concurrent TaskRuns per Pipeline or per PipelineRun? I can imagine both features being desirable for different scenarios but I'm curious which one we're specifically aiming for here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above.


`MaxParallelTasks` has to be >= 0 in. If `MaxParallelTasks` is not specified there should be no limit to how many `TaskRun` that can run in parallel and thus `spec.status.Pending` should be removed from all `TaskRuns`.

In order to not end up with a deadlock the order of the `Tasks` in a `Pipeline` has to be respected and accounted for by the `Limit service`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you go into a bit more detail on how the deadlock could happen in the first place? That sounds like a possible Risk/Mitigation we need to be aware of.


The `Limit Service` could run similar to a control loop checking `TaskRuns` and the restrictions of `MaxParallelTasks` for the related `Pipeline`. If the count of running `TaskRuns` is less than `MaxParallelTasks`, a `TaskRun` would be update and `spec.status.Pending` removed. If the count of running `TaskRuns` equals `MaxParallelTasks`, no `TaskRun` would be updated until later when another `TaskRun` is completed.

`MaxParallelTasks` has to be >= 0 in. If `MaxParallelTasks` is not specified there should be no limit to how many `TaskRun` that can run in parallel and thus `spec.status.Pending` should be removed from all `TaskRuns`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if MaxParallelTasks is specified but there's no Limit Service running in the cluster? Is there a default Limit Service that Tekton Pipelines comes with? Or are we asking all users to implement their own?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to have one by default, but allow users to implement there own.

MaxParallelTasks int `json:"maxParallelTasks,omitempty"`
}

The `Limit Service` could run similar to a control loop checking `TaskRuns` and the restrictions of `MaxParallelTasks` for the related `Pipeline`. If the count of running `TaskRuns` is less than `MaxParallelTasks`, a `TaskRun` would be update and `spec.status.Pending` removed. If the count of running `TaskRuns` equals `MaxParallelTasks`, no `TaskRun` would be updated until later when another `TaskRun` is completed.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the proposed Limit Service work in multi-tenant scenarios? Would there be one Limit Service per tenant? If not are you expecting the Limit Service to have global read and write access to all TaskRuns?

It might be worth documenting a bit about how RBAC requirements might be affected with this feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm good point, I dont know. Maybe you have some better thought on this @bobcatfish ?

@bobcatfish
Copy link
Contributor

/assign @sbwsg
/assign @chhsia0

@tekton-robot tekton-robot assigned ghost Jan 25, 2021
@tekton-robot
Copy link
Contributor

@bobcatfish: GitHub didn't allow me to assign the following users: chhsia0.

Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @sbwsg
/assign @chhsia0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ghost
Copy link

ghost commented Jan 26, 2021

In the experimental repo there's now discussion of building a controller that provides support for concurrency limits and demonstrates to Operators how to build their own. Operators could then take this controller and base their own implementation on it according to their org's specific concurrency rules. I'm not sure a TEP is necessary any longer if we go that route.

@NikeNano
Copy link
Author

the experimental repo

I think it still would be relevant to have some sort of concurrency limit in the default controller for Tekton, maintaining a controller on your own requires some work. I think it sounds great to allow for the flexibility to allow operators to write there own though.

@ghost
Copy link

ghost commented Jan 27, 2021

I want to make sure this point isn't missed: the proposed controller in tektoncd/experimental#699 is intended to be functional and support options for configuration. Operators won't have to build their own if their use-cases are supported by the implementation there. This doesn't seem that different to the LimitService proposed here, where we'd provide some kind default implementation / configuration out the box I guess?


## Requirements

- Users can specify the maximum number of Tasks that can run concurrently in a Pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @sbwsg that it would be good to see some concrete use case to better understand if this is the right approach to solve the problem.

If a pipeline or task requires a specific service, which we do not want to overload, limiting the number of concurrent tasks in a pipeline would not allow to control that. If the problem is about resource consumption on the cluster, again that might not provide a solution since:

  • we allow for execution of custom tasks, and we do not have control on the amount of resource used by a custom task (could be zero pods, or 10 pods)
  • the runtime model might change, today a task == a pod, a pipeline == N pods, but we are investigating alternative runtime approaches

Base automatically changed from master to main February 3, 2021 16:34
@pritidesai
Copy link
Member

Experiment with this in experimental repo to begin with.
assigning it to @imjasonh

@pritidesai
Copy link
Member

/assign @imjasonh

@ghost
Copy link

ghost commented Mar 1, 2021

Discussed again in API Working Group today.

We're going to close this TEP for now on the basis that we have a design ongoing in experimental. In combination with the Pending TEP I think this gives us enough room to start developing solutions to this problem without changes to Tekton Pipelines' Controller.

This doesn't preclude bringing this functionality into the controller in future though, if needed. Reopening this TEP or starting a new one at that time makes total sense.

/close

@tekton-robot
Copy link
Contributor

@sbwsg: Closed this PR.

In response to this:

Discussed again in API Working Group today.

We're going to close this TEP for now on the basis that we have a design ongoing in experimental. In combination with the Pending TEP I think this gives us enough room to start developing solutions to this problem without changes to Tekton Pipelines' Controller.

This doesn't preclude bringing this functionality into the controller in future though, if needed. Reopening this TEP or starting a new one at that time makes total sense.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants