Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEP-0120] Add proposal for concurrency controls #818

Merged
merged 1 commit into from
Oct 3, 2022

Conversation

lbernick
Copy link
Member

@lbernick lbernick commented Sep 9, 2022

This commit adds design details for canceling concurrent PipelineRuns.

/kind tep

@tekton-robot tekton-robot added the kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). label Sep 9, 2022
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 9, 2022
The appeal of a solution involving a separate CRD is that any higher-level controller (e.g. Triggers, Workflows, Pipelines as Code) could get concurrency controls
for "free" by creating a ConcurrencyControl, and handing the logic off to a separate controller.
However, in practice, it's difficult to do this in a way that achieves good separation of concerns between reconcilers. Because the PipelineRun controller
(or ConcurrencyControl controller) is not responsible for creating PipelineRuns, it has to rely on other components starting PipelineRuns as pending, and matching
these PipelineRuns to the ConcurrencyControl via some strategy such as labels, ownerReferences, or modifying PipelineRuns themselves.
This controller's logic is tied to the logic of these other components, requiring updates to be coordinated and essentially creating an
unwritten contract between reconcilers. This solution adds significant complexity compared to the proposed solution while still requiring some concurrency code to be
added to each higher-level controller that wants to support concurrency controls.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really tried to get this solution working but I struggled to do so in a way that wasn't super complex-- this is my best attempt at explaining why.

@lbernick
Copy link
Member Author

lbernick commented Sep 9, 2022

@lbernick
Copy link
Member Author

lbernick commented Sep 9, 2022

Proof of concept: tektoncd/triggers#1446

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a "slight" difference of approach in what I had and yours @lbernick.
My initial thought on concurrency was to provides the concurrency control primitives to tektoncd/pipelne so that other tools (be it upstream or elsewhere) could use them (hence having it on PipelineRun or at least in tektoncd/pipeline).

What is proposed here, however, is an "end-user" proposal, for Triggers. The two approaches are not antinomic, but one might use the 2nd. I do think the proposal on trigger probably make sense for triggers. But I do think the "primitives for concurrency control" shouldn't be in triggers (Triggers "description" is "… that allows you to create Kubernetes resources based on information it extracts from event payloads.").

As proposed here, any tool that wants to implement concurrency control and that doesn't rely on triggers, will need to "copy" what's proposed (or come up with its own approach) — which was already the case today.

Comment on lines 192 to 265
### Reconciler logic

When a Trigger with a concurrency spec creates a new PipelineRun, it will substitute the parameters in the concurrency key and apply the concurrency key as a label
with key "triggers.tekton.dev/concurrency". It will use an informer to find all PipelineRuns with the same concurrency key from the same Trigger, using label selectors.
The reconciler will patch any matching PipelineRuns as canceled before creating the new PipelineRun, but will not wait for cancelation to complete.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, the only "job" of triggers should be to create that label and most likely create the PipelineRun as pending, and have something else handle that label (aka not the triggers controller).

for "free" by creating a ConcurrencyControl, and handing the logic off to a separate controller.
However, in practice, it's difficult to do this in a way that achieves good separation of concerns between reconcilers. Because the PipelineRun controller
(or ConcurrencyControl controller) is not responsible for creating PipelineRuns, it has to rely on other components starting PipelineRuns as pending, and matching
these PipelineRuns to the ConcurrencyControl via some strategy such as labels, ownerReferences, or modifying PipelineRuns themselves.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the PipelineRun controller
(or ConcurrencyControl controller) is not responsible for creating PipelineRuns, it has to rely on other components starting PipelineRuns as pending, and matching
these PipelineRuns to the ConcurrencyControl via some strategy such as labels, ownerReferences, or modifying PipelineRuns themselves

Althought I slightly agree, the PipelineRun reconciler has all the possibilities to do whatever it wants with the newly created PipelineRun. We could decide that, if a "conccurency" label is set on a PipelineRun, we make it pending. This is a choice we can make (or not).

Comment on lines +380 to +362
This solution assumes that PipelineRuns using concurrency will typically be created by tooling such as Pipelines as Code, a Workflow, or similar,
and would likely not have different concurrency strategies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which "is" kind of true, as this solution adds it in Triggers.

@abayer
Copy link
Contributor

abayer commented Sep 12, 2022

Yeah, I agree - Triggers is upstream of Pipelines, and isn't necessarily the only place users would want to be able to control concurrency.

@lbernick
Copy link
Member Author

@vdemeester could you elaborate on what a "concurrency primitive" might look like?
I like the idea of building something that higher-level systems can use for concurrency controls, but I think concurrency controls belong at a higher level than Pipelines. It doesn't make sense to talk about concurrency controls for a single PipelineRun, so I'm not sure what a primitive in Pipelines would look like-- do you have an example in mind?

@lbernick
Copy link
Member Author

/assign @vdemeester @abayer

@lbernick
Copy link
Member Author

@abayer @vdemeester @chmouel @dibyom I've opened an alternative POC in the Pipelines repo for "concurrency primitives". PTAL and let me know if this is in line with what you are thinking: tektoncd/pipeline#5501

@dibyom
Copy link
Member

dibyom commented Sep 15, 2022

So, I agree with @vdemeester 's comment about the primitive based approach. To me, that means that the solution has to be loosely coupled i.e. we should not require changes to pipelines/triggers code and ideally no changes to the existing crds either. This does not mean we cannot add those changes in later if we decide this is a very core component of say a trigger. Also, this does not mean we cannot provide a more opinionated approach in another higher level crd such as workflows. But the most basic implementation should be flexible and loosely coupled.

Here is roughly what I had in mind (let me know if some of this is infeasible)

  1. A way for a user to specify the Concurrency configuration which contains 1. the concurrency key, 2. the strategy and 3. a way to match this concurrency config to some subset of pipelineruns - I think we can start we a label selector here and optimize later
    (e.g. for the triggers PoC scenario - since each run created by a Triggers gets a label with the name of the trigger, the label selector can just match against that label)

  2. On Run creation, modify the run to use PipelineRunPending. (This can happen in a mutating webhook so that we don't need to change code in Pipeline or Triggers)

  3. For all runs in a pending state, a new reconciler first sees if a valid concurrency label is present. If not, it matches it against all label selectors for all relevant concurrency configs. If nothing matches, remove the pipelinerun pending state. If there is a match, add a new concurrency key label to the object, run the concurrency strategy (cancel any relevant runs) and then remove the pipelinerun pending state.

Some caveats/drawbacks of this approach I can think of

  1. The matching might get complex/slow - maybe a higher level thing can set the concurrencyLabel directly
  2. Label manipulation - Ideally the selection and the concurrency key should be immutable fields; not a easily changeable label
  3. Multiple additional components - For workflows, since we generate the trigger template, we can always set the status to pending at that point instead of relying on a webhook to do so.

@lbernick
Copy link
Member Author

lbernick commented Sep 19, 2022

thanks @dibyom for the feedback! I am experimenting with a solution in a separate reconciler, similar to what you've described. The one thing I might not end up implementing is the reconciler changing the pipelinerun status from pending -> running, partially because it's a bit of a code smell for the reconciler to edit the spec of the pipelinerun it's reconciling and partially because I haven't figured out yet how to get this to actually work 😅 edit: dibyo and I figured this out!

Would be great to hear others' thoughts as well!

@lbernick
Copy link
Member Author

There's one other issue with the mutating admission webhook strategy that @RafaeLeal mentioned to me. Labels and annotations aren't available during the create request (they're patched after creation), so we'd have to mark the pipelinerun as pending during an update request. However I'm not sure how to know during an update request whether it's appropriate to mark a pipelinerun as pending, and our webhook doesn't currently allow running pipelineruns to be changed to pending (although ofc this is something we can change)

@lbernick
Copy link
Member Author

POC number 3 as requested! I haven't yet implemented a mutating admission webhook but that's next on the list.

@lbernick
Copy link
Member Author

lbernick commented Sep 23, 2022

@ all -- I've updated the TEP to list out alternatives, and the proposed solution to basically state that we're going to experiment with a separate reconciler and use it in dogfooding, and we'll design a higher level api for concurrency in workflows once the workflows api exists and after getting experience testing out the experimental project. Also added links to existing POCs. Hoping this can be merged without major changes, please lmk what you think.

This commit adds design details for canceling concurrent PipelineRuns.
@lbernick
Copy link
Member Author

@abayer @vdemeester PTAL, hoping to merge during API WG on Monday

@abayer
Copy link
Contributor

abayer commented Sep 30, 2022

/approve

Looks like a good start to me - looking forward to seeing where we end up going with this!

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 30, 2022
- Do we want to allow a PipelineRun to be part of multiple concurrency groups?
- For example, when we later support queueing, we might want to allow users to configure a rule like "Cancel all but the last PR per pull request, and only allow 5 CI runs per repo at a time."
- What concurrency strategies should we support?
- For the initial version of this proposal, we will only support cancelation. However, should we support all possible strategies for canceling PipelineRuns,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default should be "graceful cancellation", aka cancel but run finally tasks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good-- the initial version I'm working on uses regular cancelation as the default but this should be super easy to change.

- For the initial version of this proposal, we will only support cancelation. However, should we support all possible strategies for canceling PipelineRuns,
including cancelation, graceful cancelation, and graceful stopping?
- Should we attempt to prevent users from interfering with concurrency controls?
- Many of the proposed solutions rely on labels, and a user editing labels could change PipelineRun behavior. Is this desired?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so yes. A user can change labels on a Pod from a TaskRun and provoke a weird behavior for example. This is very similar here, we cannot completely stop the user from shooting its foot, and that's fine 🙃

@tekton-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abayer, vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@afrittoli
Copy link
Member

LGTM in the working group
/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 3, 2022
@tekton-robot tekton-robot merged commit 3c33b95 into tektoncd:main Oct 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/tep Categorizes issue or PR as related to a TEP (or needs a TEP). lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: UnAssigned
Development

Successfully merging this pull request may close these issues.

6 participants