Skip to content

Latest commit

 

History

History
347 lines (259 loc) · 14.6 KB

0132-queueing-concurrent-runs.md

File metadata and controls

347 lines (259 loc) · 14.6 KB
status title creation-date last-updated authors collaborators
proposed
Queueing Concurrent Runs
2023-02-24
2023-03-20
@lbernick
@pritidesai
@chengjoey
@jerop

TEP-0132: Queueing Concurrent Runs

Summary

This TEP proposes allowing users to control the number of PipelineRuns and TaskRuns (and maybe CustomRuns) that can run concurrently. This proposal also includes controlling the Tekton run time resources (pipelineRun, taskRun, and customRun) in a cluster.

The focus of this TEP is different from TEP-0120: Canceling Concurrent PipelineRuns, which focuses on PipelineRuns that may have ordering dependencies, but we may choose to develop a solution that addresses both TEPs.

Motivation

It is very common to build and execute many different pipelines in any CI/CD deployments. While running such workloads in parallel, the cluster could be overloaded and in the worst case, it could become unresponsive.

The motivation of this proposal is to support both concurrent runs of a single Pipeline or single Task, or a group of unrelated PipelineRuns or TaskRuns in a given cluster.

In addition, TEP-0092: TaskRun Timeouts provided motivation for this proposal. TEP-0092 proposed capping the amount of time a TaskRun could be queued for. However, the use cases specified in TEP-0092 (running Tasks in a resource constrained environment, or "fire and forget") would also be met by queueing PipelineRuns or TaskRuns for execution. Queueing may also be easier to understand and more flexible.

Goals

  • Can "fire and forget" runs by creating many runs of one or more Pipelines but preventing all of them from executing at once.
  • Can "fire and forget" runs by creating many runs of one or more Tasks but preventing all of them from executing at once.
  • Can control the number of matrixed Runs that can execute at once

Non-Goals

  • Priority and preemption of queued Runs, including prioritizing based on compute resources
  • Load testing or performance testing

Use Cases

Queueing non-idempotent operations

Only allow executing a single instance of a TaskRun or a PipelineRun at any given time, for example:

  • An integration test communicates with a stateful external service (like a database), and a developer wants to ensure that integration testing TaskRuns within their CI PipelineRun don’t run concurrently with other integration testing TaskRuns of the same CI Pipeline (based on this comment).

Controlling load on a cluster or external service

Some of these use cases require being able to limit concurrent PipelineRuns for a given Pipeline, or concurrent TaskRuns for a given Task. Others require being able to limit the total number of PipelineRuns and TaskRuns, regardless of whether they are associated with the same Pipeline or Task.

  • An organization has multiple teams working on a mobile application with a limited number of test devices. They want to limit the number of concurrent CI runs per team, to prevent one team from using all the available devices and crowding out CI runs from other teams.

  • A cluster operator wants to cap the number of matrixed TaskRuns (alpha) that can run at a given time.

    • Currently, we have the feature flag “default-maximum-matrix-fan-out”, which restricts the total number of TaskRuns that can be created from one Pipeline Task. However, we would like to support capping the number of matrixed TaskRuns that can run concurrently, instead of statically capping the number of matrixed TaskRuns that can be created at all.
  • A PipelineRun or TaskRun communicates with a rate-limited external service, as described in this issue. Another example of such requirement is an API call to package registries to retrieve package metadata for SBOMs. The package registries blocks the issuer if the number of requests exceeds their allowed quota. These requests could be generated from a single Pipeline/Task or unrelated Pipelines/Tasks.

  • Tekton previously used GKE clusters allocated by Boskos for our Pipelines integration tests, and Boskos caps the number of clusters that can be used at a time. It would have been useful to queue builds so that they could not launch until a cluster was available. (We now use KinD for our Pipelines integration tests.)

  • A Pipeline performs multiple parallelizable tasks with different concurrency requirements, as described in this comment.

    • Configuring different concurrency limits for multiple pipelineTasks of the same Pipeline can be part of future work for this proposal.
  • A large number of resource intensive pipelineTasks are running in parallel, causing a huge load on a node. This load is causing other unrelated TaskRuns (not part of the same Pipeline) to get timed out.

  • A large number of PipelineRuns and TaskRuns are running concurrently, resulting in an overloaded cluster. These PipelineRuns could be thousands of runs of the same Pipeline or a combination of N different Pipelines. These Pipelines could be related or unrelated; for example, they may access the same remote resources. The cluster operator would like to configure a queue of PipelineRuns/TaskRuns for fire-and-forget operations such as these.

Existing Workarounds

Use an object count quota to restrict the number of Runs that can exist in a namespace. This doesn't account for Runs' state (e.g. completed and pending PipelineRuns count towards this total) and doesn't support queueing or more advanced concurrency strategies.

Requirements

  • Must be able to cap the amount of time a Run can be queued for
  • Must be able to clear the queue manually without having to cancel Runs individually

Proposal

Notes and Caveats

Design Details

Design Evaluation

Reusability

Simplicity

Flexibility

Conformance

User Experience

Performance

Risks and Mitigations

Drawbacks

Alternatives

Implementation Plan

Test Plan

Infrastructure Needed

Upgrade and Migration Strategy

Implementation Pull Requests

References

Feature Requests

Design Proposals

Similar features in other CI/CD systems