PS-67: stop image pull backoff error handling for sidecars #344
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, our backoff error watcher monitors all containers in a pod, which is problematic for customers who heavily rely on sidecars.
Since sidecar errors theoretically do not impact the health of pipeline jobs, canceling an entire job based on the status of a sidecar only adds unnecessary trouble.
At the moment, we don't provide governance support for sidecars, meaning customers can't see logs from sidecars. When we kill a job due to a sidecar problem, customers aren't given a proper reason. Sometimes, their CI workload is functioning correctly, but some sidecars have a delayed start, leading to the job being killed. From the customers' perspective, everything appears to be working fine until something randomly terminates the job, which is frustrating.
Customers can debug sidecar issues themselves through their Kubernetes platform.
This PR reduces the scope of the image pull backoff error watcher so it only monitors containers that we actively govern.
NOTE: A longer-term solution is being planned to address the observability issues comprehensively, so this PR is a temporary solution.