Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PS-67: stop image pull backoff error handling for sidecars #344

Merged
merged 1 commit into from
Jun 17, 2024

Conversation

zhming0
Copy link
Contributor

@zhming0 zhming0 commented Jun 7, 2024

Currently, our backoff error watcher monitors all containers in a pod, which is problematic for customers who heavily rely on sidecars.

Since sidecar errors theoretically do not impact the health of pipeline jobs, canceling an entire job based on the status of a sidecar only adds unnecessary trouble.

At the moment, we don't provide governance support for sidecars, meaning customers can't see logs from sidecars. When we kill a job due to a sidecar problem, customers aren't given a proper reason. Sometimes, their CI workload is functioning correctly, but some sidecars have a delayed start, leading to the job being killed. From the customers' perspective, everything appears to be working fine until something randomly terminates the job, which is frustrating.

Customers can debug sidecar issues themselves through their Kubernetes platform.

This PR reduces the scope of the image pull backoff error watcher so it only monitors containers that we actively govern.

NOTE: A longer-term solution is being planned to address the observability issues comprehensively, so this PR is a temporary solution.

@zhming0 zhming0 requested a review from a team June 7, 2024 01:07
@zhming0 zhming0 changed the title PS-67 part 1: stop image pull backoff error handling for sidecars PS-67: stop image pull backoff error handling for sidecars Jun 14, 2024
Copy link
Contributor

@moskyb moskyb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one blocking question, but other than that LGTM!

internal/controller/scheduler/imagePullBackOffWatcher.go Outdated Show resolved Hide resolved
@zhming0 zhming0 force-pushed the ming/ps-67-part-1 branch from 6e5456c to e468fc9 Compare June 17, 2024 00:54
@zhming0 zhming0 enabled auto-merge June 17, 2024 00:55
@zhming0 zhming0 merged commit e4e8684 into main Jun 17, 2024
1 check passed
@zhming0 zhming0 deleted the ming/ps-67-part-1 branch June 17, 2024 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants