Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use init containers to preflight-check image pulls #364

Merged
merged 1 commit into from
Aug 20, 2024

Conversation

DrJosh9000
Copy link
Contributor

@DrJosh9000 DrJosh9000 commented Aug 7, 2024

What

Make image pull failures more obvious:

  • fail the job
  • with an error message that identifies the image that failed to pull
Screenshot 2024-08-08 at 7 02 59 PM

Why

Previously, if an image failed to pull (within a timeout after the pod starts), the imagePullBackOffWatcher would use the GraphQL API to cancel the job. If the agent container had started, it could detect the mismatch between connected and expected number of containers and try to log something that would warn the user about ImagePullBackOff.

Cancelling the job looks weird, because that's normally an action performed by a human, and the agent's last-gasp log message is both easily ignored and hedges its bets (it could be ImagePullBackOff, but it might not be, ya know?)

The ability to directly acquire and fail a job in BK changes the set of possibilities. If we can detect image pull failures before the agent container runs in the pod, then we can directly fail the job.

How

The scheduler now includes an init container for each distinct image value in the podSpec.
Each new init container is called imagepullcheck-%d and has an entrypoint command that exits immediately with success. (So, in theory, the container should only fail if the image fails to pull.)
imagePullBackOffWatcher is updated to watch init container statuses. If an imagepullcheck-%d container fails to pull, imagePullBackOffWatcher uses the failJob mechanism from #363 to fail the job in Buildkite.

It also asks Kubernetes to evict the pod specifically to save a bit of time. (Maybe we should detect other failure modes and do this as cleanup.) I'm assuming we couldn't do this before, since it would risk evicting the agent before it could upload its last-gasp log and end the other containers.

TestImagePullBackOffOnSidecar is deleted, because it was used to specifically check that imagePullBackOffWatcher didn't touch the sidecar. Now Kubernetes won't even get to thinking about starting the sidecar, because the corresponding init container that pre-checks its image pull will fail.

The job cancellation mechanism is left in because pre-checking for image pull failure is TOCTOU - if the main containers have an image pull policy of "Always", and there is a transient network failure starting after the init containers succeeded, then they could still fail to pull at that moment. Our only termination mechanism after the agent container has started is still to cancel the job.

This also fixes two bugs I just discovered in #363:

  • in order to read the agent token from the secret, the controller service account needs secret read permissions. Duh.
  • in order to acquire a job, you need the tags when registering the agen (at least the queue, I think). Passed on CI?

@DrJosh9000 DrJosh9000 force-pushed the preflight-imagepull branch 12 times, most recently from f958422 to 5f8cef5 Compare August 8, 2024 06:48
@DrJosh9000 DrJosh9000 changed the title [WIP] Use init containers to preflight-check image pulls Use init containers to preflight-check image pulls Aug 8, 2024
@DrJosh9000 DrJosh9000 marked this pull request as ready for review August 8, 2024 06:53
@DrJosh9000 DrJosh9000 force-pushed the preflight-imagepull branch 2 times, most recently from c651eef to d5cda2f Compare August 8, 2024 07:15
@DrJosh9000 DrJosh9000 requested a review from a team August 8, 2024 07:16
@DrJosh9000 DrJosh9000 force-pushed the preflight-imagepull branch 3 times, most recently from 711c73f to a572d7f Compare August 8, 2024 08:05
@DrJosh9000 DrJosh9000 marked this pull request as draft August 8, 2024 08:23
@DrJosh9000 DrJosh9000 force-pushed the preflight-imagepull branch 2 times, most recently from aa4b836 to bfb9471 Compare August 8, 2024 08:59
@DrJosh9000 DrJosh9000 marked this pull request as ready for review August 8, 2024 09:00
@DrJosh9000 DrJosh9000 force-pushed the preflight-imagepull branch 4 times, most recently from 14552cf to 27af0c5 Compare August 13, 2024 06:21
@DrJosh9000 DrJosh9000 force-pushed the preflight-imagepull branch from 27af0c5 to 0c76f91 Compare August 13, 2024 06:53
Copy link
Contributor

@wolfeidau wolfeidau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻 looks great

@DrJosh9000 DrJosh9000 merged commit 755d797 into main Aug 20, 2024
1 check passed
@DrJosh9000 DrJosh9000 deleted the preflight-imagepull branch August 20, 2024 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants