Use init containers to preflight-check image pulls #364

DrJosh9000 · 2024-08-07T01:44:57Z

What

Make image pull failures more obvious:

fail the job
with an error message that identifies the image that failed to pull

Why

Previously, if an image failed to pull (within a timeout after the pod starts), the imagePullBackOffWatcher would use the GraphQL API to cancel the job. If the agent container had started, it could detect the mismatch between connected and expected number of containers and try to log something that would warn the user about ImagePullBackOff.

Cancelling the job looks weird, because that's normally an action performed by a human, and the agent's last-gasp log message is both easily ignored and hedges its bets (it could be ImagePullBackOff, but it might not be, ya know?)

The ability to directly acquire and fail a job in BK changes the set of possibilities. If we can detect image pull failures before the agent container runs in the pod, then we can directly fail the job.

How

The scheduler now includes an init container for each distinct image value in the podSpec.
Each new init container is called imagepullcheck-%d and has an entrypoint command that exits immediately with success. (So, in theory, the container should only fail if the image fails to pull.)
imagePullBackOffWatcher is updated to watch init container statuses. If an imagepullcheck-%d container fails to pull, imagePullBackOffWatcher uses the failJob mechanism from #363 to fail the job in Buildkite.

It also asks Kubernetes to evict the pod specifically to save a bit of time. (Maybe we should detect other failure modes and do this as cleanup.) I'm assuming we couldn't do this before, since it would risk evicting the agent before it could upload its last-gasp log and end the other containers.

TestImagePullBackOffOnSidecar is deleted, because it was used to specifically check that imagePullBackOffWatcher didn't touch the sidecar. Now Kubernetes won't even get to thinking about starting the sidecar, because the corresponding init container that pre-checks its image pull will fail.

The job cancellation mechanism is left in because pre-checking for image pull failure is TOCTOU - if the main containers have an image pull policy of "Always", and there is a transient network failure starting after the init containers succeeded, then they could still fail to pull at that moment. Our only termination mechanism after the agent container has started is still to cancel the job.

This also fixes two bugs I just discovered in #363:

in order to read the agent token from the secret, the controller service account needs secret read permissions. Duh.
in order to acquire a job, you need the tags when registering the agen (at least the queue, I think). Passed on CI?

wolfeidau

👍🏻 looks great

DrJosh9000 force-pushed the preflight-imagepull branch 12 times, most recently from f958422 to 5f8cef5 Compare August 8, 2024 06:48

DrJosh9000 changed the title ~~[WIP] Use init containers to preflight-check image pulls~~ Use init containers to preflight-check image pulls Aug 8, 2024

DrJosh9000 marked this pull request as ready for review August 8, 2024 06:53

DrJosh9000 force-pushed the preflight-imagepull branch 2 times, most recently from c651eef to d5cda2f Compare August 8, 2024 07:15

DrJosh9000 requested a review from a team August 8, 2024 07:16

DrJosh9000 force-pushed the preflight-imagepull branch 3 times, most recently from 711c73f to a572d7f Compare August 8, 2024 08:05

DrJosh9000 marked this pull request as draft August 8, 2024 08:23

DrJosh9000 force-pushed the preflight-imagepull branch 2 times, most recently from aa4b836 to bfb9471 Compare August 8, 2024 08:59

DrJosh9000 marked this pull request as ready for review August 8, 2024 09:00

DrJosh9000 force-pushed the preflight-imagepull branch 4 times, most recently from 14552cf to 27af0c5 Compare August 13, 2024 06:21

Use init containers to early-check image pull

0c76f91

DrJosh9000 force-pushed the preflight-imagepull branch from 27af0c5 to 0c76f91 Compare August 13, 2024 06:53

wolfeidau approved these changes Aug 20, 2024

View reviewed changes

DrJosh9000 merged commit 755d797 into main Aug 20, 2024
1 check passed

DrJosh9000 deleted the preflight-imagepull branch August 20, 2024 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use init containers to preflight-check image pulls #364

Use init containers to preflight-check image pulls #364

DrJosh9000 commented Aug 7, 2024 •

edited

Loading

wolfeidau left a comment

Use init containers to preflight-check image pulls #364

Use init containers to preflight-check image pulls #364

Conversation

DrJosh9000 commented Aug 7, 2024 • edited Loading

What

Why

How

wolfeidau left a comment

Choose a reason for hiding this comment

DrJosh9000 commented Aug 7, 2024 •

edited

Loading