Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Fix bug where we pushed test images to the main image cache #11907

Closed
wants to merge 2 commits into from

Conversation

jigold
Copy link
Contributor

@jigold jigold commented Jun 10, 2022

We don't want this in PR namespaces when running test_ci

retry buildctl-daemonless.sh      build      --frontend dockerfile.v0      --local context=/io/repo/      --local dockerfile=/home/user      --output 'type=image,"name=gcr.io/hail-vdc/ci-intermediate:kbqu6modc66u,gcr.io/hail-vdc/service-base:cache",push=true'      --export-cache type=inline      --import-cache type=registry,ref=gcr.io/hail-vdc/service-base:cache      --trace=/home/user/trace
cat /home/user/trace

@jigold
Copy link
Contributor Author

jigold commented Jun 10, 2022

Actually, maybe we just don't want to push to any cache at all for test deployments.

@danking
Copy link
Contributor

danking commented Jun 10, 2022

Tests pushing to the cache is important. For example, if a PR adds a new apt-get dependency, only the first build should have to rebuild the image. Subsequent commits / retries should be fast.

self.cache_repository = f'{DOCKER_PREFIX}/{self.publish_as}:cache'
else:
self.cache_repository = f'{DOCKER_PREFIX}/ci-intermediate:cache'
self.cache_repository = f'{DOCKER_PREFIX}/ci-intermediate/{self.publish_as}:cache'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if publish_as is None, this seems wrong

@danking
Copy link
Contributor

danking commented Jun 10, 2022

I'm missing something. Why shouldn't test deployments benefit from and contribute to the cache? Why isolate them somewhere else?

@jigold
Copy link
Contributor Author

jigold commented Jun 10, 2022

I thought the purpose of the cache was to cache the latest version in production. Let's take service-base as an example. There's the deployment in production that we care about. But every PR is now going to change the cache each time to what it thinks service-base is. This means that the last 4 layers for service-base will change for every time we run a test PR and it changes hailtop, gear, or web-common.

If you don't like this change, then feel free to close it.

@danking
Copy link
Contributor

danking commented Jun 10, 2022

Ah, I understand. You're correct that pushing to the cache tag disassociates that tag with any image that was previously associated with it. That image is still in the registry though. I thought that was sufficient for the cache to work. (I was wrong, see below!)

AFAICT, this change doesn't prevent PR tests from pushing to the cache tag. This change just makes the tests run by the CI-in-the-PR not overwrite the cache. Every image build for a PR (which is tested by default namespace CI) will still overwrite the cache tag.

AFAICT, this

     --import-cache type=registry,ref=gcr.io/hail-vdc/foo

Will use as a cache source the latest tag in the gcr.io/hail-vdc/foo repository. It is not sufficient for an image to be present in the repository and untagged or with a different tag from latest.

In particular, every push to the cache tag prevents us from using other images even though they are in the registry! For example, I pushed two images to cache:

(base) # gcloud container images list-tags gcr.io/hail-vdc/dktest
DIGEST        TAGS          TIMESTAMP
fb551d9bdb94                2022-06-10T14:16:39
afb4c5ad2d7b  cache,latest  2022-06-10T14:15:55

If I rebuild [1] the most recently pushed image with

     --import-cache type=registry,ref=gcr.io/hail-vdc/dktest:cache

it succeeds in getting the cache.

If I rebuild the other image with the same import-cache, it does not see that the (untagged) image is already there!


This all suggests that all our attempts at image caching are failing terribly. Options:

  1. Only deploy builds push to a :cache tag, everyone uses that tag.
  2. List all the tags in the repository and include them all as --cache-from's (this doesn't actually work: --cache-from and Multi Stage: Pre-Stages are not cached moby/moby#34715 (comment))
  3. Push a tag for each git SHA and then include as --cache-from's the last ten git SHAs on this branch, the most recent common commit with main (i.e. git merge-base origin/main this-branch), maybe the current main, and maybe the PR number?
  4. Write our own OCI image builder so we can write our own OCI image cache that actually works the way it ought to (everything in the registry is considered fair game for the cache).

It seems like 3 is actually a decent solution that should enable lots of caching.

  1. The last ten SHAs on the branch should speed up repeated builds when you're fixing little bugs.
  2. The most recent common commit with main should avoid rebuilds unless the packages changed.
  3. I suspect the current main is actually not helpful (either 2 will work or 3 wouldn't help).
  4. Pushing to something like cache-11907 would allow force pushes to still access the last build's images.

What do you think of the #3 proposal?


[1]: I had two files:

# cat sleep/Dockerfile
FROM ubuntu:18.04
RUN sleep 10
# cat touch/Dockerfile
FROM ubuntu:18.04
RUN touch foo

To build I use this command (slightly different syntax from the buildctl syntax, but, AFAIK, uses the same backend):

docker buildx \     
       build \
       DIRECTORY_NAME_HERE \
       --output 'type=image,"name=gcr.io/hail-vdc/dktest,gcr.io/hail-vdc/dktest:cache",push=true' \
       --cache-to type=inline \
       --cache-from type=registry,ref=gcr.io/hail-vdc/dktest

Before every build I clear the local cache with:

docker system prune -a

I can clear the remote cache with:

gcloud container images list-tags gcr.io/hail-vdc/dktest --format="get(digest)" \
  | awk '{print "gcr.io/hail-vdc/dktest@" $1}' \
  | xargs gcloud container images delete --force-delete-tags

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x

@jigold
Copy link
Contributor Author

jigold commented Jun 10, 2022

3 is good. I'll try and find time next week to work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants