Fix Integration tests #2425

BronzeDeer · 2023-03-10T22:32:36Z

The existing setup for minikube is very complicated, replicating most of
the setup steps for a full kubernetes cluster in an only partially
supported minikube configuration (driver=none). Furthermore the existing
setup has been broken for sometime, likely, at least in part due to the
changes to CNI and CRI in recent kubernetes versions.

Since what we actually need is only a running Kubernetes cluster on the
node and access to a registry on localhost:5000, we can switch the
extremely complicated minikube setup for a lightweight cluster using
k3s. Minikube came with a default addon for running a registry on every
node, but the same is not the case for k3s, instead we make use of the
packaged helm controller and its HelmChart CR to deploy twuni/docker-registry.helm
and expose it on localhost using the integrated LoadBalancer controller.

This pull request also includes another commit with a small change to the boilerplate checking script.
The current implementation only allows for copyright notices from the years 2018-2022, and would therefore
reject the new 2023 notice on top of the newly introduced k3s setup script. The new commit widens the range of accepted copyright years to 2000-2099.

There was a single failing test in TestRun (issue_684), which was caused by container-diff not supporting OCI indices/images. The underlying dockerfile did not pin the container hash of the from image and in the meantime ubuntu moved to using OCI manifests. while a fix for the issue has been submitted, it is unclear whether container-diff is still maintained, instead, this PR also includes a small commit pinning the base image of the test back to what it likely was back when the dockerfile was submitted.

Based on #2400, which should be merged first

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes unit tests
Adds integration tests if needed.

See the contribution guide for more details.

Reviewer Notes

The code flow looks good.
Unit tests and or integration tests added.

- Adapted error comparison according to linter recommendation - Disabled noctx linting for http request where canceling makes no sense - Disabled nilerror linting where nil error is returned on purpose - Disabled makezero linter where slice is explicitly deepcopied

Previously the regex only allowed the copyright notice to contain the years 2018,2019,2020,2021, or 2022. This commit widens to regex to 20\d\d allowing any year in the range [2000-2099]

The existing setup for minikube is very complicated, replicating most of the setup steps for a full kubernetes cluster in an only partially supported minikube configuration (driver=none). Furthermore the existing setup has been broken for sometime, likely, at least in part due to the changes to CNI and CRI in recent kubernetes versions. Since what we actually need is only a running Kubernetes cluster on the node and access to a registry on localhost:5000, we can switch the extremely complicated minikube setup for a lightweight cluster using k3s. Minikube came with a default addon for running a registry on every node, but the same is not the case for k3s, instead we make use of the package helm controller and its HelmChart CR to deploy twuni/docker-registry.helm and expose it on localhost using the integrated LoadBalancer controller.

BronzeDeer · 2023-03-10T23:04:41Z

@imjasonh Some more Notes based on my trek through the integration tests:

A lot of the time the integration tests take (aside from the setup, which has now been speed up by about 5 minutes) comes from the fact that every regression test is executed as an end-to-end test, running a full docker and kaniko build via the command line instead of moving some to more scoped down unit tests.
The dockerfiles that capture the unit tests generally do not use version (or better yet container hash) pinning for the images or packages installed in RUN commands, therefore it is unclear whether the current "interpretation" of the dockerfile even captures the conditions that originally caused the issue, making the regression tests flaky at best and useless at worst. (A good example is the issue 684 test which currently holds up this PR. The test broke without changes to the code when the ubuntu:rolling tag started pointing to OCI images)
If the regression test change to using reproducible container builds, all of the docker build calls can be replaced by uploading a series of comparison containers to a registry and comparing the kaniko build images against them. This should almost half the time spent on running the tests, while also avoiding local runs hitting the dockerhub pull limit quite as fast, and removing the necessity for a (specific) container runtime to run the integration tests
The integration tests actually rebuilts the images under test instead of testing the already built image, increasing test execution time (slightly) and introducing the possibility of hard to detect bugs between the released and tested images

imjasonh · 2023-03-20T13:58:01Z

@BronzeDeer Wow. Thanks for sticking with this, and especially for your thorough analysis!!

I totally agree that we'd probably get better results by moving some tests to unit tests, or at least faster-running integration tests. Avoiding rebuilding the image every time definitely seems like an improvement. Making kaniko less painful to develop will pay dividends when/if we want to make improvements to it in the future.

BronzeDeer · 2023-03-20T18:43:13Z

@imjasonh Right now even getting back to green tests will be a huge boon, enabling future work. ON that note, since we are blocked by the container-diff PR, should we consider a plan b in case the maintainers over there don't react soon?

Here's my current understanding of the role that container-diff plays in the test workflow:

The integration tests attempt to verify that kaniko creates "semantically equivalent" images to those build by docker. This means that ignoring outside variability (non-pinned package versions, anything involving RNGs, timestamps) the content (and metadata?) of the layers should be the same, even if the images are not bit-for-bit equivalent.

Since we do not currently make the effort to control for these sources of variability (and docker afaik doesn't offer any timestamp-agnostic builds) we cannot currenlty go for the simpler bit-for-bit comparison using container hashes, and instead need the more sophisticated logic offered by container-diff.

In the short term, kaniko could maintain a separate patched version of container-diff as needed, but If container-diff becomes fully unmaintained, I see a long-term alternative:

We focus on creating reproducible test cases and rely fully on simple hash comparisons, also realizing many of the other benefits allong the way. This will require a complete refactor of all integration tests, understanding what the original bug was focussing and recreating the test case in a more low level, more unit test like fashion or creating a new dockerfile with fully pinned FROM images and apt/apk etc packages (although if we do understand the root cause of the problems, installing loads of packages will likely not be necessary). Finally we would have to create a small tool that could blank the timestamps in the layers after the fact, creating a canonical version from a tar dump. Once that is accomplished you wouldn't even need any remote registry and could simply cache targeted image digests. Adding new test cases would be as simple as calling a helper script to build using docker, canonicalizing timestamps and then saving the target digest. Integration tests would run kaniko, output to tar, then run the canonicalize tool and read out the digest to compare.

Besides the obvious chore work of converting all existing test cases, this would put a load on the maintainers as well to properly screen new test cases for sources of variability. (Although I imagine existing dockerfile linters could be adopted to catch most common cases automatically.

I think this could be a viable long-term goal, no matter whether we restore container-diff functionality in the short-term or not. If we get back to all green test we could institute a policy of requiring all new integration tests to follow the new approach and then slowly migrate over existing tests, over attempting a big bang change.

As for short-term fixing the falling test, we could either pin an older (pre-OCI) ubuntu images as base in the dockerfile, or build and maintain or vendor a fixed container-diff. If we go with the older ubuntu image, we would need to prioritize a long-term fix since any tag rolling to OCI will create further sporadic test failures at random points in the future

imjasonh · 2023-03-20T18:56:32Z

@imjasonh Right now even getting back to green tests will be a huge boon, enabling future work. ON that note, since we are blocked by the container-diff PR, should we consider a plan b in case the maintainers over there don't react soon?

We can disable any integration tests that are affected by this in the meantime. I don't know if anybody is still maintaining container-diff, unfortunately.

Let's just add a t.Skip or comment out TestRun/test_Dockerfile_test_issue_684 for now.

imjasonh · 2023-03-20T19:00:01Z

Pinning to an older ubuntu is also a good fix. I'd rather not maintain a fork of container-diff here, since maintainer bandwidth is already a scant resource on this project, and picking up another possibly-abandoned project as a dependency isn't going to make that any better. I'd rather risk introducing a regression on that issue by skipping its integration test, and unblock future progress and cleanups.

Thank you again for your hard work and tenacity on this. This is likely going to end up being pretty thankless work in general, but I can give you all the thanks I have. Keep it up and we'll find a way to make you a maintainer*

*this is a threat 😆

BronzeDeer · 2023-03-20T19:13:14Z

I added the older ubuntu pin. If the tests now pass and we can merge, I'll rebase the original reproducibility fix on top of it and we can get that out of the way as well

BronzeDeer · 2023-03-20T19:35:57Z

Belay that, apparently the older ubuntu images have some execve problem with their shells. I'll look into it and pushed a fixed verson soon

The dockerfile for the regression test connected to issue 684 used a rolling tag as base image, making it flaky and fail since it was introduced. This commit pins the base image to the digest of bionic-20200219, which, based on the date of the commit that introduced to the dockerfile would be the most newest ubuntu build and likely what the "rolling" tag resolved to back then. Since this also an image from the pre-oci days of ubuntu, this circumvents a bug in container-diff as well (GoogleContainerTools/container-diff#389)

BronzeDeer · 2023-03-20T20:50:37Z

Pushed a fix, all tests green on my fork now

BronzeDeer added 6 commits February 19, 2023 14:07

fix(ci): Bump golangci-lint to 1.51.1

ce2b8f8

chore(lint): fix gofmt and goimport issues

1a9cd4c

chore(ci): Update go version in tests workflows

06ed647

fix(ci): Allow boilerplate years from 2000-2099

d68e5bc

Previously the regex only allowed the copyright notice to contain the years 2018,2019,2020,2021, or 2022. This commit widens to regex to 20\d\d allowing any year in the range [2000-2099]

BronzeDeer mentioned this pull request Mar 10, 2023

fix(executor): make pax tar builds reproducible again #2384

Merged

4 tasks

kfix mentioned this pull request Mar 19, 2023

Image built with Kaniko claims to be OCI but in reality is not #1836

Closed

BronzeDeer force-pushed the ci-debug branch from 065c713 to 1dbe685 Compare March 20, 2023 19:40

imjasonh approved these changes Mar 21, 2023

View reviewed changes

imjasonh merged commit 14ea7c4 into GoogleContainerTools:main Mar 21, 2023

imjasonh mentioned this pull request Mar 21, 2023

Bump golangci lint #2400

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Integration tests #2425

Fix Integration tests #2425

BronzeDeer commented Mar 10, 2023 •

edited

Loading

BronzeDeer commented Mar 10, 2023

imjasonh commented Mar 20, 2023

BronzeDeer commented Mar 20, 2023 •

edited

Loading

imjasonh commented Mar 20, 2023

imjasonh commented Mar 20, 2023

BronzeDeer commented Mar 20, 2023

BronzeDeer commented Mar 20, 2023

BronzeDeer commented Mar 20, 2023

Fix Integration tests #2425

Fix Integration tests #2425

Conversation

BronzeDeer commented Mar 10, 2023 • edited Loading

BronzeDeer commented Mar 10, 2023

imjasonh commented Mar 20, 2023

BronzeDeer commented Mar 20, 2023 • edited Loading

imjasonh commented Mar 20, 2023

imjasonh commented Mar 20, 2023

BronzeDeer commented Mar 20, 2023

BronzeDeer commented Mar 20, 2023

BronzeDeer commented Mar 20, 2023

BronzeDeer commented Mar 10, 2023 •

edited

Loading

BronzeDeer commented Mar 20, 2023 •

edited

Loading