Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Integration tests #2425

Merged
merged 7 commits into from
Mar 21, 2023
Merged

Conversation

BronzeDeer
Copy link
Contributor

@BronzeDeer BronzeDeer commented Mar 10, 2023

The existing setup for minikube is very complicated, replicating most of
the setup steps for a full kubernetes cluster in an only partially
supported minikube configuration (driver=none). Furthermore the existing
setup has been broken for sometime, likely, at least in part due to the
changes to CNI and CRI in recent kubernetes versions.

Since what we actually need is only a running Kubernetes cluster on the
node and access to a registry on localhost:5000, we can switch the
extremely complicated minikube setup for a lightweight cluster using
k3s. Minikube came with a default addon for running a registry on every
node, but the same is not the case for k3s, instead we make use of the
packaged helm controller and its HelmChart CR to deploy twuni/docker-registry.helm
and expose it on localhost using the integrated LoadBalancer controller.

This pull request also includes another commit with a small change to the boilerplate checking script.
The current implementation only allows for copyright notices from the years 2018-2022, and would therefore
reject the new 2023 notice on top of the newly introduced k3s setup script. The new commit widens the range of accepted copyright years to 2000-2099.

There was a single failing test in TestRun (issue_684), which was caused by container-diff not supporting OCI indices/images. The underlying dockerfile did not pin the container hash of the from image and in the meantime ubuntu moved to using OCI manifests. while a fix for the issue has been submitted, it is unclear whether container-diff is still maintained, instead, this PR also includes a small commit pinning the base image of the test back to what it likely was back when the dockerfile was submitted.

Based on #2400, which should be merged first

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes unit tests
  • Adds integration tests if needed.

See the contribution guide for more details.

Reviewer Notes

  • The code flow looks good.
  • Unit tests and or integration tests added.

- Adapted error comparison according to linter recommendation
- Disabled noctx linting for http request where canceling makes no sense
- Disabled nilerror linting where nil error is returned on purpose
- Disabled makezero linter where slice is explicitly deepcopied
Previously the regex only allowed the copyright notice to contain the
years 2018,2019,2020,2021, or 2022. This commit widens to regex to
20\d\d allowing any year in the range [2000-2099]
The existing setup for minikube is very complicated, replicating most of
the setup steps for a full kubernetes cluster in an only partially
supported minikube configuration (driver=none). Furthermore the existing
setup has been broken for sometime, likely, at least in part due to the
changes to CNI and CRI in recent kubernetes versions.

Since what we actually need is only a running Kubernetes cluster on the
node and access to a registry on localhost:5000, we can switch the
extremely complicated minikube setup for a lightweight cluster using
k3s. Minikube came with a default addon for running a registry on every
node, but the same is not the case for k3s, instead we make use of the
package helm controller and its HelmChart CR to deploy twuni/docker-registry.helm
and expose it on localhost using the integrated LoadBalancer controller.
@BronzeDeer
Copy link
Contributor Author

@imjasonh Some more Notes based on my trek through the integration tests:

  • A lot of the time the integration tests take (aside from the setup, which has now been speed up by about 5 minutes) comes from the fact that every regression test is executed as an end-to-end test, running a full docker and kaniko build via the command line instead of moving some to more scoped down unit tests.

  • The dockerfiles that capture the unit tests generally do not use version (or better yet container hash) pinning for the images or packages installed in RUN commands, therefore it is unclear whether the current "interpretation" of the dockerfile even captures the conditions that originally caused the issue, making the regression tests flaky at best and useless at worst. (A good example is the issue 684 test which currently holds up this PR. The test broke without changes to the code when the ubuntu:rolling tag started pointing to OCI images)

  • If the regression test change to using reproducible container builds, all of the docker build calls can be replaced by uploading a series of comparison containers to a registry and comparing the kaniko build images against them. This should almost half the time spent on running the tests, while also avoiding local runs hitting the dockerhub pull limit quite as fast, and removing the necessity for a (specific) container runtime to run the integration tests

  • The integration tests actually rebuilts the images under test instead of testing the already built image, increasing test execution time (slightly) and introducing the possibility of hard to detect bugs between the released and tested images

@imjasonh
Copy link
Collaborator

@BronzeDeer Wow. Thanks for sticking with this, and especially for your thorough analysis!!

I totally agree that we'd probably get better results by moving some tests to unit tests, or at least faster-running integration tests. Avoiding rebuilding the image every time definitely seems like an improvement. Making kaniko less painful to develop will pay dividends when/if we want to make improvements to it in the future.

@BronzeDeer
Copy link
Contributor Author

BronzeDeer commented Mar 20, 2023

@imjasonh Right now even getting back to green tests will be a huge boon, enabling future work. ON that note, since we are blocked by the container-diff PR, should we consider a plan b in case the maintainers over there don't react soon?

Here's my current understanding of the role that container-diff plays in the test workflow:

The integration tests attempt to verify that kaniko creates "semantically equivalent" images to those build by docker. This means that ignoring outside variability (non-pinned package versions, anything involving RNGs, timestamps) the content (and metadata?) of the layers should be the same, even if the images are not bit-for-bit equivalent.

Since we do not currently make the effort to control for these sources of variability (and docker afaik doesn't offer any timestamp-agnostic builds) we cannot currenlty go for the simpler bit-for-bit comparison using container hashes, and instead need the more sophisticated logic offered by container-diff.

In the short term, kaniko could maintain a separate patched version of container-diff as needed, but If container-diff becomes fully unmaintained, I see a long-term alternative:

We focus on creating reproducible test cases and rely fully on simple hash comparisons, also realizing many of the other benefits allong the way. This will require a complete refactor of all integration tests, understanding what the original bug was focussing and recreating the test case in a more low level, more unit test like fashion or creating a new dockerfile with fully pinned FROM images and apt/apk etc packages (although if we do understand the root cause of the problems, installing loads of packages will likely not be necessary). Finally we would have to create a small tool that could blank the timestamps in the layers after the fact, creating a canonical version from a tar dump. Once that is accomplished you wouldn't even need any remote registry and could simply cache targeted image digests. Adding new test cases would be as simple as calling a helper script to build using docker, canonicalizing timestamps and then saving the target digest. Integration tests would run kaniko, output to tar, then run the canonicalize tool and read out the digest to compare.

Besides the obvious chore work of converting all existing test cases, this would put a load on the maintainers as well to properly screen new test cases for sources of variability. (Although I imagine existing dockerfile linters could be adopted to catch most common cases automatically.

I think this could be a viable long-term goal, no matter whether we restore container-diff functionality in the short-term or not. If we get back to all green test we could institute a policy of requiring all new integration tests to follow the new approach and then slowly migrate over existing tests, over attempting a big bang change.

As for short-term fixing the falling test, we could either pin an older (pre-OCI) ubuntu images as base in the dockerfile, or build and maintain or vendor a fixed container-diff. If we go with the older ubuntu image, we would need to prioritize a long-term fix since any tag rolling to OCI will create further sporadic test failures at random points in the future

@imjasonh
Copy link
Collaborator

@imjasonh Right now even getting back to green tests will be a huge boon, enabling future work. ON that note, since we are blocked by the container-diff PR, should we consider a plan b in case the maintainers over there don't react soon?

We can disable any integration tests that are affected by this in the meantime. I don't know if anybody is still maintaining container-diff, unfortunately.

Let's just add a t.Skip or comment out TestRun/test_Dockerfile_test_issue_684 for now.

@imjasonh
Copy link
Collaborator

Pinning to an older ubuntu is also a good fix. I'd rather not maintain a fork of container-diff here, since maintainer bandwidth is already a scant resource on this project, and picking up another possibly-abandoned project as a dependency isn't going to make that any better. I'd rather risk introducing a regression on that issue by skipping its integration test, and unblock future progress and cleanups.

Thank you again for your hard work and tenacity on this. This is likely going to end up being pretty thankless work in general, but I can give you all the thanks I have. Keep it up and we'll find a way to make you a maintainer*

*this is a threat 😆

@BronzeDeer
Copy link
Contributor Author

I added the older ubuntu pin. If the tests now pass and we can merge, I'll rebase the original reproducibility fix on top of it and we can get that out of the way as well

@BronzeDeer
Copy link
Contributor Author

Belay that, apparently the older ubuntu images have some execve problem with their shells. I'll look into it and pushed a fixed verson soon

The dockerfile for the regression test connected to issue 684 used a
rolling tag as base image, making it flaky and fail since it was
introduced.

This commit pins the base image to the digest of bionic-20200219, which,
based on the date of the commit that introduced to the dockerfile would
be the most newest ubuntu build and likely what the "rolling" tag
resolved to back then. Since this also an image from the pre-oci days of
ubuntu, this circumvents a bug in container-diff as well
(GoogleContainerTools/container-diff#389)
@BronzeDeer
Copy link
Contributor Author

Pushed a fix, all tests green on my fork now

@imjasonh imjasonh merged commit 14ea7c4 into GoogleContainerTools:main Mar 21, 2023
@imjasonh imjasonh mentioned this pull request Mar 21, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants