Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should rulesets distribute a pre-built artifact rather than rely on GitHub source/release archive #11

Closed
alexeagle opened this issue Nov 16, 2021 · 101 comments

Comments

@alexeagle
Copy link
Contributor

Rules ought to distribute an artifact that doesn't contain references to development-time dependencies, and omits testing code and examples.

This means the distribution can be broken if files are missing.

In addition, rules ought to integration-test against all supported bazel versions. So there should be some bazel-in-bazel test that consumes the HEAD distribution artifact and tests that the examples work.

Right now there are a few ways. rules_nodejs and rules_python have a built-in integration test runner. rules_go has a special go_bazel_test rule.

@alexeagle
Copy link
Contributor Author

See https://docs.google.com/document/d/1s_8AihGXbYujNWU_VjKNYKQb8_QGGGq7iKwAVQgjbn0/edit?usp=sharing for discussion around the requirements for testing against multiple Bazel versions.

@aherrmann
Copy link
Member

Rules ought to distribute an artifact that doesn't contain references to development-time dependencies, and omits testing code and examples.

Could you motivate this? It is not clear to me why this should be mandated.

If the motivation is that users of a rule set should not depend on dev-dependencies of that rule set, then this can be achieved without a dedicated distribution artifact. E.g. in rules_haskell dev-dependencies are only pulled in in rules_haskell's own WORKSPACE file, while regular dependencies are pulled in by the rules_haskell_dependencies macro that users are meant to call as well. Also the upcoming Bazel modules mechanism has a notion of dev-dependencies IIRC.

I think it is a plus that Bazel rule sets can be imported directly from their source repository at any commit without needing to generate a distribution artifact first. This makes it very easy to pull in a non-release commit of a rule set that contains a needed fix. If rule sets are only intended to be used from distribution artifacts, then this use-case is no longer necessarily supported, as a rule set may depend on generated files that are only included in the distribution artifact.

Either way, I don't think this should be mandated without the required tooling being available. See below.


Regarding bazel-in-bazel tests. I agree that this would be useful to have. We have looked into this for rules_haskell and in this context looked into a Gazelle extension to generate filegroup targets capturing all files required to run the rule set. (The same would be useful for generating distribution artifacts.)

We based our efforts on Gazelle's test_filegroup. However, we found it to be lacking for our use-case. Issues that come to mind are that it does not respect .gitignore or .bazelignore files, leading to invalid file inclusions of e.g. embedded workspaces for integration testing or user local configuration files like .bazelrc.local. Or that it assumes that every directory is a Bazel package, which is not a valid assumption and breaks labels like //my/pkg:src/some/source/file.

It would be great to have general purpose versions of test_filegroup and go_bazel_test available for any rule set to use. I'd view this as a prerequisite for this recommendation.

@alexeagle
Copy link
Contributor Author

Mostly the pre-built distribution artifact is required to get a stable checksum. If you rely on GitHub's generated .tgz source archives, you get a breakage when GitHub makes OS updates on their servers that create those archives.
It's also handy to avoid someone building @some_ruleset//... and breaking because there's a load statement there from a dev dependency.
I agree that it's desirable that the distro artifact is same-shaped as the source repo (generally just a subset of files) so that it's easy to opt-in to a HEAD dependency. We made that mistake with rules_nodejs and are working to undo that.

@ittaiz
Copy link

ittaiz commented Nov 21, 2021

Hi 👋🏽,
Few more thoughts on my end:

  1. Having bazel in bazel tests is super valuable also for internal rules authors. I've had this need a few times.
  2. Just in case someone doesn't know then there's the bazel integration testing repo. I've failed to keep it alive but I think a lot of the concepts there are valuable.
  3. +1 for being able to use commits from github. In practice we only had 1 issue with the checksums in 5-6 years and I can't count how many builds.

@alexeagle
Copy link
Contributor Author

@ittaiz what do you think about the SIG contributing or owning the current integration test repo in bazelbuild org?

@ittaiz
Copy link

ittaiz commented Nov 21, 2021

Be happy to add contributors and even hand ownership over if you feel that's important

@aherrmann
Copy link
Member

Mostly the pre-built distribution artifact is required to get a stable checksum. If you rely on GitHub's generated .tgz source archives, you get a breakage when GitHub makes OS updates on their servers that create those archives.

Is this still true? I haven't found official GitHub documentation stating that the archives are reproducible, but I have found this reproducible-builds thread pointing out that Github uses git archive and that git archive is designed to be reproducible.

Just as a quick test I compared the GH archive to a git archive created locally on rules_haskell.

$ curl -L https://github.com/tweag/rules_haskell/archive/455d9e6e8212f0bb73cd6e5437b0f5ce093e44be.tar.gz|sha256sum -
6841e554566d0c326beac84442dd776c49fac7d6059fef4728e75ae37c8e92cc  -
$ git clone https://github.com/tweag/rules_haskell; cd rules_haskell; git archive --format=tar --prefix=rules_haskell-455d9e6e8212f0bb73cd6e5437b0f5ce093e44be/ 455d9e6e8212f0bb73cd6e5437b0f5ce093e44be | gzip > tarball.tgz; sha256sum tarball.tgz
6841e554566d0c326beac84442dd776c49fac7d6059fef4728e75ae37c8e92cc  tarball.tgz

As you can see, the SHA256 is identical. This suggests that the archive is generated reproducibly.

Anecdotally, the only instance where I encountered issues with a changing commit hash in the last couple years was kubernetes/kubernetes#99376. In this case the change was due to a problematic .gitattributes configuration.

@alexeagle
Copy link
Contributor Author

@aherrmann I've followed this guidance ever since Jay Conrod made a big deal out of it in rules_go and bazel_gazelle. bazel-contrib/rules_go#2340 suggests maybe some GitHub URLs are reliable and some are not?

There is yet another reason I think rules should build their own distribution archive, which is that you can calculate your own checksum to produce the WORKSPACE snippet in the release process before shipping the commits to GitHub.

@aherrmann
Copy link
Member

@aherrmann I've followed this guidance ever since Jay Conrod made a big deal out of it in rules_go and bazel_gazelle. bazelbuild/rules_go#2340 suggests maybe some GitHub URLs are reliable and some are not?

Thanks for the pointer, I dug into this a little. I've attached the details in the end, in short: I don't think this was a case of the Github generated source archive changing. Instead, it looks to me as though this was a mixup between the SHA for the Github generated source archive and the release artifact. So, I don't think this is evidence to support the claim that Github source archives are non-reproducible.

There is yet another reason I think rules should build their own distribution archive, which is that you can calculate your own checksum to produce the WORKSPACE snippet in the release process before shipping the commits to GitHub.

The same can be achieved using git archive --format=tar.gz --prefix=$NAME-$TAG/ $TAG | sha256sum when using source archives.

To be clear, I'm not saying one should not use release artifacts. But, I am saying that I don't see why it should be mandated that everyone use them without a good technical reason to motivate that mandate. I haven't seen such a reason, yet. As mentioned above, there are upsides to the source archive approach and costs to the release artifact approach.


Details:

If we take a look at the changes in the PR we see

--- a/multirun/deps.bzl
+++ b/multirun/deps.bzl
@@ -4,7 +4,7 @@ def multirun_dependencies():
     _maybe(
         http_archive,
         name = "bazel_skylib",
-        sha256 = "2ef429f5d7ce7111263289644d233707dba35e39696377ebab8b0bc701f7818e",
+        sha256 = "2ea8a5ed2b448baf4a6855d3ce049c4c452a6470b1efd1504fdb7c1c134d220a",
         strip_prefix = "bazel-skylib-0.8.0",
         urls = ["https://github.com/bazelbuild/bazel-skylib/archive/0.8.0.tar.gz"],
     )

The 0.8.0 release has a release artifact and of course the generated source archive. If we look at the SHAs of each of these we find

$ curl -L https://github.com/bazelbuild/bazel-skylib/releases/download/0.8.0/bazel-skylib.0.8.0.tar.gz|sha256sum -
2ef429f5d7ce7111263289644d233707dba35e39696377ebab8b0bc701f7818e  -
$ curl -L https://github.com/bazelbuild/bazel-skylib/archive/refs/tags/0.8.0.tar.gz|sha256sum -
2ea8a5ed2b448baf4a6855d3ce049c4c452a6470b1efd1504fdb7c1c134d220a  -

I.e. the old hash was the hash of the release artifact and the new hash is the hash of the generated source archive.

If we compare the contents of these two archives we find

$ curl -L https://github.com/bazelbuild/bazel-skylib/releases/download/0.8.0/bazel-skylib.0.8.0.tar.gz|tar ztv|head -n 5
...
drwxrwxr-x root/root         0 2019-03-20 18:13 .bazelci/
-rw-rw-r-- root/root      2348 2019-03-20 18:13 .bazelci/presubmit.yml
-rw-rw-r-- root/root         9 2019-03-20 18:13 .gitignore
-rw-rw-r-- root/root       308 2019-03-20 18:13 AUTHORS
-rw-rw-r-- root/root      1002 2019-03-20 18:13 BUILD

$ curl -L https://github.com/bazelbuild/bazel-skylib/archive/refs/tags/0.8.0.tar.gz|tar ztv|head -n 5
...
drwxrwxr-x root/root         0 2019-03-20 18:13 bazel-skylib-0.8.0/
drwxrwxr-x root/root         0 2019-03-20 18:13 bazel-skylib-0.8.0/.bazelci/
-rw-rw-r-- root/root      2348 2019-03-20 18:13 bazel-skylib-0.8.0/.bazelci/presubmit.yml
-rw-rw-r-- root/root         9 2019-03-20 18:13 bazel-skylib-0.8.0/.gitignore
-rw-rw-r-- root/root       308 2019-03-20 18:13 bazel-skylib-0.8.0/AUTHORS

I.e. the release artifact has no prefix, while the generated source archive does have the standard <repo>-<rev> prefix.

The change is from Jan 2020, I'm pretty sure Github generated source archives had the <repo>-<rev> prefixes at that time as well. So, it looks like the old hash was never that of a Github generated source archive, but that of the release artifact. It seems the issue here was most likely not that the generated source archive changed, but that the wrong hash was written into multirun/deps.bzl before.

For reference, I can produce an equivalent to the Github generated source archive with the same hash on my machine today:

$ git archive --format=tar.gz --prefix=bazel-skylib-0.8.0/ 0.8.0 | sha256sum
2ea8a5ed2b448baf4a6855d3ce049c4c452a6470b1efd1504fdb7c1c134d220a  -

If I try to reproduce the release artifact I get a different hash than the release artifact uploaded on

$ git archive --format=tar.gz 0.8.0 | sha256sum
a04a79bca280f759ec2339c035e19d1f249616c38a352f9fdb8837a7c0ea2f7c  -

But, comparing this generated prefix-less tarball to the release tarball I find

$ curl -L https://github.com/bazelbuild/bazel-skylib/releases/download/0.8.0/bazel-skylib.0.8.0.tar.gz > released.tar.gz
$ git archive --format=tar.gz 0.8.0 > generated.tar.gz
$ diffoscope released.tar.gz generated.tar.gz
--- released.tar.gz
+++ generated.tar.gz
├── filetype from file(1)
│ @@ -1 +1 @@
│ -gzip compressed data, last modified: Wed Mar 20 18:02:49 2019, max compression
│ +gzip compressed data, from Unix
│   --- released.tar
├── +++ generated.tar
│ ├── filetype from file(1)
│ │ @@ -1 +1 @@
│ │ -POSIX tar archive (GNU)
│ │ +POSIX tar archive

So, the difference comes down to the release artifact containing slightly different headers including a timestamp.

@alexeagle alexeagle changed the title Recommend how rulesets can integration-test their distribution artifact Should rulesets distribute a pre-built artifact rather than rely on GitHub source/release archive Nov 30, 2021
@alexeagle
Copy link
Contributor Author

Great discussion. I think this issue ended up conflating two things. We agree that we need bazel-in-bazel integration testing of rules, let's move that to a new issue since the bulk of discussion here was about the release archive and that's just one motivation for bazel-in-bazel testing.

@alexeagle
Copy link
Contributor Author

I've updated all my repos, as well as the rules-template, to reflect that GitHub produces a stable SHA for the artifacts it serves.

@fmeum
Copy link
Member

fmeum commented Feb 1, 2022

Sorry to revive this closed issue, but I just encountered a situation in which the SHA of a GitHub-provided archive changed over time and thus ended up breaking the build.

Over at https://github.com/CodeIntelligenceTesting/jazzer, we use the following dependency on abseil-cpp:

    maybe(
        http_archive,
        name = "com_google_absl",
        sha256 = "5e1cbf25bf501f8e37866000a6052d02dbdd7b19a5b592251c59a4c9aa5c71ae",
        strip_prefix = "abseil-cpp-f2dbd918d8d08529800eb72f23bd2829f92104a4",
        url = "https://github.com/abseil/abseil-cpp/archive/f2dbd918d8d08529800eb72f23bd2829f92104a4.zip",
    )

An hour ago, CI runs started to fail with this error:

ERROR: /home/runner/work/jazzer/jazzer/driver/BUILD.bazel:21:11: //driver:fuzzed_data_provider depends on @com_google_absl//absl/strings:str_format in repository @com_google_absl which failed to fetch. no such package '@com_google_absl//absl/strings': java.io.IOException: Error downloading [https://github.com/abseil/abseil-cpp/archive/f2dbd918d8d08529800eb72f23bd2829f92104a4.zip] to /home/runner/.cache/bazel/_bazel_runner/6bc610921f14939de4c55cf170d55a62/external/com_google_absl/temp17765729958342005876/f2dbd918d8d08529800eb72f23bd2829f92104a4.zip: Checksum was 70203fec1c4823d4fe689f1c413bc7a0e6b4556dbd55b5ac40fc8862bacc0dcb but wanted 5e1cbf25bf501f8e37866000a6052d02dbdd7b19a5b592251c59a4c9aa5c71ae

I attached both the ZIP file that can currently be obtained from https://github.com/abseil/abseil-cpp/archive/f2dbd918d8d08529800eb72f23bd2829f92104a4.zip (abseil-cpp-f2dbd918d8d08529800eb72f23bd2829f92104a4.github-new.zip) and the ZIP file that was previously generated by GitHub and that I obtained from my local repository cache (abseil-cpp-f2dbd918d8d08529800eb72f23bd2829f92104a4.github-old.zip).

Running diffoscope on these files shows that the mtimes hour changed:

...
│┄ Archive contents identical but files differ, possibly due to different compression levels. Falling back to binary comparison.
├── zipinfo -v {}
│ @@ -28,15 +28,15 @@
│    file system or operating system of origin:      MS-DOS, OS/2 or NT FAT
│    version of encoding software:                   0.0
│    minimum file system compatibility required:     MS-DOS, OS/2 or NT FAT
│    minimum software version required to extract:   1.0
│    compression method:                             none (stored)
│    file security status:                           not encrypted
│    extended local header:                          no
│ -  file last modified on (DOS date/time):          2021 Nov 11 00:09:50
│ +  file last modified on (DOS date/time):          2021 Nov 11 08:09:50
│    file last modified on (UT extra field modtime): 2021 Nov 11 08:09:50 local
│    file last modified on (UT extra field modtime): 2021 Nov 11 08:09:50 UTC
│    32-bit CRC value (hex):                         00000000
│    compressed size:                                0 bytes
│    uncompressed size:                              0 bytes
│    length of filename:                             52 characters
│    length of extra field:                          9 bytes
...

@aherrmann Do you have an idea how this could happen and whether tar.gz would not have been prone to this?

@fmeum
Copy link
Member

fmeum commented Feb 1, 2022

Looks like the change has been rolled back, so this might have been an honest bug.

@brentleyjones
Copy link

And they said that they would insure the checksum doesn't change in the future. So I think this might even harden the case that we can rely on the checksum.

@fmeum
Copy link
Member

fmeum commented Feb 1, 2022

@brentleyjones That's great to know. Could you point me to the place where they confirmed that?

@brentleyjones
Copy link

So not as strong as a guarantee as I originally read it as, but it seems the rollback was related to the checksum change: https://twitter.com/tgummerer/status/1488493440103030787

@fmeum
Copy link
Member

fmeum commented Feb 1, 2022

There is https://twitter.com/tgummerer/status/1488493481874055173 though, so depending on archives for individual commits is unsafe.

@brentleyjones
Copy link

Yikes 😕

@alexeagle
Copy link
Contributor Author

I think we have to push hard and escalate (like Ulf did) to point out that GH is running a package repo and the world relies on it for supply-chain safety...

@alexeagle
Copy link
Contributor Author

/cc @tgummerer

alexeagle added a commit to aspect-build/rules_jasmine that referenced this issue Feb 10, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.
See bazel-contrib/SIG-rules-authors#11 (comment)
alexeagle added a commit to bazel-contrib/rules_jvm that referenced this issue Feb 10, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.

See bazel-contrib/SIG-rules-authors#11 (comment)
alexeagle added a commit to bazel-contrib/rules_oci that referenced this issue Feb 10, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.

See bazel-contrib/SIG-rules-authors#11 (comment)
alexeagle added a commit to bazel-contrib/bazel-mypy-integration that referenced this issue Feb 10, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.

See bazel-contrib/SIG-rules-authors#11 (comment)
alexeagle added a commit to aspect-build/rules_deno that referenced this issue Feb 10, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.
See bazel-contrib/SIG-rules-authors#11 (comment)
alexeagle added a commit to bazel-contrib/bazel-mypy-integration that referenced this issue Feb 10, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.

See bazel-contrib/SIG-rules-authors#11 (comment)
alexeagle added a commit to aspect-build/rules_py that referenced this issue Feb 10, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.
See bazel-contrib/SIG-rules-authors#11 (comment)
alexeagle added a commit to aspect-build/rules_rollup that referenced this issue Feb 13, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.
See bazel-contrib/SIG-rules-authors#11 (comment)
illicitonion pushed a commit to bazel-contrib/rules_jvm that referenced this issue Feb 14, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.

See bazel-contrib/SIG-rules-authors#11 (comment)
illicitonion pushed a commit to bazel-contrib/rules_jvm that referenced this issue Feb 14, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.

See bazel-contrib/SIG-rules-authors#11 (comment)
alexeagle added a commit to aspect-build/rules_webpack that referenced this issue Feb 14, 2023
GitHub's stability guarantee for the archive is iffy, and we want metrics on downloads.
See bazel-contrib/SIG-rules-authors#11 (comment)
vermeeren added a commit to breathe-doc/breathe that referenced this issue Mar 31, 2023
Since git v2.38.0 git archive tar.gz format default has changed from
invoking gzip to an internal gzip compressor implementation. However,
the output bitstream is not identical, meaning the resulting tar.gz
archive's checksum is different. This causes problems for PGP signing.

In order to avoid this issue for both old and new archive generation
alike manually invoke gzip in mkrelease script, bypassing git archive's
internal compression logic completely regardless of version.

GitHub and others presumably use a similar method to deal with this
change to keep old tag archive checksums from changing.

* git/git@4f4be00
* https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/
* https://github.com/orgs/community/discussions/45830
* bazel-contrib/SIG-rules-authors#11
gromgit referenced this issue in Homebrew/homebrew-core Jul 30, 2024
tiponada4l added a commit to tiponada4l/bazel-lib that referenced this issue Aug 10, 2024
Trabochuk added a commit to Trabochuk/bazel-lib that referenced this issue Aug 10, 2024
Bai-Chiang added a commit to Bai-Chiang/soconda that referenced this issue Nov 6, 2024
Using `/archive/refs/tags/` check sum hash may be more stable.

Some conflicting infomation:

[1] bazel-contrib/SIG-rules-authors#11 (comment)
[2] https://github.com/orgs/community/discussions/45830#discussioncomment-4823531
Bai-Chiang added a commit to Bai-Chiang/soconda that referenced this issue Nov 6, 2024
Using `/archive/refs/tags/` check sum hash may be more stable.

Some conflicting infomation:

[1] bazel-contrib/SIG-rules-authors#11 (comment)
[2] https://github.com/orgs/community/discussions/45830#discussioncomment-4823531
Bai-Chiang added a commit to Bai-Chiang/soconda that referenced this issue Nov 6, 2024
Using `/archive/refs/tags/` check sum hash may be more stable.

Some conflicting infomation:

[1] bazel-contrib/SIG-rules-authors#11 (comment)
[2] https://github.com/orgs/community/discussions/45830#discussioncomment-4823531
Bai-Chiang added a commit to Bai-Chiang/soconda that referenced this issue Nov 6, 2024
Using `/archive/refs/tags/` check sum hash may be more stable.

Some conflicting infomation:

[1] bazel-contrib/SIG-rules-authors#11 (comment)
[2] https://github.com/orgs/community/discussions/45830#discussioncomment-4823531
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests