Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bot missing recent LLVM releases #2531

Open
jakirkham opened this issue May 4, 2024 · 25 comments
Open

Bot missing recent LLVM releases #2531

jakirkham opened this issue May 4, 2024 · 25 comments

Comments

@jakirkham
Copy link
Contributor

It appears the bot started missing recent LLVM releases

The last bot release PR was 18.1.3:

However the last two needed to be handled manually:

That said, the bot does appear to have detected the releases:

So maybe there is an issue cropping up in the next step

@h-vetinari
Copy link
Contributor

Thanks John! Other feedstocks that build from the exact same tag & sources had varying degrees of success.

18.1.3 18.1.4 18.1.5
llvmdev bot manual manual
clangdev bot manual manual
compiler-rt bot manual no PR ❌
openmp bot bot no PR ❌
lld bot manual no PR ❌
flang bot bot bot
lldb bot bot no PR ❌
mlir bot manual no PR ❌
mlir-python-bindings bot bot no PR ❌

@h-vetinari
Copy link
Contributor

@beckermr
Copy link
Contributor

beckermr commented May 4, 2024

I think the bot has become sentient and only opens PRs now after we make issues noting they are not there! 😱

J/k but indeed the behavior is puzzling.

Typically the bot will try to make a version PR three times. If those three times fail, then the PR is put in the backlog. PRs in the backlog are tried at random after any newly found versions are tried. So it could be they were backlog and the bot finally cleared them.

@h-vetinari
Copy link
Contributor

In case of 18.1.4, the PRs didn't get opened over a period of 2 weeks, which is pretty long. I think the issue is that it fails at all the first three times. The tag with the sources is there for all feedstocks equally. I guess one possible explanation is that upstream created the tags, but left a longer-than-usual gap in uploading the llvm-project-{{ version }}.src.tar.xz sources, enough to fall into the 3 "failed three times" category?

Still, that doesn't really explain why almost all missing PRs now got opened after we were discussing it here - spooky! 😅

@jakirkham
Copy link
Contributor Author

Lol 😂

Here are some other random guesses

Did the fix you made yesterday Matt potentially have an effect on LLVM and friends?

Perhaps another possibility is some dependency changes over time

Another thing of interest might be memory pressure. The LLVM (and Arrow) recipes are a bit more complicated. So may be using more resources than the usual resource light CI jobs have. Recognize there have been improvements made in various places (including conda-build) though don't know which of these fixes are out in releases. If we see more evidence of this, it might be worth profiling

Lastly recall there were some issues in the bot ~2weeks ago that got cleared out. IIRC the first of the missed LLVM version updates was around then

@xhochy
Copy link
Member

xhochy commented May 6, 2024

@ytausch Can you also look at this? This is an issue near to the tooling you write.

@jakirkham
Copy link
Contributor Author

Curious how things are looking a month later. Ok if we don't know. Just wanted to check back in 🙂

@ytausch
Copy link
Contributor

ytausch commented May 31, 2024

I am currently pushing on decoupling some of the bot's code that will make not only the version check but also the migration itself (that seems to be failing here) run locally with debugging enabled, which will provide a sustainable solution for problems like this one.

For that reason, I did not prioritize to look into this manually so far. Let me know if you see this differently.

@h-vetinari
Copy link
Contributor

Curious how things are looking a month later. Ok if we don't know. Just wanted to check back in 🙂

LLVM 18.1.6 worked fine (bot opened all relevant PRs); LLVM 18.1.7 got tagged >24h ago, but the official release was only ~7h ago.

Since we're generally relying on llvm-project-{{ 18.1.7 }}.src.tar.xz (which isn't generated by github but uploaded by the release manager), it's possible/likely that the bot started looking for a file that wasn't there after the tag appeared. Indeed, the status page lists the llvmdev update as failed with:

3.00 attempts - bot error

I guess this is somewhat unavoidable as long as upstream has a long enough gap between tagging and uploading the tarballs. The solutions I see are:

  • Retry more often (expensive for the bot infra)
  • Special-case LLVM feedstocks (probably not worth it)
  • Switch the LLVM recipes to use the github sources directly -- that way there cannot be a race condition.

I think the last approach might actually be the sanest one.

@jakirkham
Copy link
Contributor Author

Given every project needs to wait in 6hr increments for updates. This seems like reasonable behavior from the bot so far

If there are ways to check for version updates more frequently than 6hrs, that seems like the best path for improvement (and is not specific to LLVM)

@h-vetinari
Copy link
Contributor

If there are ways to check for version updates more frequently than 6hrs, that seems like the best path for improvement

I don't see how that changes anything - the bot will just go into a "max failure" state faster. It's the recovery period after having hit max retries that seems to take 2-3 weeks (which presumably is the thing that would be effective to reduce).

In any case, switching to github-generated sources should 100% fix this problem for LLVM (and we don't even have submodules to deal with, so no benefit to using the upstream tarballs).

@jakirkham
Copy link
Contributor Author

The difference would be we would not need to wait another 6hrs once the source is available. We might wait 1hr or perhaps less. It also depends on whether we can move to something event-driven (as opposed to scraping-based)

However the downside with GitHub generated sources is they are dynamically generated on-demand. So their checksums can change between retrievals

@h-vetinari
Copy link
Contributor

The difference would be we would not need to wait another 6hrs once the source is available.

The point is we cannot influence the delay between tag creation and when the tarballs are uploaded; this may well be 24h, so retrying more often in that times has no use whatsoever. The only option would be to distinguish somehow between tag and tarball availability, and not count it as a failure if the tag is there, but the tarball isn't. But that's just "more retries" by another name.

However the downside with GitHub generated sources is they are dynamically generated on-demand. So their checksums can change between retrievals

That basically never happens because everyone and their dog depends on them being stable. 😅

@beckermr
Copy link
Contributor

beckermr commented Jun 7, 2024

You may be able to restrict how the bot searches for versions so that it doesn't find the tag before the tarball is uploaded. I think you'd want to only have it look for URLS and not use github's RSS feed.

@beckermr
Copy link
Contributor

beckermr commented Jun 7, 2024

Also, I very much doubt it is feasible that the bot could respond to release events from projects in an event driven system.

@jakirkham
Copy link
Contributor Author

However the downside with GitHub generated sources is they are dynamically generated on-demand. So their checksums can change between retrievals

That basically never happens because everyone and their dog depends on them being stable. 😅

On the contrary, this happens quite regularly

This affected us with the conda-build 24.5.0 release ( conda-forge/conda-build-feedstock#226 ) and conda before that ( conda-forge/conda-feedstock#228 (comment) ). There are well documented cases elsewhere ( https://github.com/orgs/community/discussions/45830 ). In fact when I have asked GitHub about stability with these in the past, they have noted they generate artifacts dynamically and run some tests, but checksums can change (so no guarantees). This issue has been going on for quite some time

The general movement (even from GitHub) is more validation around artifacts (not less). Here is a blogpost from GitHub last month on setting up artifact attestations, which provide even more information around the artifacts published beyond being stable (including an associated sha256 checksum). This of course requires stable artifacts produced once (not GitHub autogenerated releases)

Think we should consider carefully how we get are compiler source code and put a preference towards stable artifacts (ideally with more provenance data if possible)

@jakirkham
Copy link
Contributor Author

Also, I very much doubt it is feasible that the bot could respond to release events from projects in an event driven system.

We have long wanted an event driven system ( #54 ). Including for version updates ( #54 (comment) )

Agree this may very well be a more substantial undertaking

That said, don't think we should rule out that possibility or associated discussion simply because of that. Specing it out would be the first step in creating a shovel ready project when someone interested shows up with resources interested in helping out

@h-vetinari
Copy link
Contributor

On the contrary, this happens quite regularly

I'm aware of the cases you mention, and I still don't agree with "regularly". The first time GH changed the default compression level it broke the world (e.g. bazel recipes everywhere), and they reverted.

We're relying on GitHub generated tarballs in many hundreds of feedstocks, and I can count on one hand the unexplained hash changes that happened in the last couple of years in working across a similar number of feedstocks.

But even if a spurious change does happen, it is by far a smaller encumbrance than the bot tripping over itself and not opening PRs at all.

@jakirkham
Copy link
Contributor Author

The frequency is not what is at issue. The unreliability is

For core infrastructure (like compilers), we should know reliably what it is produced from

@h-vetinari
Copy link
Contributor

h-vetinari commented Jun 7, 2024

The frequency is not what is at issue. The unreliability is

For core infrastructure (like compilers), we should know reliably what it is produced from

We do know what it's produced from, i.e. the exact git tag. Whether the hash changes due to compression level or whatever else is completely irrelevant for provenance. Unless you are thinking about a scenario where github gets so compromised that someone can hijack the tarball generation, but that's not a realistic scenario to me (and we'd have much bigger problems then).

If you can solve the problem of the bot not opening PRs, whether through an event-based solution or some workaround in the bot infra, I'll happily switch back to the "official" tarballs (which, BTW, also aren't audited or signed). But it's not an option to have the bot regularly fail to issue PRs for this interrelated stack of feedstocks that are already a handful to maintain even with bot support.

@ytausch
Copy link
Contributor

ytausch commented Jun 26, 2024

You may be able to restrict how the bot searches for versions so that it doesn't find the tag before the tarball is uploaded. I think you'd want to only have it look for URLS and not use github's RSS feed.

This would work, yes.

Making this configureable on a per-feedstock basis is probably not too complicated with an additional configuration option in the bot section of conda-forge.yml. However, I am still not really convinced why using the GitHub tarballs is a bad idea, at least for now.

@jakirkham
Copy link
Contributor Author

Here's an example today of a GitHub autogenerated tarball having its checksum change

xref: conda-forge/cuda-python-feedstock#83 (comment)

@h-vetinari
Copy link
Contributor

FWIW, since switching to GitHub tags across the LLVM feedstocks, PRs were opened without problems (and no hashing issues observed either)

@ytausch
Copy link
Contributor

ytausch commented Jun 27, 2024

Here's an example today of a GitHub autogenerated tarball having its checksum change

xref: conda-forge/cuda-python-feedstock#83 (comment)

Hmm, it doesn't seem like this should happen, as GitHub did not announce a change like this. Currently, GitHub-generated source archives should be stable, and they intend to announce any changes to this within six months notice.

The checksum change in your example has another reason, I will comment it

Edit: Oops, you seem to be right. Probably it only happens very rarely that the hashes differ?

@ytausch
Copy link
Contributor

ytausch commented Jun 27, 2024

Making this configureable on a per-feedstock basis is probably not too complicated with an additional configuration option in the bot section of conda-forge.yml. However, I am still not really convinced why using the GitHub tarballs is a bad idea, at least for now.

I just found out this feature exists already: bot.version_updates.sources

Will create PRs for the LLVM repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants