-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Manifest.toml #3783
Remove Manifest.toml #3783
Conversation
@simone-silvestri @jagoosw I think this PR may solve the race condition issue. The essential reason is that the Manifest was being re-resolved after init because downstream tests imported additional packages that were not imported during the initial instantiation. |
It may not be necessary to delete the Manifest here but on the other hand, since we are always re-resolving I think we may as well. |
Is it possible to get julia to ignore the manifest (i.e. specify that we want to ignore it in the CI), so that when a user downloads it they can still use the manifest? Having asked that I don't actually know what the manifest does that isn't already taken care of by the Project? |
The Project compat is supposed to take care of everything. It's just that subtle bugs can creep into packages that might cause things to fail for us (for example something subtle can break type inference, causing everything to fail on the GPU). So using the Manifest is a bit more conservative, because we fix all of the packages that don't need to be upgraded. On the other hand, we are better community members by deleting the Manifest because catching those bugs is important for everyone. |
To be more specific, with Project.toml we specify compat on packages we depend on, but we do not have compat on packages that our dependencies depend on. The Manifest tracks all packages including dependencies of dependencies (and dependencies of those). We've caught issues in the past by comparing the committed Manifest to another Manifest with different packages. Some issues include, a version of MPI_jll that doesn't work on Mac, or some weird bug in ChainRulesCore... things that we don't have explicit compat for... |
Ah I understand, that makes sense! |
Well, it looks like this PR does not completely solve the problem, and the jury is out perhaps on whether it helps or not. I did learn a few things. It seems the race condition originates from an attempt to "re-resolve" the manifest during the test: https://buildkite.com/clima/oceananigans/builds/17516#01920c32-4e89-4654-a4a8-9b077e51e87c/39-221 The puzzling part is how to solve the problem. I thought at first we could solve it by preventing the Manifest from being "re-resolved" --- perhaps by resolving it correctly during initialization? But actually, this isn't possible, since the problematic manifest in question is a temporary run that is created only for the tests (and its different from the project's manifest, because it also includes the test dependencies). This temporary test manifest can't be initialized as far as I can tell and is created independently for each test. Finally, it seems that some race conditions happen in a print statement (this is just from reading the error message). It all boils down eventually to these two lines: The command its trying to run is ERROR: failed process: Process(`/net/ocean/home/data44/data5/glwagner/julia-1.10.5/bin/julia -C native -J/net/ocean/home/data44/data5/glwagner/julia-1.10.5/lib/julia/sys.so -O0 -g1 --color=yes -O0 --color=no --history-file=no --startup-file=no --project=/tmp/jl_f9Z8t1/Project.toml --eval 'append!(empty!(Base.DEPOT_PATH), ["/data5/glwagner/.julia-17516"])
--
| append!(empty!(Base.DL_LOAD_PATH), String[])
cd("/data5/glwagner/.julia-17516/packages/CUDA_Runtime_jll/YgJCI/.pkg")
--
| include("/data5/glwagner/.julia-17516/packages/CUDA_Runtime_jll/YgJCI/.pkg/select_artifacts.jl")
| ' -t1 --startup-file=no x86_64-linux-gnu-libgfortran5-cxx11-julia_version+1.10.5`, ProcessSignaled(11)) [0] and supposedly the issue arises within the https://buildkite.com/clima/oceananigans/builds/17516#01920c32-4eb2-4e77-80c5-fd88280e8c6c/37-224 |
Here's a tutorial for automatically retrying buildkite builds: https://buildkite.com/blog/job-retries |
Interesting. I noticed that there were some unnecessary test dependencies and removed those. That seems to have helped. |
Well, the tests passed finally with no manual intervention so I declare my work complete @simone-silvestri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great!
I noticed that some issues with buildkite have to do with needing to re-resolve the Manifest, so this is yet another attempt to fix the race condition in our CI...