Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Manifest.toml #3783

Merged
merged 15 commits into from
Sep 21, 2024
Merged

Remove Manifest.toml #3783

merged 15 commits into from
Sep 21, 2024

Conversation

glwagner
Copy link
Member

I noticed that some issues with buildkite have to do with needing to re-resolve the Manifest, so this is yet another attempt to fix the race condition in our CI...

@glwagner
Copy link
Member Author

@simone-silvestri @jagoosw I think this PR may solve the race condition issue. The essential reason is that the Manifest was being re-resolved after init because downstream tests imported additional packages that were not imported during the initial instantiation.

@glwagner
Copy link
Member Author

It may not be necessary to delete the Manifest here but on the other hand, since we are always re-resolving I think we may as well.

@jagoosw
Copy link
Collaborator

jagoosw commented Sep 19, 2024

Is it possible to get julia to ignore the manifest (i.e. specify that we want to ignore it in the CI), so that when a user downloads it they can still use the manifest?

Having asked that I don't actually know what the manifest does that isn't already taken care of by the Project?

@glwagner
Copy link
Member Author

Is it possible to get julia to ignore the manifest (i.e. specify that we want to ignore it in the CI), so that when a user downloads it they can still use the manifest?

Having asked that I don't actually know what the manifest does that isn't already taken care of by the Project?

The Project compat is supposed to take care of everything. It's just that subtle bugs can creep into packages that might cause things to fail for us (for example something subtle can break type inference, causing everything to fail on the GPU). So using the Manifest is a bit more conservative, because we fix all of the packages that don't need to be upgraded. On the other hand, we are better community members by deleting the Manifest because catching those bugs is important for everyone.

@glwagner
Copy link
Member Author

Is it possible to get julia to ignore the manifest (i.e. specify that we want to ignore it in the CI), so that when a user downloads it they can still use the manifest?
Having asked that I don't actually know what the manifest does that isn't already taken care of by the Project?

The Project compat is supposed to take care of everything. It's just that subtle bugs can creep into packages that might cause things to fail for us (for example something subtle can break type inference, causing everything to fail on the GPU). So using the Manifest is a bit more conservative, because we fix all of the packages that don't need to be upgraded. On the other hand, we are better community members by deleting the Manifest because catching those bugs is important for everyone.

To be more specific, with Project.toml we specify compat on packages we depend on, but we do not have compat on packages that our dependencies depend on. The Manifest tracks all packages including dependencies of dependencies (and dependencies of those). We've caught issues in the past by comparing the committed Manifest to another Manifest with different packages. Some issues include, a version of MPI_jll that doesn't work on Mac, or some weird bug in ChainRulesCore... things that we don't have explicit compat for...

@jagoosw
Copy link
Collaborator

jagoosw commented Sep 19, 2024

Ah I understand, that makes sense!

@glwagner
Copy link
Member Author

glwagner commented Sep 19, 2024

Well, it looks like this PR does not completely solve the problem, and the jury is out perhaps on whether it helps or not. I did learn a few things. It seems the race condition originates from an attempt to "re-resolve" the manifest during the test:

https://buildkite.com/clima/oceananigans/builds/17516#01920c32-4e89-4654-a4a8-9b077e51e87c/39-221

The puzzling part is how to solve the problem. I thought at first we could solve it by preventing the Manifest from being "re-resolved" --- perhaps by resolving it correctly during initialization? But actually, this isn't possible, since the problematic manifest in question is a temporary run that is created only for the tests (and its different from the project's manifest, because it also includes the test dependencies). This temporary test manifest can't be initialized as far as I can tell and is created independently for each test.

Finally, it seems that some race conditions happen in a print statement (this is just from reading the error message). It all boils down eventually to these two lines:

https://github.com/JuliaLang/Pkg.jl/blob/5fbfa125045ce3e68ce10bf9fc1727bb3232c123/src/Operations.jl#L799-L800

The command its trying to run is

ERROR: failed process: Process(`/net/ocean/home/data44/data5/glwagner/julia-1.10.5/bin/julia -C native -J/net/ocean/home/data44/data5/glwagner/julia-1.10.5/lib/julia/sys.so -O0 -g1 --color=yes -O0 --color=no --history-file=no --startup-file=no --project=/tmp/jl_f9Z8t1/Project.toml --eval 'append!(empty!(Base.DEPOT_PATH), ["/data5/glwagner/.julia-17516"])
--
  | append!(empty!(Base.DL_LOAD_PATH), String[])

cd("/data5/glwagner/.julia-17516/packages/CUDA_Runtime_jll/YgJCI/.pkg")
--
  | include("/data5/glwagner/.julia-17516/packages/CUDA_Runtime_jll/YgJCI/.pkg/select_artifacts.jl")
  | ' -t1 --startup-file=no x86_64-linux-gnu-libgfortran5-cxx11-julia_version+1.10.5`, ProcessSignaled(11)) [0]

and supposedly the issue arises within the select_artifacts.jl call in CUDA_Runtime_jll:

https://buildkite.com/clima/oceananigans/builds/17516#01920c32-4eb2-4e77-80c5-fd88280e8c6c/37-224

@glwagner
Copy link
Member Author

Here's a tutorial for automatically retrying buildkite builds: https://buildkite.com/blog/job-retries

@glwagner
Copy link
Member Author

glwagner commented Sep 20, 2024

Interesting. I noticed that there were some unnecessary test dependencies and removed those. That seems to have helped.

@glwagner
Copy link
Member Author

Well, the tests passed finally with no manual intervention so I declare my work complete @simone-silvestri

Copy link
Collaborator

@simone-silvestri simone-silvestri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great!

@glwagner glwagner merged commit 1961b66 into main Sep 21, 2024
46 checks passed
@glwagner glwagner deleted the glw/rm-manifest branch September 21, 2024 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants