Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure cached packages installed in CI test phase #14

Open
jakirkham opened this issue Jan 24, 2024 · 12 comments
Open

Ensure cached packages installed in CI test phase #14

jakirkham opened this issue Jan 24, 2024 · 12 comments

Comments

@jakirkham
Copy link
Member

Recently we ran into an issue on a project (cuCIM) where older packages of the project (libcucim & cucim from 23.12) were installed instead of the most recent packages from the PR (24.02.00a*). This made the installation look successful. However old issues that had been fixed in the development branch (branch-24.02) were not getting picked up

This was ultimately caused by a solver issue. However we were not able to ascertain that until we pinned the packages installed in the test phase to the latest version. Then the solver issue became clear and we could work to resolve that

Think we should take a closer look at this issue and come up with a way to guarantee the cached packages are picked up as opposed to some other packages. Attempted to do this more directly by using the <channel>::<package> syntax, but this didn't work well with file based channels. Maybe there is a better way to do this

@jameslamb
Copy link
Member

@jakirkham I'm a bit confused by what specifically "cached" means in your description, but I came here to report an issue that felt like the conda equivalent of #69 because that's how @vyasr described this issue in #69 (comment).

In cugraph 24.10, near the end of burndown, we observed a situation where 24.12 packages were getting pulled in on CI jobs targeting branch-24.10 (rapidsai/cugraph#4690).

It seems like the root cause was this pattern, that's used across RAPIDS CI scripts:

# create an environment
rapids-dependency-file-generator \
  --output conda \
  --file-key test_notebooks \
  --matrix "cuda=${RAPIDS_CUDA_VERSION%.*};arch=$(arch);py=${RAPIDS_PY_VERSION}" | tee env.yaml

rapids-mamba-retry env create --yes -f env.yaml -n test

# download packages built in earlier CI jobs
CPP_CHANNEL=$(rapids-download-conda-from-s3 cpp)
PYTHON_CHANNEL=$(rapids-download-conda-from-s3 python)

# install those packages
rapids-mamba-retry install \
  --channel "${CPP_CHANNEL}" \
  --channel "${PYTHON_CHANNEL}" \
  libcugraph pylibcugraph cugraph

Adding those locally-downloaded files with --channel doesn't force conda install to use the packages... it's free to choose other packages to satisfy constraints like "cugraph". Including those from other RAPIDS releases!

It would be safer for CI purposes to do something like this for that last install:

RAPIDS_VERSION="$(rapids-version-major-minor)"

rapids-mamba-retry install \
  --channel "${CPP_CHANNEL}" \
  --channel "${PYTHON_CHANNEL}" \
  "libcugraph=${RAPIDS_VERSION}.*" \
  "pylibcugraph=${RAPIDS_VERSION}.*" \
  "cugraph=${RAPIDS_VERSION}.*"

That would prevent the solver from ever falling back to older releases (e.g. using 24.08 packages on the 24.10 branch) picking up newer releases (e.g. using 24.12 packages on the 24.10 branch).

I think it's worth making changes like that across RAPIDS in 24.12, what do you think?

cc @bdice @msarahan @KyleFromNVIDIA

@jakirkham
Copy link
Member Author

Meaning the packages from S3

TBH this is something that would be solved by strict channel priority

@jameslamb
Copy link
Member

Got it, ok.

I don't think the particular issue I just noted about would be solved by strict channel priority. It's a case where conda chose 24.12.* packages from the rapidsai-nightly channel when we wanted to use 24.10.* packages from the rapidsai-nightly channel.

@jakirkham
Copy link
Member Author

Hmm...am not following

With your use case, it sounds like cuGraph packages should be in ${CPP_CHANNEL} and ${PYTHON_CHANNEL} correct?

If we added those channels higher in the channel order than rapidsai-nightly and enabled strict channel priority, it would only grab the cuGraph packages from ${CPP_CHANNEL} and ${PYTHON_CHANNEL} and ignore rapidsai-nightly. This is what we want, right?

What am I missing?

@jameslamb
Copy link
Member

🙈 you're absolutely right John, the 24.10.* are coming from those local channels, not rapidsai-nightly, and strict channel priority would solve this. Sorry, I was not thinking about this the right way. Thanks for correcting me.

We might still want to consider something like the script changes I mentioned above, since they're a relatively small amount of effort compared to enforcing strict channel priority.

@jakirkham
Copy link
Member Author

All good. Just wanted to make sure I wasn't missing something

Agree that would be a good improvement. We did the same thing in cuCIM: rapidsai/cucim#680

Another improvement that has been helpful has been this suggestion by Charles ( #22 ). It is good both for overall runtime and reliability

@vyasr
Copy link
Contributor

vyasr commented Oct 4, 2024

The reason that I likened this issue to #69 is that fundamentally --find-links is analogous to adding a conda channel because it gives package manager an extra place to pull dependencies from. The main difference between pip and conda in this regard is that pip has no concept of channel priority, so with pip you had to introduce constraints in order to force installation of the desired file. With conda, as John pointed out the channel priority should solve the problem. In the interim, though, the suggestion that you proposed would work and would be the closest direct port possible of the pip constraints approach that you've added.

@jameslamb
Copy link
Member

Yep yep makes sense, thanks for saying you also support the idea. I'm gonna plan on doing this (adding the constraints in scripts) next week, on 24.12. Seems like a few hours of effort, well worth it in my opinion to minimize pain like what we're experiencing in cugraph for 24.10 and trying to fix in rapidsai/cugraph#4690

@jameslamb
Copy link
Member

I created a sub-issue to track the work of updating those scripts: #106.

@jameslamb
Copy link
Member

#106 helped limit the issue to "will always choose a nightly from within the same major RAPIDS release", but there's still room for improvement here (why this issue is still open).

Mentioning this because @davidwendt recently observed some behavior on rapidsai/cudf#17994 of the form "the package installed at test time was different from the built in the same CI run":

Some things I can think of that would give us much greater confidence that the CI artifacts are getting installed:

  • strict channel priority (Switch to using strict channel priority during RAPIDS builds #84) + also putting the local downloaded-ci-artifacts channel first in channel order
  • getting the current version via rapids-generate-version (which computes a version with a git tag distance, like 25.04.00a357) and passing an == on that through to all installs in CI
    • then at most 2 packages would be viable: the one from the same CI run, and the corresponding nightly with exactly that same version (if it exists on the rapidsai-nightly channel)
    • would have to be really sure about how that tag distance is calculated... e.g. if I manually re-trigger a test job after a few days of active development going on on the target branch, will the tag distance be the same (because I haven't updated my branch)
  • working on getting support for passing local filepaths through the {channel}::{package} pattern in conda install and similar

@gforsyth
Copy link

I'm think I'm also hitting this trying to port ucxx over to rattler-build where the environment is failing to find the just-built packages: https://github.com/rapidsai/ucxx/actions/runs/13331288455/job/37236359429?pr=374#step:9:72

(It's possible something else is broken in the run above but stronger pinning would lead to clearer error messages, if that's the case)

@jakirkham
Copy link
Member Author

Are we using strict channel priority in that build? If so, is the nightly channel higher in priority?

If both of those are true, we should take a closer look at what is going on there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants