Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Update CTK to CUDA 11.0 #9

Merged
merged 40 commits into from
Aug 25, 2020

Conversation

mike-wendt
Copy link

@mike-wendt mike-wendt commented Jul 30, 2020

In addition add NVIDIA EULA and update about section to reflect the contents of this package.

This follows initial work in #7 for the CUDA 11RC

@jjhelmus
Copy link

@mike-wendt Can you rebase now that #6 has been merged.

Also adding __cuda >=11.0 as a run requirement will prevent installation on incompatible system including CentOS 6.

Address merge-conflicts and update changes to work for CUDA 11

* upstream-master:
  Add override flag
  fix nonembedded extract
  fix for embedded image
  add ppc64le support

# Conflicts:
#	build.py
@jjhelmus
Copy link

jjhelmus commented Aug 7, 2020

I'm getting errors similar to FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpnydooixm/cuda-toolkit/lib64' when I build this on linux? Should the recipe be looking for files in .../lib rather than ../lib64?

The recent merge seems to have broken the process that was working. CUDA 11 appears to prefer '--toolkit' over '--extract' for CUDA 10.2 and earlier.
@mike-wendt
Copy link
Author

I'm getting errors similar to FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpnydooixm/cuda-toolkit/lib64' when I build this on linux? Should the recipe be looking for files in .../lib rather than ../lib64?

I think fixing the merge in my latest commit fixes this as I revereted to using --toolkit which I had successfully working before. That said I'm unable to build due to conda/conda-build#4000

@jjhelmus @jakirkham do you have any suggestions. Right now for testing I've been removing it but I know it is needed in the end package that is released.

Attempting to use the changes from ppc64le merge before changing them
Will probably need to add a ppc64le override to 'lib64'
This does not owkr either and causes failures during solve. Need to remove to unblock our CI.
@mike-wendt
Copy link
Author

mike-wendt commented Aug 20, 2020

These versions do not have any run constraints:

CUDA 11 GA - https://anaconda.org/nvidia/cudatoolkit/files?version=11.0.194
CUDA 11 Update 1 - https://anaconda.org/nvidia/cudatoolkit/files?version=11.0.221

Adding __cuda>=11.0 broke all of our solves in gpuCI, trying to use __glibc>=2.17 also broke when it should solve. I understand the need for a constraint to ensure CentOS 6 users do not pick up this package, but there has to be a way to prevent them getting the package an not breaking existing solves. I'm open to suggestions, but at this point both of those options do not work for us currently.

Error message using __glibc>=2.17:

  - feature:/linux-64::__glibc==2.23=0
  - feature:|@/linux-64::__glibc==2.23=0
  - cudatoolkit=11.0 -> __glibc[version='>=2.17']

Your installed version is: 2.23

Error using __cuda>=11.0:

  - cudatoolkit=11.0 -> __cuda[version='>=11.0']

Your installed version is: not available

Both of these were from docker builds running on a CPU-only node, but building CUDA images. As I mentioned in a commit we need something that works for CPU-only environments as well given we build all of our conda packages on CPU-only nodes with CUDA images.

cc @kkraus14 @jakirkham

@jakirkham
Copy link

What does conda info say?

@mike-wendt
Copy link
Author

What does conda info say?

As far as? I've replaced the packages so I don't have any to test without rebuilding them

@jakirkham
Copy link

Am trying to understand more about the machine where this conflict is showing up. conda info would help with that.

@mike-wendt
Copy link
Author

mike-wendt commented Aug 21, 2020

Am trying to understand more about the machine where this conflict is showing up. conda info would help with that.

It was running on an ubuntu 18.04 AWS node doing a docker build. The error comes from inside the docker build on any image. This is just one example

Any of the failed CUDA 11 builds for this job have the same __glibc error

@mike-wendt
Copy link
Author

This is a failing job for __cuda constraint and the full matrix

@mike-wendt
Copy link
Author

@jjhelmus ready for review and input on the above constraints issues. Thanks

@kkraus14
Copy link

@jjhelmus would it be possible for you to review this again in the near future? We're trying to push out the RAPIDS 0.15 release and this update is needed to build CUDA 11 enabled conda packages in conda-forge for things like CuPy.

Is there anything we can do on our end to help reduce the maintenance burden on you?

@jjhelmus
Copy link

jjhelmus commented Aug 25, 2020

@kkraus14 @mike-wendt This looks good outside of the question on how to constrain the package. I've been able to replicate the build on our machines for linux-64 and am trying a build on our linux-ppc64le machine as well.

I need to do some more testing around the __cuda and __glibc virtual packages. They are not behaving as I expect. Would it be reasonable to include run_constained: _cuda >=11.0 here for the time being and if necessary hotfix this with something different. The existing cudatoolkit packages use constrained to enforce the driver requirement.

@jjhelmus
Copy link

Was able to confirm this builds fine on linux-ppc64le as well. The proposed change to match the other cudatoolkit packages is:

$ git diff 
diff --git a/meta.yaml b/meta.yaml
index 1f5361f..6ad4d6c 100644
--- a/meta.yaml
+++ b/meta.yaml
@@ -32,8 +32,8 @@ requirements:
     - tqdm
     # for run_exports
     - {{ compiler('cxx') }}
-  #run:
-  #  - __glibc >=2.17 # [linux]
+  run_constrained:
+    - _cuda >=11.0

I have linux packages built with this change that I plan on uploaded to defaults tonight/early tomorrow morning unless there are concerns.

@jjhelmus
Copy link

mike-wendt#4

@mike-wendt
Copy link
Author

@jjhelmus this is not in master currently - how did the ppc64le 10.2 pkg get published and have the constraint on the web but the tarball and included meta.yaml not have it?

add run_constrained requirement on __cuda
Co-authored-by: jakirkham <jakirkham@gmail.com>
@jakirkham
Copy link

So I thought we were already hotfixing cudatoolkit packages in PR ( AnacondaRecipes/repodata-hotfixes#81 ). Does that already do what we want or should we be doing something different?

@jjhelmus
Copy link

@jjhelmus this is not in master currently - how did the ppc64le 10.2 pkg get published and have the constraint on the web but the tarball and included meta.yaml not have it?

The constraint gets added via a patch when the packages are indexed. Details are in AnacondaRecipes/repodata-hotfixes#81

@jjhelmus
Copy link

So I thought we were already hotfixing cudatoolkit packages in PR ( AnacondaRecipes/repodata-hotfixes#81 ). Does that already do what we want or should we be doing something different?

We are but ideally the packages would have the constraint included rather than patched in.

@mike-wendt
Copy link
Author

So I thought we were already hotfixing cudatoolkit packages in PR ( AnacondaRecipes/repodata-hotfixes#81 ). Does that already do what we want or should we be doing something different?

This is my point on why is this change necessary here when it is obvious it is being added elsewhere. Now I know where.

@jakirkham
Copy link

Ok, that sounds fine.

On the building point, could you please check whether this ( #9 (comment) ) works, Mike?

@mike-wendt
Copy link
Author

@jakirkham I have to rebuild this package and then try to build images which is an hour or more of work. Given we're in the middle of a release I don't have that time to troubleshoot this at the moment. If you're both happy with this then I would say merge and publish.

I still have pkgs without the constraint so we won't be impacted but my suspicion is this is a larger issue. From my view an image that is FROM nvidia/cuda and has miniconda installed should work and not fail with this constraint. That being said it looks like I have a workaround so I'm good.

@jjhelmus
Copy link

jjhelmus commented Aug 25, 2020

With conda 4.8.4 I'm able to build packages from either of these two recipes if CONDA_OVERRIDE_CUDA=11.0 is set the the shell prior to calling conda build:

constrained:

package:
  name: test
  version: 1.0.0
requirements:
  run_constrained:
    - __cuda >=11.0
test:
  commands:
    - echo "Hi"

run:

package:
  name: test
  version: 2.0.0
requirements:
  run:
    - __cuda >=11.0
test:
  commands:
    - echo "Hi"

test-1.0.0 (the constraned version) is not install-able with conda 4.8.4 on system without the CUDA 11 driver but it can with earlier version of conda.

test-2.0.0 is not install-able without the CUDA 11 driver with both conda 4.8.4 or earlier versions.

@jakirkham
Copy link

No worries @mike-wendt. Just trying to make sure you have a path forward 🙂

@jjhelmus
Copy link

Merging. linux-64 and linux-ppc64le packages should be available on default tonight. win-64 will need to wait until the end of the week.

@jjhelmus jjhelmus merged commit 3310110 into AnacondaRecipes:master Aug 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants