-
-
Notifications
You must be signed in to change notification settings - Fork 14.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pytorch: 1.0.0 -> 1.2.0 #65041
pytorch: 1.0.0 -> 1.2.0 #65041
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Hopefully we can get this as feature complete as possible.
@@ -72,6 +67,12 @@ in buildPythonPackage rec { | |||
PYTORCH_BUILD_VERSION = version; | |||
PYTORCH_BUILD_NUMBER = 0; | |||
|
|||
USE_FBGEMM = 0; # this can't build because of CMAKE downloads | |||
USE_MKLDNN = 0; # mkl with cuda is broken | |||
USE_NCCL = 0; # multigpu looks broken broken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try including NCCL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, unfortunately. I reviewed setup.py and will try using their bundled NCCL (via USE_SYSTEM_NCCL=0
), also setting NCCL variables manually (NCCL_ROOT_DIR
, NCCL_LIB_DIR
, NCCL_INCLUDE_DIR
), and making sure that nccl.out
and nccl.dev
are included.
None of these seem to work. You will quickly (<10m) get the following errors:
nvlink error : Undefined reference to '_Z25ncclAllReduceRing_min_f32P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error : Undefined reference to '_Z27ncclAllReduceRingLL_min_f32P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error : Undefined reference to '_Z25ncclAllReduceTree_min_f32P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error : Undefined reference to '_Z27ncclAllReduceTreeLL_min_f32P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error : Undefined reference to '_Z25ncclAllReduceRing_min_f64P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error : Undefined reference to '_Z27ncclAllReduceRingLL_min_f64P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error : Undefined reference to '_Z25ncclAllReduceTree_min_f64P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
nvlink error : Undefined reference to '_Z27ncclAllReduceTreeLL_min_f64P14CollectiveArgs' in '/build/source/build/nccl/obj/collectives/device/functions.o' (target: sm_30)
make[5]: *** [Makefile:68: /build/source/build/nccl/obj/collectives/device/devlink.o] Error 255
make[4]: *** [Makefile:44: /build/source/build/nccl/obj/collectives/device/colldevice.a] Error 2
make[4]: *** Waiting for unfinished jobs....
There are a lot of these undefined references, all coming from /build/source/build/nccl/obj/collectives/device/functions.o
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is because nvlink is attempting to perform checks in the GPU runtime?
Update: I found the symbols in "${nccl.dev}"/lib/libnccl_static.a
but they seem to fail lookups on build, even when they are included in $NIX_LDFLAGS. Manually setting NCCL_*
environment variables seems to fix this!
@@ -36,14 +39,6 @@ in buildPythonPackage rec { | |||
sha256 = "076cpbig4sywn9vv674c0xdg832sdrd5pk1d0725pjkm436kpvlm"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this SHA correct? Looks the same to me as 1.0.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah! Yes, I'm primarily working off of https://github.com/stites/libtorch-nix and missed this when transferring over details. Good catch -- I'm a bit surprised this slipped through since I was building off of my fork (using the command from the PR message).
@@ -72,6 +67,12 @@ in buildPythonPackage rec { | |||
PYTORCH_BUILD_VERSION = version; | |||
PYTORCH_BUILD_NUMBER = 0; | |||
|
|||
USE_FBGEMM = 0; # this can't build because of CMAKE downloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth reviewing the instructions at https://github.com/pytorch/FBGEMM. Looks like we can avoid downloads by setting ASMJIT_SRC_DIR, CPUINFO_SRC_DIR, GOOGLETEST_SOURCE_DIR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I did set these as environment variables, using pointers from your gist: https://gist.github.com/tbenst/15495f97ffe2a71dc18c49e247af8f51#file-pytorch_11-nix-L62
Unfortunately, these didn't seem to propagate through to CMake. I'll do a round-two later.
@@ -72,6 +67,12 @@ in buildPythonPackage rec { | |||
PYTORCH_BUILD_VERSION = version; | |||
PYTORCH_BUILD_NUMBER = 0; | |||
|
|||
USE_FBGEMM = 0; # this can't build because of CMAKE downloads | |||
USE_MKLDNN = 0; # mkl with cuda is broken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MKL with cuda is not broken, actually -- just MKLDNN. The problem seems to be a warning which is treated as an error:
/build/source/third_party/ideep/mkl-dnn/src/cpu/jit_uni_reorder_utils.cpp: In function ‘void mkldnn::impl::cpu::tr::prb_normalize(mkldnn::impl::cpu::tr::prb_t&)’:
/build/source/third_party/ideep/mkl-dnn/src/cpu/jit_uni_reorder_utils.cpp:273:29: error: array subscript is above array bounds [-Werror=array-bounds]
|| p.nodes[j].os < p.nodes[min_pos].os
~~~~~~~~~^
/build/source/third_party/ideep/mkl-dnn/src/cpu/jit_uni_reorder_utils.cpp:276:37: error: array subscript is above array bounds [-Werror=array-bounds]
&& p.nodes[j].n < p.nodes[min_pos].n);
~~~~~~~~~^
cc1plus: all warnings being treated as errors
make[2]: *** [third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/build.make:1272: third_party/ideep/mkl-dnn/src/CMakeFiles/mkldnn.dir/cpu/jit_uni_reorder_utils.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the issue: pytorch/pytorch#22346
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed the workaround at https://github.com/NixOS/nixpkgs/blob/fb76f093c9ab204fbec9dc9c56cb9b4a81f43bc7/pkgs/development/python-modules/pytorch/default.nix#L79
I'll be adding a mklSupport ? false
which will swap out the numpy BLAS implementation so that this is more explicit. The working code is similar to my_magma
:
my_numpy = if mklSupport && numpy.blasImplementation != "mkl" then numpy.override { blas = mkl; } else numpy;
I've just rebased this on top of master, which means I need to cherry-pick and remove the #61347 commit (65494f2) before this is finalized. I don't think this should be a problem since I believe everything needs to be squashed anyways. I've updated the PR with the following:
Things that still need to be worked on (a check means that I have these working locally and/or in libtorch-nix).
For updates on this work please check out https://github.com/stites/libtorch-nix. @tscholak is also experimentally building the C++ |
FBGEMM works in https://github.com/stites/libtorch-nix/blob/master/deps/fbgemm.nix but, unfortunately, setup.py seems to hide arguments passed into The problem, found in #61820, is that pytorch didn't have the fbgemm dependencies under submodule control until a commit in May (1.1.0 was released in April). Unfortunately, git submodules cannot be patched. Furthermore Since FBGEMM is optional, I think the most simple solution is to put a version bound that disables FBGEMM unless pytorch is >1.1.0. This way FBGEMM will automatically turn on if someone bumps the version and forgets about this workaround. According to the milestones, it looks like 1.1.1 might be cut for release soon. |
I asked around on the pytorch slack and the 1.1.1 release is going to be cut this week, so I think waiting is the best course of action. For posterity, if anyone is trying to build pytorch-1.1.0, this PR is fully-functional without FBGEMM -- which provides optional, low-precision general matrix multiplication for small batch sizes, various MM-based optimizations, and fused operators. If you are working off of an x86 CPU backend, however, having FBGEMM might be a bit of a nessecity. |
https://github.com/pytorch/pytorch/releases/tag/v1.2.0 Looks like they skipped the 1.1.1 release. I'll update this PR for pytorch-1.2.0. Update: unfortunately, pytorch-1.2.0 brings back the NCCL issues. Here is the build matrix I am working with:
The ":grey_exclamation:" means that namedtensors and binaries weren't attempted in pytorch-v1.1.0. The "*" implies that FBGEMM is not included in any of the "Full" builds in pytorch-v1.1.0, but is included for all of pytorch-v1.2.0 I've also started including previously ignored tests for pytorch-1.2.0. It looks like we can get away with only disabling |
Still working on this, slowly, in pytorch-world and am updating the above build matrix to cut down on github notifications -- I'll make a final commit and message eventually. I am also going to set |
I made an update-worthy breakthrough for pytorch-1.2.0. The errors I saw weren't with NCCL, but with the 3.0 CUDA architectures, documented in pytorch/pytorch#23573. I'm currently trying to figure out how to polish this up and did notice that cmake was throwing this warning:
Which is quite troubling. So far, we have been using I'll be looking at adjusting NOTE: Because of sandboxing, the build can't auto-detect the hardware's cuda architecture, so there is also now a problem around new architectures not being supported until explicitly added to this derivation. |
I'll update this PR this week -- I've also added myself as a maintainer for this package. |
I've gone ahead and pushed up the latest changes, pytorch-1.2.0 is fully functional! Note that this is still blocked by #61347 and I can squash the commits if desired, but this PR now touches openmpi (adding cuda support to it) and magma (adding mkl support to it). Final notes:
This is ready for review. Again, not sure if I need to squash all of this or split up the PR into parts. Because I just found out how to properly support cuda-9, cachix binaries are pending -- I'll comment one last time to ping folks who want to test-drive this (although only the vanilla cuda-9 build is in there). |
Rebased! After this PR, I'll start on 1.3.0 |
@jonringer good to merge? |
even with my 3900X, this takes a while to build maxing out all 24 threads >.> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is caffe2 supposed to be bundled with pytorch?
[nix-shell:/home/jon/.cache/nix-review/pr-65041-1]$ tree -L 4 ./results/python37Packages.pytorchWithCuda/
./results/python37Packages.pytorchWithCuda/
├── bin
│ ├── convert-caffe2-to-onnx
│ └── convert-onnx-to-caffe2
└── lib
└── python3.7
└── site-packages
├── caffe2
├── torch
└── torch-1.2.0.dist-info
if so, please add click to the propagatedBuildInputs:
[nix-shell:/home/jon/.cache/nix-review/pr-65041-1]$ ./results/python37Packages.pytorchWithCuda/bin/convert-caffe2-to-onnx --help
Traceback (most recent call last):
File "/nix/store/cbrg1ldydrl4wb6v4xpkkfh2jbmhwaiv-python3.7-pytorch-1.2.0/bin/.convert-caffe2-to-onnx-wrapped", line 7, in <module>
from caffe2.python.onnx.bin.conversion import caffe2_to_onnx
File "/nix/store/5ck5ncbzq8pr9aynzyp3j2y0bhk0v42i-python3.7-pytorch-1.2.0/lib/python3.7/site-packages/caffe2/python/onnx/bin/conversion.py", line 11, in <module>
import click
ModuleNotFoundError: No module named 'click'
closes #69424 |
I believe caffe2 is used for the "production-ready" features of pytorch. I've added click to the propagatedBuildInputs. My ML box is currently down (just moved offices), so this is a bit of an optimistic push. |
I'm building and testing now. If it seems to work in the trivial case, I'll merge :) |
@GrahamcOfBorg eval |
not sure how much you use onnx, but I added a cpu-only version to nixpkgs #67542 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nix-review
passes on NixOS (python38Packages.pytest is broken)
diff LGTM
[nix-shell:/home/jon/.cache/nix-review/pr-65041-2]$ python
Python 3.7.4 (default, Jul 8 2019, 18:31:06)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.2.0'
[3 built (2 failed), 0.0 MiB DL]
error: build of '/nix/store/zqv58qrag64v4db7ljcmk62r3w4vjjd7-env.drv' failed
https://github.com/NixOS/nixpkgs/pull/65041
3 package failed to build:
python38Packages.pytorch python38Packages.pytorchWithCuda python38Packages.torchvision
4 package were build:
magma python37Packages.pytorch python37Packages.pytorchWithCuda python37Packages.torchvision
the caffe2 executables still need some love, but I'll do that in a different PR seeing as this one has been going on for close to 3 months
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooops, to conform to contirbuting.md, please break out the work into the following commits:
maintainers: add tscholak
openmpi: enable cuda support
magma: use cudatoolkit
python3Packages.pytorch: 1.0.0 -> 1.2.0
quick ping to mention that this change was made! |
I might wait for #71780 to get in, as it has significant improvements to python38. And it's really hard to not have 480 commits bit rot fast |
fridh said he would de-conflict the merge at a later time. |
Heads-up: an initial build of pytorch-1.3.0 succeeds on python3.7 with and without cuda. Test are failing do to some dependencies issues. I'm also about to be very busy for the next week -- expect a PR after. |
no sweat, take your time man. We appreciate your effort :) |
Looks like there is still an issue in the grid sweep. pytorch-1.3 will be delayed a bit. |
I think 2947c23 should also be backported to 19.09?
|
Motivation for this change
Closes #63073
Closes #65041
Closes #67468
This depends on @tbenst 's #61347 (thank you!). See #63073 and this PR thread for more details.
Work can be previewed at stites/pytorch-world. Building from source takes a long time, please check out the hosted cachix binaries
Changelog:
utils
cudaArchList
argument that allows users to test on the latest nvidia hardware without waiting for this package..dev
output which gives top-level access to generated C/C++ code (libtorch).cudaSupport
is disabled.Things done
sandbox
innix.conf
on non-NixOS)nix-shell -p nix-review --run "nix-review wip"
./result/bin/
)nix path-info -S
before and after)