-
-
Notifications
You must be signed in to change notification settings - Fork 14.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda-modules: apply lessons learned from cross-compilation attempts #301416
cuda-modules: apply lessons learned from cross-compilation attempts #301416
Conversation
a688406
to
79c50d7
Compare
Outdated `nixpkgs-review` for 79c50d75eb4593169b52154b8c65d896855b6848
✅
|
I'm able to run $ nix run -L --override-input nixpkgs "github:NixOS/nixpkgs/79c50d75eb4593169b52154b8c65d896855b6848" .#nix-cuda-test
Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Files already downloaded and verified
Files already downloaded and verified
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0
1 | model | ViT | 86.3 M
-----------------------------------------------
86.3 M Trainable params
0 Non-trainable params
86.3 M Total params
345.317 Total estimated model params size (MB)
Epoch 9: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 781/781 [01:26<00:00, 9.04it/s, v_num=4, train_loss=2.350, val_loss=2.330]`Trainer.fit` stopped: `max_epochs=10` reached.
Epoch 9: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 781/781 [01:28<00:00, 8.82it/s, v_num=4, train_loss=2.350, val_loss=2.330] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing up this documentation @ConnorBaker !
I don't currently have a need for cross-compilation support, so I'm fine continuing to pursue the cudaPackages
transition, but I propose that we consult with @kmittman to double check that we're not building ourselves into a corner.
@kmittman I'm not sure if you've already been appraised of the cross-compilation work going on here, but tl;dr it looks like cross-compilation doesn't work with the CUDA redistributable packages. We've been investing in migrating away from the runfile-based CUDA towards the redistributables, but this discovery gives us pause. Any insight you may have would be constructive for us and much appreciated!
pkgs/development/cuda-modules/setup-hooks/auto-add-driver-runpath/default.nix
Outdated
Show resolved
Hide resolved
pkgs/development/cuda-modules/setup-hooks/mark-for-cudatoolkit-root/default.nix
Outdated
Show resolved
Hide resolved
.../development/cuda-modules/setup-hooks/mark-for-cudatoolkit-root/mark-for-cudatoolkit-root.sh
Outdated
Show resolved
Hide resolved
.../development/cuda-modules/setup-hooks/mark-for-cudatoolkit-root/mark-for-cudatoolkit-root.sh
Outdated
Show resolved
Hide resolved
|
d2963d7
to
4e84386
Compare
See #302694 (which is stacked on this PR).
I'll try this again later and see if I can get it to work. |
For more information about *why* this is desirable, see NixOS#45717 and NixOS#27069
…ix deprecation warning
@samuela @SomeoneSerge @kmittman thank you all for the feedback on the cross-compilation notes -- I've moved that to #303122 to try to unblock this PR. |
80330cd
to
b1158d6
Compare
Important This review was run against commit b1158d6ff287d1f9c726a516ab08330643f67a0a Note Template PR="301416"; \
SYSTEM="aarch64-linux"; \
CUDA_SUPPORT="true"; \
CUDA_CAPABILITIES='[ "7.5" ]'; \
nixpkgs-review pr "${PR:?}" \
--system "${SYSTEM:?}" \
--no-shell \
--checkout commit \
--allow aliases \
--build-args "--max-jobs 1" \
--build-graph nix \
--skip-package-regex 'python3(\d){1,2}Packages\.(pytorch-pfn-extras|ffcv)' \
--skip-package-regex 'python3(\d){1,2}Packages\.(shap|mlflow|optuna)' \
--skip-package-regex 'python3(\d){1,2}Packages\.(apricot-select)' \
--extra-nixpkgs-config "{
allowUnfree = true;
allowBroken = false;
checkMeta = true;
cudaSupport = ${CUDA_SUPPORT:?};
cudaCapabilities = ${CUDA_CAPABILITIES:?};
}" The packages Template log archival command: PR="301416"; \
RUN_NUMBER="13"; \
SYSTEM="aarch64-linux"; \
CUDA_CAPABILITIES="7_5"; \
tar -cvf "${SYSTEM:?}-cap-${CUDA_CAPABILITIES:?}-pr-${PR:?}${RUN_NUMBER:+-$RUN_NUMBER}.tar.zst" \
-I "zstd -T0 --long=31 --ultra -22" \
-C "${HOME:?}/.cache/nixpkgs-review/pr-${PR:?}${RUN_NUMBER:+-$RUN_NUMBER}" \
. Logs are made available at https://drive.google.com/drive/folders/1GgABhwa2ooKeZXMqf5He6PcyClNrE6cQ?usp=share_link ✅
|
pkgs/development/cuda-modules/cuda-library-samples/extension.nix
Outdated
Show resolved
Hide resolved
pkgs/development/cuda-modules/setup-hooks/mark-for-cudatoolkit-root-hook.sh
Outdated
Show resolved
Hide resolved
pkgs/development/cuda-modules/cuda-library-samples/extension.nix
Outdated
Show resolved
Hide resolved
The setupCudaHook always checks for the existence of `$prefix/nix-support/include-in-cudatoolkit-root`, so we need to be sure it always exists. It isn't populated when `strictDeps` is set, but it must exist.
On NixOS, the the return status is that of the last command executed within the function or script. When we're doing tests immediately before the return, the value you end up returning might not be what you want.
…rs, and enable on Jetson post-11.4
…lify src, break out comments in postPatch
af1ffad
to
6208368
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Now this looks strictly like a simplification over the previous state!
@@ -86,7 +86,7 @@ let | |||
[ | |||
(import ../development/cuda-modules/setup-hooks/extension.nix) | |||
(callPackage ../development/cuda-modules/cuda/extension.nix { inherit cudaVersion; }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could still add your comment about which callPackage
this is - this would be the place.
@@ -1,14 +1,25 @@ | |||
# shellcheck shell=bash | |||
|
|||
# Should we mimick cc-wrapper's "hygiene"? | |||
[[ -z ${strictDeps-} ]] || (( "$hostOffset" < 0 )) || return 0 | |||
(( ${hostOffset:?} == -1 && ${targetOffset:?} == 0)) || return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: I believe that "in practice" the derivation never sees offsets other than {-1, 0, 1}
, but I saw other hooks write < 0
instead of == -1
... I really can't say what this change means in the larger scheme of things
brokenConditions = prevAttrs.brokenConditions // { | ||
"libnvjitlink missing (CUDA >= 12.0)" = | ||
!(cudaAtLeast "12.0" -> (libnvjitlink != null && libnvjitlink.lib != null)); | ||
"libcusparse missing (CUDA >= 12.1)" = | ||
!(cudaAtLeast "12.1" -> (libcusparse != null && libcusparse.lib != null)); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed this leaks into the .drv
file, affecting the hash
Important
To-do:
cmakeFlagsArray
in the setup hooks does not cause problems by (for example) overriding user-specified values incmakeFlags
because of the order of string interpolation in the CMake setup hook.Note
Description of changes
Supersedes #279952 and #275560.
This PR instead adds fixes and updates accumulated over the course of work on #279952 and #275560. Work related specifically to rewriting the setup hooks is covered by #302694.
*Platform
throughstdenv
rather than directly frompkgs
substituteInPlace
to--replace-fail
to fix deprecationcmakeCudaArchitectures
andcmakeCudaArchitecturesString
return
so it does not use the return value of the previously executed command (which would sometimes cause failures when evaluating bash functions)setupCudaHook
wherein attempting to read an empty marker file would failnccl
andsaxpy
brokenConditions
Things done
nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)Add a 👍 reaction to pull requests you find important.