-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{tools}[gfbf/2023a] jax v0.4.25 w/ CUDA 12.1.1 #20119
{tools}[gfbf/2023a] jax v0.4.25 w/ CUDA 12.1.1 #20119
Conversation
Test report by @ThomasHoffmann77 |
Test report by @ThomasHoffmann77 |
Test report by @ThomasHoffmann77 |
Test report by @branfosj Same three failures as #19841 (comment) |
Test report by @branfosj Same three failures as #19841 (comment) |
Test report by @ThomasHoffmann77 |
I don't have a build node setup to upload test reports. Did see this test error:
|
I see you're all building with |
|
Test report by @ThomasHoffmann77 |
Test report by @ThomasHoffmann77 |
Test report by @Flamefire |
Test report by @Flamefire |
In both cases the failure is:
Due to XLA comes with even more dependencies ( |
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
Co-authored-by: Alexander Grund <Flamefire@users.noreply.github.com>
Test report by @Flamefire This is caused by a crash. It isn't really clear why it fails or in which test, as when I run the crashing test file manually it works. Attaching GDB shows |
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/m/ml_dtypes/ml_dtypes-0.3.2-foss-2023a.eb
Outdated
Show resolved
Hide resolved
easybuild/easyconfigs/j/jax/jax-0.4.25-foss-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
Ah, I'm blind. So used to *100 being A100, not used to H100 yet (even though we should be having them in production in the next 3 or so weeks ;-)). It's a bit of a long shot, but what if you only build for a single compute capability, i.e. only the one for H100 (that's 9.0 I believe, right)? |
Test report by @casparvl |
Nope, using single compute capability for the H100 (9.0) also fails in the same way. |
0.4.29 has been released in the meantime. It might be worth to try this version. |
First attempt at using 0.4.29 with this toolchain failed:
|
Test report by @VRehnberg |
Test report by @VRehnberg |
Test report by @VRehnberg |
To get this to run on H100 one needs a newer CUDA, 0.4.29 with foss/2023a and CUDA/12.5.0 passes all but a single broken test, i.e. the test itself is broken... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
🎉 |
easybuild/easyconfigs/j/jax/jax-0.4.25-gfbf-2023a-CUDA-12.1.1.eb
Outdated
Show resolved
Hide resolved
fix local_extract_cmd according to @akesandgren 's suggestion
Edit: Recent change in global pip.conf made build fail. Unrelated to this PR. |
Which one exactly and what was the error? Might be worth addressing in framework |
https://gist.github.com/VRehnberg/5be54199260e8a478002d18dd986725c
|
Test report by @VRehnberg |
Test report by @VRehnberg |
Ah I remember that. There is a fix in the easyblocks: easybuilders/easybuild-easyblocks#3374 |
Thanks, wasn't using that easyblock still on 4.9.2 easyblocks.
So the tests are not VRAM hungry. I never saw more than 1 GB used and T4 only have 16 GB in total so that's a strict limit even if our monitoring would miss a short spike. Does use about 27 GB of regular RAM though in case that could be an issue. |
Test report by @lexming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merging, thanks a lot for keeping up with all the issues @ThomasHoffmann77 ! |
(created using
eb --new-pr
)requires:
edit: requires bug fix in framework for "
cp %s %(builddir)s/archives
" to work as extract command: