Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reorganize Flax/JAX stack in 2023a: move jax + Optax to gfbf/2023a toolchain + use standalone Flax + absl-py as dependencies #21038

Merged

Conversation

lexming
Copy link
Contributor

@lexming lexming commented Jul 24, 2024

(created using eb --new-pr)

Adding new easyconfig for Flax, which deserves its own package; and for absl-py, which is used in many places already.

This clarifies the dependency tree of the Flax/JAX stack as:
absl-py > jax > Optax > Flax

Changelog:

  • move jax v0.4.25 from foss/2023a to gfbf/2023a following the changes introduced in {tools}[gfbf/2023a] jax v0.4.25 w/ CUDA 12.1.1 #20119
  • simplify easyconfig of jax v0.4.25:
    • replace component on absl-py with regular dependency
    • define source_urls individually per each source sownloaded
    • add again the test step in jax
  • move Optax v0.2.2 from foss/2023a to gfbf/2023a and remove redundant extensions already provided by its dependencies
  • add easyconfig for Flax v0.8.4
  • remove redundant extensions from scvi-tools already provided by its dependencies

@lexming lexming added change and removed update labels Jul 24, 2024
@lexming
Copy link
Contributor Author

lexming commented Jul 24, 2024

@boegelbot: please test @ generoso

@boegelbot
Copy link
Collaborator

@lexming: Request for testing this PR well received on login1

PR test command 'EB_PR=21038 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_21038 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13950

Test results coming soon (I hope)...

- notification for comment with ID 2248003332 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@verdurin
Copy link
Member

Test report by @verdurin
SUCCESS
Build succeeded for 35 out of 35 (5 easyconfigs in total)
easybuild-el8.cloud.in.bmrc.ox.ac.uk - Linux Rocky Linux 8.10, x86_64, Intel Xeon Processor (Skylake, IBRS), Python 3.6.8
See https://gist.github.com/verdurin/348867169b5e2ae44f1bafbd338210db for a full test report.

@boegel
Copy link
Member

boegel commented Jul 31, 2024

@boegelbot: please test @ generoso

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=21038 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_21038 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13983

Test results coming soon (I hope)...

- notification for comment with ID 2260701967 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 5 (5 easyconfigs in total)
cns2 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/a7663c32e0cb6d94146a6bd1e123fe7d for a full test report.

@yqshao

This comment was marked as off-topic.

@lexming
Copy link
Contributor Author

lexming commented Aug 2, 2024

@boegel Test in generoso failed due to a dangling lock

== 2024-07-31 14:51:30,027 build_log.py:171 ERROR EasyBuild crashed with an error (at easybuild/easybuild-framework/easybuild/base/exceptions.py:126 in init): Lock /project/boegelbot/Rocky8/haswell/software/.locks/_project_boegelbot_Rocky8_haswell_software_jax_0.4.25-gfbf-2023a.lock already exists, aborting! (at easybuild/easybuild-framework/easybuild/tools/filetools.py:2013 in check_lock)

@lexming
Copy link
Contributor Author

lexming commented Aug 5, 2024

Test report by @lexming
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
node605.hydra.os - Linux Rocky Linux 8.10, x86_64, AMD EPYC 9384X 32-Core Processor (x86_64_v4), Python 3.6.8
See https://gist.github.com/lexming/c1364a7156454ddc04b7be9b7b334e6a for a full test report.

('flit_core', '3.9.0', {
'checksums': ['72ad266176c4a3fcfab5f2930d76896059851240570ce9a98733b658cb786eba'],
}),
('absl-py', '2.1.0', {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lexming Why isn't absl-py` required as dependency if this is removed? Was that incorrect to begin with?

@boegel
Copy link
Member

boegel commented Aug 29, 2024

@boegel Test in generoso failed due to a dangling lock

I've cleaned up the lock, trying again...

@boegel
Copy link
Member

boegel commented Aug 29, 2024

@boegelbot please test @ generoso
CORE_CNT=16

@boegel boegel modified the milestones: 4.x, release after 4.9.2 Aug 29, 2024
@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=21038 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_21038 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 14161

Test results coming soon (I hope)...

- notification for comment with ID 2317735713 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/bdc5d1859582bb73670f8bf85d55927b for a full test report.

@boegel
Copy link
Member

boegel commented Aug 29, 2024

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=21038 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_21038 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 4776

Test results coming soon (I hope)...

- notification for comment with ID 2318254795 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Aug 29, 2024

Test report by @boegel
FAILED
Build succeeded for 4 out of 5 (5 easyconfigs in total)
node3105.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/boegel/734a90fdf5b586bb5168fa2c1d36f79b for a full test report.

@boegel
Copy link
Member

boegel commented Aug 29, 2024

test report failed for me because the existing jax/0.4.25-foss-2023a was picked up, I'll remove that and try again...

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
jsczen3c2.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/979031fd1db07a37fd0ed8d106334fcb for a full test report.

@boegel
Copy link
Member

boegel commented Aug 29, 2024

Test report by @boegel
FAILED
Build succeeded for 5 out of 6 (5 easyconfigs in total)
node3105.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/boegel/0e9cd00e712862648fb3277e8f7514a4 for a full test report.

@lexming
Copy link
Contributor Author

lexming commented Sep 2, 2024

@boegel you test failed because your robots path has both jax-0.4.25-gfbf-2023a from this PR and jax-0.4.25-foss-2023a from your local file tree. So scvi-tools-1.1.2-foss-2023a picks jax-0.4.25-foss-2023a instead of using jax from this PR.

@boegel
Copy link
Member

boegel commented Sep 4, 2024

@boegel you test failed because your robots path has both jax-0.4.25-gfbf-2023a from this PR and jax-0.4.25-foss-2023a from your local file tree. So scvi-tools-1.1.2-foss-2023a picks jax-0.4.25-foss-2023a instead of using jax from this PR.

Indeed. I'm a bit confused why this is not a problem for the bots though (since they use develop branch to test PRs with, which has jax-0.4.25-foss-2023a.eb still in there currently.

@branfosj
Copy link
Member

branfosj commented Sep 4, 2024

@boegel you test failed because your robots path has both jax-0.4.25-gfbf-2023a from this PR and jax-0.4.25-foss-2023a from your local file tree. So scvi-tools-1.1.2-foss-2023a picks jax-0.4.25-foss-2023a instead of using jax from this PR.

Indeed. I'm a bit confused why this is not a problem for the bots though (since they use develop branch to test PRs with, which has jax-0.4.25-foss-2023a.eb still in there currently.

Because you have LMOD_DISABLE_NAME_AUTOSWAP set? The bots do not and will swap between the possible modules.

On generoso:

$ module use /project/boegelbot/Rocky8/haswell/modules/all
$ module load jax/0.4.25-foss-2023a
$ module load Flax/0.8.4-gfbf-2023a

The following have been reloaded with a version change:
  1) jax/0.4.25-foss-2023a => jax/0.4.25-gfbf-2023a

$ export LMOD_DISABLE_NAME_AUTOSWAP=yes
$ module load jax/0.4.25-foss-2023a
$ module load Flax/0.8.4-gfbf-2023a
Lmod has detected the following error:  Your site prevents the automatic swapping of modules with same name. You must explicitly unload the loaded version of "jax/0.4.25-foss-2023a" before you
can load the new one. Use swap to do this:

   $ module swap jax/0.4.25-foss-2023a jax/0.4.25-gfbf-2023a

Alternatively, you can set the environment variable LMOD_DISABLE_SAME_NAME_AUTOSWAP to "no" to re-enable same name autoswapping.

While processing the following module(s):
    Module fullname        Module Filename
    ---------------        ---------------
    Flax/0.8.4-gfbf-2023a  /project/boegelbot/Rocky8/haswell/modules/all/Flax/0.8.4-gfbf-2023a.lua

@boegel
Copy link
Member

boegel commented Sep 4, 2024

@boegel you test failed because your robots path has both jax-0.4.25-gfbf-2023a from this PR and jax-0.4.25-foss-2023a from your local file tree. So scvi-tools-1.1.2-foss-2023a picks jax-0.4.25-foss-2023a instead of using jax from this PR.

Indeed. I'm a bit confused why this is not a problem for the bots though (since they use develop branch to test PRs with, which has jax-0.4.25-foss-2023a.eb still in there currently.

Because you have LMOD_DISABLE_NAME_AUTOSWAP set? The bots do not and will swap between the possible modules.

Exactly, I overlooked that part. You're 100% right.

I'll dance around this using --robot-paths /tmp/$USER and get a new test report uploaded.

@boegel
Copy link
Member

boegel commented Sep 4, 2024

Test report by @boegel
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
node3123.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/boegel/f1721d4bc981b2dddd8cf6082f14bcc9 for a full test report.

@boegel boegel changed the title reorganize Flax/JAX stack in 2023a: scvi-tools, Flax, Optax, jax, absl-py reorganize Flax/JAX stack in 2023a: move jax + Optax to gfbf/2023a toolchain + use standalone Flax + absl-py as dependencies Sep 4, 2024
Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Sep 4, 2024

Going in, thanks @lexming!

@boegel boegel merged commit 5093821 into easybuilders:develop Sep 4, 2024
9 checks passed
@lexming lexming deleted the 20240724121105_new_pr_absl-py210 branch September 4, 2024 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants