Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{bio}[foss/2023a] AlphaFold v2.3.2, dm-haiku v0.0.12, tensorstore v0.1.65 w/ CUDA v12.1.1 #19942

Merged

Conversation

ThomasHoffmann77
Copy link
Contributor

@ThomasHoffmann77 ThomasHoffmann77 commented Feb 20, 2024

@ThomasHoffmann77 ThomasHoffmann77 changed the title {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.11, tensorstore v0.1.53 w/ CUDA v12.1.1 {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.11, tensorstore v0.1.53, OpenMM v8.0.0 HH-Suite v3.3.0 w/ CUDA v12.1.1 Feb 20, 2024
@ThomasHoffmann77 ThomasHoffmann77 changed the title {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.11, tensorstore v0.1.53, OpenMM v8.0.0 HH-Suite v3.3.0 w/ CUDA v12.1.1 {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.11, tensorstore v0.1.53, OpenMM v8.0.0, dm-tree v0.1.8, HH-Suite v3.3.0 w/ CUDA v12.1.1 Feb 20, 2024
@jfgrimm jfgrimm added this to the 4.x milestone Feb 22, 2024
@easybuilders easybuilders deleted a comment from boegelbot Feb 22, 2024
@ThomasHoffmann77 ThomasHoffmann77 changed the title {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.11, tensorstore v0.1.53, OpenMM v8.0.0, dm-tree v0.1.8, HH-Suite v3.3.0 w/ CUDA v12.1.1 {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.11, tensorstore v0.1.53, OpenMM v8.0.0, HH-Suite v3.3.0 w/ CUDA v12.1.1 Feb 28, 2024
@migueldiascosta
Copy link
Member

fwiw, I'm getting error: HWCAP_NEON was not declared in this scope when building OpenMM from this PR on NVIDIA Grace-Hopper

it looks like openmm-8.0.0/cmake_modules/TargetArch.cmake is detecting Grace-Hopper as arm instead of armv8,

which then leads to openmm-8.0.0/CMakeLists.txt setting -D__ARM__=1 instead of -D__ARM64__=1,

which in turn leads openmm-8.0.0/openmmapi/include/openmm/internal/vectorize_neon.h to use HWCAP_NEON instead of HWCAP_ASIMD

forcing TARGET_ARCH to be armv8 in openmm-8.0.0/CMakeLists.txt fixed the issue for me

@ThomasHoffmann77
Copy link
Contributor Author

fwiw, I'm getting error: HWCAP_NEON was not declared in this scope when building OpenMM from this PR on NVIDIA Grace-Hopper

it looks like openmm-8.0.0/cmake_modules/TargetArch.cmake is detecting Grace-Hopper as arm instead of armv8,

which then leads to openmm-8.0.0/CMakeLists.txt setting -D__ARM__=1 instead of -D__ARM64__=1,

which in turn leads openmm-8.0.0/openmmapi/include/openmm/internal/vectorize_neon.h to use HWCAP_NEON instead of HWCAP_ASIMD

forcing TARGET_ARCH to be armv8 in openmm-8.0.0/CMakeLists.txt fixed the issue for me

#18911

@ThomasHoffmann77 ThomasHoffmann77 changed the title {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.11, tensorstore v0.1.53, OpenMM v8.0.0, HH-Suite v3.3.0 w/ CUDA v12.1.1 {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.11, tensorstore v0.1.53, HH-Suite v3.3.0 w/ CUDA v12.1.1 Mar 4, 2024
@VRehnberg
Copy link
Contributor

Fyi, I've got a draft #20421 that might become relevant for this one as well. Perhaps you'll have opinions :).

Copy link
Contributor

@VRehnberg VRehnberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this.

So typical use would be(?):

  1. CPU only job to get features (gpu not detected)
  2. Job-array to GPUs to run predictions (features.pkl found and --only-model-pred="${SLURM_ARRAY_TASK_ID}")
  3. Single job with GPU to run relaxation (possibly in parallel, [How is this launched, or is it not run separately???])

@ThomasHoffmann77
Copy link
Contributor Author

ThomasHoffmann77 commented May 21, 2024

Thanks for adding this.

So typical use would be(?):

  1. CPU only job to get features (gpu not detected)
  2. Job-array to GPUs to run predictions (features.pkl found and --only-model-pred="${SLURM_ARRAY_TASK_ID}")

yes, for monomer jobs.
For multimer, you need to translate the array ID to X,Y with X in [1..5], Y in [0..4] (if you run with --num_multimer_predictions_per_model=5)

  1. Single job with GPU to run relaxation (possibly in parallel, [How is this launched, or is it not run separately???])

In order to get the ranking, you can run a quick CPU job after running the predictions.
--models_to_relax default is changed from best to none. Therefore the pipeline stops after the predictions.
You can resume with the relaxation by restarting with --models_to_relax=all (or best).

@ThomasHoffmann77 ThomasHoffmann77 force-pushed the 20240220124705_new_pr_AlphaFold232 branch from 0ebfd8f to e7367ff Compare July 24, 2024 10:04
@ThomasHoffmann77
Copy link
Contributor Author

accidentally closed

@ThomasHoffmann77 ThomasHoffmann77 marked this pull request as draft October 11, 2024 06:53
@ThomasHoffmann77 ThomasHoffmann77 changed the title {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.11, tensorstore v0.1.53, HH-Suite v3.3.0 w/ CUDA v12.1.1 {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.12, tensorstore v0.1.65, HH-Suite v3.3.0 w/ CUDA v12.1.1 Oct 11, 2024
@ThomasHoffmann77 ThomasHoffmann77 marked this pull request as ready for review October 11, 2024 14:10
@ThomasHoffmann77 ThomasHoffmann77 changed the title {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, Kalign v3.4.0, dm-haiku v0.0.12, tensorstore v0.1.65, HH-Suite v3.3.0 w/ CUDA v12.1.1 {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, dm-haiku v0.0.12, tensorstore v0.1.65 w/ CUDA v12.1.1 Oct 11, 2024
@ThomasHoffmann77 ThomasHoffmann77 changed the title {bio}[foss/2023a,GCCcore/12.3.0] AlphaFold v2.3.2, dm-haiku v0.0.12, tensorstore v0.1.65 w/ CUDA v12.1.1 {bio}[foss/2023a] AlphaFold v2.3.2, dm-haiku v0.0.12, tensorstore v0.1.65 w/ CUDA v12.1.1 Oct 11, 2024
@boegel boegel dismissed akesandgren’s stale review October 11, 2024 18:02

requested changes done

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Oct 11, 2024

@boegelbot please test @ jsc-zen3-a100

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=19942 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_19942 --ntasks=8 --partition=jsczen3g --gres=gpu:1 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 5066

Test results coming soon (I hope)...

- notification for comment with ID 2407894839 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Oct 11, 2024

Test report by @boegel
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
node3901.accelgor.os - Linux RHEL 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA A100-SXM4-80GB, 545.23.08, Python 3.6.8
See https://gist.github.com/boegel/cb450cdab5ed9c44eb1dd6e80e9541d2 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 4 out of 4 (3 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.4, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 555.42.06, Python 3.9.18
See https://gist.github.com/boegelbot/c87f706d4ad650a5227d1ab0da8297d1 for a full test report.

@boegel boegel modified the milestones: 4.x, release after 4.9.4 Oct 11, 2024
@boegel
Copy link
Member

boegel commented Oct 11, 2024

Going in, thanks @ThomasHoffmann77!

@boegel boegel merged commit 96515d3 into easybuilders:develop Oct 11, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants