Skip to content

Commit

Permalink
Extensible configuration of the superbuild location (#2426)
Browse files Browse the repository at this point in the history
* Updated the CI to use a variant of the superbuild CMake scripts.  The third-party external dependencies should be installed out-of-band using the scripts in the superbuild CI directory.  The DiHydrogen, Hydrogen, and Aluminun dependencies are installed as part of the CI run.  Note that the support for more extensible Spack-based builds are included but disabled.

----

* Adding some flexibility in the customized_build_env script to make the
location of the external superbuild dependencies easily relocatable.

* Adding code to explicitly get the hostname for the superbuild configuration.

* Updated to the latest ROCm versions.

* Added some env variables for RCCL

* Add spack type for mi300a

* Only include the external CUDA libraries on cuda systems.

* Fixed the external modules for cray-mpich.

* Ensure that the CMAKE_PREFIX_PATH is captured in the superbuild suggested prefix path.  Fixed bug where the forwarded CMAKE_PREFIX_PATH was overwritten when a package dependend on other packages.

* Automatically output the suggested cmake prefix path to the install directory.

* Forwarded the CMAKE_PREFIX_PATH to the LBANN build.

* Added a flag to the build_lbann.sh script to specify a directory of superbuilt external libraries.

* Added the superbuild-prefix to the Pascal CI pipeline.

* Disable caliper and force gcc@11.2.1

* Switch back to using the system specific spack.

* Force the use of normal zlib

* Split the superbuild scripts into core dependencies and DHA dependencies.

* Added superbuild script for DHA with half.

* Updated the build scripts to allow for specific DHA compiled versions.

* Reenabled half on pascal CI test.

* Allow for newer gcc compilers.

* Updated all of the pascal CI scripts to use the new stable dependencies.

* Updating the Tioga scripts to use the superbuild.

* Fixed the sense of the shared variant on protobuf.

* Updated the AMD ROCm stack to 6.1.2

* Adding path for external HWLOC in superbuild stable dependencies.  Added code to export the CRAY_LD_LIBRARY_PATH.

* Add aws-ofi-rccl to the superbuild externals.

* Fix how the CMAKE_PREFIX_PATH is forwarded to DHA libraries.

* Updating the Tioga superbuild scripts to force the runpaths to be properly set.

* Updating the Pascal superbuild scripts to force the runpaths to be properly set.

* Added CMake flags to enable shared library builds.

* Added a path to cuTensor for x86_64 platforms.

* Added a path to the correct miopen.

* Mark the new MIOpen as develop.

* Disable the superbuild on Corona and Lassen

* Fixed the install path.

* Add some logic to clean up the initial CMAKE_INSTALL_RPATH

The path auto-generated by Spack may not be ideal.

* Remove system paths from build rpath

* Fixed how the CMake environment sets up the PYTHONPATH and caches it
in the lbann_pfe.sh and module files.

Added hints to the superbuild of where to install necessary Python
packages.

* Revert back to ROCm 5.7.1

* Updated the superbuild scripts to use LDD and Gold linkers as
appropriate.  Made the Tioga superbuild scripts easier to change to
new ROCm versions.

* Removing custom MIOpen build.

* Added the build modules to the LBANN_DEPENDENT_MODULES so that they are loaded at runtime since the RPATH and RUNPATH isn't capturing certain Cray packages.

* Fixed how the LBANN_DEPENDENT_MODULES are composed.

* Temporarily reduce the time for Tioga jobs

* Try a different set of modules for Tioga.

* Fixed grouping on link flags.  Fixed RPATH issues for build and install objects.

* Increasing the precision of the reported error for check metric.

* Force the installation of pip packages in the installed location to
avoid bad system install.

* Correctly set the --force-reinstall flag on the pip command.

* Correcting the nightly time limit.

* Set the CXX and CUDA flags to an optimized build.

* Updated the Tioga builds to include the PE_ENV field in the stable dependencies pathname.

* Updated the build path so that the source files can be saved for debugging.

* Updated the build path so that the source files can be saved for debugging on pascal.

* Removed the pip force-reinstall

* Fixed pascal build path.

* Fixed the quotes around the linker flags.

* Do not use gold linker for core dependencies because protobuf fails.

* Updated the version of half to 2.2.0

* Did not set the loaded modules in the LBANN module file.

* Include ROCM_PATH/lib to RPATH.  Switch Pascal back to gcc/10.3.1.

* Switch Pascal CI to using Clang 14. Added compiler into the CI
superbuild external paths.

* Fixed compiler paths and typos.

* Fixed typo.

* Commented out unused variable.

* Log file for superbuild shell script is now defined in the environment rather than passed as an argument.

* Fixed the extra RPATH on cray.

* Switched back to half v2.1.0.  Added logging for the modules used to
build the superbuild.

* Fixing the extra RPATHs field to handle multiple entries.

* Add an updated time limit for the reconstruction loss unit test.

* Add EnsureComm calls to truncation selection algo

* Use a vertical | to avoid issues propagating ;.

* Constrain version of NumPy to 1.22.3

* Removed the -02 optimization flags from the pascal and tioga
environments because it will be set by the CMake build type.  Added a
superbuild package for hipTT.

* Added superbuild scripts for Corona.  Added hipTT to build_lbann.sh
build script set.  Updated Corona to 5.7.1.

Re-enabled the Corona CI builds.

* Moved the definition of the external hiptt to a ROCm only section.

* Update Corona to ROCm 6.0.2

* Changed the Corona externals to use variable for ROCm version.

* Exporting the shell variable.

* Moved when the ROCm version is defined.

* Back to 6.0.2

* Trying a unified single pipeline for Pascal CI.

* Working on updating the CI builds to use a more direct script setup.

* Added configure scripts for LBANN and a script to run the unit and integration tests.

* Cleaning up the CI scripts.

* Added GitLab CI yaml files.

* Lowered the git depth.

* Fix the submodule strategy.

* Fixed the CI tests to use 2 nodes.  Better error handling.

* Fixed the name of the test result files so that they would be picked up by CI.

* Added a test pascal pipeline.

* Fixed how the DistConv flag is propagated.

* Added external flags for building with HALF and FFT support.  Limited the distconv builds to only run the right tests.

* Cleaning up code.

* Added distconv pascal test.

* Fix the status capture.

* Fixed logic bug in bash.

* Fixed the include path to Half and disabled FFT

* Fixed the failed test reporting and that distconv and half don't play together.

* Extend the mpi catch tests time limit.

* Added optimization flags for DHA

* Added Corona to new CI.

* Added config for Lassen.

* Fixed how the lapack argument is passed to Hydrogen

* Fixed flag for LBANN BLA.

* Added scripts to install core dependencies for lassen.

* Added Lassen CI.

* Adding in some help for extra rpaths.

* Force LBANN to RPATH DHA libraries inside of the project.

* Improve the reporting of the MPI catch tests.  Consolidated all of the
MPI catch tests to a single execution.  Avoid logging unit and catch
testing outputs to console.

* Updated Lassen to use a newer python.  Tweaking how rpath's are set.

* Fixed quoting on RPATH

* Fixed the path for the catch tests.

* Fixed up a few shell details to make switching PEs simpler.

* Building for Mi300A as well as 250.

* Stop hardcoding the CRAY_MPICH_VERSION

* Added the ability to export the AWS_OFI_RCCL plugin to the
LD_LIBRARY_PATH when using the lbann_pfe.sh shell script.

* Tweak the Tioga build environment.

* Work on building the dependencies on PrgEnv-cray.

* Fixed accidental debugging code.

* Added DiHydrogen cache check.  Only add Half prefix path when asked for.

* Add the hash for H2.

* Ensure that for AMD/HIP/ROCm systems all three fields GPU_TARGETS,
AMDGPU_TARGETS, and CMAKE_HIP_ARCHITECTURES are set.

* Disable FFT on Lassen

* Disable installing torch.

* Disable FFT on lassen right now.

* Set proper AMD architectures.

* Use a special PR for 6.2.0

* Explicitly turned on the half feature, which is not properly disabled when not set.

* When not using a flag, set it to a NULL string, not 0.

* Reporting the state of the build script DHA features.

* Set flag to ON not 1

* Fix when local 6.2.0 MIOpen library is linked in.

* Auto-detect the CUDA version and compiler version.

* Working to consolidate how the core dependencies are built to use the
same setup file as the CI runs.  Fixed the build issues for CI on
corona.  Removed scripts for building DHA and LBANN manually (outside
of CI).

* Cleaning up Power and HIP specific flags.

* Added support for creating a Python virtual environment in the CI stack.

Improved the core dependencies for Power.

* Removed older core platform specific dependency scripts.

* Update python/lbann/contrib/lc/launcher.py

Co-authored-by: Tom Benson <benson31@llnl.gov>

* Add pytest to the venv.  Cleaned up.

* Added code to build OpenBLAS on Power and then install standard libraries via PIP in the stable dependencies.

* Only create the virtual environment if it doesn't exist.

* Changed to installing all of the PIP installs in the virtual env directory.

* Apply suggestions from code review

Co-authored-by: Tom Benson <benson31@llnl.gov>

* Renamed variable AWS_OFI_RCCL_LIBRARY to AWS_OFI_RCCL_LIBDIR.

* Gather the build logs for the DHA dependencies and keep them as artifacts.

* Added some cmake logic to capture the path to the python venv used during configuration.

* Removed bad debug statement.

* If a python virtual enviornment was defined and used during the build
time, the Lua module file will now activate it when loaded. Removed
the TCL module file since it wasn't being used by systems.

Added a prompt name to the python venv.

Fixed an empty variable field in the Lassen gitlab code that deleted
other variables.

* Trying to fix a bug where lbann_pfe.sh isn't found after loading the
module.

* Temporarily remove the lua code to activate the virtual environment.

* Disabled always rebuilding the dependencies.  Added a check to
deactivate an active environment before loading the LBANN module.

* Updated the Tioga tests to use ROCm 6.2.1beta1 and craycc.

* Rewound the Tioga ROCm versions.

---------

Co-authored-by: Tom Benson <benson31@llnl.gov>
  • Loading branch information
bvanessen and benson31 authored Sep 3, 2024
1 parent 6adcc45 commit ead9cce
Show file tree
Hide file tree
Showing 39 changed files with 1,843 additions and 310 deletions.
105 changes: 18 additions & 87 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,110 +28,41 @@
# clusters. To run testing locally, consult the README in the ci_test
# directory.

variables:
FF_USE_NEW_BASH_EVAL_STRATEGY: 'true'
FF_ENABLE_BASH_EXIT_CODE_CHECK: 1
LBANN_CI_CLEAN_BUILD: 'true'
include:
- project: 'lc-templates/id_tokens'
file: 'id_tokens.yml'

stages:
- run-all-clusters

corona testing:
stage: run-all-clusters
variables:
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
trigger:
strategy: depend
include: .gitlab/corona/pipeline.yml

corona distconv testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_ENV_BASE_NAME_MODIFIER: "-distconv"
SPACK_SPECS: "+rocm +distconv"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/corona/pipeline.yml

lassen testing:
stage: run-all-clusters
variables:
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
trigger:
strategy: depend
include: .gitlab/lassen/pipeline.yml

lassen distconv testing:
tioga testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_ENV_BASE_NAME_MODIFIER: "-multi-stage-distconv"
SPACK_SPECS: "+cuda +distconv +fft"
# SPACK_SPECS: "+cuda +distconv +nvshmem +fft"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/lassen/multi_stage_pipeline.yml
include: '.gitlab/build-and-test-tioga.yml'
forward:
pipeline_variables: true

pascal testing:
stage: run-all-clusters
variables:
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
trigger:
strategy: depend
include: .gitlab/pascal/pipeline.yml
include: '.gitlab/build-and-test-pascal.yml'
forward:
pipeline_variables: true

pascal compiler testing:
stage: run-all-clusters
variables:
SPACK_SPECS: "%gcc@10.3.1 +cuda +half +fft"
BUILD_SCRIPT_OPTIONS: "--no-default-mirrors"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
trigger:
strategy: depend
include: .gitlab/pascal/pipeline_compiler_tests.yml

pascal distconv testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_SPECS: "%gcc@10.3.1 +cuda +distconv +fft"
BUILD_SCRIPT_OPTIONS: "--no-default-mirrors"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/pascal/pipeline.yml

tioga testing:
corona testing:
stage: run-all-clusters
variables:
# FF_USE_NEW_BASH_EVAL_STRATEGY: 1
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
trigger:
strategy: depend
include: .gitlab/tioga/pipeline.yml
include: '.gitlab/build-and-test-corona.yml'
forward:
pipeline_variables: true

tioga distconv testing:
lassen testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_ENV_BASE_NAME_MODIFIER: "-distconv"
SPACK_SPECS: "+rocm +distconv"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/tioga/pipeline.yml
include: '.gitlab/build-and-test-lassen.yml'
forward:
pipeline_variables: true
56 changes: 56 additions & 0 deletions .gitlab/build-and-test-common.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
################################################################################
## Copyright (c) 2014-2024, Lawrence Livermore National Security, LLC.
## Produced at the Lawrence Livermore National Laboratory.
## Written by the LBANN Research Team (B. Van Essen, et al.) listed in
## the CONTRIBUTORS file. <lbann-dev@llnl.gov>
##
## LLNL-CODE-697807.
## All rights reserved.
##
## This file is part of LBANN: Livermore Big Artificial Neural Network
## Toolkit. For details, see http://software.llnl.gov/LBANN or
## https://github.com/LLNL/LBANN.
##
## Licensed under the Apache License, Version 2.0 (the "Licensee"); you
## may not use this file except in compliance with the License. You may
## obtain a copy of the License at:
##
## http://www.apache.org/licenses/LICENSE-2.0
##
## Unless required by applicable law or agreed to in writing, software
## distributed under the License is distributed on an "AS IS" BASIS,
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
## implied. See the License for the specific language governing
## permissions and limitations under the license.
################################################################################

.build-and-test-base:
variables:
LLNL_SERVICE_USER: lbannusr
LLNL_SLURM_SCHEDULER_PARAMETERS: "-N2 -t 90"
LLNL_FLUX_SCHEDULER_PARAMETERS: "-N2 -t 120m"
LLNL_LSF_SCHEDULER_PARAMETERS: "-q pbatch -nnodes 2 -W 60"
GIT_SUBMODULE_STRATEGY: none
GIT_DEPTH: 5
script:
- printenv > ${CI_PROJECT_DIR}/ci_environment.log
- ${CI_PROJECT_DIR}/.gitlab/build-and-test.sh
cache:
key: $CI_JOB_NAME_SLUG
paths:
- install-deps-${CI_JOB_NAME_SLUG}
timeout: 6h

.build-and-test:
artifacts:
when: always
paths:
- "${CI_PROJECT_DIR}/*junit.*xml"
- "${CI_PROJECT_DIR}/ci_environment.log"
- "${CI_PROJECT_DIR}/build-${CI_JOB_ID}/build-lbann/build.ninja"
- "${CI_PROJECT_DIR}/build-${CI_JOB_ID}/build-lbann/CMakeFiles/rules.ninja"
- "${CI_PROJECT_DIR}/build-${CI_JOB_ID}/build-deps/all_build_files.tar.gz"
- "${CI_PROJECT_DIR}/build-${CI_JOB_ID}/build-deps/all_output_logs.tar.gz"
reports:
junit: "${CI_PROJECT_DIR}/*junit.*xml"
extends: .build-and-test-base
54 changes: 54 additions & 0 deletions .gitlab/build-and-test-corona.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
################################################################################
## Copyright (c) 2014-2024, Lawrence Livermore National Security, LLC.
## Produced at the Lawrence Livermore National Laboratory.
## Written by the LBANN Research Team (B. Van Essen, et al.) listed in
## the CONTRIBUTORS file. <lbann-dev@llnl.gov>
##
## LLNL-CODE-697807.
## All rights reserved.
##
## This file is part of LBANN: Livermore Big Artificial Neural Network
## Toolkit. For details, see http://software.llnl.gov/LBANN or
## https://github.com/LLNL/LBANN.
##
## Licensed under the Apache License, Version 2.0 (the "Licensee"); you
## may not use this file except in compliance with the License. You may
## obtain a copy of the License at:
##
## http://www.apache.org/licenses/LICENSE-2.0
##
## Unless required by applicable law or agreed to in writing, software
## distributed under the License is distributed on an "AS IS" BASIS,
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
## implied. See the License for the specific language governing
## permissions and limitations under the license.
################################################################################

default:
id_tokens:
SITE_ID_TOKEN:
aud: https://lc.llnl.gov/gitlab

stages:
- build

include:
local: "/.gitlab/build-and-test-common.yml"

rocm-5-7-1-corona:
variables:
COMPILER_FAMILY: amdclang
MODULES: "rocm/5.7.1 clang/14.0.6-magic openmpi/4.1.2"
extends: .build-and-test-on-corona

rocm-5-7-1-distconv-corona:
variables:
COMPILER_FAMILY: amdclang
MODULES: "rocm/5.7.1 clang/14.0.6-magic openmpi/4.1.2"
WITH_DISTCONV: "ON"
extends: .build-and-test-on-corona

.build-and-test-on-corona:
stage: build
tags: [corona, batch]
extends: .build-and-test
55 changes: 55 additions & 0 deletions .gitlab/build-and-test-lassen.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
################################################################################
## Copyright (c) 2014-2024, Lawrence Livermore National Security, LLC.
## Produced at the Lawrence Livermore National Laboratory.
## Written by the LBANN Research Team (B. Van Essen, et al.) listed in
## the CONTRIBUTORS file. <lbann-dev@llnl.gov>
##
## LLNL-CODE-697807.
## All rights reserved.
##
## This file is part of LBANN: Livermore Big Artificial Neural Network
## Toolkit. For details, see http://software.llnl.gov/LBANN or
## https://github.com/LLNL/LBANN.
##
## Licensed under the Apache License, Version 2.0 (the "Licensee"); you
## may not use this file except in compliance with the License. You may
## obtain a copy of the License at:
##
## http://www.apache.org/licenses/LICENSE-2.0
##
## Unless required by applicable law or agreed to in writing, software
## distributed under the License is distributed on an "AS IS" BASIS,
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
## implied. See the License for the specific language governing
## permissions and limitations under the license.
################################################################################

default:
id_tokens:
SITE_ID_TOKEN:
aud: https://lc.llnl.gov/gitlab

stages:
- build

include:
local: "/.gitlab/build-and-test-common.yml"

# fftw/3.3.10-gcc-11.2.1
clang-16-0-6-gcc-11-2-1-cuda-12-2-2-lassen:
variables:
COMPILER_FAMILY: clang
MODULES: "clang/16.0.6-gcc-11.2.1 spectrum-mpi/rolling-release cuda/12.2.2 cmake/3.29.2 python/3.11.5"
extends: .build-and-test-on-lassen

clang-16-0-6-gcc-11-2-1-cuda-12-2-2-distconv-lassen:
variables:
COMPILER_FAMILY: clang
MODULES: "clang/16.0.6-gcc-11.2.1 spectrum-mpi/rolling-release cuda/12.2.2 cmake/3.29.2 python/3.11.5"
WITH_DISTCONV: "ON"
extends: .build-and-test-on-lassen

.build-and-test-on-lassen:
stage: build
tags: [lassen, batch]
extends: .build-and-test
54 changes: 54 additions & 0 deletions .gitlab/build-and-test-pascal.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
################################################################################
## Copyright (c) 2014-2024, Lawrence Livermore National Security, LLC.
## Produced at the Lawrence Livermore National Laboratory.
## Written by the LBANN Research Team (B. Van Essen, et al.) listed in
## the CONTRIBUTORS file. <lbann-dev@llnl.gov>
##
## LLNL-CODE-697807.
## All rights reserved.
##
## This file is part of LBANN: Livermore Big Artificial Neural Network
## Toolkit. For details, see http://software.llnl.gov/LBANN or
## https://github.com/LLNL/LBANN.
##
## Licensed under the Apache License, Version 2.0 (the "Licensee"); you
## may not use this file except in compliance with the License. You may
## obtain a copy of the License at:
##
## http://www.apache.org/licenses/LICENSE-2.0
##
## Unless required by applicable law or agreed to in writing, software
## distributed under the License is distributed on an "AS IS" BASIS,
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
## implied. See the License for the specific language governing
## permissions and limitations under the license.
################################################################################

default:
id_tokens:
SITE_ID_TOKEN:
aud: https://lc.llnl.gov/gitlab

stages:
- build

include:
local: "/.gitlab/build-and-test-common.yml"

clang-14-0-6-cuda-11-8-0-pascal:
variables:
COMPILER_FAMILY: clang
MODULES: "clang/14.0.6-magic openmpi/4.1.2 cuda/11.8.0 ninja/1.11.1"
WITH_HALF: "ON"
extends: [.build-and-test-on-pascal, .build-and-test]

clang-14-0-6-cuda-11-8-0-distconv-pascal:
variables:
COMPILER_FAMILY: clang
MODULES: "clang/14.0.6-magic openmpi/4.1.2 cuda/11.8.0 ninja/1.11.1"
WITH_DISTCONV: "ON"
extends: [.build-and-test-on-pascal, .build-and-test]

.build-and-test-on-pascal:
stage: build
tags: [pascal, batch]
Loading

0 comments on commit ead9cce

Please sign in to comment.