Using distconv and model-parallel for GAN #2447
-
I am trying to run a multi-gpu GAN similar to Below I have put the log file for each run, batch script used to run (maybe I am missing some mpi flag?), and a short python code snippet showing how I try to invoke model parallelism. (Happy to also provide the prototext if it is useful). one GPU:
Two GPUs:
batch file used to run: #!/bin/bash
export DISTCONV_WS_CAPACITY_FACTOR=0.8
export LBANN_DISTCONV_HALO_EXCHANGE=AL
export LBANN_DISTCONV_TENSOR_SHUFFLER=AL
export LBANN_DISTCONV_CONVOLUTION_FWD_ALGORITHM=AUTOTUNE
export LBANN_DISTCONV_CONVOLUTION_BWD_DATA_ALGORITHM=AUTOTUNE
export LBANN_DISTCONV_CONVOLUTION_BWD_FILTER_ALGORITHM=AUTOTUNE
export LBANN_DISTCONV_RANK_STRIDE=1
export LBANN_DISTCONV_NUM_IO_PARTITIONS=2
export LBANN_KEEP_ERROR_SIGNALS=1
echo "Started at $(date)"
mpiexec -n 2 --map-by ppr:2:node -wdir /home/jwilliams/project/lbann_gan/stylegan2_withdiscriminator_weightDemodTal_modelparallel_filterLastLayer/20240507_114638_gan_HW256_nonsaturatingloss_n1_ppn2 /home/jwilliams/lbann-builds/lbann-latest/build_appspack/install/bin/lbann --use_data_store --preload_data_store --prototext=/home/jwilliams/project/lbann_gan/stylegan2_withdiscriminator_weightDemodTal_modelparallel_filterLastLayer/20240507_114638_gan_HW256_nonsaturatingloss_n1_ppn2/experiment.prototext
status=$?
echo "Finished at $(date)"
exit ${status} I am setting parallel strategy in the following way: environment = {}
parallel_strategy = None
procs_per_node = 1
if args.modelparallel:
height_groups = 2
procs_per_node = 2
environment = get_distconv_environment(num_io_partitions=procs_per_node)
parallel_strategy = lbann.core.util.get_parallel_strategy_args(
height_groups=procs_per_node,
)
# setup model
# (code omitted for brevity)...
# assign parallel strategy to each layer
layers = list(lbann.traverse_layer_graph(loss))
for l in layers:
l.parallel_strategy = parallel_strategy
kwargs = {
"nodes": 1,
"procs_per_node": procs_per_node,
"scheduler": "openmpi",
}
# run
lbann.run(
trainer,
model,
data_reader,
opt,
job_name=job_name,
environment=environment,
lbann_args=lbann_args,
**kwargs,
) |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 7 replies
-
FYI, on this system we built the dependencies with spack:
and then compiled lbann itself with cmake (there was an issue in building lbann itself with spack). |
Beta Was this translation helpful? Give feedback.
DistConv is part of DiHydrogen. In spack, building
dihydrogen+distconv
should get you there -- though make sure you're up to date with LBANN@develop since I merged a PR yesterday addressing a versioning issue. If you build H2 directly with CMake, the option is-D H2_ENABLE_DISTCONV_LEGACY=ON
.