[BUG] restarting from checkpoint doesn't work with multiple GPUs #2712

luukasnik · 2023-08-02T11:40:21Z

Bug summary

Restarting training from a checkpoint file doesn't work for multipe GPUs.

When running the command

srun dp train --mpi-log=workers input.json -r model.ckpt

I get the error log. The command works without srun, but only runs on 1 GPU.

DeePMD-kit Version

2.2.2

TensorFlow Version

2.12

How did you download the software?

pip

Input Files, Running Commands, Error Log, etc.

The following modules were not unloaded:
(Use "module --force purge" to unload all):

csc-tools
NOTE: This module uses Apptainer (Singularity). Some commands execute inside
the container (e.g. python3, pip3).

Currently Loaded Modules:

csc-tools (S) 2) gcc/9.4.0 3) tensorflow/2.12

Where:
S: Module is Sticky, requires --force to unload or purge

2023-08-01 15:41:30.901704: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-08-01 15:41:30.901714: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-08-01 15:41:30.901797: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-08-01 15:41:30.901764: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-08-01 15:41:33.857100: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 15:41:33.857156: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 15:41:33.857189: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 15:41:33.857218: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 15:42:22.689050: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-01 15:42:22.689045: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-01 15:42:22.689087: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-01 15:42:22.689074: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD rank:0 INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD rank:1 INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD rank:3 INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD rank:2 INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
2023-08-01 15:43:32.927072: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-08-01 15:43:32.927064: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-08-01 15:43:32.927069: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-08-01 15:43:32.927104: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
DEEPMD rank:2 INFO training data with min nbor dist: 0.8505298046431992
DEEPMD rank:2 INFO training data with max nbor size: [8 8 2 1]
DEEPMD rank:2 INFO _____ _____ __ __ _____ _ _ _
DEEPMD rank:2 INFO | __ \ | __ \ | / || __ \ | | ()| |
DEEPMD rank:2 INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |
DEEPMD rank:2 INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || |
DEEPMD rank:2 INFO | || || /| /| | | | | || || | | < | || |
DEEPMD rank:2 INFO |/ _| _||| || |_||/ ||_|| _|
DEEPMD rank:2 INFO Please read and cite:
DEEPMD rank:2 INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD rank:2 INFO Zeng et al, arXiv:2304.09409
DEEPMD rank:2 INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD rank:2 INFO installed to: /project/skbuild/linux-x86_64-3.11/cmake-install
DEEPMD rank:2 INFO source : v2.2.2
DEEPMD rank:2 INFO source brach: HEAD
DEEPMD rank:2 INFO source commit: 92ca097
DEEPMD rank:2 INFO source commit at: 2023-05-24 13:45:03 +0800
DEEPMD rank:2 INFO build float prec: double
DEEPMD rank:2 INFO build variant: cuda
DEEPMD rank:2 INFO build with tf inc: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include;/tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include
DEEPMD rank:2 INFO build with tf lib: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/libtensorflow_cc.so.2
DEEPMD rank:2 INFO ---Summary of the training---------------------------------------
DEEPMD rank:2 INFO distributed
DEEPMD rank:2 INFO world size: 4
DEEPMD rank:2 INFO my rank: 2
DEEPMD rank:2 INFO node list: ['r02g01']
DEEPMD rank:2 INFO running on: r02g01
DEEPMD rank:2 INFO computing device: gpu:2
DEEPMD rank:2 INFO CUDA_VISIBLE_DEVICES: 0,1,2,3
DEEPMD rank:2 INFO Count of visible GPU: 4
DEEPMD rank:2 INFO num_intra_threads: 0
DEEPMD rank:2 INFO num_inter_threads: 0
DEEPMD rank:2 INFO -----------------------------------------------------------------
DEEPMD rank:2 INFO ---Summary of DataSystem: training -----------------------------------------------
DEEPMD rank:2 INFO found 1 system(s):
DEEPMD rank:2 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:2 INFO -- vanillin/10k/data/dipole/training_data 19 100 80 1.000 F
DEEPMD rank:2 INFO --------------------------------------------------------------------------------------
DEEPMD rank:3 INFO training data with min nbor dist: 0.8505298046431992
DEEPMD rank:3 INFO training data with max nbor size: [8 8 2 1]
DEEPMD rank:3 INFO _____ _____ __ __ _____ _ _ _
DEEPMD rank:3 INFO | __ \ | __ \ | / || __ \ | | ()| |
DEEPMD rank:3 INFO | | | | ___ ___ | |) || \ / || | | | ______ | | __ _ | |_
DEEPMD rank:3 INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || |
DEEPMD rank:3 INFO | || || /| /| | | | | || || | | < | || |
DEEPMD rank:3 INFO |/ _| _||| || |_||/ ||_|| _|
DEEPMD rank:3 INFO Please read and cite:
DEEPMD rank:3 INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD rank:3 INFO Zeng et al, arXiv:2304.09409
DEEPMD rank:3 INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD rank:3 INFO installed to: /project/skbuild/linux-x86_64-3.11/cmake-install
DEEPMD rank:3 INFO source : v2.2.2
DEEPMD rank:3 INFO source brach: HEAD
DEEPMD rank:3 INFO source commit: 92ca097
DEEPMD rank:3 INFO source commit at: 2023-05-24 13:45:03 +0800
DEEPMD rank:3 INFO build float prec: double
DEEPMD rank:3 INFO build variant: cuda
DEEPMD rank:3 INFO build with tf inc: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include;/tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include
DEEPMD rank:3 INFO build with tf lib: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/libtensorflow_cc.so.2
DEEPMD rank:3 INFO ---Summary of the training---------------------------------------
DEEPMD rank:3 INFO distributed
DEEPMD rank:3 INFO world size: 4
DEEPMD rank:3 INFO my rank: 3
DEEPMD rank:3 INFO node list: ['r02g01']
DEEPMD rank:3 INFO running on: r02g01
DEEPMD rank:3 INFO computing device: gpu:3
DEEPMD rank:3 INFO CUDA_VISIBLE_DEVICES: 0,1,2,3
DEEPMD rank:3 INFO Count of visible GPU: 4
DEEPMD rank:3 INFO num_intra_threads: 0
DEEPMD rank:3 INFO num_inter_threads: 0
DEEPMD rank:3 INFO -----------------------------------------------------------------
DEEPMD rank:3 INFO ---Summary of DataSystem: training -----------------------------------------------
DEEPMD rank:3 INFO found 1 system(s):
DEEPMD rank:3 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:3 INFO -- vanillin/10k/data/dipole/training_data 19 100 80 1.000 F
DEEPMD rank:3 INFO --------------------------------------------------------------------------------------
DEEPMD rank:3 INFO ---Summary of DataSystem: validation -----------------------------------------------
DEEPMD rank:3 INFO found 1 system(s):
DEEPMD rank:3 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:3 INFO -- nillin/10k/data/dipole/validation_data 19 500 4 1.000 F
DEEPMD rank:3 INFO --------------------------------------------------------------------------------------
DEEPMD rank:3 INFO training without frame parameter
DEEPMD rank:2 INFO ---Summary of DataSystem: validation -----------------------------------------------
DEEPMD rank:2 INFO found 1 system(s):
DEEPMD rank:2 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:2 INFO -- nillin_/10k/data/dipole/validation_data 19 500 4 1.000 F
DEEPMD rank:2 INFO --------------------------------------------------------------------------------------
DEEPMD rank:2 INFO training without frame parameter
DEEPMD rank:0 INFO training data with min nbor dist: 0.8505298046431992
DEEPMD rank:0 INFO training data with max nbor size: [8 8 2 1]
DEEPMD rank:1 INFO training data with min nbor dist: 0.8505298046431992
DEEPMD rank:1 INFO training data with max nbor size: [8 8 2 1]
DEEPMD rank:0 INFO _____ _____ __ __ _____ _ _ _
DEEPMD rank:0 INFO | __ \ | __ \ | / || __ \ | | (_)| |
DEEPMD rank:0 INFO | | | | ___ ___ | |) || \ / || | | | ______ | | __ _ | |_
DEEPMD rank:0 INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || |
DEEPMD rank:0 INFO | || || /| /| | | | | || || | | < | || |
DEEPMD rank:0 INFO |/ _| _||| || |_||/ ||_|| __|
DEEPMD rank:0 INFO Please read and cite:
DEEPMD rank:0 INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD rank:0 INFO Zeng et al, arXiv:2304.09409
DEEPMD rank:0 INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD rank:0 INFO installed to: /project/skbuild/linux-x86_64-3.11/cmake-install
DEEPMD rank:0 INFO source : v2.2.2
DEEPMD rank:0 INFO source brach: HEAD
DEEPMD rank:0 INFO source commit: 92ca097
DEEPMD rank:0 INFO source commit at: 2023-05-24 13:45:03 +0800
DEEPMD rank:0 INFO build float prec: double
DEEPMD rank:0 INFO build variant: cuda
DEEPMD rank:0 INFO build with tf inc: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include;/tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include
DEEPMD rank:0 INFO build with tf lib: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/libtensorflow_cc.so.2
DEEPMD rank:0 INFO ---Summary of the training---------------------------------------
DEEPMD rank:0 INFO distributed
DEEPMD rank:0 INFO world size: 4
DEEPMD rank:0 INFO my rank: 0
DEEPMD rank:0 INFO node list: ['r02g01']
DEEPMD rank:0 INFO running on: r02g01
DEEPMD rank:0 INFO computing device: gpu:0
DEEPMD rank:0 INFO CUDA_VISIBLE_DEVICES: 0,1,2,3
DEEPMD rank:0 INFO Count of visible GPU: 4
DEEPMD rank:0 INFO num_intra_threads: 0
DEEPMD rank:0 INFO num_inter_threads: 0
DEEPMD rank:0 INFO -----------------------------------------------------------------
DEEPMD rank:1 INFO _____ _____ __ __ _____ _ _ _
DEEPMD rank:1 INFO | __ \ | __ \ | / || __ \ | | ()| |
DEEPMD rank:1 INFO | | | | ___ ___ | |) || \ / || | | | ______ | | __ _ | |_
DEEPMD rank:1 INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || |
DEEPMD rank:1 INFO | || || /| /| | | | | || || | | < | || |
DEEPMD rank:1 INFO |/ _| _||| || |_||____/ ||_|| _|
DEEPMD rank:1 INFO Please read and cite:
DEEPMD rank:1 INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD rank:1 INFO Zeng et al, arXiv:2304.09409
DEEPMD rank:1 INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD rank:1 INFO installed to: /project/skbuild/linux-x86_64-3.11/cmake-install
DEEPMD rank:1 INFO source : v2.2.2
DEEPMD rank:1 INFO source brach: HEAD
DEEPMD rank:1 INFO source commit: 92ca097
DEEPMD rank:1 INFO source commit at: 2023-05-24 13:45:03 +0800
DEEPMD rank:1 INFO build float prec: double
DEEPMD rank:1 INFO build variant: cuda
DEEPMD rank:1 INFO build with tf inc: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include;/tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include
DEEPMD rank:1 INFO build with tf lib: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/libtensorflow_cc.so.2
DEEPMD rank:1 INFO ---Summary of the training---------------------------------------
DEEPMD rank:1 INFO distributed
DEEPMD rank:1 INFO world size: 4
DEEPMD rank:1 INFO my rank: 1
DEEPMD rank:1 INFO node list: ['r02g01']
DEEPMD rank:1 INFO running on: r02g01
DEEPMD rank:1 INFO computing device: gpu:1
DEEPMD rank:1 INFO CUDA_VISIBLE_DEVICES: 0,1,2,3
DEEPMD rank:1 INFO Count of visible GPU: 4
DEEPMD rank:1 INFO num_intra_threads: 0
DEEPMD rank:1 INFO num_inter_threads: 0
DEEPMD rank:1 INFO -----------------------------------------------------------------
DEEPMD rank:0 INFO ---Summary of DataSystem: training -----------------------------------------------
DEEPMD rank:0 INFO found 1 system(s):
DEEPMD rank:0 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:0 INFO -- vanillin/10k/data/dipole/training_data 19 100 80 1.000 F
DEEPMD rank:0 INFO --------------------------------------------------------------------------------------
DEEPMD rank:1 INFO ---Summary of DataSystem: training -----------------------------------------------
DEEPMD rank:1 INFO found 1 system(s):
DEEPMD rank:1 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:1 INFO -- vanillin/10k/data/dipole/training_data 19 100 80 1.000 F
DEEPMD rank:1 INFO --------------------------------------------------------------------------------------
DEEPMD rank:0 INFO ---Summary of DataSystem: validation -----------------------------------------------
DEEPMD rank:0 INFO found 1 system(s):
DEEPMD rank:0 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:0 INFO -- nillin_/10k/data/dipole/validation_data 19 500 4 1.000 F
DEEPMD rank:0 INFO --------------------------------------------------------------------------------------
DEEPMD rank:0 INFO training without frame parameter
DEEPMD rank:1 INFO ---Summary of DataSystem: validation -----------------------------------------------
DEEPMD rank:1 INFO found 1 system(s):
DEEPMD rank:1 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:1 INFO -- nillin_/10k/data/dipole/validation_data 19 500 4 1.000 F
DEEPMD rank:1 INFO --------------------------------------------------------------------------------------
DEEPMD rank:1 INFO training without frame parameter
2023-08-01 15:43:50.922270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29357 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
2023-08-01 15:43:50.923407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 29357 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0
2023-08-01 15:43:50.924350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 29357 MB memory: -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0
2023-08-01 15:43:50.925349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 29527 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0
2023-08-01 15:43:50.944724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29357 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
2023-08-01 15:43:50.945809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 29357 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0
2023-08-01 15:43:50.946740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 29357 MB memory: -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0
2023-08-01 15:43:50.947576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 29501 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0
2023-08-01 15:43:50.985007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29357 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
2023-08-01 15:43:50.986080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 29357 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0
2023-08-01 15:43:50.987024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 29357 MB memory: -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0
2023-08-01 15:43:50.987923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 29465 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0
2023-08-01 15:43:51.227636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29357 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
2023-08-01 15:43:51.228725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 29357 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0
2023-08-01 15:43:51.229691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 29357 MB memory: -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0
2023-08-01 15:43:51.230606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 29357 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0
DEEPMD rank:3 INFO built lr
DEEPMD rank:0 INFO built lr
DEEPMD rank:1 INFO built lr
DEEPMD rank:2 INFO built lr
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
DEEPMD rank:3 INFO built network
DEEPMD rank:3 INFO Scale learning rate by coef: 4.000000
DEEPMD rank:0 INFO built network
DEEPMD rank:0 INFO Scale learning rate by coef: 4.000000
DEEPMD rank:1 INFO built network
DEEPMD rank:1 INFO Scale learning rate by coef: 4.000000
DEEPMD rank:2 INFO built network
DEEPMD rank:2 INFO Scale learning rate by coef: 4.000000
DEEPMD rank:3 INFO built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-08-01 15:43:57.005045: E tensorflow/core/common_runtime/session.cc:91] Failed to create session: ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (3 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
2023-08-01 15:43:57.005070: E tensorflow/c/c_api.cc:2218] ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (3 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
Traceback (most recent call last):
File "/users/luukasni/.local/bin/dp", line 8, in
sys.exit(main())
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 623, in main
train_dp(**dict_args)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 164, in train
_do_work(jdata, run_opt, is_compress)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 278, in _do_work
model.train(train_data, valid_data)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 821, in train
DEEPMD rank:0 INFO built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-08-01 15:43:57.070654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29357 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
self._init_session()
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 768, in _init_session
self.sess = tf.Session(config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1604, in init
super(Session, self).init(target, graph, config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 712, in init
self.session = tf_session.TF_NewSessionRef(c_graph, opts)
tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (GPU:0) is being mapped to multiple devices (3 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
DEEPMD rank:0 INFO restart from model /scratch/pyykko2/luukasni/turbomole_deepmd/vanillin/10k/model_training/dipole/model.ckpt
DEEPMD rank:1 INFO built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-08-01 15:43:57.238986: E tensorflow/core/common_runtime/session.cc:91] Failed to create session: ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (1 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
2023-08-01 15:43:57.239010: E tensorflow/c/c_api.cc:2218] ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (1 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
Traceback (most recent call last):
File "/users/luukasni/.local/bin/dp", line 8, in
sys.exit(main())
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 623, in main
train_dp(**dict_args)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 164, in train
_do_work(jdata, run_opt, is_compress)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 278, in _do_work
model.train(train_data, valid_data)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 821, in train
self._init_session()
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 768, in _init_session
self.sess = tf.Session(config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1604, in init
super(Session, self).init(target, graph, config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 712, in init
self._session = tf_session.TF_NewSessionRef(c_graph, opts)
tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (GPU:0) is being mapped to multiple devices (1 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
DEEPMD rank:2 INFO built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-08-01 15:43:57.345751: E tensorflow/core/common_runtime/session.cc:91] Failed to create session: ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (2 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
2023-08-01 15:43:57.345774: E tensorflow/c/c_api.cc:2218] ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (2 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
Traceback (most recent call last):
File "/users/luukasni/.local/bin/dp", line 8, in
sys.exit(main())
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 623, in main
train_dp(**dict_args)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 164, in train
_do_work(jdata, run_opt, is_compress)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 278, in _do_work
model.train(train_data, valid_data)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 821, in train
self._init_session()
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 768, in _init_session
self.sess = tf.Session(config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1604, in init
super(Session, self).init(target, graph, config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 712, in init
self._session = tf_session.TF_NewSessionRef(c_graph, opts)
tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (GPU:0) is being mapped to multiple devices (2 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
srun: error: r02g01: tasks 1-3: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=18026489.0
slurmstepd: error: *** STEP 18026489.0 ON r02g01 CANCELLED AT 2023-08-01T15:43:58 ***
srun: error: r02g01: task 0: Terminated
srun: Force Terminated StepId=18026489.0

Steps to Reproduce

Run normal training until a checkpoint is created. Restart checkpoint with multiple GPUs.

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

njzjz · 2023-08-02T18:00:49Z

Did you install horovod?

luukasnik · 2023-08-03T09:45:40Z

Hello, yes I have horovod installed.

The normal training works with "world size: distributed" when running
srun dp freeze input.json

but when adding the chekcpoint I get the aforementioned problem
srun dp freeze input.json -r model.ckpt

Fix #2712. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

When doing finetune with Horovod, the same error as deepmodeling#2712 throws at what I modified in this PR. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

When fine-tuning with Horovod, the same error as #2712 is thrown at the place I modified in this PR. It seems `tf.test.is_gpu_available` will try to use all GPUs, but `tf.config.get_visible_devices` won't. --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

luukasnik added the bug label Aug 2, 2023

njzjz added the reproduced This bug has been reproduced by developers label Aug 3, 2023

njzjz mentioned this issue Aug 3, 2023

make only the local GPU visible #2716

Merged

njzjz linked a pull request Aug 3, 2023 that will close this issue

make only the local GPU visible #2716

Merged

wanghan-iapcm pushed a commit that referenced this issue Aug 7, 2023

make only the local GPU visible (#2716)

009f40b

Fix #2712. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

wanghan-iapcm closed this as completed Aug 7, 2023

njzjz mentioned this issue Dec 10, 2023

fix GPU mapping error for Horovod + finetune #3048

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] restarting from checkpoint doesn't work with multiple GPUs #2712

[BUG] restarting from checkpoint doesn't work with multiple GPUs #2712

luukasnik commented Aug 2, 2023

njzjz commented Aug 2, 2023

luukasnik commented Aug 3, 2023

[BUG] restarting from checkpoint doesn't work with multiple GPUs #2712

[BUG] restarting from checkpoint doesn't work with multiple GPUs #2712

Comments

luukasnik commented Aug 2, 2023

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

njzjz commented Aug 2, 2023

luukasnik commented Aug 3, 2023