Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] restarting from checkpoint doesn't work with multiple GPUs #2712

Closed
luukasnik opened this issue Aug 2, 2023 · 2 comments · Fixed by #2716
Closed

[BUG] restarting from checkpoint doesn't work with multiple GPUs #2712

luukasnik opened this issue Aug 2, 2023 · 2 comments · Fixed by #2716
Labels
bug reproduced This bug has been reproduced by developers

Comments

@luukasnik
Copy link

Bug summary

Restarting training from a checkpoint file doesn't work for multipe GPUs.

When running the command

srun dp train --mpi-log=workers input.json -r model.ckpt

I get the error log. The command works without srun, but only runs on 1 GPU.

DeePMD-kit Version

2.2.2

TensorFlow Version

2.12

How did you download the software?

pip

Input Files, Running Commands, Error Log, etc.

The following modules were not unloaded:
(Use "module --force purge" to unload all):

  1. csc-tools
    NOTE: This module uses Apptainer (Singularity). Some commands execute inside
    the container (e.g. python3, pip3).

Currently Loaded Modules:

  1. csc-tools (S) 2) gcc/9.4.0 3) tensorflow/2.12

Where:
S: Module is Sticky, requires --force to unload or purge

2023-08-01 15:41:30.901704: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-08-01 15:41:30.901714: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-08-01 15:41:30.901797: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-08-01 15:41:30.901764: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-08-01 15:41:33.857100: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 15:41:33.857156: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 15:41:33.857189: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 15:41:33.857218: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-01 15:42:22.689050: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-01 15:42:22.689045: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-01 15:42:22.689087: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-01 15:42:22.689074: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DEEPMD rank:0 INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD rank:1 INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD rank:3 INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
DEEPMD rank:2 INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
2023-08-01 15:43:32.927072: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-08-01 15:43:32.927064: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-08-01 15:43:32.927069: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-08-01 15:43:32.927104: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
DEEPMD rank:2 INFO training data with min nbor dist: 0.8505298046431992
DEEPMD rank:2 INFO training data with max nbor size: [8 8 2 1]
DEEPMD rank:2 INFO _____ _____ __ __ _____ _ _ _
DEEPMD rank:2 INFO | __ \ | __ \ | / || __ \ | | ()| |
DEEPMD rank:2 INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |

DEEPMD rank:2 INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || |
DEEPMD rank:2 INFO | || || /| /| | | | | || || | | < | || |
DEEPMD rank:2 INFO |
/ _| _||| || |_||/ ||_|| _|
DEEPMD rank:2 INFO Please read and cite:
DEEPMD rank:2 INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD rank:2 INFO Zeng et al, arXiv:2304.09409
DEEPMD rank:2 INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD rank:2 INFO installed to: /project/skbuild/linux-x86_64-3.11/cmake-install
DEEPMD rank:2 INFO source : v2.2.2
DEEPMD rank:2 INFO source brach: HEAD
DEEPMD rank:2 INFO source commit: 92ca097
DEEPMD rank:2 INFO source commit at: 2023-05-24 13:45:03 +0800
DEEPMD rank:2 INFO build float prec: double
DEEPMD rank:2 INFO build variant: cuda
DEEPMD rank:2 INFO build with tf inc: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include;/tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include
DEEPMD rank:2 INFO build with tf lib: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/libtensorflow_cc.so.2
DEEPMD rank:2 INFO ---Summary of the training---------------------------------------
DEEPMD rank:2 INFO distributed
DEEPMD rank:2 INFO world size: 4
DEEPMD rank:2 INFO my rank: 2
DEEPMD rank:2 INFO node list: ['r02g01']
DEEPMD rank:2 INFO running on: r02g01
DEEPMD rank:2 INFO computing device: gpu:2
DEEPMD rank:2 INFO CUDA_VISIBLE_DEVICES: 0,1,2,3
DEEPMD rank:2 INFO Count of visible GPU: 4
DEEPMD rank:2 INFO num_intra_threads: 0
DEEPMD rank:2 INFO num_inter_threads: 0
DEEPMD rank:2 INFO -----------------------------------------------------------------
DEEPMD rank:2 INFO ---Summary of DataSystem: training -----------------------------------------------
DEEPMD rank:2 INFO found 1 system(s):
DEEPMD rank:2 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:2 INFO -- vanillin
/10k/data/dipole/training_data 19 100 80 1.000 F
DEEPMD rank:2 INFO --------------------------------------------------------------------------------------
DEEPMD rank:3 INFO training data with min nbor dist: 0.8505298046431992
DEEPMD rank:3 INFO training data with max nbor size: [8 8 2 1]
DEEPMD rank:3 INFO _____ _____ __ __ _____ _ _ _
DEEPMD rank:3 INFO | __ \ | __ \ | / || __ \ | | (
)| |
DEEPMD rank:3 INFO | | | | ___ ___ | |) || \ / || | | | ______ | | __ _ | |_
DEEPMD rank:3 INFO | | | | / _ \ / _ | / | |/| || | | ||
|| |/ /| || |
DEEPMD rank:3 INFO | |
| || /| /| | | | | || || | | < | || |
DEEPMD rank:3 INFO |
/ _| _||| || |_||/ ||_|| _|
DEEPMD rank:3 INFO Please read and cite:
DEEPMD rank:3 INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD rank:3 INFO Zeng et al, arXiv:2304.09409
DEEPMD rank:3 INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD rank:3 INFO installed to: /project/skbuild/linux-x86_64-3.11/cmake-install
DEEPMD rank:3 INFO source : v2.2.2
DEEPMD rank:3 INFO source brach: HEAD
DEEPMD rank:3 INFO source commit: 92ca097
DEEPMD rank:3 INFO source commit at: 2023-05-24 13:45:03 +0800
DEEPMD rank:3 INFO build float prec: double
DEEPMD rank:3 INFO build variant: cuda
DEEPMD rank:3 INFO build with tf inc: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include;/tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include
DEEPMD rank:3 INFO build with tf lib: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/libtensorflow_cc.so.2
DEEPMD rank:3 INFO ---Summary of the training---------------------------------------
DEEPMD rank:3 INFO distributed
DEEPMD rank:3 INFO world size: 4
DEEPMD rank:3 INFO my rank: 3
DEEPMD rank:3 INFO node list: ['r02g01']
DEEPMD rank:3 INFO running on: r02g01
DEEPMD rank:3 INFO computing device: gpu:3
DEEPMD rank:3 INFO CUDA_VISIBLE_DEVICES: 0,1,2,3
DEEPMD rank:3 INFO Count of visible GPU: 4
DEEPMD rank:3 INFO num_intra_threads: 0
DEEPMD rank:3 INFO num_inter_threads: 0
DEEPMD rank:3 INFO -----------------------------------------------------------------
DEEPMD rank:3 INFO ---Summary of DataSystem: training -----------------------------------------------
DEEPMD rank:3 INFO found 1 system(s):
DEEPMD rank:3 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:3 INFO -- vanillin
/10k/data/dipole/training_data 19 100 80 1.000 F
DEEPMD rank:3 INFO --------------------------------------------------------------------------------------
DEEPMD rank:3 INFO ---Summary of DataSystem: validation -----------------------------------------------
DEEPMD rank:3 INFO found 1 system(s):
DEEPMD rank:3 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:3 INFO -- nillin
/10k/data/dipole/validation_data 19 500 4 1.000 F
DEEPMD rank:3 INFO --------------------------------------------------------------------------------------
DEEPMD rank:3 INFO training without frame parameter
DEEPMD rank:2 INFO ---Summary of DataSystem: validation -----------------------------------------------
DEEPMD rank:2 INFO found 1 system(s):
DEEPMD rank:2 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:2 INFO -- nillin_/10k/data/dipole/validation_data 19 500 4 1.000 F
DEEPMD rank:2 INFO --------------------------------------------------------------------------------------
DEEPMD rank:2 INFO training without frame parameter
DEEPMD rank:0 INFO training data with min nbor dist: 0.8505298046431992
DEEPMD rank:0 INFO training data with max nbor size: [8 8 2 1]
DEEPMD rank:1 INFO training data with min nbor dist: 0.8505298046431992
DEEPMD rank:1 INFO training data with max nbor size: [8 8 2 1]
DEEPMD rank:0 INFO _____ _____ __ __ _____ _ _ _
DEEPMD rank:0 INFO | __ \ | __ \ | / || __ \ | | (_)| |
DEEPMD rank:0 INFO | | | | ___ ___ | |) || \ / || | | | ______ | | __ _ | |_
DEEPMD rank:0 INFO | | | | / _ \ / _ | / | |/| || | | ||
|| |/ /| || |
DEEPMD rank:0 INFO | |
| || /| /| | | | | || || | | < | || |
DEEPMD rank:0 INFO |
/ _| _||| || |_||/ ||_|| __|
DEEPMD rank:0 INFO Please read and cite:
DEEPMD rank:0 INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD rank:0 INFO Zeng et al, arXiv:2304.09409
DEEPMD rank:0 INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD rank:0 INFO installed to: /project/skbuild/linux-x86_64-3.11/cmake-install
DEEPMD rank:0 INFO source : v2.2.2
DEEPMD rank:0 INFO source brach: HEAD
DEEPMD rank:0 INFO source commit: 92ca097
DEEPMD rank:0 INFO source commit at: 2023-05-24 13:45:03 +0800
DEEPMD rank:0 INFO build float prec: double
DEEPMD rank:0 INFO build variant: cuda
DEEPMD rank:0 INFO build with tf inc: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include;/tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include
DEEPMD rank:0 INFO build with tf lib: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/libtensorflow_cc.so.2
DEEPMD rank:0 INFO ---Summary of the training---------------------------------------
DEEPMD rank:0 INFO distributed
DEEPMD rank:0 INFO world size: 4
DEEPMD rank:0 INFO my rank: 0
DEEPMD rank:0 INFO node list: ['r02g01']
DEEPMD rank:0 INFO running on: r02g01
DEEPMD rank:0 INFO computing device: gpu:0
DEEPMD rank:0 INFO CUDA_VISIBLE_DEVICES: 0,1,2,3
DEEPMD rank:0 INFO Count of visible GPU: 4
DEEPMD rank:0 INFO num_intra_threads: 0
DEEPMD rank:0 INFO num_inter_threads: 0
DEEPMD rank:0 INFO -----------------------------------------------------------------
DEEPMD rank:1 INFO _____ _____ __ __ _____ _ _ _
DEEPMD rank:1 INFO | __ \ | __ \ | / || __ \ | | (
)| |
DEEPMD rank:1 INFO | | | | ___ ___ | |) || \ / || | | | ______ | | __ _ | |_
DEEPMD rank:1 INFO | | | | / _ \ / _ | / | |/| || | | ||
|| |/ /| || |
DEEPMD rank:1 INFO | |
| || /| /| | | | | || || | | < | || |
DEEPMD rank:1 INFO |
/ _| _||| || |_||____/ ||_|| _|
DEEPMD rank:1 INFO Please read and cite:
DEEPMD rank:1 INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
DEEPMD rank:1 INFO Zeng et al, arXiv:2304.09409
DEEPMD rank:1 INFO See https://deepmd.rtfd.io/credits/ for details.
DEEPMD rank:1 INFO installed to: /project/skbuild/linux-x86_64-3.11/cmake-install
DEEPMD rank:1 INFO source : v2.2.2
DEEPMD rank:1 INFO source brach: HEAD
DEEPMD rank:1 INFO source commit: 92ca097
DEEPMD rank:1 INFO source commit at: 2023-05-24 13:45:03 +0800
DEEPMD rank:1 INFO build float prec: double
DEEPMD rank:1 INFO build variant: cuda
DEEPMD rank:1 INFO build with tf inc: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include;/tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/include
DEEPMD rank:1 INFO build with tf lib: /tmp/pip-build-env-vl5nal1k/normal/lib/python3.11/site-packages/tensorflow/libtensorflow_cc.so.2
DEEPMD rank:1 INFO ---Summary of the training---------------------------------------
DEEPMD rank:1 INFO distributed
DEEPMD rank:1 INFO world size: 4
DEEPMD rank:1 INFO my rank: 1
DEEPMD rank:1 INFO node list: ['r02g01']
DEEPMD rank:1 INFO running on: r02g01
DEEPMD rank:1 INFO computing device: gpu:1
DEEPMD rank:1 INFO CUDA_VISIBLE_DEVICES: 0,1,2,3
DEEPMD rank:1 INFO Count of visible GPU: 4
DEEPMD rank:1 INFO num_intra_threads: 0
DEEPMD rank:1 INFO num_inter_threads: 0
DEEPMD rank:1 INFO -----------------------------------------------------------------
DEEPMD rank:0 INFO ---Summary of DataSystem: training -----------------------------------------------
DEEPMD rank:0 INFO found 1 system(s):
DEEPMD rank:0 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:0 INFO -- vanillin
/10k/data/dipole/training_data 19 100 80 1.000 F
DEEPMD rank:0 INFO --------------------------------------------------------------------------------------
DEEPMD rank:1 INFO ---Summary of DataSystem: training -----------------------------------------------
DEEPMD rank:1 INFO found 1 system(s):
DEEPMD rank:1 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:1 INFO -- vanillin
/10k/data/dipole/training_data 19 100 80 1.000 F
DEEPMD rank:1 INFO --------------------------------------------------------------------------------------
DEEPMD rank:0 INFO ---Summary of DataSystem: validation -----------------------------------------------
DEEPMD rank:0 INFO found 1 system(s):
DEEPMD rank:0 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:0 INFO -- nillin_/10k/data/dipole/validation_data 19 500 4 1.000 F
DEEPMD rank:0 INFO --------------------------------------------------------------------------------------
DEEPMD rank:0 INFO training without frame parameter
DEEPMD rank:1 INFO ---Summary of DataSystem: validation -----------------------------------------------
DEEPMD rank:1 INFO found 1 system(s):
DEEPMD rank:1 INFO system natoms bch_sz n_bch prob pbc
DEEPMD rank:1 INFO -- nillin_/10k/data/dipole/validation_data 19 500 4 1.000 F
DEEPMD rank:1 INFO --------------------------------------------------------------------------------------
DEEPMD rank:1 INFO training without frame parameter
2023-08-01 15:43:50.922270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29357 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
2023-08-01 15:43:50.923407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 29357 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0
2023-08-01 15:43:50.924350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 29357 MB memory: -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0
2023-08-01 15:43:50.925349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 29527 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0
2023-08-01 15:43:50.944724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29357 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
2023-08-01 15:43:50.945809: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 29357 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0
2023-08-01 15:43:50.946740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 29357 MB memory: -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0
2023-08-01 15:43:50.947576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 29501 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0
2023-08-01 15:43:50.985007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29357 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
2023-08-01 15:43:50.986080: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 29357 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0
2023-08-01 15:43:50.987024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 29357 MB memory: -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0
2023-08-01 15:43:50.987923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 29465 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0
2023-08-01 15:43:51.227636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29357 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
2023-08-01 15:43:51.228725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 29357 MB memory: -> device: 1, name: Tesla V100-SXM2-32GB, pci bus id: 0000:62:00.0, compute capability: 7.0
2023-08-01 15:43:51.229691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 29357 MB memory: -> device: 2, name: Tesla V100-SXM2-32GB, pci bus id: 0000:89:00.0, compute capability: 7.0
2023-08-01 15:43:51.230606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 29357 MB memory: -> device: 3, name: Tesla V100-SXM2-32GB, pci bus id: 0000:8a:00.0, compute capability: 7.0
DEEPMD rank:3 INFO built lr
DEEPMD rank:0 INFO built lr
DEEPMD rank:1 INFO built lr
DEEPMD rank:2 INFO built lr
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
WARNING:tensorflow:From /users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: accumulate_n (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.math.add_n Instead
DEEPMD rank:3 INFO built network
DEEPMD rank:3 INFO Scale learning rate by coef: 4.000000
DEEPMD rank:0 INFO built network
DEEPMD rank:0 INFO Scale learning rate by coef: 4.000000
DEEPMD rank:1 INFO built network
DEEPMD rank:1 INFO Scale learning rate by coef: 4.000000
DEEPMD rank:2 INFO built network
DEEPMD rank:2 INFO Scale learning rate by coef: 4.000000
DEEPMD rank:3 INFO built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-08-01 15:43:57.005045: E tensorflow/core/common_runtime/session.cc:91] Failed to create session: ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (3 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
2023-08-01 15:43:57.005070: E tensorflow/c/c_api.cc:2218] ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (3 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
Traceback (most recent call last):
File "/users/luukasni/.local/bin/dp", line 8, in
sys.exit(main())
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 623, in main
train_dp(**dict_args)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 164, in train
_do_work(jdata, run_opt, is_compress)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 278, in _do_work
model.train(train_data, valid_data)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 821, in train
DEEPMD rank:0 INFO built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-08-01 15:43:57.070654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 29357 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:61:00.0, compute capability: 7.0
self._init_session()
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 768, in _init_session
self.sess = tf.Session(config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1604, in init
super(Session, self).init(target, graph, config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 712, in init
self.session = tf_session.TF_NewSessionRef(c_graph, opts)
tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (GPU:0) is being mapped to multiple devices (3 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
DEEPMD rank:0 INFO restart from model /scratch/pyykko2/luukasni/turbomole_deepmd/vanillin
/10k/model_training/dipole/model.ckpt
DEEPMD rank:1 INFO built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-08-01 15:43:57.238986: E tensorflow/core/common_runtime/session.cc:91] Failed to create session: ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (1 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
2023-08-01 15:43:57.239010: E tensorflow/c/c_api.cc:2218] ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (1 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
Traceback (most recent call last):
File "/users/luukasni/.local/bin/dp", line 8, in
sys.exit(main())
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 623, in main
train_dp(**dict_args)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 164, in train
_do_work(jdata, run_opt, is_compress)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 278, in _do_work
model.train(train_data, valid_data)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 821, in train
self._init_session()
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 768, in _init_session
self.sess = tf.Session(config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1604, in init
super(Session, self).init(target, graph, config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 712, in init
self._session = tf_session.TF_NewSessionRef(c_graph, opts)
tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (GPU:0) is being mapped to multiple devices (1 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
DEEPMD rank:2 INFO built training
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-08-01 15:43:57.345751: E tensorflow/core/common_runtime/session.cc:91] Failed to create session: ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (2 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
2023-08-01 15:43:57.345774: E tensorflow/c/c_api.cc:2218] ALREADY_EXISTS: TensorFlow device (GPU:0) is being mapped to multiple devices (2 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
Traceback (most recent call last):
File "/users/luukasni/.local/bin/dp", line 8, in
sys.exit(main())
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 623, in main
train_dp(**dict_args)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 164, in train
_do_work(jdata, run_opt, is_compress)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 278, in _do_work
model.train(train_data, valid_data)
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 821, in train
self._init_session()
File "/users/luukasni/.local/lib/python3.9/site-packages/deepmd/train/trainer.py", line 768, in _init_session
self.sess = tf.Session(config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1604, in init
super(Session, self).init(target, graph, config=config)
File "/users/luukasni/.local/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 712, in init
self._session = tf_session.TF_NewSessionRef(c_graph, opts)
tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (GPU:0) is being mapped to multiple devices (2 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see tensorflow/tensorflow#19083
srun: error: r02g01: tasks 1-3: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=18026489.0
slurmstepd: error: *** STEP 18026489.0 ON r02g01 CANCELLED AT 2023-08-01T15:43:58 ***
srun: error: r02g01: task 0: Terminated
srun: Force Terminated StepId=18026489.0

Steps to Reproduce

Run normal training until a checkpoint is created. Restart checkpoint with multiple GPUs.

Further Information, Files, and Links

No response

@luukasnik luukasnik added the bug label Aug 2, 2023
@njzjz
Copy link
Member

njzjz commented Aug 2, 2023

Did you install horovod?

@luukasnik
Copy link
Author

Hello, yes I have horovod installed.

The normal training works with "world size: distributed" when running
srun dp freeze input.json

but when adding the chekcpoint I get the aforementioned problem
srun dp freeze input.json -r model.ckpt

@njzjz njzjz added the reproduced This bug has been reproduced by developers label Aug 3, 2023
@njzjz njzjz linked a pull request Aug 3, 2023 that will close this issue
wanghan-iapcm pushed a commit that referenced this issue Aug 7, 2023
Fix #2712.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
njzjz added a commit to njzjz/deepmd-kit that referenced this issue Dec 10, 2023
When doing finetune with Horovod, the same error as deepmodeling#2712 throws at what I modified in this PR.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
wanghan-iapcm pushed a commit that referenced this issue Dec 11, 2023
When fine-tuning with Horovod, the same error as
#2712 is thrown at the
place I modified in this PR.

It seems `tf.test.is_gpu_available` will try to use all GPUs, but
`tf.config.get_visible_devices` won't.

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug reproduced This bug has been reproduced by developers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants