The function deepmd.infer.DeepDipole.eval() cannot utilize multiple GPUs in parallel #2877

Kehan-Cai-nanako · 2023-09-28T16:14:57Z

Bug summary

When using the function deepmd.infer.DeepDipole.eval() to infer Wannier centroids, even though I requested multiple GPUs, only one of them is used in practice, while others are in idle state. Namely, the function feed all information of atomic positions into one GPU, and this may trigger the out-of-memory error when the size of the simulation system is large.

DeePMD-kit Version

2.2.4

TensorFlow Version

2.12.0

How did you download the software?

conda

Input Files, Running Commands, Error Log, etc.

2023-09-28 11:33:23.683264: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-09-28 11:33:25.043901: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-28 11:33:27.204238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
WARNING:root:To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, TF_INTRA_OP_PARALLELISM_THREADS, and TF_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
2023-09-28 11:33:33.162274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79067 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:65:00.0, compute capability: 8.0
2023-09-28 11:33:33.163249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 79067 MB memory: -> device: 1, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:ca:00.0, compute capability: 8.0
2023-09-28 11:33:33.231080: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:353] MLIR V1 optimization pass is not enabled
2023-09-28 11:34:55.644022: W tensorflow/tsl/framework/bfc_allocator.cc:366] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
cuda assert: invalid argument /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/lib/src/cuda/neighbor_list.cu 194
2023-09-28 11:34:56.865522: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at custom_op.cc:18 : INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
2023-09-28 11:34:56.865588: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
2023-09-28 11:34:56.865609: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
Traceback (most recent call last):
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1378, in _do_call
return fn(*args)
^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1361, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1454, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/scratch/gpfs/kehanc/StudyFold/ResearchFold/Relaxor_Ferroelectrics/preliminary/src/wannier3.py", line 131, in
compute_wannier_centroid_savenpz(read_conf_directory, read_traj_directory, DW, 'full') # MODIFY!! concern atom_style = 'full' or 'atomic'
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/gpfs/kehanc/StudyFold/ResearchFold/Relaxor_Ferroelectrics/preliminary/src/wannier3.py", line 64, in compute_wannier_centroid_savenpz
wannier_ref = DW.eval(pos_ref, cell_ref, atom_types=atypes).reshape(-1,3)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/deepmd/infer/deep_tensor.py", line 229, in eval
v_out = self.sess.run(t_out, feed_dict=feed_dict_test)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 968, in run
result = self._run(None, fetches, feed_dict, options_ptr,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1191, in _run
results = self._do_run(handle, final_targets, final_fetches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1371, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tigress/yifanl/usr/licensed/anaconda3/2021.11/envs/dpc/lib/python3.11/site-packages/tensorflow/python/client/session.py", line 1397, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
Detected at node 'load/ProdEnvMatA' defined at (most recent call last):
Node: 'load/ProdEnvMatA'
2 root error(s) found.
(0) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
[[load/o_dipole/_25]]
(1) INTERNAL: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /scratch/gpfs/yifanl/Softwares/deepmd-kit-dev/deepmd-kit/source/op/custom_op.cc:18
[[{{node load/ProdEnvMatA}}]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'load/ProdEnvMatA':

give_yifan.zip

Steps to Reproduce

Run the command:

sbatch run_wc3.slurm

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

Yi-FanLi · 2023-09-28T17:14:01Z

I guess that this error stems from the lack of support for multi-GPU parallelization of inference from python api. @njzjz Is that true?

@Kehan-Cai-nanako Can you try to use LAMMPS's rerun feature and use the compute deeptensor/atom command to do the inference?

njzjz · 2023-09-28T17:32:31Z

Do you input one frame or multiple frames? Currently DeepTensor does not support automatic batch size, unlike DeepPot, so inputing multiple frames may cause OOM.
@Yi-FanLi You can try to support it. See #1173.

Kehan-Cai-nanako · 2023-09-28T17:37:57Z

The error arises when I only input one frame, i.e., the initial configuration of the simulation. Best, Kehan

…

On Thu, Sep 28, 2023 at 1:32 PM Jinzhe Zeng ***@***.***> wrote: Do you input one frame or multiple frames? Currently DeepTensor does not support automatic batch size, unlike DeepPot, so this may cause OOM. @Yi-FanLi <https://github.com/Yi-FanLi> You can try to support it. See #1173 <#1173>. — Reply to this email directly, view it on GitHub <#2877 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AVR2XSSC3UFPNJRZVAUE7LDX4WYDTANCNFSM6AAAAAA5LG6JJE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Kehan-Cai-nanako · 2023-09-28T17:41:43Z

By the way, the size of the system is 34560 atoms + 27548 ghost atoms, which may be relatively large for the memory capacity of GPU. ( The simulation used DPLR. ) Best,

…

On Thu, Sep 28, 2023 at 1:37 PM Kehan Cai ***@***.***> wrote: The error arises when I only input one frame, i.e., the initial configuration of the simulation. Best, Kehan On Thu, Sep 28, 2023 at 1:32 PM Jinzhe Zeng ***@***.***> wrote: > Do you input one frame or multiple frames? Currently DeepTensor does not > support automatic batch size, unlike DeepPot, so this may cause OOM. > @Yi-FanLi <https://github.com/Yi-FanLi> You can try to support it. See > #1173 <#1173>. > > — > Reply to this email directly, view it on GitHub > <#2877 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AVR2XSSC3UFPNJRZVAUE7LDX4WYDTANCNFSM6AAAAAA5LG6JJE> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

Yi-FanLi · 2023-09-28T19:44:09Z

It's a little bit weird because you only have 1 frame and ~30k atoms. With an energy model based on the se_e2_a descriptor, an 80GB A100 can bear ~1000k atoms. @njzjz I think we need to do a more detailed analysis of the memory use in DeepTensor's inference.

Yi-FanLi · 2023-10-01T03:12:02Z

We concluded that this issue is due to the system being too large for the GPU version of the Python interface. More specifically, the neighbor list is too large so it cannot be allocated on the GPU. The workaround is to use LAMMPS's rerun command and use the compute deeptensor/atom command in LAMMPS.

Kehan-Cai-nanako · 2023-10-01T04:13:52Z

Thank you for the clarification. Best, Kehan

…

On Sat, Sep 30, 2023 at 11:12 PM Yifan Li李一帆 ***@***.***> wrote: We concluded that this issue is due to the system being too large for the GPU version of the Python interface. More specifically, the neighbor list is too large so it cannot be allocated on the GPU. The workaround is to use LAMMPS's rerun command and use the compute deeptensor/atom command in LAMMPS. — Reply to this email directly, view it on GitHub <#2877 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AVR2XSRHNJ34SDDM6G3DAPDX5DNQ3ANCNFSM6AAAAAA5LG6JJE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

njzjz · 2023-10-01T21:06:49Z

The space complexity of the current neighboring algorithm is $O(n^2)$. A better algorithm is required when the number of atoms is large.

Fix deepmodeling#2877 Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

Fix #2877 --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz · 2023-12-11T21:20:15Z

With #3046, I can now run the script, so I think this issue has been resolved.

output:

read_conf_directory = ./T50/
read_traj_directory = ./T50/
atypes.shape = (34560,)
atypes = [3 1 2 2 2 3 1 2 2 2 3 1 2 2 2 3 0 2 2 2]
pos_ref.shape = (34560, 3)
cell_ref = Cell([67.88225099, 67.88225099, 96.0])
np.diag(cell_ref) = [67.88225099 67.88225099 96.        ]
type(wannier_ref) is <class 'numpy.ndarray'>
wannier_ref.shape = (27648, 3)
wannier_ref = [[ 6.76164581e-03  4.77885641e-11 -4.78120556e-03]
 [-4.82161764e-11  4.37749264e-11  4.10922896e-03]
 [ 7.36184699e-02 -7.36184692e-02 -5.14084738e-03]
 [ 5.39583441e-03 -1.28866391e-10  3.81543074e-03]
 [ 3.29759379e-10  6.76164553e-03  4.78120554e-03]
 [ 4.25239296e-10  3.73859745e-10  9.63019841e-02]
 [-5.18825737e-03  5.18825643e-03  3.24166692e-18]
 [ 6.80957854e-02  6.80957861e-02 -5.94455577e-18]
 [ 1.37889803e-02 -1.37889803e-02 -8.30501870e-18]
 [-5.00852858e-11 -5.39583395e-03 -3.81543078e-03]]

I note that ASE may not be the fastest implementation. One can use other implementations but convert them to the ASE interface.

Kehan-Cai-nanako added the bug label Sep 28, 2023

Yi-FanLi self-assigned this Sep 28, 2023

njzjz added enhancement and removed bug labels Oct 1, 2023

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Dec 8, 2023

build neighbor list with external Python program

ee0ab0f

Fix deepmodeling#2877 Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz mentioned this issue Dec 8, 2023

build neighbor list with external Python program #3046

Merged

njzjz linked a pull request Dec 8, 2023 that will close this issue

build neighbor list with external Python program #3046

Merged

njzjz self-assigned this Dec 8, 2023

wanghan-iapcm pushed a commit that referenced this issue Dec 11, 2023

build neighbor list with external Python program (#3046)

a6f1333

Fix #2877 --------- Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz closed this as completed Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The function deepmd.infer.DeepDipole.eval() cannot utilize multiple GPUs in parallel #2877

The function deepmd.infer.DeepDipole.eval() cannot utilize multiple GPUs in parallel #2877

Kehan-Cai-nanako commented Sep 28, 2023

Yi-FanLi commented Sep 28, 2023

njzjz commented Sep 28, 2023 •

edited

Loading

Kehan-Cai-nanako commented Sep 28, 2023 via email

Kehan-Cai-nanako commented Sep 28, 2023 via email

Yi-FanLi commented Sep 28, 2023

Yi-FanLi commented Oct 1, 2023

Kehan-Cai-nanako commented Oct 1, 2023 via email

njzjz commented Oct 1, 2023

njzjz commented Dec 11, 2023

The function deepmd.infer.DeepDipole.eval() cannot utilize multiple GPUs in parallel #2877

The function deepmd.infer.DeepDipole.eval() cannot utilize multiple GPUs in parallel #2877

Comments

Kehan-Cai-nanako commented Sep 28, 2023

Bug summary

DeePMD-kit Version

TensorFlow Version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

Yi-FanLi commented Sep 28, 2023

njzjz commented Sep 28, 2023 • edited Loading

Kehan-Cai-nanako commented Sep 28, 2023 via email

Kehan-Cai-nanako commented Sep 28, 2023 via email

Yi-FanLi commented Sep 28, 2023

Yi-FanLi commented Oct 1, 2023

Kehan-Cai-nanako commented Oct 1, 2023 via email

njzjz commented Oct 1, 2023

njzjz commented Dec 11, 2023

njzjz commented Sep 28, 2023 •

edited

Loading