Error when running LAMMPS in the devel branch #4161

wujing81 · 2024-09-24T08:02:44Z

Summary

I created a container node registry.dp.tech/dptech/deepmd-kit:3.0.0b3-cuda12.1 using the Bourium platform. Then I installed the devel branch of DeepMD-kit with:
conda create -n deepmd-dev python=3.10
source activate deepmd-dev
pip install git+https://github.com/deepmodeling/deepmd-kit.git@devel
rsync -a --ignore-existing /opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/ /opt/deepmd-kit-3.0.0b3/
The command /opt/deepmd-kit-3.0.0b3/bin/dp --version displays: DeePMD-kit v3.0.0b4.dev56+g0b72dae3.
I trained a model using this version of dp, and the training input file is attached. I used dp --pt freeze to get a .pth file. Then, I used this model to run MD simulations with the command /opt/deepmd-kit-3.0.0b3/bin/lmp -i lammps.in. The input.lammps and conf.lmp files are attached.
An error occurs:
[bohrium-11849-1195151:01982] mca_base_component_repository_open: unable to open mca_btl_openib: librdmacm.so.1: cannot open shared object file: No such file or directory (ignored)
LAMMPS (2 Aug 2023)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
DeePMD-kit: Successfully load libcudart.so.11.0
2024-09-24 15:37:29.837816: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-24 15:37:29.837871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-24 15:37:29.837882: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Loaded 1 plugins from /opt/deepmd-kit-3.0.0b3/lib/deepmd_lmp
Reading data file ...
triclinic box = (0 0 0) to (12.4447 12.4447 12.4447) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms ...
192 atoms
read_data CPU = 0.003 seconds
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
Summary of lammps deepmd module ...

Info of deepmd-kit:
installed to: /opt/deepmd-kit-3.0.0b3
source:
source branch: HEAD
source commit: cbf2de6
source commit at: 2024-07-27 05:11:58 +0000
support model ver.: 1.1
build variant: cuda
build with tf inc: /opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/tensorflow/include;/opt/deepmd-kit-3.0.0b3/include
build with tf lib: /opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/tensorflow/libtensorflow_cc.so.2
build with pt lib: torch;torch_library;/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/torch/lib/libc10.so;/usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
set tf intra_op_parallelism_threads: 0
set tf inter_op_parallelism_threads: 0
Info of lammps module:
use deepmd-kit at: /opt/deepmd-kit-3.0.0b3load model from: model.pth to cpu
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
Info of model(s):
using 1 model(s): model.pth
rcut in model: 4.5
ntypes in model: 118

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

USER-DEEPMD package:
The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 10 steps, delay = 0 steps, check = no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 6.5
ghost atom cutoff = 6.5
binsize = 3.25, bins = 4 4 4
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair deepmd, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0005
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend JIT error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/ener_model.py", line 56, in forward_lower
comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
_5 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _5, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_6 = (self).get_fitting_net()
model_predict = annotate(Dict[str, Tensor], {})
File "code/torch/deepmd/pt/model/model/ener_model.py", line 213, in forward_common_lower
cc_ext, _36, fp, ap, input_prec, = _35
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_37 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 50, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 93, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
fitting_net = self.fitting_net
File "code/torch/deepmd/pt/model/descriptor/dpa2.py", line 98, in forward
repformers1 = self.repformers
_17 = nlist_dict[_1(_16, (repformers1).get_nsel(), )]
_18 = (repformers).forward(_17, extended_coord, extended_atype, g13, mapping0, comm_dict0, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
g14, g2, h2, rot_mat, sw, = _18
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/repformers.py", line 364, in forward
_65 = "border_op is not available since customized PyTorch OP library is not built when freezing the model."
_66 = uninitialized(Tensor)
ops.prim.RaiseException(_65, "builtins.NotImplementedError")

return _66

Traceback of TorchScript, original code (most recent call last):
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/model/ener_model.py", line 109, in forward_lower
      comm_dict: Optional[Dict[str, torch.Tensor]] = None,
  ):
      model_ret = self.forward_common_lower(
                  ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
          extended_coord,
          extended_atype,
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/model/make_model.py", line 261, in forward_common_lower
          )
          del extended_coord, fparam, aparam
          atomic_ret = self.atomic_model.forward_common_atomic(
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
              cc_ext,
              extended_atype,
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 242, in forward_common_atomic
  
      ext_atom_mask = self.make_atom_mask(extended_atype)
      ret_dict = self.forward_atomic(
                 ~~~~~~~~~~~~~~~~~~~ <--- HERE
          extended_coord,
          torch.where(ext_atom_mask, extended_atype, 0),
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 189, in forward_atomic
      if self.do_grad_r() or self.do_grad_c():
          extended_coord.requires_grad_(True)
      descriptor, rot_mat, g2, h2, sw = self.descriptor(
                                        ~~~~~~~~~~~~~~~ <--- HERE
          extended_coord,
          extended_atype,
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/descriptor/dpa2.py", line 799, in forward
          g1 = g1_ext
      # repformer
      g1, g2, h2, rot_mat, sw = self.repformers(
                                ~~~~~~~~~~~~~~~ <--- HERE
          nlist_dict[
              get_multiple_nlist_key(
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/descriptor/repformers.py", line 62, in forward
      argument8,
  ) -> torch.Tensor:
      raise NotImplementedError(
      ~~~~~~~~~~~~~~~~~~~~~~~~~~
          "border_op is not available since customized PyTorch OP library is not built when freezing the model."
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
      )
builtins.NotImplementedError: border_op is not available since customized PyTorch OP library is not built when freezing the model.
(/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586)
Last command: run             1000
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------


### DeePMD-kit Version

DeePMD-kit v3.0.0b4.dev56+g0b72dae3

### Backend and its version

PyTorch v2.4.1+cu121-g38b96d3399a

### Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

_No response_

### Details

[input.zip](https://github.com/user-attachments/files/17110378/input.zip)

The text was updated successfully, but these errors were encountered:

iProzd · 2024-09-24T14:22:09Z

@wujing81 Apologies for the confusion during installation; I faced the same issue while debugging.

The problem arises because DPA2 requires the border_op module, which depends on enabling PyTorch support during installation. You can do this by using the following command:
DP_VARIANT=cuda DP_ENABLE_PYTORCH=1 pip install git+https://github.com/deepmodeling/deepmd-kit.git@devel

But why is this option False in default? @njzjz @CaRoLZhangxy To my understanding, users who want to use dpa2 model with lammps must need this option. BTW, the doc mentioned this option here may be not so clear? https://docs.deepmodeling.com/projects/deepmd/en/latest/install/install-from-source.html#envvar-DP_ENABLE_PYTORCH

njzjz · 2024-09-24T19:00:38Z

But why is this option False in default?

xref: #3891 (comment)

I am not going to change the default option to True until PyTorch fixes pytorch/pytorch#78530.

Fix deepmodeling#4161. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

wujing81 added the wontfix label Sep 24, 2024

njzjz added Docs and removed wontfix labels Sep 24, 2024

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Oct 1, 2024

docs: add documentation for installation requirements of DPA-2

9e47e3f

Fix deepmodeling#4161. Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>

njzjz linked a pull request Oct 1, 2024 that will close this issue

docs: add documentation for installation requirements of DPA-2 #4178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when running LAMMPS in the devel branch #4161

Error when running LAMMPS in the devel branch #4161

wujing81 commented Sep 24, 2024

iProzd commented Sep 24, 2024

njzjz commented Sep 24, 2024

Error when running LAMMPS in the devel branch #4161

Error when running LAMMPS in the devel branch #4161

Comments

wujing81 commented Sep 24, 2024

Summary

iProzd commented Sep 24, 2024

njzjz commented Sep 24, 2024