Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when running LAMMPS in the devel branch #4161

Open
wujing81 opened this issue Sep 24, 2024 · 2 comments · May be fixed by #4178
Open

Error when running LAMMPS in the devel branch #4161

wujing81 opened this issue Sep 24, 2024 · 2 comments · May be fixed by #4178
Labels

Comments

@wujing81
Copy link

Summary

I created a container node registry.dp.tech/dptech/deepmd-kit:3.0.0b3-cuda12.1 using the Bourium platform. Then I installed the devel branch of DeepMD-kit with:
conda create -n deepmd-dev python=3.10
source activate deepmd-dev
pip install git+https://github.com/deepmodeling/deepmd-kit.git@devel
rsync -a --ignore-existing /opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/ /opt/deepmd-kit-3.0.0b3/
The command /opt/deepmd-kit-3.0.0b3/bin/dp --version displays: DeePMD-kit v3.0.0b4.dev56+g0b72dae3.
I trained a model using this version of dp, and the training input file is attached. I used dp --pt freeze to get a .pth file. Then, I used this model to run MD simulations with the command /opt/deepmd-kit-3.0.0b3/bin/lmp -i lammps.in. The input.lammps and conf.lmp files are attached.
An error occurs:
[bohrium-11849-1195151:01982] mca_base_component_repository_open: unable to open mca_btl_openib: librdmacm.so.1: cannot open shared object file: No such file or directory (ignored)
LAMMPS (2 Aug 2023)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
DeePMD-kit: Successfully load libcudart.so.11.0
2024-09-24 15:37:29.837816: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-24 15:37:29.837871: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-24 15:37:29.837882: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Loaded 1 plugins from /opt/deepmd-kit-3.0.0b3/lib/deepmd_lmp
Reading data file ...
triclinic box = (0 0 0) to (12.4447 12.4447 12.4447) with tilt (0 0 0)
1 by 1 by 1 MPI processor grid
reading atoms ...
192 atoms
read_data CPU = 0.003 seconds
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
Summary of lammps deepmd module ...

Info of deepmd-kit:
installed to: /opt/deepmd-kit-3.0.0b3
source:
source branch: HEAD
source commit: cbf2de6
source commit at: 2024-07-27 05:11:58 +0000
support model ver.: 1.1
build variant: cuda
build with tf inc: /opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/tensorflow/include;/opt/deepmd-kit-3.0.0b3/include
build with tf lib: /opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/tensorflow/libtensorflow_cc.so.2
build with pt lib: torch;torch_library;/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/torch/lib/libc10.so;/usr/local/cuda/lib64/stubs/libcuda.so;/usr/local/cuda/lib64/libnvrtc.so;/usr/local/cuda/lib64/libnvToolsExt.so;/usr/local/cuda/lib64/libcudart.so;/opt/deepmd-kit-3.0.0b3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
set tf intra_op_parallelism_threads: 0
set tf inter_op_parallelism_threads: 0
Info of lammps module:
use deepmd-kit at: /opt/deepmd-kit-3.0.0b3load model from: model.pth to cpu
DeePMD-kit WARNING: Environmental variable DP_INTRA_OP_PARALLELISM_THREADS is not set. Tune DP_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable DP_INTER_OP_PARALLELISM_THREADS is not set. Tune DP_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
Info of model(s):
using 1 model(s): model.pth
rcut in model: 4.5
ntypes in model: 118

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

  • USER-DEEPMD package:
    The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Generated 0 of 1 mixed pair_coeff terms from geometric mixing rule
Neighbor list info ...
update: every = 10 steps, delay = 0 steps, check = no
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 6.5
ghost atom cutoff = 6.5
binsize = 3.25, bins = 4 4 4
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair deepmd, perpetual
attributes: full, newton on
pair build: full/bin/atomonly
stencil: full/bin/3d
bin: standard
Setting up Verlet run ...
Unit style : metal
Current step : 0
Time step : 0.0005
ERROR on proc 0: DeePMD-kit C API Error: DeePMD-kit Error: DeePMD-kit PyTorch backend JIT error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
File "code/torch/deepmd/pt/model/model/ener_model.py", line 56, in forward_lower
comm_dict: Optional[Dict[str, Tensor]]=None) -> Dict[str, Tensor]:
_5 = (self).need_sorted_nlist_for_lower()
model_ret = (self).forward_common_lower(extended_coord, extended_atype, nlist, mapping, fparam, aparam, do_atomic_virial, comm_dict, _5, )
~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_6 = (self).get_fitting_net()
model_predict = annotate(Dict[str, Tensor], {})
File "code/torch/deepmd/pt/model/model/ener_model.py", line 213, in forward_common_lower
cc_ext, _36, fp, ap, input_prec, = _35
atomic_model = self.atomic_model
atomic_ret = (atomic_model).forward_common_atomic(cc_ext, extended_atype, nlist0, mapping, fp, ap, comm_dict, )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
_37 = (self).atomic_output_def()
training = self.training
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 50, in forward_common_atomic
ext_atom_mask = (self).make_atom_mask(extended_atype, )
_3 = torch.where(ext_atom_mask, extended_atype, 0)
ret_dict = (self).forward_atomic(extended_coord, _3, nlist, mapping, fparam, aparam, comm_dict, )
~~~~~~~~~~~~~~~~~~~~ <--- HERE
ret_dict0 = (self).apply_out_stat(ret_dict, atype, )
_4 = torch.slice(torch.slice(ext_atom_mask), 1, None, nloc)
File "code/torch/deepmd/pt/model/atomic_model/energy_atomic_model.py", line 93, in forward_atomic
pass
descriptor = self.descriptor
_16 = (descriptor).forward(extended_coord, extended_atype, nlist, mapping, comm_dict, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
descriptor0, rot_mat, g2, h2, sw, = _16
fitting_net = self.fitting_net
File "code/torch/deepmd/pt/model/descriptor/dpa2.py", line 98, in forward
repformers1 = self.repformers
_17 = nlist_dict[_1(_16, (repformers1).get_nsel(), )]
_18 = (repformers).forward(_17, extended_coord, extended_atype, g13, mapping0, comm_dict0, )
~~~~~~~~~~~~~~~~~~~ <--- HERE
g14, g2, h2, rot_mat, sw, = _18
concat_output_tebd = self.concat_output_tebd
File "code/torch/deepmd/pt/model/descriptor/repformers.py", line 364, in forward
_65 = "border_op is not available since customized PyTorch OP library is not built when freezing the model."
_66 = uninitialized(Tensor)
ops.prim.RaiseException(_65, "builtins.NotImplementedError")

return _66

Traceback of TorchScript, original code (most recent call last):
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/model/ener_model.py", line 109, in forward_lower
      comm_dict: Optional[Dict[str, torch.Tensor]] = None,
  ):
      model_ret = self.forward_common_lower(
                  ~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
          extended_coord,
          extended_atype,
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/model/make_model.py", line 261, in forward_common_lower
          )
          del extended_coord, fparam, aparam
          atomic_ret = self.atomic_model.forward_common_atomic(
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
              cc_ext,
              extended_atype,
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/base_atomic_model.py", line 242, in forward_common_atomic
  
      ext_atom_mask = self.make_atom_mask(extended_atype)
      ret_dict = self.forward_atomic(
                 ~~~~~~~~~~~~~~~~~~~ <--- HERE
          extended_coord,
          torch.where(ext_atom_mask, extended_atype, 0),
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 189, in forward_atomic
      if self.do_grad_r() or self.do_grad_c():
          extended_coord.requires_grad_(True)
      descriptor, rot_mat, g2, h2, sw = self.descriptor(
                                        ~~~~~~~~~~~~~~~ <--- HERE
          extended_coord,
          extended_atype,
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/descriptor/dpa2.py", line 799, in forward
          g1 = g1_ext
      # repformer
      g1, g2, h2, rot_mat, sw = self.repformers(
                                ~~~~~~~~~~~~~~~ <--- HERE
          nlist_dict[
              get_multiple_nlist_key(
File "/opt/deepmd-kit-3.0.0b3/envs/deepmd-dev/lib/python3.10/site-packages/deepmd/pt/model/descriptor/repformers.py", line 62, in forward
      argument8,
  ) -> torch.Tensor:
      raise NotImplementedError(
      ~~~~~~~~~~~~~~~~~~~~~~~~~~
          "border_op is not available since customized PyTorch OP library is not built when freezing the model."
          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
      )
builtins.NotImplementedError: border_op is not available since customized PyTorch OP library is not built when freezing the model.
(/home/conda/feedstock_root/build_artifacts/deepmd-kit_1722057353391/work/source/lmp/pair_deepmd.cpp:586)
Last command: run             1000
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------


### DeePMD-kit Version

DeePMD-kit v3.0.0b4.dev56+g0b72dae3

### Backend and its version

PyTorch v2.4.1+cu121-g38b96d3399a

### Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

_No response_

### Details

[input.zip](https://github.com/user-attachments/files/17110378/input.zip)
@iProzd
Copy link
Collaborator

iProzd commented Sep 24, 2024

@wujing81 Apologies for the confusion during installation; I faced the same issue while debugging.

The problem arises because DPA2 requires the border_op module, which depends on enabling PyTorch support during installation. You can do this by using the following command:
DP_VARIANT=cuda DP_ENABLE_PYTORCH=1 pip install git+https://github.com/deepmodeling/deepmd-kit.git@devel

But why is this option False in default? @njzjz @CaRoLZhangxy To my understanding, users who want to use dpa2 model with lammps must need this option. BTW, the doc mentioned this option here may be not so clear? https://docs.deepmodeling.com/projects/deepmd/en/latest/install/install-from-source.html#envvar-DP_ENABLE_PYTORCH

@njzjz njzjz added Docs and removed wontfix labels Sep 24, 2024
@njzjz
Copy link
Member

njzjz commented Sep 24, 2024

But why is this option False in default?

xref: #3891 (comment)

I am not going to change the default option to True until PyTorch fixes pytorch/pytorch#78530.

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Oct 1, 2024
Fix deepmodeling#4161.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants