Merge pull request #163 from amcadmus/master

Merge devel into master
deepmodeling · Jun 11, 2021 · 59dfe7d · 59dfe7d
2 parents b6247d5 + d044d90
commit 59dfe7d
Show file tree

Hide file tree

Showing 136 changed files with 59,935 additions and 108 deletions.
diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
@@ -13,12 +13,18 @@ jobs:
 
     steps:
     - uses: actions/checkout@v2
+    # set up conda
     - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v2
+      uses: conda-incubator/setup-miniconda@v2
       with:
-        python-version: ${{ matrix.python-version }}
+        auto-activate-base: true
+        activate-environment: ""
+    # install rdkit and openbabel
+    - name: Install rdkit
+      run: conda create -c conda-forge -n my-rdkit-env python=${{ matrix.python-version }} rdkit openbabel;
     - name: Install dependencies
-      run: pip install . coverage codecov
+      run: source $CONDA/bin/activate my-rdkit-env && pip install .[amber] coverage codecov
     - name: Test
-      run: cd tests && coverage run --source=../dpdata -m unittest && cd .. && coverage combine tests/.coverage && coverage report
-    - run: codecov
+      run: source $CONDA/bin/activate my-rdkit-env && cd tests && coverage run --source=../dpdata -m unittest && cd .. && coverage combine tests/.coverage && coverage report
+    - name: Run codecov
+      run: source $CONDA/bin/activate my-rdkit-env && codecov
diff --git a/.gitignore b/.gitignore
@@ -20,4 +20,5 @@ dist
 dpdata.egg-info
 _version.py
 !tests/cp2k/aimd/cp2k.log
+!tests/cp2k/restart_aimd/ch4.log
 __pycache__
diff --git a/README.md b/README.md
@@ -78,7 +78,8 @@ The `System` or `LabeledSystem` can be constructed from the following file forma
 | PWmat   | movement    | True         | True    | LabeledSystem | 'pwmat/movement'     |
 | PWmat   | OUT.MLMD    | True         | True    | LabeledSystem | 'pwmat/out.mlmd'     |
 | Amber   | multi       | True         | True    | LabeledSystem | 'amber/md'           |
-| Gromacs | gro         | False        | False   | System        | 'gromacs/gro'        |
+| Amber/sqm | sqm.out   | False        | False   | System        | 'sqm/out'            |
+| Gromacs | gro         | True         | False   | System        | 'gromacs/gro'        |
 
 
 The Class `dpdata.MultiSystems`  can read data  from a dir which may contains many files of different systems, or from single xyz file which contains different systems.
@@ -206,4 +207,48 @@ s.replace('Hf', 'Zr', 8)
 s.to_vasp_poscar('POSCAR.P42nmc.replace')
 ```
 
+# BondOrderSystem
+A new class `BondOrderSystem` which inherits from class `System` is introduced in dpdata. This new class contains information of chemical bonds and formal charges (stored in `BondOrderSystem.data['bonds']`, `BondOrderSystem.data['formal_charges']`). Now BondOrderSystem can only read from .mol/.sdf formats, because of its dependency on rdkit (which means rdkit must be installed if you want to use this function). Other formats, such as pdb, must be converted to .mol/.sdf format (maybe with software like open babel). 
+```python
+import dpdata
+system_1 = dpdata.BondOrderSystem("tests/bond_order/CH3OH.mol", fmt="mol") # read from .mol file
+system_2 = dpdata.BondOrderSystem("tests/bond_order/methane.sdf", fmt="sdf") # read from .sdf file
+```
+In sdf file, all molecules must be of the same topology (i.e. conformers of the same molecular configuration).
+`BondOrderSystem` also supports initialize from a `rdkit.Chem.rdchem.Mol` object directly.
+```python
+from rdkit import Chem
+from rdkit.Chem import AllChem
+import dpdata
+
+mol = Chem.MolFromSmiles("CC")
+mol = Chem.AddHs(mol)
+AllChem.EmbedMultipleConfs(mol, 10)
+system = dpdata.BondOrderSystem(rdkit_mol=mol)
+```
+
+## Bond Order Assignment
+The `BondOrderSystem` implements a more robust sanitize procedure for rdkit Mol, as defined in `dpdata.rdkit.santizie.Sanitizer`. This class defines 3 level of sanitization process by: low, medium and high. (default is medium).
++ low: use `rdkit.Chem.SanitizeMol()` function to sanitize molecule.
++ medium: before using rdkit, the programm will first assign formal charge of each atom to avoid inappropriate valence exceptions. However, this mode requires the rightness of the bond order information in the given molecule.
++ high: the program will try to fix inappropriate bond orders in aromatic hetreocycles, phosphate, sulfate, carboxyl, nitro, nitrine, guanidine groups. If this procedure fails to sanitize the given molecule, the program will then try to call `obabel` to pre-process the mol and repeat the sanitization procedure. **That is to say, if you wan't to use this level of sanitization, please ensure `obabel` is installed in the environment.**
+According to our test, our sanitization procedure can successfully read 4852 small molecules in the PDBBind-refined-set. It is necessary to point out that the in the molecule file (mol/sdf), the number of explicit hydrogens has to be correct. Thus, we recommend to use
+ `obabel xxx -O xxx -h` to pre-process the file. The reason why we do not implement this hydrogen-adding procedure in dpdata is that we can not ensure its correctness.
+
+```python
+import dpdata
+
+for sdf_file in glob.glob("bond_order/refined-set-ligands/obabel/*sdf"):
+    syst = dpdata.BondOrderSystem(sdf_file, sanitize_level='high', verbose=False)
+```
+## Formal Charge Assignment
+BondOrderSystem implement a method to assign formal charge for each atom based on the 8-electron rule (see below). Note that it only supports common elements in bio-system: B,C,N,O,P,S,As
+```python
+import dpdata
+
+syst = dpdata.BondOrderSystem("tests/bond_order/CH3NH3+.mol", fmt='mol')
+print(syst.get_formal_charges()) # return the formal charge on each atom
+print(syst.get_charge()) # return the total charge of the system
+```
 
+If a valence of 3 is detected on carbon, the formal charge will be assigned to -1. Because for most cases (in alkynyl anion, isonitrile, cyclopentadienyl anion), the formal charge on 3-valence carbon is -1, and this is also consisent with the 8-electron rule.
diff --git a/dpdata/__init__.py b/dpdata/__init__.py
@@ -9,3 +9,15 @@
     from ._version import version as __version__
 except ImportError:
     from .__about__ import __version__
+
+# BondOrder System has dependency on rdkit
+try:
+    import rdkit
+    USE_RDKIT = True
+except ModuleNotFoundError:
+    USE_RDKIT = False
+
+if USE_RDKIT:
+    from .bond_order_system import BondOrderSystem
+
+
diff --git a/dpdata/abacus/scf.py b/dpdata/abacus/scf.py
@@ -0,0 +1,180 @@
+import os,sys
+import numpy as np
+
+bohr2ang = 0.5291770
+ry2ev = 13.605698
+kbar2evperang3 = 1e3 / 1.6021892e6
+# The consts are cited from $ABACUS_ROOT/source/src_global/constant.h
+
+
+def get_block (lines, keyword, skip = 0, nlines = None):
+    ret = []
+    found = False
+    if not nlines:
+        nlines = 1e6
+    for idx,ii in enumerate(lines) :
+        if keyword in ii :
+            found = True
+            blk_idx = idx + 1 + skip
+            line_idx = 0
+            while len(lines[blk_idx]) == 0:
+                blk_idx += 1
+            while len(lines[blk_idx]) != 0 and line_idx < nlines and blk_idx != len(lines):
+                ret.append(lines[blk_idx])
+                blk_idx += 1
+                line_idx += 1
+            break
+    if not found:
+        raise RuntimeError("The keyword %s is not found in the script." %keyword)
+    return ret
+
+def get_geometry_in(fname, inlines):
+    geometry_path_in = os.path.join(fname, "STRU")
+    for line in inlines:
+        if "atom_file" in line and "atom_file"==line.split()[0]:
+           atom_file = line.split()[1]
+           geometry_path_in = os.path.join(fname, atom_file)
+           break
+    return geometry_path_in
+
+def get_path_out(fname, inlines):
+    path_out = os.path.join(fname, "OUT.ABACUS/running_scf.log")
+    for line in inlines:
+        if "suffix" in line and "suffix"==line.split()[0]:
+           suffix = line.split()[1]
+           path_out = os.path.join(fname, "OUT.%s/running_scf.log" % suffix)
+           break
+    return path_out
+
+def get_cell(geometry_inlines):
+    cell_lines = get_block(geometry_inlines, "LATTICE_VECTORS", skip = 0, nlines = 3)
+    celldm_lines = get_block(geometry_inlines, "LATTICE_CONSTANT", skip=0, nlines=1)
+
+    celldm = float(celldm_lines[0].split()[0]) * bohr2ang # lattice const is in Bohr
+    cell = []
+    for ii in range(3):
+        cell.append([float(jj) for jj in cell_lines[ii].split()[0:3]])
+    cell = celldm*np.array(cell)
+    return celldm, cell
+
+def get_coords(celldm, cell, geometry_inlines, inlines):
+    coords_lines = get_block(geometry_inlines, "ATOMIC_POSITIONS", skip=0)
+    # assuming that ATOMIC_POSITIONS is at the bottom of the STRU file
+    coord_type = coords_lines[0].split()[0].lower() # cartisan or direct
+    atom_names = [] # element abbr in periodic table 
+    atom_types = [] # index of atom_names of each atom in the geometry
+    atom_numbs = [] # of atoms for each element
+    coords = [] # coordinations of atoms
+    ntype = 0
+    for line in inlines:
+        if "ntype" in line and "ntype"==line.split()[0]:
+            ntype = int(line.split()[1])
+            break
+    if ntype <= 0:
+        raise RuntimeError('ntype cannot be found in INPUT file.')
+    line_idx = 1 # starting line of first element
+    for it in range(ntype):
+        atom_names.append(coords_lines[line_idx].split()[0])
+        line_idx+=2
+        atom_numbs.append(int(coords_lines[line_idx].split()[0]))
+        line_idx+=1
+        for iline in range(atom_numbs[it]):
+            xyz = np.array([float(xx) for xx in coords_lines[line_idx].split()[0:3]])
+            if coord_type == "cartesian":
+                xyz = xyz*celldm
+            elif coord_type == "direct":
+                tmp = np.matmul(xyz, cell)
+                xyz = tmp
+            else:
+                print("coord_type = %s" % coord_type)
+                raise RuntimeError("Input coordination type is invalid.\n Only direct and cartesian are accepted.")
+            coords.append(xyz)
+            atom_types.append(it)
+            line_idx += 1
+    coords = np.array(coords) # need transformation!!!
+    atom_types = np.array(atom_types)
+    return atom_names, atom_numbs, atom_types, coords
+
+def get_energy(outlines):
+    Etot = None
+    for line in outlines:
+        if "!FINAL_ETOT_IS" in line:
+            Etot = float(line.split()[1]) # in eV
+            break
+    if not Etot:
+        not_converge = False
+        for line in outlines:
+            if "convergence has NOT been achieved!" in line:
+                not_converge = True
+                raise RuntimeError("convergence has NOT been achieved in scf!")
+                break
+        if not not_converge:
+            raise RuntimeError("Final total energy cannot be found in output. Unknown problem.")
+    return Etot
+
+def get_force (outlines):
+    force = []
+    force_inlines = get_block (outlines, "TOTAL-FORCE (eV/Angstrom)", skip = 4)
+    for line in force_inlines:
+        force.append([float(f) for f in line.split()[1:4]])
+    force = np.array(force)
+    return force
+
+def get_stress(outlines):
+    stress = []
+    stress_inlines = get_block(outlines, "TOTAL-STRESS (KBAR)", skip = 3)
+    for line in stress_inlines:
+        stress.append([float(f) for f in line.split()])
+    stress = np.array(stress) * kbar2evperang3
+    return stress
+
+
+
+def get_frame (fname):
+    if type(fname) == str:
+        # if the input parameter is only one string, it is assumed that it is the 
+        # base directory containing INPUT file;
+        path_in = os.path.join(fname, "INPUT")
+    else:
+        raise RuntimeError('invalid input')    
+    with open(path_in, 'r') as fp:
+        inlines = fp.read().split('\n')
+
+    geometry_path_in = get_geometry_in(fname, inlines) 
+    path_out = get_path_out(fname, inlines) 
+
+    with open(geometry_path_in, 'r') as fp:
+        geometry_inlines = fp.read().split('\n')
+    with open(path_out, 'r') as fp:
+        outlines = fp.read().split('\n')
+
+    celldm, cell = get_cell(geometry_inlines) 
+    atom_names, natoms, types, coords = get_coords(celldm, cell, geometry_inlines, inlines) 
+
+    energy = get_energy(outlines) 
+    force = get_force (outlines) 
+    stress = get_stress(outlines) * np.linalg.det(cell) 
+
+    data = {}
+    data['atom_names'] = atom_names
+    data['atom_numbs'] = natoms
+    data['atom_types'] = types
+    data['cells'] = cell[np.newaxis, :, :]
+    data['coords'] = coords[np.newaxis, :, :]
+    data['energies'] = np.array(energy)[np.newaxis]
+    data['forces'] = force[np.newaxis, :, :]
+    data['virials'] = stress[np.newaxis, :, :]
+    data['orig'] = np.zeros(3)
+    # print("atom_names = ", data['atom_names'])
+    # print("natoms = ", data['atom_numbs'])
+    # print("types = ", data['atom_types'])
+    # print("cells = ", data['cells'])
+    # print("coords = ", data['coords'])
+    # print("energy = ", data['energies'])
+    # print("force = ", data['forces'])
+    # print("virial = ", data['virials'])
+    return data
+
+if __name__ == "__main__":
+    path = "/home/lrx/work/12_ABACUS_dpgen_interface/dpdata/dpdata/tests/abacus.scf"
+    data = get_frame(path)
diff --git a/dpdata/amber/__init__.py b/dpdata/amber/__init__.py
@@ -0,0 +1 @@
+
diff --git a/dpdata/amber/mask.py b/dpdata/amber/mask.py
@@ -0,0 +1,35 @@
+"""Amber mask"""
+try:
+    import parmed
+except ImportError:
+    pass
+
+def pick_by_amber_mask(param, maskstr, coords=None):
+    """Pick atoms by amber masks
+    
+    Parameters
+    ----------
+    param: str or parmed.Structure
+      filename of Amber param file or parmed.Structure
+    maskstr: str
+      Amber masks
+    coords: np.ndarray (optional)
+      frame coordinates, shape: N*3
+    """
+    parm = load_param_file(param)
+    if coords is not None:
+        parm.initialize_topology(xyz=coords)
+    sele = []
+    if len(maskstr) > 0:
+        newmaskstr = maskstr.replace("@0", "!@*")
+        sele = [parm.atoms[i].idx for i in parmed.amber.mask.AmberMask(
+            parm, newmaskstr).Selected()]
+    return sele
+
+def load_param_file(param_file):
+    if isinstance(param_file, str):
+        return parmed.load_file(param_file)
+    elif isinstance(param_file, parmed.Structure):
+        return param_file
+    else:
+        raise RuntimeError("Unsupported structure")