Skip to content

Commit

Permalink
Merge pull request #163 from amcadmus/master
Browse files Browse the repository at this point in the history
Merge devel into master
  • Loading branch information
amcadmus authored Jun 11, 2021
2 parents b6247d5 + d044d90 commit 59dfe7d
Show file tree
Hide file tree
Showing 136 changed files with 59,935 additions and 108 deletions.
16 changes: 11 additions & 5 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,18 @@ jobs:

steps:
- uses: actions/checkout@v2
# set up conda
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: conda-incubator/setup-miniconda@v2
with:
python-version: ${{ matrix.python-version }}
auto-activate-base: true
activate-environment: ""
# install rdkit and openbabel
- name: Install rdkit
run: conda create -c conda-forge -n my-rdkit-env python=${{ matrix.python-version }} rdkit openbabel;
- name: Install dependencies
run: pip install . coverage codecov
run: source $CONDA/bin/activate my-rdkit-env && pip install .[amber] coverage codecov
- name: Test
run: cd tests && coverage run --source=../dpdata -m unittest && cd .. && coverage combine tests/.coverage && coverage report
- run: codecov
run: source $CONDA/bin/activate my-rdkit-env && cd tests && coverage run --source=../dpdata -m unittest && cd .. && coverage combine tests/.coverage && coverage report
- name: Run codecov
run: source $CONDA/bin/activate my-rdkit-env && codecov
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,5 @@ dist
dpdata.egg-info
_version.py
!tests/cp2k/aimd/cp2k.log
!tests/cp2k/restart_aimd/ch4.log
__pycache__
47 changes: 46 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,8 @@ The `System` or `LabeledSystem` can be constructed from the following file forma
| PWmat | movement | True | True | LabeledSystem | 'pwmat/movement' |
| PWmat | OUT.MLMD | True | True | LabeledSystem | 'pwmat/out.mlmd' |
| Amber | multi | True | True | LabeledSystem | 'amber/md' |
| Gromacs | gro | False | False | System | 'gromacs/gro' |
| Amber/sqm | sqm.out | False | False | System | 'sqm/out' |
| Gromacs | gro | True | False | System | 'gromacs/gro' |


The Class `dpdata.MultiSystems` can read data from a dir which may contains many files of different systems, or from single xyz file which contains different systems.
Expand Down Expand Up @@ -206,4 +207,48 @@ s.replace('Hf', 'Zr', 8)
s.to_vasp_poscar('POSCAR.P42nmc.replace')
```

# BondOrderSystem
A new class `BondOrderSystem` which inherits from class `System` is introduced in dpdata. This new class contains information of chemical bonds and formal charges (stored in `BondOrderSystem.data['bonds']`, `BondOrderSystem.data['formal_charges']`). Now BondOrderSystem can only read from .mol/.sdf formats, because of its dependency on rdkit (which means rdkit must be installed if you want to use this function). Other formats, such as pdb, must be converted to .mol/.sdf format (maybe with software like open babel).
```python
import dpdata
system_1 = dpdata.BondOrderSystem("tests/bond_order/CH3OH.mol", fmt="mol") # read from .mol file
system_2 = dpdata.BondOrderSystem("tests/bond_order/methane.sdf", fmt="sdf") # read from .sdf file
```
In sdf file, all molecules must be of the same topology (i.e. conformers of the same molecular configuration).
`BondOrderSystem` also supports initialize from a `rdkit.Chem.rdchem.Mol` object directly.
```python
from rdkit import Chem
from rdkit.Chem import AllChem
import dpdata

mol = Chem.MolFromSmiles("CC")
mol = Chem.AddHs(mol)
AllChem.EmbedMultipleConfs(mol, 10)
system = dpdata.BondOrderSystem(rdkit_mol=mol)
```

## Bond Order Assignment
The `BondOrderSystem` implements a more robust sanitize procedure for rdkit Mol, as defined in `dpdata.rdkit.santizie.Sanitizer`. This class defines 3 level of sanitization process by: low, medium and high. (default is medium).
+ low: use `rdkit.Chem.SanitizeMol()` function to sanitize molecule.
+ medium: before using rdkit, the programm will first assign formal charge of each atom to avoid inappropriate valence exceptions. However, this mode requires the rightness of the bond order information in the given molecule.
+ high: the program will try to fix inappropriate bond orders in aromatic hetreocycles, phosphate, sulfate, carboxyl, nitro, nitrine, guanidine groups. If this procedure fails to sanitize the given molecule, the program will then try to call `obabel` to pre-process the mol and repeat the sanitization procedure. **That is to say, if you wan't to use this level of sanitization, please ensure `obabel` is installed in the environment.**
According to our test, our sanitization procedure can successfully read 4852 small molecules in the PDBBind-refined-set. It is necessary to point out that the in the molecule file (mol/sdf), the number of explicit hydrogens has to be correct. Thus, we recommend to use
`obabel xxx -O xxx -h` to pre-process the file. The reason why we do not implement this hydrogen-adding procedure in dpdata is that we can not ensure its correctness.

```python
import dpdata

for sdf_file in glob.glob("bond_order/refined-set-ligands/obabel/*sdf"):
syst = dpdata.BondOrderSystem(sdf_file, sanitize_level='high', verbose=False)
```
## Formal Charge Assignment
BondOrderSystem implement a method to assign formal charge for each atom based on the 8-electron rule (see below). Note that it only supports common elements in bio-system: B,C,N,O,P,S,As
```python
import dpdata

syst = dpdata.BondOrderSystem("tests/bond_order/CH3NH3+.mol", fmt='mol')
print(syst.get_formal_charges()) # return the formal charge on each atom
print(syst.get_charge()) # return the total charge of the system
```

If a valence of 3 is detected on carbon, the formal charge will be assigned to -1. Because for most cases (in alkynyl anion, isonitrile, cyclopentadienyl anion), the formal charge on 3-valence carbon is -1, and this is also consisent with the 8-electron rule.
12 changes: 12 additions & 0 deletions dpdata/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,15 @@
from ._version import version as __version__
except ImportError:
from .__about__ import __version__

# BondOrder System has dependency on rdkit
try:
import rdkit
USE_RDKIT = True
except ModuleNotFoundError:
USE_RDKIT = False

if USE_RDKIT:
from .bond_order_system import BondOrderSystem


180 changes: 180 additions & 0 deletions dpdata/abacus/scf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
import os,sys
import numpy as np

bohr2ang = 0.5291770
ry2ev = 13.605698
kbar2evperang3 = 1e3 / 1.6021892e6
# The consts are cited from $ABACUS_ROOT/source/src_global/constant.h


def get_block (lines, keyword, skip = 0, nlines = None):
ret = []
found = False
if not nlines:
nlines = 1e6
for idx,ii in enumerate(lines) :
if keyword in ii :
found = True
blk_idx = idx + 1 + skip
line_idx = 0
while len(lines[blk_idx]) == 0:
blk_idx += 1
while len(lines[blk_idx]) != 0 and line_idx < nlines and blk_idx != len(lines):
ret.append(lines[blk_idx])
blk_idx += 1
line_idx += 1
break
if not found:
raise RuntimeError("The keyword %s is not found in the script." %keyword)
return ret

def get_geometry_in(fname, inlines):
geometry_path_in = os.path.join(fname, "STRU")
for line in inlines:
if "atom_file" in line and "atom_file"==line.split()[0]:
atom_file = line.split()[1]
geometry_path_in = os.path.join(fname, atom_file)
break
return geometry_path_in

def get_path_out(fname, inlines):
path_out = os.path.join(fname, "OUT.ABACUS/running_scf.log")
for line in inlines:
if "suffix" in line and "suffix"==line.split()[0]:
suffix = line.split()[1]
path_out = os.path.join(fname, "OUT.%s/running_scf.log" % suffix)
break
return path_out

def get_cell(geometry_inlines):
cell_lines = get_block(geometry_inlines, "LATTICE_VECTORS", skip = 0, nlines = 3)
celldm_lines = get_block(geometry_inlines, "LATTICE_CONSTANT", skip=0, nlines=1)

celldm = float(celldm_lines[0].split()[0]) * bohr2ang # lattice const is in Bohr
cell = []
for ii in range(3):
cell.append([float(jj) for jj in cell_lines[ii].split()[0:3]])
cell = celldm*np.array(cell)
return celldm, cell

def get_coords(celldm, cell, geometry_inlines, inlines):
coords_lines = get_block(geometry_inlines, "ATOMIC_POSITIONS", skip=0)
# assuming that ATOMIC_POSITIONS is at the bottom of the STRU file
coord_type = coords_lines[0].split()[0].lower() # cartisan or direct
atom_names = [] # element abbr in periodic table
atom_types = [] # index of atom_names of each atom in the geometry
atom_numbs = [] # of atoms for each element
coords = [] # coordinations of atoms
ntype = 0
for line in inlines:
if "ntype" in line and "ntype"==line.split()[0]:
ntype = int(line.split()[1])
break
if ntype <= 0:
raise RuntimeError('ntype cannot be found in INPUT file.')
line_idx = 1 # starting line of first element
for it in range(ntype):
atom_names.append(coords_lines[line_idx].split()[0])
line_idx+=2
atom_numbs.append(int(coords_lines[line_idx].split()[0]))
line_idx+=1
for iline in range(atom_numbs[it]):
xyz = np.array([float(xx) for xx in coords_lines[line_idx].split()[0:3]])
if coord_type == "cartesian":
xyz = xyz*celldm
elif coord_type == "direct":
tmp = np.matmul(xyz, cell)
xyz = tmp
else:
print("coord_type = %s" % coord_type)
raise RuntimeError("Input coordination type is invalid.\n Only direct and cartesian are accepted.")
coords.append(xyz)
atom_types.append(it)
line_idx += 1
coords = np.array(coords) # need transformation!!!
atom_types = np.array(atom_types)
return atom_names, atom_numbs, atom_types, coords

def get_energy(outlines):
Etot = None
for line in outlines:
if "!FINAL_ETOT_IS" in line:
Etot = float(line.split()[1]) # in eV
break
if not Etot:
not_converge = False
for line in outlines:
if "convergence has NOT been achieved!" in line:
not_converge = True
raise RuntimeError("convergence has NOT been achieved in scf!")
break
if not not_converge:
raise RuntimeError("Final total energy cannot be found in output. Unknown problem.")
return Etot

def get_force (outlines):
force = []
force_inlines = get_block (outlines, "TOTAL-FORCE (eV/Angstrom)", skip = 4)
for line in force_inlines:
force.append([float(f) for f in line.split()[1:4]])
force = np.array(force)
return force

def get_stress(outlines):
stress = []
stress_inlines = get_block(outlines, "TOTAL-STRESS (KBAR)", skip = 3)
for line in stress_inlines:
stress.append([float(f) for f in line.split()])
stress = np.array(stress) * kbar2evperang3
return stress



def get_frame (fname):
if type(fname) == str:
# if the input parameter is only one string, it is assumed that it is the
# base directory containing INPUT file;
path_in = os.path.join(fname, "INPUT")
else:
raise RuntimeError('invalid input')
with open(path_in, 'r') as fp:
inlines = fp.read().split('\n')

geometry_path_in = get_geometry_in(fname, inlines)
path_out = get_path_out(fname, inlines)

with open(geometry_path_in, 'r') as fp:
geometry_inlines = fp.read().split('\n')
with open(path_out, 'r') as fp:
outlines = fp.read().split('\n')

celldm, cell = get_cell(geometry_inlines)
atom_names, natoms, types, coords = get_coords(celldm, cell, geometry_inlines, inlines)

energy = get_energy(outlines)
force = get_force (outlines)
stress = get_stress(outlines) * np.linalg.det(cell)

data = {}
data['atom_names'] = atom_names
data['atom_numbs'] = natoms
data['atom_types'] = types
data['cells'] = cell[np.newaxis, :, :]
data['coords'] = coords[np.newaxis, :, :]
data['energies'] = np.array(energy)[np.newaxis]
data['forces'] = force[np.newaxis, :, :]
data['virials'] = stress[np.newaxis, :, :]
data['orig'] = np.zeros(3)
# print("atom_names = ", data['atom_names'])
# print("natoms = ", data['atom_numbs'])
# print("types = ", data['atom_types'])
# print("cells = ", data['cells'])
# print("coords = ", data['coords'])
# print("energy = ", data['energies'])
# print("force = ", data['forces'])
# print("virial = ", data['virials'])
return data

if __name__ == "__main__":
path = "/home/lrx/work/12_ABACUS_dpgen_interface/dpdata/dpdata/tests/abacus.scf"
data = get_frame(path)
1 change: 1 addition & 0 deletions dpdata/amber/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

35 changes: 35 additions & 0 deletions dpdata/amber/mask.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
"""Amber mask"""
try:
import parmed
except ImportError:
pass

def pick_by_amber_mask(param, maskstr, coords=None):
"""Pick atoms by amber masks
Parameters
----------
param: str or parmed.Structure
filename of Amber param file or parmed.Structure
maskstr: str
Amber masks
coords: np.ndarray (optional)
frame coordinates, shape: N*3
"""
parm = load_param_file(param)
if coords is not None:
parm.initialize_topology(xyz=coords)
sele = []
if len(maskstr) > 0:
newmaskstr = maskstr.replace("@0", "!@*")
sele = [parm.atoms[i].idx for i in parmed.amber.mask.AmberMask(
parm, newmaskstr).Selected()]
return sele

def load_param_file(param_file):
if isinstance(param_file, str):
return parmed.load_file(param_file)
elif isinstance(param_file, parmed.Structure):
return param_file
else:
raise RuntimeError("Unsupported structure")
Loading

0 comments on commit 59dfe7d

Please sign in to comment.