Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polymer builder overhaul #34

Merged
merged 78 commits into from
Dec 12, 2024
Merged
Changes from 1 commit
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
1adc0b4
Added typehints for SMILES and SMARTS strings
timbernat Dec 4, 2024
5919913
Exposed Smiles and Smarts typehints at subpackage level
timbernat Dec 4, 2024
d5cd8fc
Added function for uniquifying strings (which can preserve character …
timbernat Dec 4, 2024
9a6278e
Updated DOP calculation to check for and yield correct number of mono…
timbernat Dec 4, 2024
c18d394
Established placeholder file + sample fragments for polymer building …
timbernat Dec 4, 2024
4cc360d
Added internal used-monomer-only MonomerGroup which improves accuracy…
timbernat Dec 4, 2024
173990c
Deprecated DOP alias for n_monomers property
timbernat Dec 4, 2024
8eb7267
Expunged all references to "DOP" in favor of clearer terminology
timbernat Dec 4, 2024
88f8b6b
Deprecated filter_text_by_condition()
timbernat Dec 4, 2024
98cec89
Renamed textual.strsearch to textual.substrings, updated docstring
timbernat Dec 4, 2024
de314e7
Implemented function for repeating a string a (possibly fractional) n…
timbernat Dec 4, 2024
3685b39
Added argument for indicating separator between string repeats
timbernat Dec 4, 2024
497791e
Renamed uniquify_str() to unique_string()
timbernat Dec 4, 2024
3c7448e
Renamed "join_indicator" to "joiner" for brevity
timbernat Dec 4, 2024
13c9db7
Fixed bug with parenthesization vs tuplification
timbernat Dec 4, 2024
d3c8be7
Wrote unit tests for textual.substrings
timbernat Dec 4, 2024
487ae6a
Delayed monomer linearity check to only be on the monomer fragments s…
timbernat Dec 4, 2024
ddc20bc
Added range and int typing checks to target_length
timbernat Dec 5, 2024
8714eb8
Added option to register residue names when converting a spec SMARTS …
timbernat Dec 5, 2024
0fe10e9
Renamed "resname_repl" to "resname_map" throughout
timbernat Dec 5, 2024
0bd4355
Implemented mBuild Compound to RDKit converter which preserves confor…
timbernat Dec 5, 2024
d947336
Deprecated irrelevant custom Exceptions, pared down use of "Error" su…
timbernat Dec 5, 2024
7d4b5a1
Implemented support for fractional sequence repeats, with informative…
timbernat Dec 5, 2024
3797c64
Added new custom Exception for end-group dominated chains
timbernat Dec 5, 2024
8c96e48
Separated procrustean sequence determination into dedicated helper fu…
timbernat Dec 5, 2024
f7422bb
Switched order of residue name and head/tail identifier in MonomerGro…
timbernat Dec 5, 2024
e2be34d
Added __post_init__ check for listification of bare SMARTS and for SM…
timbernat Dec 5, 2024
56b4144
Added module-level logger
timbernat Dec 5, 2024
d852ed3
Added internal method for producing end groups for linear polymer bui…
timbernat Dec 5, 2024
9903dd1
Changed MonomerGroup.linear_end_groups from property to vanilla metho…
timbernat Dec 5, 2024
70ee787
Deprecated _has_valid_linear_term_orient, included residue name in li…
timbernat Dec 6, 2024
df12d48
Deferred end group determination to internal implemenation in Monomer…
timbernat Dec 6, 2024
f1e6925
Enhanced logging of sequence breakdown, unified logging between whole…
timbernat Dec 6, 2024
dc053af
Added custom Exception for missing package dependency which reduces e…
timbernat Dec 6, 2024
d4f6361
Deleted superfluous imports
timbernat Dec 6, 2024
88182e7
Converted polymers.building into a package, split up functionality am…
timbernat Dec 6, 2024
c66a3f2
Fixed missing "raise" keywords and incorrect package checks
timbernat Dec 6, 2024
e696828
Fiddled with MissingPrerequisitePackage error message format
timbernat Dec 6, 2024
31f0675
Added Exception for unexpectedly-empty copolymer sequences
timbernat Dec 6, 2024
79296ae
Added precheck for empty sequence kernel
timbernat Dec 9, 2024
cb2587a
Expanded PROCRUSTEAN sequencing algorithm into dedicated dataclass
timbernat Dec 9, 2024
742522e
Made LinearCopolymerSequencer serializable to/from JSON
timbernat Dec 9, 2024
013e09b
Added RDKit-driven PDB writer for mbuild Compounds
timbernat Dec 9, 2024
fb3d898
Expanded out unit test modules for .polymers.building
timbernat Dec 9, 2024
677fede
Updated description of the "PROCRUSTEAN" acronym
timbernat Dec 9, 2024
870a1eb
Wrote unit tests fo copolymer sequencing
timbernat Dec 9, 2024
ce9c55c
Updates SMILES/SMARTS-related type annotations on validation functions
timbernat Dec 9, 2024
75fccee
Moved fragment data directly into code, as opposed to maintaining sep…
timbernat Dec 9, 2024
59f988a
Removed superfluous mBuild imports
timbernat Dec 9, 2024
29d3198
Added devnote to revisit SMARTS-specification auto-cleaning
timbernat Dec 9, 2024
7afceed
Added devnote for spec compliance checker
timbernat Dec 9, 2024
d6d5880
Added MPD-TMC polyamide fragments for examples
timbernat Dec 9, 2024
0e7749f
Added unit tests for MonomerGroup initialization and core properties
timbernat Dec 10, 2024
e33ca3d
Added polyethylene example to test when fewer than the max 2 end grou…
timbernat Dec 10, 2024
0074e4a
Wrote unit test for end group identification
timbernat Dec 10, 2024
f27263f
Expanded syntax and support for addition/validation of new monomer SM…
timbernat Dec 10, 2024
10b8b8c
Added bug note for validation skipping when accessing monomer attribu…
timbernat Dec 10, 2024
30642dc
Added test for degenerate eng group autoassignment (i.e. when NO term…
timbernat Dec 10, 2024
bda08d8
Attempted (unsuccessfully) to get __hash__ working for MonomerGroup
timbernat Dec 10, 2024
32f8d53
Wrote unit tests for linear polymer builder
timbernat Dec 10, 2024
4e31488
Fixed indent on openff_topology_to_openmm() arguments
timbernat Dec 10, 2024
051d128
Removed deprecated local TKREGS import
timbernat Dec 10, 2024
e0cdfc5
Moved unitsys outside of omminter to resolve circular import
timbernat Dec 10, 2024
f416a32
Moved sample monomer fragment sets from unit tests to polymerist proper
timbernat Dec 11, 2024
a770906
Corrected typo in end group autogen warning
timbernat Dec 11, 2024
c991c17
Fixed indent on serialize_openmm_pdb() arguments
timbernat Dec 11, 2024
040a718
Fixed accidental duplication of 3-functional TMC monomer fragment
timbernat Dec 11, 2024
bbd1b85
Added new subpackage for molecule file I/O
timbernat Dec 11, 2024
34d1a7b
Froze SerialAtomLabeller dataclass to avoid unintentional label forma…
timbernat Dec 11, 2024
d9afdc4
Switched PDB atom labelled to dependency-injection based model
timbernat Dec 11, 2024
668c3f9
Renamed "chain" to "polymer" where it occurs to avoid confusion with …
timbernat Dec 11, 2024
a5cc442
Added placeholder unit tests for newly-created `molfiles` subpackage
timbernat Dec 11, 2024
4d128e3
Added residue info injection into mbmol_to_openmm_pdb (PDB outputs ar…
timbernat Dec 11, 2024
f1f8039
Renamed "atom_label_size" to "atom_label_length" for clarity
timbernat Dec 12, 2024
370dc3a
Renamed once more to atom_label_width
timbernat Dec 12, 2024
c61f16a
Fixed non-attribute value in atom_label_width Exception message
timbernat Dec 12, 2024
e40600e
Added string type check for atom element symbols
timbernat Dec 12, 2024
5c282d7
Wrote unit tests for molfiles.pdb.SerialAtomLabeller
timbernat Dec 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Implemented support for fractional sequence repeats, with informative…
… Exceptions for invalid inputs
  • Loading branch information
timbernat committed Dec 5, 2024
commit 7d4b5a147f8b800ea55558e604829d082c7f6ca0
82 changes: 55 additions & 27 deletions polymerist/polymers/building.py
Original file line number Diff line number Diff line change
@@ -15,15 +15,17 @@
from mbuild import Compound
from mbuild.lib.recipes.polymer import Polymer as MBPolymer

from fractions import Fraction
from pathlib import Path
from rdkit import Chem
from collections import Counter

from .exceptions import InsufficientChainLengthError, MorphologyError
from rdkit import Chem

from .exceptions import InsufficientChainLength, PartialBlockSequence, MorphologyError
from .estimation import estimate_n_atoms_linear

from ..genutils.decorators.functional import allow_string_paths
from ..genutils.textual.substrings import unique_string
from ..genutils.textual.substrings import unique_string, repeat_string_to_length

from ..rdutils.bonding.portlib import get_linker_ids
from ..rdutils.bonding.substitution import saturate_ports, hydrogenate_rdmol_ports
@@ -53,7 +55,12 @@ def mbmol_from_mono_rdmol(rdmol : Chem.Mol, resname : Optional[str]=None) -> tup
return mb_compound, linker_ids

@allow_string_paths
def mbmol_to_openmm_pdb(pdb_path : Path, mbmol : Compound, num_atom_digits : int=2, resname_map : dict[str, str]=None) -> None:
def mbmol_to_openmm_pdb(
pdb_path : Path,
mbmol : Compound,
num_atom_digits : int=2,
resname_map : Optional[dict[str, str]]=None,
) -> None:
'''Save an MBuild Compound into an OpenMM-compatible PDB file'''
if resname_map is None: # avoid mutable default
resname_map = {'RES' : 'Pol'}
@@ -116,53 +123,74 @@ def mbmol_to_rdmol(

return rdmol


# LINEAR POLYMER BUILDING
def build_linear_polymer(
monomers : MonomerGroup,
n_monomers : int,
sequence : str='A',
allow_partial_sequences : bool=True,
allow_partial_sequences : bool=False,
add_Hs : bool=False,
energy_minimize : bool=False,
) -> MBPolymer:
'''Accepts a dict of monomer residue names and SMARTS (as one might find in a monomer JSON)
and a degree of polymerization (i.e. chain length in number of monomers)) and returns an mbuild Polymer object'''
# 0) DETERMINE THE ORIENTATION AND NUMBER OF TERMINAL MONOMERS, SUPPLYING THIS IF AN INVALID DEFINITION IS PROVIDED
if monomers.has_valid_linear_term_orient: # DEV: consider moving this logic into MonomerGroup
# 0) DETERMINE THE ORIENTATION AND NUMBER OF TERMINAL MONOMERS, SUPPLYING THIS IF AN INVALID DEFINITION IS PROVIDED - DEV: consider moving this logic into MonomerGroup
if monomers.has_valid_linear_term_orient:
term_orient = monomers.term_orient
LOGGER.info(f'Using pre-defined terminal group orientation {term_orient}')
else:
term_orient = {
resname : orient
for (resname, rdmol), orient in zip(monomers.iter_rdmols(term_only=True), ['head', 'tail']) # will raise StopIteration if fewer
for (resname, rdmol), orient in zip(monomers.iter_rdmols(term_only=True), ['head', 'tail'])
}
LOGGER.warning(f'No valid terminal monomer orientations defined; autogenerated orientations "{term_orient}"; USER SHOULD VERIFY THIS YIELDS A CHEMICALLY-VALID POLYMER!')

# 1) DETERMINE NUMBER OF SEQUENCE REPEATS NEEDED TO MEET TARGET NUMBER OF MONOMER UNITS (IF POSSIBLE)
n_mono_term = len(term_orient) # determine how many terminal monomers are actually present and well-defined
n_mono_middle = n_monomers - n_mono_term # in a linear chain, all monomers are either middle of terminal
# 1) DETERMINE NUMBER OF SEQUENCE REPEATS NEEDED TO MEET TARGET NUMBER OF MONOMER UNITS (IF POSSIBLE) - DEV: consider making a separate function
block_size = len(sequence)
n_mono_term = len(term_orient) # number of terminal monomers are actually present and well-defined
n_mono_middle = n_monomers - n_mono_term # number of terminal monomers needed to reach target; in a linear chain, all monomers are either middle or terminal
if n_mono_middle < 0:
raise InsufficientChainLength(f'Registered number of terminal monomers exceeds requested chain length ({n_monomers}-mer chain can\'t possibly contain {n_mono_term} terminal monomers)')

n_seq_whole : int # number of full sequence repeats to reach a number of monomers less than or equal to the target
n_symbols_remaining : int # number of any remaining symbols in sequence (i.e. monomers) needed to close the gap to the target (allowed to be 0 if target is a multiple of the sequence length)
n_seq_whole, n_symbols_remaining = divmod(n_mono_middle, block_size)
print(n_seq_whole, n_symbols_remaining)

if n_symbols_remaining != 0: # a whole number of sequence repeats (including possibly 0) plus some fraction of a full block sequence
if not allow_partial_sequences:
raise PartialBlockSequence(
f'Partial polymer block sequence required to meet target number of monomers ("{sequence[:n_symbols_remaining]}" prefix of sequence "{sequence}"). ' \
'If this is acceptable, set "allow_partial_sequences=True" and try calling build routine again'
)
sequence_selected = repeat_string_to_length(sequence, target_length=n_mono_middle, joiner='')
n_seq_repeats = 1 # just repeat the entire mixed-fraction length sequence (no full sequence repeats to exploit)
LOGGER.warning(
f'Target number of monomers is achievable WITH a partial {n_symbols_remaining}/{block_size} sequence repeat; ' \
f'({n_seq_whole}*{block_size} [{sequence}] + {n_symbols_remaining} [{sequence[:n_symbols_remaining]}]) middle monomers + {n_mono_term} terminal monomers = {n_monomers} total monomers'
)
else: # for a purely-whole number of block sequence repeats
if n_seq_whole < 1: # NOTE: if it were up to me, this would be < 0 to allow dimers, but mBuild has forced by hand
raise InsufficientChainLength(
f'{n_monomers}-monomer chain cannot accomodate both {n_mono_term} end groups AND at least 1 middle monomer sequence'
)
sequence_selected = sequence # NOTE: rename here is for clarity, and for consistency with partial sequence case
n_seq_repeats = n_seq_whole
LOGGER.info(
f'Target chain length achievable with {n_seq_repeats} whole block(s) of the sequence "{sequence_selected}"; ' \
f'({n_seq_repeats}*{block_size} [{sequence_selected}]) middle monomers + {n_mono_term} terminal monomers = {n_monomers} total monomers'
)
print(sequence_selected, n_seq_repeats)

if (n_mono_middle % block_size) != 0:
raise ValueError(f'Cannot build a(n) {n_monomers}-monomer chain from any number of {block_size}-monomer blocks and {n_mono_term} end groups')
# NOTE: not explicitly forcing n_seq_reps to catch lingering float input / inexact division errors
n_seq_reps = n_mono_middle // block_size # number of times to repeat the block sequence between end groups to reach the target chain length
if n_seq_reps < 1: # NOTE: if it were up to me, this would be < 0 to allow dimers, but mBuild has forced by hand
raise InsufficientChainLengthError(f'{n_monomers}-monomer chain has few total monomers to accomodate {n_mono_term} end groups AND at least 1 middle monomer sequence')
# TODO: consider adding support for fractional sequence lengths IFF that fraction is a rational number whose denominator divides the sequence length...
# ...for example, could allow 5/2 * 'BACA' to be interpreted as 'BACA|BACA|BA'; 5/3 * 'BACA' would still be invalid though
LOGGER.info(f'Target chain length achievable with {n_seq_reps} block sequence repeat(s) ({n_seq_reps}*{block_size} [{sequence}] middle monomers + {n_mono_term} terminal monomers = {n_monomers} total monomers)')

# 2) REGISTERING MONOMERS TO BE USED FOR CHAIN ASSEMBLY
monomers_selected = MonomerGroup() # used to track and estimate sized of the monomers being used for building
## 2A) ADD MIDDLE MONOMERS TO CHAIN
chain = MBPolymer()
for (resname, middle_monomer), sequence_key in zip(
for (resname, middle_monomer), symbol in zip(
monomers.iter_rdmols(term_only=False),
unique_string(sequence, preserve_order=True), # only register a new monomer for each appearance of a new indicator in the sequence
unique_string(sequence_selected, preserve_order=True), # only register a new monomer for each appearance of a new indicator in the sequence
): # zip with sequence limits number of middle monomers to length of block sequence
LOGGER.info(f'Registering middle monomer {resname} (block identifier "{sequence_key}")')
LOGGER.info(f'Registering middle monomer {resname} (block identifier "{symbol}")')
mb_monomer, linker_ids = mbmol_from_mono_rdmol(middle_monomer, resname=resname)
chain.add_monomer(compound=mb_monomer, indices=linker_ids)
monomers_selected.monomers[resname] = monomers.monomers[resname]
@@ -173,7 +201,7 @@ def build_linear_polymer(
for resname, rdmol_list in monomers.rdmols(term_only=True).items()
}
for resname, head_or_tail in term_orient.items():
term_monomer = next(term_iters[resname])
term_monomer = next(term_iters[resname]) # will raise StopIteration if the terminal monomer in question is empty
LOGGER.info(f'Registering terminal monomer {resname} (orientation "{head_or_tail}")')
mb_monomer, linker_ids = mbmol_from_mono_rdmol(term_monomer, resname=resname)
chain.add_end_groups(compound=mb_monomer, index=linker_ids.pop(), label=head_or_tail, duplicate=False) # use single linker ID and provided head-tail orientation
@@ -185,7 +213,7 @@ def build_linear_polymer(

n_atoms_est = estimate_n_atoms_linear(monomers_selected, n_monomers) # TODO: create new MonomerGroup with ONLY the registered monomers to guarantee accuracy
LOGGER.info(f'Assembling linear {n_monomers}-mer chain (estimated {n_atoms_est} atoms)')
chain.build(n_seq_reps, sequence=sequence, add_hydrogens=add_Hs) # "-2" is to account for term groups (in mbuild, "n" is the number of times to replicate just the middle monomers)
chain.build(n_seq_repeats, sequence=sequence_selected, add_hydrogens=add_Hs) # "-2" is to account for term groups (in mbuild, "n" is the number of times to replicate just the middle monomers)
for atom in chain.particles():
atom.charge = 0.0 # initialize all atoms as being uncharged (gets rid of pesky blocks of warnings)
LOGGER.info(f'Successfully assembled linear {n_monomers}-mer chain (exactly {chain.n_particles} atoms)')