Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nuclinfo Major Pair and Minor Pair overhaul #3735

Merged
merged 45 commits into from
Oct 10, 2023

Conversation

ALescoulie
Copy link
Contributor

@ALescoulie ALescoulie commented Jun 26, 2022

Fixes #3720

Changes made in this Pull Request:

  • Add a class for Major Pair and Minor Pair distance in the new overhauled nucleic acids module.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@pep8speaks
Copy link

pep8speaks commented Jun 26, 2022

Hello @ALescoulie! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2023-10-10 06:31:35 UTC

@codecov
Copy link

codecov bot commented Jun 26, 2022

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (427f1a7) 93.40% compared to head (ee44dca) 93.41%.

Additional details and impacted files
@@            Coverage Diff             @@
##           develop    #3735     +/-   ##
==========================================
  Coverage    93.40%   93.41%             
==========================================
  Files          170      184     +14     
  Lines        22257    23394   +1137     
  Branches      4071     4075      +4     
==========================================
+ Hits         20790    21854   +1064     
- Misses         951     1024     +73     
  Partials       516      516             
Files Coverage Δ
package/MDAnalysis/analysis/nucleicacids.py 100.00% <100.00%> (ø)

... and 14 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry only had the time/energy for a quick review.

Couple of extra things I couldn't add comments for but worth mentioning since it seems like there's going to be further changes to this code:

  1. lines 132, 139, and 142 - we already populate a times array under self.times, is this still necessary to populate? Can we just do a call to np.array(self.times)?
  2. The Attributes for NucPairDist aren't indented properly, could you change them here?
  3. Dicussion point (other @MDAnalysis/coredevs please weigh in here), the way the results dictionary is being used here is very different from how we do results everywhere else & imho makes it a bit hard to specifically extract the pairs of distances vs say the times entry. I realise it's a bit late since this already shipped in 2.2.0, but before we keep going, would it make sense to consider switching this to instead be results.pair_distances (maybe even making it a pairs x times 2D ndarray?

Bases are matched by their index in the lists given as arguments.

Parameters
__________
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(here and everywhere else) I know these still render, but the line style is very much different from the numpy-like style guide we use. Can you instead replace this with a ------ line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in a recent commit

Comment on lines 285 to 291
for s in strand:
if s[0].resname[0] in [c_name, t_name, u_name]:
a1, a2 = o2_name, c2_name
elif s[0].resname[0] in [a_name, g_name]:
a1, a2 = c2_name, o2_name
else:
raise ValueError(f"{s} are not valid nucleic acids")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm misreading, this code block is essentially being repeated in all the child classes. Could you just use a staticmethod here that generalises this code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then you can test it separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored it into the NucPairDist parent class like @IAlibay recommended

else:
raise ValueError(f"{s} are not valid nucleic acids")

sel1.append(s[0].atoms.select_atoms(f'name {a1}'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can double check this on your end but testing things out you'd probably get a ~ 5x speedup if you did
s[0].atoms[np.where(s[0].atoms.names == a1)] instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious if this was still true after the string interning changes...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardjgowers PLEASE write a benchmark case to be run for anything that you improve

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been bench-marked yet? If not I can set one up and adjust the selection code accordingly.

elif s[0].resname[0] in [a_name, g_name]:
a1, a2 = c2_name, o2_name
else:
raise ValueError(f"{s} are not valid nucleic acids")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please cover these errors with tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I test in the refactor of the selection into NucPairDist

Attributes
__________
results: numpy.ndarray
first index is selection second index is time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing indentation it seems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

raise ValueError(f"{s} are not valid nucleic acids")

sel1.append(s[0].atoms.select_atoms(f'name {a1}'))
sel2.append(s[1].atoms.select_atoms(f'name {a2}'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here if one of the selections returns an empty atomgroup and the other doesn't? (for some reason or another, maybe the wrong name selection was passed?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check for that in the static method in NucPairDist

@@ -56,6 +64,7 @@
from .distances import calc_bonds
from .base import AnalysisBase, Results
from MDAnalysis.core.groups import Residue
from .dihedrals import Dihedral
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably missing something obvious, where is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that didn't get cut when I moved a commit around. I was working on the rebuilding another the torsion calculator and had imported that, I'll remove that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in latest push

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of extra things since I was playing with pytest.approx

Comment on lines 26 to 27
from MDAnalysis.analysis.nucleicacids import WatsonCrickDist,\
MajorPairDist, MinorPairDist
Copy link
Member

@IAlibay IAlibay Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't get picked up by pep8speaks, but implicit continuation is best where possible

Suggested change
from MDAnalysis.analysis.nucleicacids import WatsonCrickDist,\
MajorPairDist, MinorPairDist
from MDAnalysis.analysis.nucleicacids import (WatsonCrickDist,
MajorPairDist, MinorPairDist,)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the issue in my latest push

MI = MinorPairDist(strand1, strand2)
MI.run()

assert_allclose(MI.results[0][0], 15.06506, atol=1e-3)
Copy link
Member

@IAlibay IAlibay Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another one for @MDAnalysis/coredevs to consider for our testing style going ahead - single float comparisons are ~ 20x faster (at least on my laptop) if done using pytest.approx instead of assert_allclose. It's microseconds (170 microsec vs 9 microsec), but it adds up quickly to seconds, how do we feel about making pytest.approx the assert method of choice in those cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised #3743 for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored my test to use pytest approx

@@ -27,6 +27,8 @@ Enhancements
* Additional logger.info output when per-frame analysis starts (PR #3710)
* Timestep has been converted to a Cython Extension type
(CZI Performance track, PR #3683)
* Add higher performance AnalysisBase derived Major and Minor pair distances to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually go newest first for these entries, could you move it to the top of the enhancements list?

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! Refactoring old code is never pleasant.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the rewrite.

I have a bunch of readability comments (see inline).

I agree with @IAlibay that we should remove results['times']. I also agree that results needs restructuring, both for consistency but also for ease of accessing the data with e.g. mdacli. Let's raise a separate issue where others can also easily chime in. Normally we adhere to semantic versioning. However, I would consider this module still experimental and I am willing to break strict semver here and just change results without deprecations. The changes to results can go into a separate PR (preferrable) or this one (if a lot easier).

package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved
Comment on lines 269 to 272
ValueError
if the residues given are not amino acids
ValueError
if the selections given are not the same length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


"""

def __init__(self, strand1: List[Residue], strand2: List[Residue],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union(List[Residue], ResidueGroup) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that, I looked at ResidueGroup and there is no reason NucPairDist can't accept it or Residue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a type alias to do it so that the code is less verbose, what do you think about that solution? It is a bit less clear in the docs, but simplifies the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what a type alias is and where you used it — sorry, apologies for my ignorance. Can you please explain and comment on the corresponding piece of code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least from the docs it is not clear to me that the class would now also accept strand1: ResidueGroup — ... at least from your comment I surmise that's what should be possible now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orbeckst A type alias is when you give a specific name to a composite type in my case ResidueClass is an alias for Union[Residue, ResidueGroup] so my code's type signatures are a bit less verbose

**kwargs) -> None:
sel1: List[mda.AtomGroup] = []
sel2: List[mda.AtomGroup] = []
strand = zip(strand1, strand2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plural "strands" ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps clearer, double_strand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set it to strands

package/MDAnalysis/analysis/nucleicacids.py Show resolved Hide resolved
package/MDAnalysis/analysis/nucleicacids.py Show resolved Hide resolved
package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved
@orbeckst
Copy link
Member

Before going much further here we should discuss the format of Results — please see #3744.

@IAlibay
Copy link
Member

IAlibay commented Jul 31, 2023

@ALescoulie It would be great to not let this work get forgotten, are you still looking to contribute this?

@IAlibay IAlibay added the close? Evaluate if issue/PR is stale and can be closed. label Jul 31, 2023
@ALescoulie
Copy link
Contributor Author

Yeah, I'll continue working on this. Sorry for abandoning it, I was busy with school and dealing with mental health issues. I'm getting the development repo set up on my desktop now.

@orbeckst orbeckst removed the close? Evaluate if issue/PR is stale and can be closed. label Aug 18, 2023
@orbeckst
Copy link
Member

Hi @ALescoulie good to hear from you! Would be fantastic to get your work in!

@github-actions
Copy link

github-actions bot commented Aug 19, 2023

Linter Bot Results:

Hi @ALescoulie! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/6465590293/job/17552033113


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

@ALescoulie
Copy link
Contributor Author

ALescoulie commented Aug 21, 2023

I pushed a larger change to the PR. It was a bit hard picking up where I left off, but I was able to jump back into developing without too many issues.

I refactored my code so that all of the Pair distance analysis classes use a NucPairDist static method select_strand_atoms which takes in the strands, selected atoms names, and the nucleic acid names. It then returns a tuple containing the the lists of atoms for each strand. It also has checks ensuring that the strands only contain Nucleic Acids and that the selections don't return empty AtomGroups. I also added tests for both errors.

TODO

  • Reformat NucPairDist to use the newer results format
  • Fix misc docs issues
  • Update the change log

ALescoulie and others added 4 commits September 5, 2023 14:17
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>
descriptions and fix misc typos
strand1 and strand2 are different types
simplify type checks and catch TypeError
@ALescoulie
Copy link
Contributor Author

@orbeckst I believe I addressed your feedback on the documentation, and added some further clarity. I also rewrote how WatsonCrickDist handles List[Resiude] in order to catch possible errors when both a List[Residue] and a ResidueGroup are passed in and in cases where the list is not entirely Residues. I think I should still add one additional test to get back to 100% coverage as there is one if statement I don't hit, I also will go read the documentation one more time.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments. The only real issue is that we need to deprecate results.pair_distance in the WatsonCrick class properly.

In the new classes there's nothing to deprecate as they appear here for the first time.

package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved
package/MDAnalysis/analysis/nucleicacids.py Show resolved Hide resolved
package/MDAnalysis/analysis/nucleicacids.py Outdated Show resolved Hide resolved

.. versionchanged:: 2.5.0
Accessing results by passing strand indices to :attr:`results` is
was deprecated and is now removed as of MDAnalysis version 2.5.0. Please
use :attr:`results.pair_distances` instead.
The :attr:`results.times` was deprecated and is now removed as of
MDAnalysis 2.5.0. Please use the class attribute :attr:`times` instead.

.. versionchanged:: 2.7.0
.. _Deprecation Notice
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above : does the anchor work, did you check the generated docs?

@ALescoulie
Copy link
Contributor Author

@orbeckst I think this PR about ready, assuming I didn't miss any typos in the docs. I also pulled most of the atom selections in the tests into a fixture so there is less repetition when running the CI.

@ALescoulie ALescoulie requested a review from orbeckst October 4, 2023 20:30
- use deprecated sphinx role
- fixed minor errors in descriptions
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ALescoulie , looks good.

I made some changes to the docs, mainly to use sphinx markup for deprecations, but I also included a few fixes. In particular, please have a look for yourself that I correctly defined results.distances.

package/MDAnalysis/analysis/nucleicacids.py Show resolved Hide resolved
Comment on lines 337 to 339
Distances are stored in a 2d numpy array with axis 0 (first index)
indexing the trajectory frame and axis 1 (second index) selecting the
Residue pair.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ALescoulie please check that this rewritten description is correct. It is now consistent with the description of results.pair_distances.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle all good, but please fix PEP8 things.

@ALescoulie
Copy link
Contributor Author

@orbeckst, I'll get it done by tomorrow than I can start on the next part of #3720, the pep8 stuff will just take a few minutes.

@ALescoulie ALescoulie force-pushed the nuclinfo_dist_rebuild branch from 06a55cb to ee44dca Compare October 10, 2023 06:31
@ALescoulie
Copy link
Contributor Author

@orbeckst I got the pep8 stuff done

@orbeckst orbeckst merged commit 18372f1 into MDAnalysis:develop Oct 10, 2023
@orbeckst
Copy link
Member

Hooray. I reopen #3720 for the other things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Re-implement nuclinfo using AnalysisBase style subclasses
5 participants