Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should readers demand an n_atoms kwarg #733

Closed
kain88-de opened this issue Feb 23, 2016 · 9 comments
Closed

Should readers demand an n_atoms kwarg #733

kain88-de opened this issue Feb 23, 2016 · 9 comments

Comments

@kain88-de
Copy link
Member

I've started to add the new standard reader tests to our PDB tests. I've noticed now that PrimitivePDBReader and ExtendedPDBReader expect an natoms kwarg. None of the other two readers (XDR, XYZ) I have adopted so far expects this kwarg. Does our topology system a check of the n_atoms read by the trajectory?

Questions is should the natom kwarg be removed or become an explicit part of the reader API?

@richardjgowers
Copy link
Member

I think a lot of the binary readers need the keyword in order to read their files, so Universe always passes the kwarg (and sometimes it's ignored).

We then check that Reader and Topology had the same number of atoms here:

if self.trajectory.n_atoms != self.atoms.n_atoms:

@kain88-de
Copy link
Member Author

If they truly need to know the number of atoms in advance we should make it an explicit part of the reader API.

@richardjgowers
Copy link
Member

@kain88-de
Copy link
Member Author

OK so it is a requirement. This is ok. I can check the requirements then in the tests.

@orbeckst
Copy link
Member

Readers can ignore n_atoms in the list of arguments that they get so it is not a requirement for the reader that it must be processed in any way; that's why we used kwargs instead of explicit naming. See the Trajectory Reader specs in the Trajectory API docs. We don't really have a API for topology things... probably should have one with/after #363.

@kain88-de
Copy link
Member Author

I wrote a script to check which Readers currently require the n_atoms kwarg, see bottom.
The following readers currently trigger an exception if n_atoms isn't passed as an argument.

TRJReader, PrimitivePDBReader, ExtentedPDBReader, LAMMPS.DATAReader, TRZReader

From these the PDB readers don't really need the kwarg. LAMMPS also looks like it can infer the number of atoms from the trajectory itself. For the TRJ and TRZ format we need number of atoms beforehand because they don't encode it separately in the trajectory file. From our implementation I also don't see a way that we could infer the number of atoms reliable from the trajectory.

Personally I would like if the API to construct a Reader to be consistent between all formats. We have currently two formats which need a to know n_atoms to be parsed. Because of that I would say that all Readers require a n_atoms kwarg independent of if we can infer n_atoms from the trajectory or not.

Test Script

import MDAnalysis as mda
from os.path import basename

from MDAnalysisTests.datafiles import (CRD, DCD, DMS, GRO, INPCRD, PDB, PQR,
                                       TRJ, TRR, TRZ, GMS_ASYMOPT, DLP_CONFIG,
                                       DLP_HISTORY, XTC, LAMMPScnt,
                                       mol2_molecules, PDBQT_input, XYZ)


c = mda.coordinates
readers = {c.CRD.CRDReader: CRD,
           c.DCD.DCDReader: DCD,
           c.DMS.DMSReader: DMS,
           c.GRO.GROReader: GRO,
           c.INPCRD.INPReader: INPCRD,
           c.PDB.PDBReader: PDB,
           c.PDB.PrimitivePDBReader: PDB,
           c.PDB.ExtendedPDBReader: PDB,
           c.PQR.PQRReader: PQR,
           c.TRJ.TRJReader: TRJ,
           c.TRR.TRRReader: TRR,
           c.TRZ.TRZReader: TRZ,
           c.XTC.XTCReader: XTC,
           c.XYZ.XYZReader: XYZ,
           c.DLPoly.ConfigReader: DLP_CONFIG,
           c.DLPoly.HistoryReader: DLP_HISTORY,
           c.GMS.GMSReader: GMS_ASYMOPT,
           c.MOL2.MOL2Reader: mol2_molecules,
           c.PDBQT.PDBQTReader: PDBQT_input,
           c.LAMMPS.DATAReader: LAMMPScnt,
           }


for read, topo in readers.items():
    # print("using Reader = {}".format(read))
    try:
        read(topo)
    except Exception as e:
        print("open failed for format {}".format(basename(topo)))
        print("used Reader = {}".format(read))
        print("Exception = {}", e)
        print("")

@mnmelo
Copy link
Member

mnmelo commented Feb 25, 2016

I second @orbeckst's view. Readers that don't need n_atoms can just ignore the kwarg. We just need to properly document this aspect of the API.

I would definitely not make it mandatory for every Reader. It brings unneeded work for the user for the majority of the cases.
Plus, it raises the question: what happens when a user supplies n_atoms different from the trajectory's? Is the kwarg ignored? Then it makes little sense demanding it be present.

@richardjgowers
Copy link
Member

So I think I wrote DATAReader, and if I remember right I had in my mind that because n_atoms is read in the topology the Reader shouldn't have to do it. But yeah, upon reflection it makes more sense for the Reader to work standalone, so that's an issue.

TRZReader I could get to work without n_atoms as each written fortran block is prefixed by its size in bytes, and you can figure out how many floats (==natoms). But that's a bit hacky.

@kain88-de you've forgot TRJ.NCDF, it can calculate n_atoms on its own.

I don't see a problem with an optional requirement, it just reflects the formats strengths/weaknesses

@kain88-de
Copy link
Member Author

The "problem" comes with the standardized tests BaseReaderTest for the Readers I introduced last fall.
It expects that all Readers can be constructed with exactly the same calls, right now just a string pointing to a trajectory file. It wouldn't be to hard to overwrite the __init__ function for the formats that need to have the n_atoms kwarg.

If we are all OK to have Readers by default work without any topology information. I can keep BaseReaderTest as it is and add documentation what should happen for formats where we need additional information besides just the filename.

As far as documentation goes for users. I would remove the hint about the kwargs and say only that some formats need additional information from the topology to be read and that more information can be found in the doc-strings of the specific format readers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants