Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 A^3 boxes from EM structures in the PDB #2599

Closed
IAlibay opened this issue Mar 9, 2020 · 16 comments · Fixed by #2645
Closed

1 A^3 boxes from EM structures in the PDB #2599

IAlibay opened this issue Mar 9, 2020 · 16 comments · Fixed by #2645

Comments

@IAlibay
Copy link
Member

IAlibay commented Mar 9, 2020

Context

This is related to this discussion: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/mdnalysis-discussion/IKYv18MqlFs

From a cursory look at a couple of systems (6NT5, 6CVM, 3JAC), it looks like PDB files for EM structures have a tendency to have the following CRYST1 records:

CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1

As dicussed in the dicussion thread, this is causing issues with periodic selections as it is folding the system into a cube of 1 A.

Proposed solution

My solution here would be to catch cases where the CRYST1 record is defining a 1 A cube, warn the user and make the system dimensionless.

Code to reproduce the behavior

Very slightly altered code from Lucas Siemons' post.

import MDAnalysis as mda
import numpy as np

u = mda.Universe('6cvm.pdb')

cur_sele = u.select_atoms('around 0.1 (resid 4 and name CA and segid A)')

ref_ca = u.select_atoms('resid 4 and name CA and segid A')

print('sele len: ', len(cur_sele))
for i in cur_sele:
    dist = np.linalg.norm(i.position-ref_ca[0].position)
    print('Distance: %f '%(dist), i)

Current version of MDAnalysis

  • Which version are you using? (run python -c "import MDAnalysis as mda; print(mda.__version__)") 0.20.2-dev0
  • Which version of Python (python -V)? 3.7.6
  • Which operating system? Ubuntu 18.04.3 LTS
@richardjgowers
Copy link
Member

richardjgowers commented Mar 9, 2020 via email

@IAlibay
Copy link
Member Author

IAlibay commented Mar 9, 2020

@siddharthjain1611 this issue is indeed currently free to work on if you wish to do so. I would however suggest possibly not over-extending yourself by looking to fix multiple issues concurrently (namely #2469), and instead focusing on one so that you can get a PR merged before the GSOC application window opens. Doing so means that it also doesn't block off issues from other prospective GSOC applicants.

Edit: looks like the comment this was aimed at has now been deleted.

@siddharthjain1611
Copy link
Member

Sir, I would first focus on the other issue, therefore, please feel free to assign someone else to this issue who is more worthy.

@orbeckst
Copy link
Member

Note:

REMARK 247 ELECTRON MICROSCOPY
REMARK 247  THE COORDINATES IN THIS ENTRY WERE GENERATED FROM ELECTRON
REMARK 247  MICROSCOPY DATA. PROTEIN DATA BANK CONVENTIONS REQUIRE
REMARK 247  THAT CRYST1 AND SCALE RECORDS BE INCLUDED, BUT THE VALUES
REMARK 247  ON THESE RECORDS ARE MEANINGLESS EXCEPT FOR THE CALCULATION
REMARK 247  OF THE STRUCTURE FACTORS.

Technically, it’s the users responsibility to provide correct input.

In practice, this can easily lead to very wrong results when working with PDB files.

I really dislike having to put a lot of special casing in readers but if

CRYST1    1.000    1.000    1.000  90.00  90.00  90.00 P 1

is a common thing in cryo-EM structures (which will soon make up a significant portion of the PDB) then we can’t just slip them through. Perhaps we should have a way to set the box to None (or whatever we use for no box) when we encounter either these magic box size or if we have another clear indication that the box is non-sensical.

TODO

  • establish if these special CRYST1 records are the standard for cryo-EM
  • decide on a way to treat them by default
  • warn user
  • give user option to keep the box (not sure when a user would want to do this but perhaps for testing)

@richardjgowers
Copy link
Member

So the codes pasted above give no dimensions info when using fetch_mmtf so it looks like it's a strictly PDB problem at least.

We could try and do some sort of string match in the remarks if we really wanted, but I just think it's fine to warn and delete the box info.

@IAlibay
Copy link
Member Author

IAlibay commented Mar 10, 2020

So to tackle the first "TODO" from @orbeckst's post, I had a quick thrawl through the RSCB PDB to investigate the frequency of cryo-EM PDB CRYST1 records matching:

CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1

Out of 2286 entries matching the a text search for "cryo-em", 2109 (~92%) matched the above CRYST1 record.

I've not had a thorough look at the cases where the CRYST1 record was different, but in ~ dozen cases it was a case that the CRYST1 record was partially empty in some way, in ~ 50 cases there were no CRYST1 records at all, in the remainder of cases some kind of non-1 A^3 box was defined.

@orbeckst
Copy link
Member

I really don't want to start digging through REMARKS.

Let's define what we mean by "a sensible box", document, throw it out & warn by default, and give user an option to keep whatever rubbish is in the file if they really want to.

@0xethsign
Copy link

is someone currently working on this issue?

@orbeckst
Copy link
Member

@smrpn, as far as I can see there's no PR linked to this issue at the moment so by all means go ahead and put up a PR.

@GomaruSai
Copy link

Hello!
I would like to know if someone is working on this issue?

@0xethsign
Copy link

@GomaruSai i'm not

@IAlibay
Copy link
Member Author

IAlibay commented Mar 22, 2020

@GomaruSai as far as I'm aware no one has opened a PR on this, please do go ahead and open one if you'd like to.

@Oscuro-Phoenix
Copy link
Contributor

Hi! I'm trying this out.

If you trace the issue back to groups.py in the core module, it seems like unless the periodic key in the dictionary is not specified as periodic=False (during select_atoms function call) it would take it as True by default. Could you (@IAlibay) kindly clarify this ?

@IAlibay
Copy link
Member Author

IAlibay commented Mar 23, 2020

I was under the impression that @GomaruSai wanted to work on this issue.

If you trace the issue back to groups.py in the core module, it seems like unless the periodic key in the dictionary is not specified as periodic=False (during select_atoms function call) it would take it as True by default. Could you (@IAlibay) kindly clarify this ?

I'm not sure I follow you fully here, but yes this is happening because periodicity is applied by default if a CRYST record exist in the PDB. That being said, we wouldn't want to change the default value of the periodicity flag.

Instead we just want to pick up a case where an 1 A^3 box is defined as a possible EM structure that was given a default non-real CRYST record. On identifying this, we'd then just warn the user, and not set the box dimensions.

@Oscuro-Phoenix
Copy link
Contributor

I'm kind-of struggling to find the genesis of this default behavior on encountering CRYST. I found that the PDB.py has a section that reads CRYST1 but it doesn't explain the behavior. Any hints on what I might be missing?

@IAlibay
Copy link
Member Author

IAlibay commented Mar 24, 2020

I'm kind-of struggling to find the genesis of this default behavior on encountering CRYST. I found that the PDB.py has a section that reads CRYST1 but it doesn't explain the behavior. Any hints on what I might be missing?

So the section you're probably looking for is:

elif line[:6] == 'CRYST1':
# does an implicit str -> float conversion
try:
self.ts._unitcell[:] = [line[6:15], line[15:24],
line[24:33], line[33:40],
line[40:47], line[47:54]]
except ValueError:
warnings.warn("Failed to read CRYST1 record, "
"possibly invalid PDB file, got:\n{}"
"".format(line))

What we want to do is identify where the unit cell values being fed here define a 1 A^3 box, i.e. when the CRYST1 record is: CRYST1 1.000 1.000 1.000 90.00 90.00 90.00 P 1 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants