Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circular chromosomes in @SQ header #403

Closed
jmarshall opened this issue Apr 25, 2019 · 9 comments
Closed

Circular chromosomes in @SQ header #403

jmarshall opened this issue Apr 25, 2019 · 9 comments
Labels

Comments

@jmarshall
Copy link
Member

It is occasionally suggested that it would be useful to have a convention for annotating reference sequences as being circular. See for example this 2011/12 samtools-devel thread and this tweet. There are further questions about how to represent mappings across the “join” in a circular chromosome (as mentioned in that thread), but being able to represent the concept in SAM at all is a useful first step.

For example, this could be

@SQ    SN:MT    CI:true

for Circular—true, or perhaps more self-explanatorily and flexibly something along the lines of

@SQ    SN:MT    TP:circular

for Molecule Topology—circular (which would have a default implied value of linear).

@jmarshall jmarshall added the sam label Apr 25, 2019
@yfarjoun
Copy link
Contributor

Sounds like a nice addition. I don't know offhand of other topologies that we would want to include...unless we want to eventually allow knot notations... (http://katlas.math.toronto.edu/wiki/The_Rolfsen_Knot_Table) I would be in favor of a boolean option.

@jkbonfield
Copy link
Contributor

It's a nice idea and I don't see a problem with adding it, even if downstream tools don't yet support it. Add it and there's a chance that'll happen.

Ancient history: Way back in earlier job we found this notion useful; eg see page 274 (actual 294) of http://nebc.nerc.ac.uk/bioinformatics/documentation/staden/doc/manual_unix.pdf. In gap4 if a contig reference was marked as circular then the editor would permit scrolling beyond the end and wrapping around to the start. You could also define where the starting point was.

Both of these were feature requests from people working on studying the Mitochondrial genomes, for which the standard reference sequence at the time just happened to have base number 1 in the hyper variable region, so people often rotated the genome so alignments to that variable region (which was the thing under study) worked.

@colinhercus
Copy link

colinhercus commented Apr 26, 2019 via email

@jmarshall
Copy link
Member Author

>chrM AC:J01415.2 gi:113200490 LN:16569 rl:Mitochondrion AS:GRCh38 tp:circular

@colinhercus: I guess this is from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/ GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/ which contains README_analysis_sets.txt describing the various tags used (see §4), several of which are SAM @SQ fields brought across to FASTA:

…AC, gi, LN, rg, rl, M5, AS, hm…
tp: topology

  • circular for chrM and chrEBV
  • not present for linear chromosomes and scaffolds

For a definition in SAM @SQ we'd want to make the tag uppercase,¹ but this is certainly motivation for the TP / Topology terminology.


¹ “Tags containing lowercase letters are reserved for local use and will not be formally defined in any future version of this specification.”

@colinhercus
Copy link

colinhercus commented Apr 27, 2019 via email

@nh13
Copy link
Member

nh13 commented Apr 27, 2019

I’d love this addition to made to the spec. I think an upper case is appropriate as lower case tags are reserved for local use (as @jmarshall cites in the spec). I have an aligned that I am working on for a very long read application that will need to map across the origin. I’ll leave any changes to the spec about how to represent alignments across the origin in a circular reference for later (currently split the alignment into two).

jmarshall pushed a commit to nh13/hts-specs that referenced this issue May 3, 2019
This is to support annotating reference sequences as circular,
e.g., for bacterial organisms or the human mitochondrial chromosome.
[Summarise @nh13's footnote text so it fits on one line, so `@RG-SM`
is not pushed off to the next page as an orphan. Remove now unneeded
pagebreak hint.] Fixes samtools#403.
@yfarjoun
Copy link
Contributor

yfarjoun commented May 6, 2019

I commented in the PR...but I'll comment here too:

One thing is troubling me: If two people have a two different versions of a reference sequence, one which is TP:circular and the other is not, they will have the same md5 which will mean that refGet will clash, and other sanity checks will fail. Should we redefine md5 for TP:circular in some way to avoid this?

For example:

For the purposes of calculating md5, a TP:circular reference will be appended a '$' sign at the end. Example:
>chr1
CTCAACCACTTGAGCAAACTCCAAGAC
has md5 of 4ac2dee8e6ac06422295df0320981c70
but 
>chr1 TP:circular 
CTCAACCACTTGAGCAAACTCCAAGAC
has md5 of de8fd4b0606b285e11251b538b27ce76

@peterjc
Copy link
Contributor

peterjc commented May 6, 2019

@yfarjoun seems like long term I'd want the meta-data clash to be be spotted and treated as an error. In the short term (since the circular tag will take a while to be adopted), have the sanity checker treat this as a warning only? Am I missing something (not familiar with the implementation details of refGet)?

jmarshall pushed a commit that referenced this issue May 7, 2019
This is to support annotating reference sequences as circular,
e.g., for bacterial organisms or the human mitochondrial chromosome.
[Summarise @nh13's footnote text so it fits on one line, so `@RG-SM`
is not pushed off to the next page as an orphan. Remove now unneeded
pagebreak hint.] Fixes #403.
@jmarshall
Copy link
Member Author

Closed as the SQ-TP header field has now been merged. See #405 (comment) and following for the remainder of the MD5 concern discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants