Mm tag clarification #582

jts · 2021-07-20T14:29:46Z

In implementing the Mm/Ml tags in nanopolish, and parsing them in a small toolkit I'm writing, I've run into a potential ambiguity I'd like clarification on. For illustration, here's a toy example:

SEQ ACGTCGACGA
Mm:C+m,0,1;
Ml:B:C,230,10

Here, the first C should be interpreted as having a high probability of modification, and the third C should be considered canonical with high probability. Should the second C be considered canonical (without any indication of probability?) or should it be treated as non-informative/undefined? In my current implementation untagged bases are considered non-informative (so wouldn't be used when calculating methylation frequency at a particular genomic position) but I believe the modification-aware version of guppy considers them to be canonical, so it would be good to clarify how these should be interpreted.

The text was updated successfully, but these errors were encountered:

cjw85 · 2021-07-20T14:38:47Z

To follow-up on the Guppy behaviour, Guppy can output a modification score for any (relevant) canonical base regardless of locus and surrounding motif; its not limited to e.g. labelling bases in DAM, DCM, CpG, etc.

It does however have the option:

  --bam_methylation_threshold arg       The value below which a predicted
                                        methylation probability will not be
                                        emitted into a BAM file, expressed as a
                                        percentage. Default is 5.0(%).

which in essence masks the output and cuts down the entries in the bam tag. That could be interpreted as the basecaller determining that there's a low probability of modification: an assessment was made, the score was low and was dispensed with for brevity.

I do agree with @jts that the specification needs to be more explicit even if thats "the interpretation is undefined and implementation dependent".

jkbonfield · 2021-07-21T10:28:34Z

Any suggested wording? My suggestion would be "The lack of a modification should be considered as indicative of an unmodified base with a high, but implementation dependent, probability."

As mentioned above, the intent here was for the unlisted entries to be consider as having negligable modification probabilities. It's a useful default position as it means there is an option of shrinking down the modification string, which would otherwise be quite a significant overhead once we start to have callers trained on multiple modification calls for A, C, G and T.

jts · 2021-07-21T14:48:15Z

Thanks @jkbonfield, your suggested wording sounds good to me.

For my use case (which I admit is somewhat outside the intention of the spec) it is desirable to annotate unknown/non-informative sites. This could be due to caller restrictions (e.g. only calling at CpG sites, so modification status at other Cs is unknown) or technical issues (a certain region of the read couldn't be processed, for various reasons).

Is there a way to handle this? I see three possibilities:

Write a probability of 0.5 for unknown sites
We define a special probability code (say, 0 or 255) for unknown sites
We define a mask string in Mm that annotates the unknown sites. Using my example above, Mm:C+m,0,1;C+x,1 would indicate the status of the second C is unknown

I'd prefer 2 or 3, over option 1, which would require special interpretation downstream.

cjw85 · 2021-07-21T15:12:00Z

As biased as I might be given my employer toward the contrary position, I tend to agree this seems like a deficiency in the specification: there is no way to inform the end user that no call was attempted. That doesn't mean the user should blindly assume some bases to be unmodified.

I can't immediately think of alternatives to @jts's proposals that provide definitive guidance to an end user.

cjw85 · 2021-07-21T15:31:36Z

Thinking about option 1 more: expressing the idea of unknown/uncalled like this is not unsavoury simply because its somewhat asserting a uniform prior and confused with true P=0.5 calls, but also if we record an entry for every base we're losing any benefit of the tag structure in being brief and alternatively could take the opportunity to be simultaneously more explicit (e.g. option 2).

Something more like option 3 is most appealing: it extend the current spec. I thought about the C+x mask being optional to leave the current idea that a lack of call implies a canonical base, but that's a little messy in that the interpretation of C+m becomes different when the mask is present. A mandatory, but possibly empty, mask is more clear.

jkbonfield · 2021-07-21T15:35:30Z

All of these require the caller to be amended, and some require the decoder to be aware too.

Option 1 doesn't need any decoder changes, but what value would you use? How would you tell them apart from called modifications that just happen to have that value as a probability? Hence it doesn't seem to actually solve the problem.

Option 2 could work, albeit perhaps requiring the tools reading the data stream to have recognition for a specific code, but if we went down this road then 0 would make sense as it's an illogical value to write out given the default position of not-listed means not-(probably)-modified and a naive interpretation that doesn't treat 0 as a special case should still work (albeit assuming no modification rather than lack of looking for a modification). That's still better than reading it as having a very high quality modification. If we explicitly did this for every C then it would mean the C+m string would have many runs of 0,0,0,... in this and the corresponding Ml would equally be 0,0,0,.... Both of these ought to compress well.

Option 3 is like 2 but a bit messier as it has the added complexity of what to do with Ml? Writing out C+x,1 would require an extra quality to be consistent, but they're rather meaningless placeholders. It feels unclean unclean to just fabricate something, but also a bit changing the format to have one modification type being absent from the Ml list. It also adds another loop hole - what if we have 4 Cs but 2 are listed as m and 1 as x. We're back to square 1! (We can obviously state this as illegal, but if it can happen someone probably will do it.)

So of the three, option 2 feels best.

I wonder though if there is an easier way via a header. These are tricky to extend IMO, but maybe something in the read-group? PG fields most ideal, but it's very program specific and not particularly robustly implemented usually. The thought here is we need a way of saying "unmentioned values mean X" where X can be one of "not called, so unknown" and "called as very likely unmodified". It's basically just a boolean so we don't really need to store an extra piece of information for every C in every sequence to say whether or not we attempted to call it, do we?
As 4, but append this information into the Mm/Ml strings direct. So not C+x,1 for your example for a specific base or bases, but maybe an extra markup to distinguish whether with Mm chain is complete or not. This has the added benefit of letting us say we checked every C but only some Gs (for example).

Thus we could have:

Mm:C+m,5,17,2;

All Cs have been checked and 6th, 24th and 27th Cs have been called as methylated. In this scenario we can omit the positions we check but find to be unmodified.

Or:

Mm:C+m,*,5,17,2;

We checked C position 6, 24 and 27 and their likelihood is recorded, with everything else unchecked. In this scenario we must include the positions we check and find to be unmodified, so the corresponding Ml line will have lots of low qualities. There would be no Ml entry for the C position *. For ease of parsing we'd require it to be the first in the list too.

Option 5 is more similar to your option 3, but without the requirement to explicitly list all the other C positions.

jts · 2021-07-22T13:12:05Z

I agree with your assessment of the different options, and favour option 5, so put my vote down in that column. While it would require a breaking change to the format, it would be nice if the extra flag was part of the description segment (C+m*,... instead of C+m,*,...).

jkbonfield · 2021-07-22T13:21:51Z

I did consider that option, and don't mind either way. Basically it's just a marker so it can fit anywhere.

I was thinking on it again this morning and basically it's distinguishing between implicit and explicit calls for unmodified. The current definition is implicit - we list the modified bases and implicitly assert everything else to be unmodified. The alternative is that we explicitly state which bases are unmodified (as well as obviously the modified ones), meaning the remainder are unspecified instead.

I'm not sure if that is a clearer way to describe it though. :)

cjw85 · 2021-07-22T13:46:27Z

I would think of the implicit form as applying some prior belief that all bases are canonical. I think @jts and I agree that the assymetric property of the implicit version isn't terribly nice; why should we assume canonical, not assume modified? Only because in the absence of the modified tags that's what we've been doing to this point!

We should be careful in saying in the implicit case that things are unmodified: the more accurate statement in the case of Guppy's current behaviour is that the absent positions have a modification probability less than some threshold.

I prefer option 5 over putting the marker in a header, where it is easily lost.

jkbonfield · 2021-07-22T16:40:37Z

(Via John Marshall from conf call)

A couple more thoughts on this.

We need to be explicit about what happens when there is no Mm tag at all. It needs saying, but I think it's uncontraversial to state that no tags means not analysed, and analysed with no observed modifications can be implemented as an empty Mm string.
For @jts, do you have long regions of a read where you have modifications called, followed by regions that you have no calls, followed by regions with calls again, and so on? The thought here is whether a simple binary "implicit" vs "explicit" model is sufficient or whether we need a way to turn it on and off (not a pleasant idea).
Finally, before really agreeing on any specific change, we probably do need to test just how inefficient it is to store a modification at every base for a specific type with most of them having a minimal "almost certainly not modified" likelihood. In Mm world this just turns into a run of 0,0,0... and Ml equally most scores will be identical. I'd hope this means it compresses very well, but it does still have implications for tools that hold many uncompressed reads in memory (pileup, sort, etc).

jts · 2021-07-22T17:17:27Z

Agreed
For my use case we call exclusively at CpG (or optionally a handful of other motifs that we have trained models on). So any C not followed by a G will not be tagged. In addition we have to skip long regions with very high CpG density (CGCGCGCGCG....) for limitations specific to nanopolish. These regions will not be tagged as well. So the distribution of called/not called sites depends on the underlying distribution of the motifs in the genome. I agree that toggling on/off the mode wouldn't be pleasant and I'd prefer to pay a higher storage cost to avoid it.
@cjw85 and I were discussing this as well. I'll generate some test files with a dense representation (at all Cs), encoding skipped positions with likelihood 0.

jts · 2021-07-22T19:54:18Z

Here's a quick test of sparse (current spec) versus dense tags (forcing output at all Cs) for 4,500 records from chr20:

Mm/Ml tags	format	size (bytes)
none	bam	52,841,385
sparse	bam	54,457,647
dense	bam	54,987,708
none	cram	33,764,559
sparse	cram	34,152,974
dense	cram	34,171,589

So, changing from sparse to dense tags increases size by 118 bytes per record for bam. For cram, the increase is only 4 bytes per record.

jkbonfield · 2021-07-23T08:13:43Z

Percentage wise, it's a 33% growth in these tags for BAM and 5% for CRAM, but the overall percentage of the file is still minimal. That doesn't seem too bad. (It's more or less what I expected, with BAM suffering a larger growth.)

When designing these tags though I was thinking about the future, where calling models may get much much bigger and we're predicting dozens of different modification types. I suspect such models though won't be needing to use the explicit mode of labelling every single nucleotide, so I think this is workable.

Thanks for the data.

jts · 2021-08-18T15:03:19Z

@jkbonfield I'd be interested to hear your thoughts on the proposed change linked above - if you think the basic structure is OK, I'll expand the examples to include these flags.

jkbonfield · 2021-08-18T15:38:19Z

Apologies for not getting back to you on this. I've had my head down doing other things and been a bit remiss with the specs.

I think the proposal is reasonable, so yes please. it would be most helpful if you could expand the examples. Did we agree on the notion of a markup in the symbol part of the string, eg C+m*,10 vs C+m,10. Or even the corner case of C+mh*,10 implying we checked position 10, looking for m/h mods jointly.

jts · 2021-08-18T15:45:15Z

No problem, I was off on holidays after our initial discussion so I'm just getting back to this myself.

We didn't decide on the positioning of the flag but I propose it be mandatory and following [+-] as it would be simpler to parse 3 single-character values, then the possibly variable-length modification code, rather than having this flag follow the mod code. Up for discussion though, as this would be a breaking change, but maybe this is OK since the format is so new.

jkbonfield · 2021-08-18T16:08:17Z

I'd be happy with it being between the +/- and the mod types, although I don't think it has to be a breaking change. We could decree a default position (what we have now) and use the additional flag as a way to indicate explicit scoring (only those listed) rather than implicit (by omission). Note I'm not against having markup stating "this list is implicit" either, but I'm saying that the absence of such a markup shouldn't be invalid and will be interpreted as implicit by default.

cjw85 · 2021-08-18T16:13:40Z

@jkbonfield How are you thinking C+mh*,10 (or C+mh? in @jts's draft) is a corner case? I think you mean should it be always understood that both m and h were evaluated explicitely? The current spec indicates you always need to provide a score in the multi-modification case.

jkbonfield · 2021-08-18T17:11:08Z

Sorry that was a poor choice of words. I meant more complex case, where we have multiple symbols listed but only one flag which binds to both rather than just one. Ie c+*mh (reversing it as discussed above) means both m and h are explicitly listed rather than only m and not h. It would be messy to try and describe one as explicit and one as implicit, plus we can always break the combined case down into two lists which then would permit such a description.

I haven't seen a draft yet either so can't comment on that specifically.

jkbonfield · 2021-08-18T17:14:02Z

Also I'm not sure what symbols make the most sense here. I just had * as a sort of place-holder flag without thinking about it. If we wish to add extra markup to state implicit vs explicit list, then maybe we want a pair of symbols that are related in some way, eg > and <. Open to suggestions, but will see what @jts comes up with.

jts · 2021-08-18T17:29:35Z

Sorry, I should have made this clear earlier, my proposed changes are here: jts@ea2784b

I chose . for assume-canonical as . is used to denote the reference base in other contexts (mpileup, etc).? for assume-unknown is fairly intuitive, imo.

jkbonfield added sam sam-tags labels Jul 21, 2021

jts added a commit to jts/hts-specs that referenced this issue Aug 18, 2021

first pass at addressing samtools#582

ea2784b

jts added a commit to jts/hts-specs that referenced this issue Aug 25, 2021

updates following the discussion in samtools#582

59b785e

jts mentioned this issue Aug 25, 2021

Extend Mm tag to allow callers to indicate unknown positions #592

Closed

jkbonfield closed this as completed in 71241fe Dec 9, 2021

cmdcolin mentioned this issue Dec 13, 2021

Add support for updated Mm tag unknown positions GMOD/jbrowse-components#2582

Closed

cjw85 mentioned this issue Jun 17, 2022

Clarification on absence of ? or . in MM tags #654

Closed

yinshiyi mentioned this issue Sep 12, 2022

nanopolish output bam file with MM/ML tags jts/nanopolish#1037

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mm tag clarification #582

Mm tag clarification #582

jts commented Jul 20, 2021

cjw85 commented Jul 20, 2021

jkbonfield commented Jul 21, 2021 •

edited

Loading

jts commented Jul 21, 2021 •

edited

Loading

cjw85 commented Jul 21, 2021 •

edited

Loading

cjw85 commented Jul 21, 2021

jkbonfield commented Jul 21, 2021 •

edited

Loading

jts commented Jul 22, 2021

jkbonfield commented Jul 22, 2021 •

edited

Loading

cjw85 commented Jul 22, 2021

jkbonfield commented Jul 22, 2021 •

edited

Loading

jts commented Jul 22, 2021

jts commented Jul 22, 2021

jkbonfield commented Jul 23, 2021 •

edited

Loading

jts commented Aug 18, 2021

jkbonfield commented Aug 18, 2021

jts commented Aug 18, 2021

jkbonfield commented Aug 18, 2021

cjw85 commented Aug 18, 2021

jkbonfield commented Aug 18, 2021

jkbonfield commented Aug 18, 2021 •

edited

Loading

jts commented Aug 18, 2021

Mm tag clarification #582

Mm tag clarification #582

Comments

jts commented Jul 20, 2021

cjw85 commented Jul 20, 2021

jkbonfield commented Jul 21, 2021 • edited Loading

jts commented Jul 21, 2021 • edited Loading

cjw85 commented Jul 21, 2021 • edited Loading

cjw85 commented Jul 21, 2021

jkbonfield commented Jul 21, 2021 • edited Loading

jts commented Jul 22, 2021

jkbonfield commented Jul 22, 2021 • edited Loading

cjw85 commented Jul 22, 2021

jkbonfield commented Jul 22, 2021 • edited Loading

jts commented Jul 22, 2021

jts commented Jul 22, 2021

jkbonfield commented Jul 23, 2021 • edited Loading

jts commented Aug 18, 2021

jkbonfield commented Aug 18, 2021

jts commented Aug 18, 2021

jkbonfield commented Aug 18, 2021

cjw85 commented Aug 18, 2021

jkbonfield commented Aug 18, 2021

jkbonfield commented Aug 18, 2021 • edited Loading

jts commented Aug 18, 2021

jkbonfield commented Jul 21, 2021 •

edited

Loading

jts commented Jul 21, 2021 •

edited

Loading

cjw85 commented Jul 21, 2021 •

edited

Loading

jkbonfield commented Jul 21, 2021 •

edited

Loading

jkbonfield commented Jul 22, 2021 •

edited

Loading

jkbonfield commented Jul 22, 2021 •

edited

Loading

jkbonfield commented Jul 23, 2021 •

edited

Loading

jkbonfield commented Aug 18, 2021 •

edited

Loading