Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mm tag clarification #582

Closed
jts opened this issue Jul 20, 2021 · 21 comments
Closed

Mm tag clarification #582

jts opened this issue Jul 20, 2021 · 21 comments

Comments

@jts
Copy link
Contributor

jts commented Jul 20, 2021

In implementing the Mm/Ml tags in nanopolish, and parsing them in a small toolkit I'm writing, I've run into a potential ambiguity I'd like clarification on. For illustration, here's a toy example:

SEQ ACGTCGACGA
Mm:C+m,0,1;
Ml:B:C,230,10

Here, the first C should be interpreted as having a high probability of modification, and the third C should be considered canonical with high probability. Should the second C be considered canonical (without any indication of probability?) or should it be treated as non-informative/undefined? In my current implementation untagged bases are considered non-informative (so wouldn't be used when calculating methylation frequency at a particular genomic position) but I believe the modification-aware version of guppy considers them to be canonical, so it would be good to clarify how these should be interpreted.

@cjw85
Copy link

cjw85 commented Jul 20, 2021

To follow-up on the Guppy behaviour, Guppy can output a modification score for any (relevant) canonical base regardless of locus and surrounding motif; its not limited to e.g. labelling bases in DAM, DCM, CpG, etc.

It does however have the option:

  --bam_methylation_threshold arg       The value below which a predicted
                                        methylation probability will not be
                                        emitted into a BAM file, expressed as a
                                        percentage. Default is 5.0(%).

which in essence masks the output and cuts down the entries in the bam tag. That could be interpreted as the basecaller determining that there's a low probability of modification: an assessment was made, the score was low and was dispensed with for brevity.

I do agree with @jts that the specification needs to be more explicit even if thats "the interpretation is undefined and implementation dependent".

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 21, 2021

Any suggested wording? My suggestion would be "The lack of a modification should be considered as indicative of an unmodified base with a high, but implementation dependent, probability."

As mentioned above, the intent here was for the unlisted entries to be consider as having negligable modification probabilities. It's a useful default position as it means there is an option of shrinking down the modification string, which would otherwise be quite a significant overhead once we start to have callers trained on multiple modification calls for A, C, G and T.

@jts
Copy link
Contributor Author

jts commented Jul 21, 2021

Thanks @jkbonfield, your suggested wording sounds good to me.

For my use case (which I admit is somewhat outside the intention of the spec) it is desirable to annotate unknown/non-informative sites. This could be due to caller restrictions (e.g. only calling at CpG sites, so modification status at other Cs is unknown) or technical issues (a certain region of the read couldn't be processed, for various reasons).

Is there a way to handle this? I see three possibilities:

  1. Write a probability of 0.5 for unknown sites
  2. We define a special probability code (say, 0 or 255) for unknown sites
  3. We define a mask string in Mm that annotates the unknown sites. Using my example above, Mm:C+m,0,1;C+x,1 would indicate the status of the second C is unknown

I'd prefer 2 or 3, over option 1, which would require special interpretation downstream.

@cjw85
Copy link

cjw85 commented Jul 21, 2021

As biased as I might be given my employer toward the contrary position, I tend to agree this seems like a deficiency in the specification: there is no way to inform the end user that no call was attempted. That doesn't mean the user should blindly assume some bases to be unmodified.

I can't immediately think of alternatives to @jts's proposals that provide definitive guidance to an end user.

@cjw85
Copy link

cjw85 commented Jul 21, 2021

Thinking about option 1 more: expressing the idea of unknown/uncalled like this is not unsavoury simply because its somewhat asserting a uniform prior and confused with true P=0.5 calls, but also if we record an entry for every base we're losing any benefit of the tag structure in being brief and alternatively could take the opportunity to be simultaneously more explicit (e.g. option 2).

Something more like option 3 is most appealing: it extend the current spec. I thought about the C+x mask being optional to leave the current idea that a lack of call implies a canonical base, but that's a little messy in that the interpretation of C+m becomes different when the mask is present. A mandatory, but possibly empty, mask is more clear.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 21, 2021

All of these require the caller to be amended, and some require the decoder to be aware too.

Option 1 doesn't need any decoder changes, but what value would you use? How would you tell them apart from called modifications that just happen to have that value as a probability? Hence it doesn't seem to actually solve the problem.

Option 2 could work, albeit perhaps requiring the tools reading the data stream to have recognition for a specific code, but if we went down this road then 0 would make sense as it's an illogical value to write out given the default position of not-listed means not-(probably)-modified and a naive interpretation that doesn't treat 0 as a special case should still work (albeit assuming no modification rather than lack of looking for a modification). That's still better than reading it as having a very high quality modification. If we explicitly did this for every C then it would mean the C+m string would have many runs of 0,0,0,... in this and the corresponding Ml would equally be 0,0,0,.... Both of these ought to compress well.

Option 3 is like 2 but a bit messier as it has the added complexity of what to do with Ml? Writing out C+x,1 would require an extra quality to be consistent, but they're rather meaningless placeholders. It feels unclean unclean to just fabricate something, but also a bit changing the format to have one modification type being absent from the Ml list. It also adds another loop hole - what if we have 4 Cs but 2 are listed as m and 1 as x. We're back to square 1! (We can obviously state this as illegal, but if it can happen someone probably will do it.)

So of the three, option 2 feels best.

  1. I wonder though if there is an easier way via a header. These are tricky to extend IMO, but maybe something in the read-group? PG fields most ideal, but it's very program specific and not particularly robustly implemented usually. The thought here is we need a way of saying "unmentioned values mean X" where X can be one of "not called, so unknown" and "called as very likely unmodified". It's basically just a boolean so we don't really need to store an extra piece of information for every C in every sequence to say whether or not we attempted to call it, do we?

  2. As 4, but append this information into the Mm/Ml strings direct. So not C+x,1 for your example for a specific base or bases, but maybe an extra markup to distinguish whether with Mm chain is complete or not. This has the added benefit of letting us say we checked every C but only some Gs (for example).

Thus we could have:

Mm:C+m,5,17,2;

All Cs have been checked and 6th, 24th and 27th Cs have been called as methylated. In this scenario we can omit the positions we check but find to be unmodified.

Or:

Mm:C+m,*,5,17,2;

We checked C position 6, 24 and 27 and their likelihood is recorded, with everything else unchecked. In this scenario we must include the positions we check and find to be unmodified, so the corresponding Ml line will have lots of low qualities. There would be no Ml entry for the C position *. For ease of parsing we'd require it to be the first in the list too.

Option 5 is more similar to your option 3, but without the requirement to explicitly list all the other C positions.

@jts
Copy link
Contributor Author

jts commented Jul 22, 2021

I agree with your assessment of the different options, and favour option 5, so put my vote down in that column. While it would require a breaking change to the format, it would be nice if the extra flag was part of the description segment (C+m*,... instead of C+m,*,...).

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 22, 2021

I did consider that option, and don't mind either way. Basically it's just a marker so it can fit anywhere.

I was thinking on it again this morning and basically it's distinguishing between implicit and explicit calls for unmodified. The current definition is implicit - we list the modified bases and implicitly assert everything else to be unmodified. The alternative is that we explicitly state which bases are unmodified (as well as obviously the modified ones), meaning the remainder are unspecified instead.

I'm not sure if that is a clearer way to describe it though. :)

@cjw85
Copy link

cjw85 commented Jul 22, 2021

I would think of the implicit form as applying some prior belief that all bases are canonical. I think @jts and I agree that the assymetric property of the implicit version isn't terribly nice; why should we assume canonical, not assume modified? Only because in the absence of the modified tags that's what we've been doing to this point!

We should be careful in saying in the implicit case that things are unmodified: the more accurate statement in the case of Guppy's current behaviour is that the absent positions have a modification probability less than some threshold.

I prefer option 5 over putting the marker in a header, where it is easily lost.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 22, 2021

(Via John Marshall from conf call)

A couple more thoughts on this.

  1. We need to be explicit about what happens when there is no Mm tag at all. It needs saying, but I think it's uncontraversial to state that no tags means not analysed, and analysed with no observed modifications can be implemented as an empty Mm string.

  2. For @jts, do you have long regions of a read where you have modifications called, followed by regions that you have no calls, followed by regions with calls again, and so on? The thought here is whether a simple binary "implicit" vs "explicit" model is sufficient or whether we need a way to turn it on and off (not a pleasant idea).

  3. Finally, before really agreeing on any specific change, we probably do need to test just how inefficient it is to store a modification at every base for a specific type with most of them having a minimal "almost certainly not modified" likelihood. In Mm world this just turns into a run of 0,0,0... and Ml equally most scores will be identical. I'd hope this means it compresses very well, but it does still have implications for tools that hold many uncompressed reads in memory (pileup, sort, etc).

@jts
Copy link
Contributor Author

jts commented Jul 22, 2021

  1. Agreed

  2. For my use case we call exclusively at CpG (or optionally a handful of other motifs that we have trained models on). So any C not followed by a G will not be tagged. In addition we have to skip long regions with very high CpG density (CGCGCGCGCG....) for limitations specific to nanopolish. These regions will not be tagged as well. So the distribution of called/not called sites depends on the underlying distribution of the motifs in the genome. I agree that toggling on/off the mode wouldn't be pleasant and I'd prefer to pay a higher storage cost to avoid it.

  3. @cjw85 and I were discussing this as well. I'll generate some test files with a dense representation (at all Cs), encoding skipped positions with likelihood 0.

@jts
Copy link
Contributor Author

jts commented Jul 22, 2021

Here's a quick test of sparse (current spec) versus dense tags (forcing output at all Cs) for 4,500 records from chr20:

Mm/Ml tags format size (bytes)
none bam 52,841,385
sparse bam 54,457,647
dense bam 54,987,708
none cram 33,764,559
sparse cram 34,152,974
dense cram 34,171,589

So, changing from sparse to dense tags increases size by 118 bytes per record for bam. For cram, the increase is only 4 bytes per record.

@jkbonfield
Copy link
Contributor

jkbonfield commented Jul 23, 2021

Percentage wise, it's a 33% growth in these tags for BAM and 5% for CRAM, but the overall percentage of the file is still minimal. That doesn't seem too bad. (It's more or less what I expected, with BAM suffering a larger growth.)

When designing these tags though I was thinking about the future, where calling models may get much much bigger and we're predicting dozens of different modification types. I suspect such models though won't be needing to use the explicit mode of labelling every single nucleotide, so I think this is workable.

Thanks for the data.

jts added a commit to jts/hts-specs that referenced this issue Aug 18, 2021
@jts
Copy link
Contributor Author

jts commented Aug 18, 2021

@jkbonfield I'd be interested to hear your thoughts on the proposed change linked above - if you think the basic structure is OK, I'll expand the examples to include these flags.

@jkbonfield
Copy link
Contributor

Apologies for not getting back to you on this. I've had my head down doing other things and been a bit remiss with the specs.

I think the proposal is reasonable, so yes please. it would be most helpful if you could expand the examples. Did we agree on the notion of a markup in the symbol part of the string, eg C+m*,10 vs C+m,10. Or even the corner case of C+mh*,10 implying we checked position 10, looking for m/h mods jointly.

@jts
Copy link
Contributor Author

jts commented Aug 18, 2021

No problem, I was off on holidays after our initial discussion so I'm just getting back to this myself.

We didn't decide on the positioning of the flag but I propose it be mandatory and following [+-] as it would be simpler to parse 3 single-character values, then the possibly variable-length modification code, rather than having this flag follow the mod code. Up for discussion though, as this would be a breaking change, but maybe this is OK since the format is so new.

@jkbonfield
Copy link
Contributor

I'd be happy with it being between the +/- and the mod types, although I don't think it has to be a breaking change. We could decree a default position (what we have now) and use the additional flag as a way to indicate explicit scoring (only those listed) rather than implicit (by omission). Note I'm not against having markup stating "this list is implicit" either, but I'm saying that the absence of such a markup shouldn't be invalid and will be interpreted as implicit by default.

@cjw85
Copy link

cjw85 commented Aug 18, 2021

@jkbonfield How are you thinking C+mh*,10 (or C+mh? in @jts's draft) is a corner case? I think you mean should it be always understood that both m and h were evaluated explicitely? The current spec indicates you always need to provide a score in the multi-modification case.

@jkbonfield
Copy link
Contributor

Sorry that was a poor choice of words. I meant more complex case, where we have multiple symbols listed but only one flag which binds to both rather than just one. Ie c+*mh (reversing it as discussed above) means both m and h are explicitly listed rather than only m and not h. It would be messy to try and describe one as explicit and one as implicit, plus we can always break the combined case down into two lists which then would permit such a description.

I haven't seen a draft yet either so can't comment on that specifically.

@jkbonfield
Copy link
Contributor

jkbonfield commented Aug 18, 2021

Also I'm not sure what symbols make the most sense here. I just had * as a sort of place-holder flag without thinking about it. If we wish to add extra markup to state implicit vs explicit list, then maybe we want a pair of symbols that are related in some way, eg > and <. Open to suggestions, but will see what @jts comes up with.

@jts
Copy link
Contributor Author

jts commented Aug 18, 2021

Sorry, I should have made this clear earlier, my proposed changes are here: jts@ea2784b

I chose . for assume-canonical as . is used to denote the reference base in other contexts (mpileup, etc).? for assume-unknown is fairly intuitive, imo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants