-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mm tag clarification #582
Comments
To follow-up on the Guppy behaviour, Guppy can output a modification score for any (relevant) canonical base regardless of locus and surrounding motif; its not limited to e.g. labelling bases in DAM, DCM, CpG, etc. It does however have the option:
which in essence masks the output and cuts down the entries in the bam tag. That could be interpreted as the basecaller determining that there's a low probability of modification: an assessment was made, the score was low and was dispensed with for brevity. I do agree with @jts that the specification needs to be more explicit even if thats "the interpretation is undefined and implementation dependent". |
Any suggested wording? My suggestion would be "The lack of a modification should be considered as indicative of an unmodified base with a high, but implementation dependent, probability." As mentioned above, the intent here was for the unlisted entries to be consider as having negligable modification probabilities. It's a useful default position as it means there is an option of shrinking down the modification string, which would otherwise be quite a significant overhead once we start to have callers trained on multiple modification calls for A, C, G and T. |
Thanks @jkbonfield, your suggested wording sounds good to me. For my use case (which I admit is somewhat outside the intention of the spec) it is desirable to annotate unknown/non-informative sites. This could be due to caller restrictions (e.g. only calling at CpG sites, so modification status at other Cs is unknown) or technical issues (a certain region of the read couldn't be processed, for various reasons). Is there a way to handle this? I see three possibilities:
I'd prefer 2 or 3, over option 1, which would require special interpretation downstream. |
As biased as I might be given my employer toward the contrary position, I tend to agree this seems like a deficiency in the specification: there is no way to inform the end user that no call was attempted. That doesn't mean the user should blindly assume some bases to be unmodified. I can't immediately think of alternatives to @jts's proposals that provide definitive guidance to an end user. |
Thinking about option 1 more: expressing the idea of unknown/uncalled like this is not unsavoury simply because its somewhat asserting a uniform prior and confused with true P=0.5 calls, but also if we record an entry for every base we're losing any benefit of the tag structure in being brief and alternatively could take the opportunity to be simultaneously more explicit (e.g. option 2). Something more like option 3 is most appealing: it extend the current spec. I thought about the C+x mask being optional to leave the current idea that a lack of call implies a canonical base, but that's a little messy in that the interpretation of C+m becomes different when the mask is present. A mandatory, but possibly empty, mask is more clear. |
All of these require the caller to be amended, and some require the decoder to be aware too. Option 1 doesn't need any decoder changes, but what value would you use? How would you tell them apart from called modifications that just happen to have that value as a probability? Hence it doesn't seem to actually solve the problem. Option 2 could work, albeit perhaps requiring the tools reading the data stream to have recognition for a specific code, but if we went down this road then 0 would make sense as it's an illogical value to write out given the default position of not-listed means not-(probably)-modified and a naive interpretation that doesn't treat 0 as a special case should still work (albeit assuming no modification rather than lack of looking for a modification). That's still better than reading it as having a very high quality modification. If we explicitly did this for every C then it would mean the Option 3 is like 2 but a bit messier as it has the added complexity of what to do with Ml? Writing out So of the three, option 2 feels best.
Thus we could have:
All Cs have been checked and 6th, 24th and 27th Cs have been called as methylated. In this scenario we can omit the positions we check but find to be unmodified. Or:
We checked C position 6, 24 and 27 and their likelihood is recorded, with everything else unchecked. In this scenario we must include the positions we check and find to be unmodified, so the corresponding Ml line will have lots of low qualities. There would be no Ml entry for the C position Option 5 is more similar to your option 3, but without the requirement to explicitly list all the other C positions. |
I agree with your assessment of the different options, and favour option 5, so put my vote down in that column. While it would require a breaking change to the format, it would be nice if the extra flag was part of the description segment ( |
I did consider that option, and don't mind either way. Basically it's just a marker so it can fit anywhere. I was thinking on it again this morning and basically it's distinguishing between implicit and explicit calls for unmodified. The current definition is implicit - we list the modified bases and implicitly assert everything else to be unmodified. The alternative is that we explicitly state which bases are unmodified (as well as obviously the modified ones), meaning the remainder are unspecified instead. I'm not sure if that is a clearer way to describe it though. :) |
I would think of the implicit form as applying some prior belief that all bases are canonical. I think @jts and I agree that the assymetric property of the implicit version isn't terribly nice; why should we assume canonical, not assume modified? Only because in the absence of the modified tags that's what we've been doing to this point! We should be careful in saying in the implicit case that things are unmodified: the more accurate statement in the case of Guppy's current behaviour is that the absent positions have a modification probability less than some threshold. I prefer option 5 over putting the marker in a header, where it is easily lost. |
(Via John Marshall from conf call) A couple more thoughts on this.
|
|
Here's a quick test of sparse (current spec) versus dense tags (forcing output at all Cs) for 4,500 records from chr20:
So, changing from sparse to dense tags increases size by 118 bytes per record for bam. For cram, the increase is only 4 bytes per record. |
Percentage wise, it's a 33% growth in these tags for BAM and 5% for CRAM, but the overall percentage of the file is still minimal. That doesn't seem too bad. (It's more or less what I expected, with BAM suffering a larger growth.) When designing these tags though I was thinking about the future, where calling models may get much much bigger and we're predicting dozens of different modification types. I suspect such models though won't be needing to use the explicit mode of labelling every single nucleotide, so I think this is workable. Thanks for the data. |
@jkbonfield I'd be interested to hear your thoughts on the proposed change linked above - if you think the basic structure is OK, I'll expand the examples to include these flags. |
Apologies for not getting back to you on this. I've had my head down doing other things and been a bit remiss with the specs. I think the proposal is reasonable, so yes please. it would be most helpful if you could expand the examples. Did we agree on the notion of a markup in the symbol part of the string, eg |
No problem, I was off on holidays after our initial discussion so I'm just getting back to this myself. We didn't decide on the positioning of the flag but I propose it be mandatory and following |
I'd be happy with it being between the +/- and the mod types, although I don't think it has to be a breaking change. We could decree a default position (what we have now) and use the additional flag as a way to indicate explicit scoring (only those listed) rather than implicit (by omission). Note I'm not against having markup stating "this list is implicit" either, but I'm saying that the absence of such a markup shouldn't be invalid and will be interpreted as implicit by default. |
@jkbonfield How are you thinking |
Sorry that was a poor choice of words. I meant more complex case, where we have multiple symbols listed but only one flag which binds to both rather than just one. Ie I haven't seen a draft yet either so can't comment on that specifically. |
Also I'm not sure what symbols make the most sense here. I just had |
Sorry, I should have made this clear earlier, my proposed changes are here: jts@ea2784b I chose |
In implementing the Mm/Ml tags in nanopolish, and parsing them in a small toolkit I'm writing, I've run into a potential ambiguity I'd like clarification on. For illustration, here's a toy example:
Here, the first C should be interpreted as having a high probability of modification, and the third C should be considered canonical with high probability. Should the second C be considered canonical (without any indication of probability?) or should it be treated as non-informative/undefined? In my current implementation untagged bases are considered non-informative (so wouldn't be used when calculating methylation frequency at a particular genomic position) but I believe the modification-aware version of guppy considers them to be canonical, so it would be good to clarify how these should be interpreted.
The text was updated successfully, but these errors were encountered: