Skip to content

Commit

Permalink
Recommend always using the MM ? . encoding.
Browse files Browse the repository at this point in the history
The '.' code is the default interpretation, but historically tools
omitting '?' and '.' have used both styles.  An explicit definition in
the MM string removes any ambiguity.

See samtools#654 comments for background.
  • Loading branch information
jkbonfield committed Jun 22, 2022
1 parent f2dbeb3 commit a80276e
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions SAMtags.tex
Original file line number Diff line number Diff line change
Expand Up @@ -491,9 +491,12 @@ \subsection{Base modifications}
Note `{\tt N}' may be used to match any base rather than specifically an `{\tt N}' call by the sequencing instrument.
This may be used in situations where the base modification is not a derivation of a standard base type.
This is followed by either plus or minus indicating the strand the modification was observed on (relative to the original sequenced strand of {\sf SEQ} with plus meaning same orientation),\footnote{Hence a tool that may reverse complement sequences does not need to understand how to manipulate the {\tt MM} and {\tt ML} tags.} and one or more base modification codes.

Following the base modification codes is an optional `{\tt .}' or `{\tt ?}' describing how skipped seq bases of the stated base type should be interpreted by downstream tools.
When this flag is `{\tt ?}' there is no information about the modification status of the skipped bases provided.
When this flag is not present, or it is `{\tt .}', these bases should be assumed to have low probability of modification.\footnote{The decision whether a base is assumed to be unmodified or has a probability explicitly provided is up to the modification calling program. Some programs will elide calls with modification probabilites below a threshold to provide a more compact modification tag.}
While optional with a default interpretation, it is strongly recommended that tools are explicit and always add either `{\tt ?}' or `{\tt .}' to avoid ambiguity.

This is then followed by a comma separated list of how many seq bases of the stated base type to skip, stored as a delta to the last and starting with 0 as the first (or next) base, starting from the uncomplemented 5' end of the {\sf SEQ} field.
This number series is comparable to the numbers in an {\tt MD} tag,
albeit counting specific base types only and potentially reverse-complemented.
Expand Down

0 comments on commit a80276e

Please sign in to comment.