Skip to content

Commit

Permalink
Updates from review.
Browse files Browse the repository at this point in the history
- Fixed typos / grammar.
- Require strand to be indicative of the original sequence orientation
  and not the current one listed in the file.
- Added example text for multiple choices of modification at a single
  site.
- Added space as a separator between MP entries.
- Adjusted the regexp to permit empty numeric series (representing no
  modification present).
  • Loading branch information
jkbonfield committed Oct 21, 2019
1 parent 6b06097 commit 11d7fb9
Showing 1 changed file with 19 additions and 13 deletions.
32 changes: 19 additions & 13 deletions SAMtags.tex
Original file line number Diff line number Diff line change
Expand Up @@ -457,7 +457,7 @@ \subsubsection{Color space}

\subsection{Base modifications}

Base basications, including base methylation, are represented as a
Base modifications, including base methylation, are represented as a
series of edits from the primary unmodified sequence stored in the
main SAM {\sf SEQ} field. If the modifed base has no natural
unmodified form then this should be stored as ``N''.
Expand All @@ -468,29 +468,32 @@ \subsection{Base modifications}
this modification being correct, rather than the base being unmodified.

\begin{description}
\item[MM:Z:\tagregex{([ACGTN][-+][a-z](,[0-9]+)+;)*}]
\item[MM:Z:\tagregex{([ACGTN][-+][a-z](,[0-9]+)*;)*}]
\hfill\\
The first character is the unmodified base as seen in the {\sf SEQ}
field, one of {\tt A}, {\tt C}, {\tt G}, {\tt T} or {\tt N}, with
the exception that {\tt N} is used to match any base rather than
strictly {\tt N}. This is followed by either plus or minus
indicating the strand the modification was observed on (relative to
the recorded strand of {\sf SEQ} with plus meaning same
orientation), and a base modification symbol. This is then followed
by a comma separated list of how many unmodified seq bases of the
the original sequenced strand of {\sf SEQ} with plus meaning same
orientation\footnote{Hence a tool that may reverse complement sequences does not need to understand how to manipulate the {\tt MM} and {\tt MP} tags.}), and a base modification symbol.
This is then followed by a comma separated list of how many unmodified seq bases of the
stated base type to skip, stored as a delta to the last and starting
with 0 as the first (or next) base. Hence this number series is
comparable to the numbers in an {\tt MD} tag.

For example {\tt C+m,5,12,0;} tells us there are three 5-Methylcytosine
bases in the original {\sf SEQ}. The first 5 {\tt C} bases are
unmodified and the 6th is modified, as are the 19th (12 inbetween the
unmodified and the 6th is modified, as are the 19th (12 between the
6th and 19th) and 20th. Similarly {\tt G-m,14;} indicates the 15th
{\tt G} is a 5-Methylcytosine on the opposite strand.

This permits modifications to be listed on either strand with the rare
potential for both strands to have a modification at the same site.

Note it is permitted for the coordinate list to be empty (for example {\tt MM:Z:C+m;}), which may be used as an explicit indicator that this base modification is not present.
It is not permitted for coordinates to be beyond the length of the sequence.

If the modification is not one of the standard common types (listed
below) it can be specified as a numeric ChEBI code. For example
{\tt C+76792,57;} is the same as {\tt C+h,57;}.
Expand Down Expand Up @@ -536,22 +539,25 @@ \subsection{Base modifications}

\item[MP:Z:\tagvalue{qualities}]
\hfill\\
The {\tt MP} tag if present lists the Phred qualities of each
modification listed in the {\tt MM} tag. The length should match the
number of position deltas from {\tt MM}. The qualities are encoded in
the same manner as the primary {\sf QUAL} field; one byte per quality
with ASCII value Phred score + 33. No separators should be present.
The optional {\tt MP} tag lists the Phred qualities of each modification listed in the {\tt MM} tag.
The qualities are encoded in the same manner as the primary {\sf QUAL} field; one byte per quality with ASCII value Phred score + 33.
A space character (`{\tt \textvisiblespace}') should be used as a separator between concatenated quality strings when multiple modification lists are present in the {\tt MM} tag.
The length should match the number of position deltas from {\tt MM} plus 1 per space character required.

For example {\tt MM:Z:C+m,5,12,3;C+h,57;} may have an associated
quality tag of {\tt MP:Z:5EB/}.
quality tag of {\tt MP:Z:5EB /}.

Quality values for ambiguity codes give the likelihood that the
modification is one of the possible codes compatible with that
ambiguity code. For example {\tt MM:Z:C+C,10 MP:Z:+} indicates a C
ambiguity code. For example {\tt MM:Z:C+C,10; MP:Z:+} indicates a C
call with an unspecified modification and the phred score of 10 (ASCII
value {\tt +}). This corresponds to a 90\% chance of the base being
modified.

To represent several possible modifications at the same site the {\tt MP} tag can be used to indicate the probabilities of each possibility.
The values used should be absolute probabilities, not relative between the alternatives.
For example, a C base that has 90\% chance of being modified with 5mC being three times more likely than 5hmC will encode 5mC with 67.5\% probability ($0.9 * 0.75$)and 5hmC with 22.5\% probability ($0.9 * 0.25$).
This could be represented with {\tt MM:Z:C+m,10;C+h,10; MP:Z:" \&}.

\end{description}

Expand Down

0 comments on commit 11d7fb9

Please sign in to comment.