Skip to content

Commit

Permalink
merge public master to my fork (#11)
Browse files Browse the repository at this point in the history
* Update CRAM spec section on substitution matrix and codes.

* Respond to review comments.

* CRAM Slice and Container ref seq IDs must match (samtools#401)

* Code review part 2.

* Fix minor typo in the predecessors of BCF2 (samtools#427)

* layed -> laid

* Update MAINTAINERS.md (samtools#432)

Proposal to add Rasko Leinonen as refget maintainer.

* add jmmut to MAINTAINERS.md and move Cristina to "Past Members"

* change order of maintainers

* Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436)

INFO/END (when present) provides the size of the interval that the
variant is located in, along with the CHROM and POS fields. This is
also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's
description of BCF's rlen field.

The implications of INFO/END have not previously been clear. In the
absence of clear documentation, some SV tools have been using INFO/END
fields for their own semi-related purposes (using INFO/CHR2:INFO/END
as the other side's position in an interchromosomal rearrangement),
leading to broken .csi indexes and region queries that don't work.
Fixes samtools#425.

* Allow C and Java native text spellings of NaN and infinities (samtools#409)

C's printf doesn't output mixed case, while Java's Double.valueOf and
Double.toString parse/output only `Infinity`, not `Inf`. Rather than
requiring special-case code for both input and output in both languages,
relax the VCF specification to allow NAN/INF/INFINITY case-insensitively.

(Add "IEEE-754" to be specific and to improve the line breaks.)

* Adding a note about <NON_REF> (samtools#380)

* Update PDFs (CRAM and VCF additions; others cosmetic)

CRAMv3 PRs samtools#401 and samtools#412.
VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END).
All: Minor typo and whitespace formatting fixes.

* Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385)

Includes barebones authorizationCode Oauth2 flow, which should aid/inform
code generation.

Uses int64 with minimum 0, unsure if that is really an uint64 though.

* Codify the existing policy of generally squashing PRs (PR samtools#444)

* Permit AP_Delta in multi-ref slices.

This means AP_delta can become negative.  I have validated this
decodes fine in both htsjdk and htslib.  This is because AP is ITF8
and hence signed, like all other integers, so it would need explicit
code to forbid this (which obviously isn't in the implementations).
Hence the limitation is primarily one of an over-zealous specification.

The impact of this is for position-sorted multi-ref slices AP can
legally be stored efficiently.

Also clarified fields in the container compression header when in
multi-ref mode.

Fixes samtools#431
  • Loading branch information
thefferon authored Oct 9, 2019
1 parent d70aca8 commit c5a5102
Show file tree
Hide file tree
Showing 14 changed files with 403 additions and 38 deletions.
Binary file modified BCFv2_qref.pdf
Binary file not shown.
Binary file modified CRAMv3.pdf
Binary file not shown.
73 changes: 44 additions & 29 deletions CRAMv3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -446,13 +446,16 @@ \section{\textbf{Container header structure}}
\hline
itf8 & reference sequence id & reference sequence identifier or\linebreak{}
-1 for unmapped reads\linebreak{}
-2 for multiple reference sequences\tabularnewline
-2 for multiple reference sequences.\linebreak{}
All slices in this container must have a reference sequence id matching this value.\tabularnewline
\hline
itf8 & starting position on the reference & the alignment start position or\linebreak{}
0 for unmapped reads\tabularnewline
0 if the container is multiple-reference
or contains unmapped unplaced reads\tabularnewline
\hline
itf8 & alignment span & the length of the alignment or\linebreak{}
0 for unmapped reads\tabularnewline
0 if the container is multiple-reference
or contains unmapped unplaced reads\tabularnewline
\hline
itf8 & number of records & number of records in the container\tabularnewline
\hline
Expand Down Expand Up @@ -631,10 +634,11 @@ \subsubsection*{Data series encodings}
\hline
RL & encoding\texttt{<}int\texttt{>} & read lengths & read lengths\tabularnewline
\hline
AP & encoding\texttt{<}int\texttt{>} & in-seq positions & if \textbf{APDelta} = true: 0-based alignment start
delta from the previous record. When the record is the first in the slice,
its alignment start will be equal to that of the slice, so its alignment delta is 0.\linebreak{}
if \textbf{APDelta} = false: encodes the alignment start position directly\tabularnewline
AP & encoding\texttt{<}int\texttt{>} & in-seq positions & if \textbf{AP-Delta} = true: 0-based alignment start
delta from the AP value in the previous record.
Note this delta may be negative, for example when switching references in a multi-reference slice.
When the record is the first in the slice, the previous position used is the slice alignment-start field (hence the first delta should be zero for single-reference slices, or the AP value itself for multi-reference slices). \linebreak{}
if \textbf{AP-Delta} = false: encodes the alignment start position directly\tabularnewline
\hline
RG & encoding\texttt{<}int\texttt{>} & read groups & read groups. Special value
`-1' stands for no group.\tabularnewline
Expand Down Expand Up @@ -778,14 +782,15 @@ \subsection{\textbf{Slice header block}}
\hline
itf8 & reference sequence id & reference sequence identifier or\linebreak{}
-1 for unmapped reads\linebreak{}
-2 for multiple reference sequences\tabularnewline
-2 for multiple reference sequences.\linebreak{}
This value must match that of its enclosing container.\tabularnewline
\hline
itf8 & alignment start & the alignment start position.\linebreak{}
Ignored on read and set to 0 on write if the slice is multiple-reference
0 if the slice is multiple-reference
or contains unmapped unplaced reads\tabularnewline
\hline
itf8 & alignment span & the length of the alignment.\linebreak{}
Ignored on read and set to 0 on write if the slice is multiple-reference
0 if the slice is multiple-reference
or contains unmapped unplaced reads\tabularnewline
\hline
itf8 & number of records & the number of records in the slice\tabularnewline
Expand Down Expand Up @@ -1096,7 +1101,8 @@ \subsection{\textbf{CRAM positional data}}
Positional data is stored for both mapped and unmapped sequences, as unmapped data may still be ``placed'' at a specific location in the genome (without being aligned).
Typically this is done to keep a sequence pair (paired-end or mate-pair sequencing libraries) together when one of the pair aligns and the other does not.

The AP data series is delta encoded for reads mapped to a position-sorted slice containing data from a single reference, and as a normal integer value in all other cases.
For reads stored in a position-sorted slice, the AP-delta flag in the compression header preservation map should be set and the AP data series will be delta encoded, using the slice alignment-start value as the first position to delta against.
Note for multi-reference slices this may mean that the AP series includes negative values, such as when moving from an alignment to the end of one reference sequence to the start of the next or to unmapped unplaced data. When the AP-delta flag is not set the AP data series is stored as a normal integer value.

\begin{tabular}{|>{\raggedright}p{70pt}|>{\raggedright}p{75pt}|>{\raggedright}p{90pt}|>{\raggedright}p{171pt}|}
\hline
Expand All @@ -1121,7 +1127,7 @@ \subsection{\textbf{CRAM positional data}}
\State $reference\_id\gets slice\_header.reference\_sequence\_id$
\EndIf
\State $read\_length \gets$ \Call{ReadItem}{RL, Integer}
\If{$container\_pmap.AP\_delta \ne 0$ \textbf{and} $slice\_header.reference\_sequence\_id \geq 0$}
\If{$container\_pmap.AP\_delta \ne 0$}
\If{$first\_record\_in\_slice$}
\State $last\_position\gets$ $slice\_header.alignment\_start$
\EndIf
Expand Down Expand Up @@ -1378,19 +1384,31 @@ \subsubsection*{Read feature codes}

\subsubsection*{Base substitution codes (BS data series)}

A base substitution is defined as a change from one nucleotide base (reference
base) to another (read base) including N as an unknown or missing base. There are
5 possible bases ACGTN, 4 possible substitutions for each base and 20 substitutions
in total. Substitutions for the same reference base are assigned integer codes
from 0 to 3 inclusive. To restore a base one would need to know its substitution
code and the reference base.
A base substitution is defined as a change from one nucleotide base (reference base) to
another (read base), including N as an unknown or missing base. There are 5 possible reference
bases (ACGTN), with 4 possible substitutions for each base, and 20 substitutions in total.
The codes for all possible substitutions are stored in a substitution matrix. To restore a
base, one would use the reference base and the substitution code, resolving the base via lookup
in the substitution matrix.

\subsubsection*{Substitution Matrix Format}

A base substitution matrix assigns integer codes to all possible substitutions.
Each of the 4 possible substitutions for a given reference base is assigned a 2-bit integer
code (see below) with a value ranging from 0 to 3 inclusive. The 4 2-bit codes are packed
into a single byte, high 2-bits first, for each base ACGTN (minus the reference base itself).
The entire substitution matrix is written as 5 such bytes, one for each reference base, also
in the order ACGTN.

Substitution matrix is written as follows. Substitutions for a given reference
base are sorted by their frequencies in descending order then assigned numbers
from 0 to 3. Same-frequency ties are broken using alphabetical order. For example,
let us assume the following substitution frequencies for base A:
\subsubsection*{Substitution Code Assignment}

To assign the susbtitution code for a given reference base/read base, the substitutions for
each reference base may optionally be sorted by their frequencies, in descending order, with
same-frequency ties broken using the fixed order ACGTN. Although sorting by substitution
frequency is not required by the CRAM format, assigning substitution codes based on frequency
maximizes compression by ensuring that the most frequent substitutions use the shortest possible
codes.

For example, let us assume the following substitution frequencies for base A:

AC: 15\%

Expand All @@ -1410,12 +1428,9 @@ \subsubsection*{Base substitution codes (BS data series)}

AN: 3

and they are written as a single byte, 10 01 00 11 = 147 decimal or 0x93 in this
case. The whole substitution matrix is written as 5 bytes, one for each reference
base in the alphabetical order: A, C, G, T and N.

Note: the last two bits of each substitution code are redundant but still required
to simplify the reading.
The first byte of the substitution matrix entry for reference base A is written as a single byte,
with the codes in the order CGTN: 10 01 00 11 = 147 decimal, or 0x93 in this case. This will then
be followed by 4 more bytes representing substitutions for reference bases C, G, T and N.

\subsubsection*{Decode mapped read pseudocode}

Expand Down
9 changes: 7 additions & 2 deletions MAINTAINERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@ Past CRAM maintainers include Vadim Zalunin.

### VCF/BCF

* Cristina Yenyxe Gonzalez Garcia (@cyenyxe)
* Louis Bergelson (@lbergelson)
* Petr Danecek (@pd3)
* Jose Miguel Mut Lopez (@jmmut)

Past VCF/BCF maintainers include Ryan Poplin and David Roazen.
Past VCF/BCF maintainers include Cristina Yenyxe Gonzalez Garcia, Ryan Poplin, and David Roazen.

### Htsget

Expand All @@ -38,6 +38,7 @@ Past VCF/BCF maintainers include Ryan Poplin and David Roazen.

* Andy Yates (@andrewyatz)
* Matt Laird (@lairdm)
* Rasko Leinonen (@raskoleinonen)

[ga4gh-ff]: https://www.ga4gh.org/howwework/workstreams/#lsg

Expand All @@ -49,6 +50,10 @@ Larger changes should be proposed as pull requests so that they can be discussed
(Even those with write access to the **samtools/hts-specs** repository should in general create their pull request branches within their own **hts-specs** forks.
This way when the main repository is forked again, the new fork is created with a minimum of extraneous volatile branches.)

In general, pull requests should be squashed and rebased before merging: _squashed_ to avoid immortalising trivial editorial commits that occurred during refinement of the PR, and _rebased_ (where practical) to avoid unnecessary merge commits.
Cases where this shouldn't be done include pull requests with multiple non-trivial commits (e.g., separate changes, or a series of commits that tells a story), which should be rebased and/or, if the branch point is reasonably recent, simply merged with a merge commit.

Ensure that the pull request number is present in the resulting commit history, either in the merge commit message or by adding `(PR #NNN)` to the first line of the squashed commit or one that is representative of the PR.

## Generating PDF specification documents

Expand Down
Binary file modified SAMtags.pdf
Binary file not shown.
Binary file modified SAMv1.pdf
Binary file not shown.
Binary file modified VCFv4.1.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion VCFv4.1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -1214,7 +1214,7 @@ \subsubsection{Type encoding}

\vspace{0.3cm}

\textbf{Vectors} --- The BCF2 type byte may indicate that the upcoming data stream contains not a single value but a fixed length vector of values. The vector values occur in order (1st, 2nd, 3rd, etc) encoded as expected for the type declared in the vector's type byte. For example, a vector of 3 16-bit integers would be layed out as first the vector type byte, followed immediately by 3 2-byte values for each integer, including a total of 7 bytes.
\textbf{Vectors} --- The BCF2 type byte may indicate that the upcoming data stream contains not a single value but a fixed length vector of values. The vector values occur in order (1st, 2nd, 3rd, etc) encoded as expected for the type declared in the vector's type byte. For example, a vector of 3 16-bit integers would be laid out as first the vector type byte, followed immediately by 3 2-byte values for each integer, including a total of 7 bytes.

Missing values in vectors are handled slightly differently from atomic values. There are two possibilities for missing values:

Expand Down
Binary file modified VCFv4.2.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion VCFv4.2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -1231,7 +1231,7 @@ \subsubsection{Type encoding}

\vspace{0.3cm}

\textbf{Vectors} --- The BCF2 type byte may indicate that the upcoming data stream contains not a single value but a fixed length vector of values. The vector values occur in order (1st, 2nd, 3rd, etc) encoded as expected for the type declared in the vector's type byte. For example, a vector of 3 16-bit integers would be layed out as first the vector type byte, followed immediately by 3 2-byte values for each integer, including a total of 7 bytes.
\textbf{Vectors} --- The BCF2 type byte may indicate that the upcoming data stream contains not a single value but a fixed length vector of values. The vector values occur in order (1st, 2nd, 3rd, etc) encoded as expected for the type declared in the vector's type byte. For example, a vector of 3 16-bit integers would be laid out as first the vector type byte, followed immediately by 3 2-byte values for each integer, including a total of 7 bytes.

Missing values in vectors are handled slightly differently from atomic values. There are two possibilities for missing values:

Expand Down
Binary file modified VCFv4.3.pdf
Binary file not shown.
23 changes: 18 additions & 5 deletions VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,9 @@ \subsection{Character encoding, non-printable characters and characters with spe


\subsection{Data types}
Data types supported by VCF are: Integer (32-bit, signed), Float (32-bit, formatted to match the regular expression \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$|, \texttt{NaN}, or \texttt{+/-Inf}), Flag, Character, and String.
Data types supported by VCF are: Integer (32-bit, signed), Float (32-bit IEEE-754, formatted to match one of the regular expressions \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| or \verb"^[-+]?(INF|INFINITY|NAN)$" case insensitively),%
\footnote{Note Java's {\tt Double.valueOf} is particular about capitalisation, so additional code is needed to parse all VCF infinite/NaN values.}
Flag, Character, and String.
For the Integer type, the values from $-2^{31}$ to $-2^{31}+7$ cannot be stored in the binary version and therefore are disallowed in both VCF and BCF, see \ref{BcfTypeEncoding}.

\subsection{Meta-information lines}
Expand Down Expand Up @@ -355,7 +357,7 @@ \subsubsection{Fixed fields}
INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: key[=data[,data]].
INFO keys must match the regular expression \texttt{\^{}([A-Za-z\_][0-9A-Za-z\_.]*|1000G)\$}, please note that ``1000G'' is allowed as a special legacy value.
Duplicate keys are not allowed.
Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} are reserved (albeit optional).
Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} and described below are reserved (albeit optional).

The exact format of each INFO key should be specified in the meta-information (as described above).
Example for an INFO field: DP=154;MQ=52;H2.
Expand Down Expand Up @@ -386,7 +388,7 @@ \subsubsection{Fixed fields}
CIGAR & A & String & Cigar string describing how to align an alternate allele to the reference allele \\
DB & 0 & Flag & dbSNP membership \\
DP & 1 & Integer & Combined depth across samples \\
END & 1 & Integer & End position (for use with symbolic alleles) \\
END & 1 & Integer & End position on CHROM (used with symbolic alleles; see below) \\
H2 & 0 & Flag & HapMap2 membership \\
H3 & 0 & Flag & HapMap3 membership \\
MQ & 1 & Float & RMS mapping quality \\
Expand All @@ -398,6 +400,15 @@ \subsubsection{Fixed fields}
1000G & 0 & Flag & 1000 Genomes membership \\
\end{longtable}

\begin{itemize}
\renewcommand{\labelitemii}{$\circ$}
\item END: End reference position (1-based), indicating the variant spans positions POS--END on reference/contig CHROM.
Normally this is the position of the last base in the REF allele, so it can be derived from POS and the length of REF, and no END INFO field is needed.
However when symbolic alleles are used, e.g.\ in gVCF or structural variants, an explicit END INFO field provides variant span information that is otherwise unknown.

This field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}) and is important when indexing VCF/BCF files to enable random access and querying by position.
\end{itemize}

\subsubsection{Genotype fields}
If genotype information is present, then the same types of data must be present for all samples.
First a FORMAT field is given specifying the data types and order (colon-separated FORMAT keys matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate keys are not allowed).
Expand Down Expand Up @@ -1393,9 +1404,10 @@ \subsubsection{Phasing adjacencies in an aneuploid context}
\pagebreak
\subsection{Representing unspecified alleles and REF-only blocks (gVCF)}
In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format\footnote{\url{https://help.basespace.illumina.com/articles/descriptive/gvcf-files/}}.
The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele.

The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele represented as $<$*$>$.
Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise).
A symbolic alternate allele $<$*$>$ is used to represent this unspecified alternate allele.
The $<$*$>$ representation is preferred over the symbolic allele $<$NON\_REF$>$.

Example records are given below:
\scriptsize
Expand Down Expand Up @@ -1529,6 +1541,7 @@ \subsection{BCF2 records}
Compression of a BCF file is recommended but not required.

\subsubsection{Site encoding}
\label{BcfSiteEncoding}

{\small
\begin{tabular}{|l | l | p{30em} | } \hline
Expand Down
1 change: 1 addition & 0 deletions htsget.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Explicitly this API does NOT:

* Provide a way to discover the identifiers for valid ReadGroupSets --- clients obtain these via some out of band mechanism

This protocol specification is accompanied by a [corresponding OpenAPI description](pub/htsget-openapi.yaml). OpenAPI is a language-independent way of describing REST services and is compatible with a number of [third party tools](http://openapi.tools/).

# Protocol essentials

Expand Down
Loading

0 comments on commit c5a5102

Please sign in to comment.