merge public master to my fork (#11)

* Update CRAM spec section on substitution matrix and codes. * Respond to review comments. * CRAM Slice and Container ref seq IDs must match (samtools#401) * Code review part 2. * Fix minor typo in the predecessors of BCF2 (samtools#427) * layed -> laid * Update MAINTAINERS.md (samtools#432) Proposal to add Rasko Leinonen as refget maintainer. * add jmmut to MAINTAINERS.md and move Cristina to "Past Members" * change order of maintainers * Clarify that INFO/END is used to form a CHROM:POS-END region (PR samtools#436) (samtools#436) INFO/END (when present) provides the size of the interval that the variant is located in, along with the CHROM and POS fields. This is also used when indexing VCF/BCF files, as can be gleaned from §6.3.1's description of BCF's rlen field. The implications of INFO/END have not previously been clear. In the absence of clear documentation, some SV tools have been using INFO/END fields for their own semi-related purposes (using INFO/CHR2:INFO/END as the other side's position in an interchromosomal rearrangement), leading to broken .csi indexes and region queries that don't work. Fixes samtools#425. * Allow C and Java native text spellings of NaN and infinities (samtools#409) C's printf doesn't output mixed case, while Java's Double.valueOf and Double.toString parse/output only `Infinity`, not `Inf`. Rather than requiring special-case code for both input and output in both languages, relax the VCF specification to allow NAN/INF/INFINITY case-insensitively. (Add "IEEE-754" to be specific and to improve the line breaks.) * Adding a note about <NON_REF> (samtools#380) * Update PDFs (CRAM and VCF additions; others cosmetic) CRAMv3 PRs samtools#401 and samtools#412. VCFv4.3 PRs samtools#380 (<NON_REF>), samtools#409 (infinity/NaN), and samtools#436 (INFO/END). All: Minor typo and whitespace formatting fixes. * Add htsget 1.2.0 OpenAPI v3.0.2 spec (PR samtools#385) Includes barebones authorizationCode Oauth2 flow, which should aid/inform code generation. Uses int64 with minimum 0, unsure if that is really an uint64 though. * Codify the existing policy of generally squashing PRs (PR samtools#444) * Permit AP_Delta in multi-ref slices. This means AP_delta can become negative. I have validated this decodes fine in both htsjdk and htslib. This is because AP is ITF8 and hence signed, like all other integers, so it would need explicit code to forbid this (which obviously isn't in the implementations). Hence the limitation is primarily one of an over-zealous specification. The impact of this is for position-sorted multi-ref slices AP can legally be stored efficiently. Also clarified fields in the container compression header when in multi-ref mode. Fixes samtools#431
thefferon · Oct 9, 2019 · c5a5102 · c5a5102
1 parent d70aca8
commit c5a5102
Show file tree

Hide file tree

Showing 14 changed files with 403 additions and 38 deletions.
diff --git a/BCFv2_qref.pdf b/BCFv2_qref.pdf
diff --git a/CRAMv3.pdf b/CRAMv3.pdf
diff --git a/CRAMv3.tex b/CRAMv3.tex
@@ -446,13 +446,16 @@ \section{\textbf{Container header structure}}
 \hline
 itf8 & reference sequence id & reference sequence identifier  or\linebreak{}
 -1 for unmapped reads\linebreak{}
--2 for multiple reference sequences\tabularnewline
+-2 for multiple reference sequences.\linebreak{}
+All slices in this container must have a reference sequence id matching this value.\tabularnewline
 \hline
 itf8 & starting position on the reference & the alignment start position or\linebreak{}
-0 for unmapped reads\tabularnewline
+0 if the container is multiple-reference
+or contains unmapped unplaced reads\tabularnewline
 \hline
 itf8 & alignment span & the length of the alignment or\linebreak{}
-0 for unmapped reads\tabularnewline
+0 if the container is multiple-reference
+or contains unmapped unplaced reads\tabularnewline
 \hline
 itf8 & number of records & number of records in the container\tabularnewline
 \hline
@@ -631,10 +634,11 @@ \subsubsection*{Data series encodings}
 \hline
 RL & encoding\texttt{<}int\texttt{>} & read lengths & read lengths\tabularnewline
 \hline
-AP & encoding\texttt{<}int\texttt{>} & in-seq positions & if \textbf{APDelta} = true: 0-based alignment start
-delta from the previous record.  When the record is the first in the slice,
-its alignment start will be equal to that of the slice, so its alignment delta is 0.\linebreak{}
-if \textbf{APDelta} = false: encodes the alignment start position directly\tabularnewline
+AP & encoding\texttt{<}int\texttt{>} & in-seq positions & if \textbf{AP-Delta} = true: 0-based alignment start
+delta from the AP value in the previous record.
+Note this delta may be negative, for example when switching references in a multi-reference slice.
+When the record is the first in the slice, the previous position used is the slice alignment-start field (hence the first delta should be zero for single-reference slices, or the AP value itself for multi-reference slices).  \linebreak{}
+if \textbf{AP-Delta} = false: encodes the alignment start position directly\tabularnewline
 \hline
 RG & encoding\texttt{<}int\texttt{>} & read groups & read groups. Special value 
 `-1' stands for no group.\tabularnewline
@@ -778,14 +782,15 @@ \subsection{\textbf{Slice header block}}
 \hline
 itf8 & reference sequence id & reference sequence identifier or\linebreak{}
 -1 for unmapped reads\linebreak{}
--2 for multiple reference sequences\tabularnewline
+-2 for multiple reference sequences.\linebreak{}
+This value must match that of its enclosing container.\tabularnewline
 \hline
 itf8 & alignment start & the alignment start position.\linebreak{}
-Ignored on read and set to 0 on write if the slice is multiple-reference
+0 if the slice is multiple-reference
 or contains unmapped unplaced reads\tabularnewline
 \hline
 itf8 & alignment span & the length of the alignment.\linebreak{}
-Ignored on read and set to 0 on write if the slice is multiple-reference
+0 if the slice is multiple-reference
 or contains unmapped unplaced reads\tabularnewline
 \hline
 itf8 & number of records & the number of records in the slice\tabularnewline
@@ -1096,7 +1101,8 @@ \subsection{\textbf{CRAM positional data}}
 Positional data is stored for both mapped and unmapped sequences, as unmapped data may still be ``placed'' at a specific location in the genome (without being aligned).
 Typically this is done to keep a sequence pair (paired-end or mate-pair sequencing libraries) together when one of the pair aligns and the other does not.
 
-The AP data series is delta encoded for reads mapped to a position-sorted slice containing data from a single reference, and as a normal integer value in all other cases.
+For reads stored in a position-sorted slice, the AP-delta flag in the compression header preservation map should be set and the AP data series will be delta encoded, using the slice alignment-start value as the first position to delta against.
+Note for multi-reference slices this may mean that the AP series includes negative values, such as when moving from an alignment to the end of one reference sequence to the start of the next or to unmapped unplaced data.  When the AP-delta flag is not set the AP data series is stored as a normal integer value.
 
 \begin{tabular}{|>{\raggedright}p{70pt}|>{\raggedright}p{75pt}|>{\raggedright}p{90pt}|>{\raggedright}p{171pt}|}
 \hline
@@ -1121,7 +1127,7 @@ \subsection{\textbf{CRAM positional data}}
   \State $reference\_id\gets slice\_header.reference\_sequence\_id$
 \EndIf
 \State $read\_length \gets$ \Call{ReadItem}{RL, Integer}
-\If{$container\_pmap.AP\_delta \ne 0$ \textbf{and} $slice\_header.reference\_sequence\_id \geq 0$}
+\If{$container\_pmap.AP\_delta \ne 0$}
     \If{$first\_record\_in\_slice$}
         \State $last\_position\gets$ $slice\_header.alignment\_start$
     \EndIf
@@ -1378,19 +1384,31 @@ \subsubsection*{Read feature codes}
 
 \subsubsection*{Base substitution codes (BS data series)}
 
-A base substitution is defined as a change from one nucleotide base (reference 
-base) to another (read base) including N as an unknown or missing base. There are 
-5 possible bases ACGTN, 4 possible substitutions for each base and 20 substitutions 
-in total. Substitutions for the same reference base are assigned integer codes 
-from 0 to 3 inclusive. To restore a base one would need to know its substitution 
-code and the reference base. 
+A base substitution is defined as a change from one nucleotide base (reference base) to
+another (read base), including N as an unknown or missing base. There are 5 possible reference
+bases (ACGTN), with 4 possible substitutions for each base, and 20 substitutions in total.
+The codes for all possible substitutions are stored in a substitution matrix. To restore a
+base, one would use the reference base and the substitution code, resolving the base via lookup
+in the substitution matrix.
+
+\subsubsection*{Substitution Matrix Format}
 
-A base substitution matrix assigns integer codes to all possible substitutions. 
+Each of the 4 possible substitutions for a given reference base is assigned a 2-bit integer
+code (see below) with a value ranging from 0 to 3 inclusive. The 4 2-bit codes are packed
+into a single byte, high 2-bits first, for each base ACGTN (minus the reference base itself).
+The entire substitution matrix is written as 5 such bytes, one for each reference base, also
+in the order ACGTN.
 
-Substitution matrix is written as follows. Substitutions for a given reference 
-base are sorted by their frequencies in descending order then assigned numbers 
-from 0 to 3. Same-frequency ties are broken using alphabetical order. For example, 
-let us assume the following substitution frequencies for base A: 
+\subsubsection*{Substitution Code Assignment}
+
+To assign the susbtitution code for a given reference base/read base, the substitutions for
+each reference base may optionally be sorted by their frequencies, in descending order, with
+same-frequency ties broken using the fixed order ACGTN. Although sorting by substitution
+frequency is not required by the CRAM format, assigning substitution codes based on frequency
+maximizes compression by ensuring that the most frequent substitutions use the shortest possible
+codes.
+
+For example, let us assume the following substitution frequencies for base A: 
 
 AC: 15\%
 
@@ -1410,12 +1428,9 @@ \subsubsection*{Base substitution codes (BS data series)}
 
 AN: 3
 
-and they are written as a single byte, 10 01 00 11 = 147 decimal or 0x93 in this 
-case. The whole substitution matrix is written as 5 bytes, one for each reference 
-base in the alphabetical order: A, C, G, T and N.
-
-Note: the last two bits of each substitution code are redundant but still required 
-to simplify the reading. 
+The first byte of the substitution matrix entry for reference base A is written as a single byte,
+with the codes in the order CGTN: 10 01 00 11 = 147 decimal, or 0x93 in this case. This will then
+be followed by 4 more bytes representing substitutions for reference bases C, G, T and N.
 
 \subsubsection*{Decode mapped read pseudocode}
 

diff --git a/MAINTAINERS.md b/MAINTAINERS.md
@@ -22,11 +22,11 @@ Past CRAM maintainers include Vadim Zalunin.
 
 ### VCF/BCF
 
-* Cristina Yenyxe Gonzalez Garcia (@cyenyxe)
 * Louis Bergelson (@lbergelson)
 * Petr Danecek (@pd3)
+* Jose Miguel Mut Lopez (@jmmut)
 
-Past VCF/BCF maintainers include Ryan Poplin and David Roazen.
+Past VCF/BCF maintainers include Cristina Yenyxe Gonzalez Garcia, Ryan Poplin, and David Roazen.
 
 ### Htsget
 
@@ -38,6 +38,7 @@ Past VCF/BCF maintainers include Ryan Poplin and David Roazen.
 
 * Andy Yates (@andrewyatz)
 * Matt Laird (@lairdm)
+* Rasko Leinonen (@raskoleinonen)
 
 [ga4gh-ff]:  https://www.ga4gh.org/howwework/workstreams/#lsg
 
@@ -49,6 +50,10 @@ Larger changes should be proposed as pull requests so that they can be discussed
 (Even those with write access to the **samtools/hts-specs** repository should in general create their pull request branches within their own **hts-specs** forks.
 This way when the main repository is forked again, the new fork is created with a minimum of extraneous volatile branches.)
 
+In general, pull requests should be squashed and rebased before merging: _squashed_ to avoid immortalising trivial editorial commits that occurred during refinement of the PR, and _rebased_ (where practical) to avoid unnecessary merge commits.
+Cases where this shouldn't be done include pull requests with multiple non-trivial commits (e.g., separate changes, or a series of commits that tells a story), which should be rebased and/or, if the branch point is reasonably recent, simply merged with a merge commit.
+
+Ensure that the pull request number is present in the resulting commit history, either in the merge commit message or by adding `(PR #NNN)` to the first line of the squashed commit or one that is representative of the PR.
 
 ## Generating PDF specification documents
 

diff --git a/SAMtags.pdf b/SAMtags.pdf
diff --git a/SAMv1.pdf b/SAMv1.pdf
diff --git a/VCFv4.1.pdf b/VCFv4.1.pdf
diff --git a/VCFv4.1.tex b/VCFv4.1.tex
@@ -1214,7 +1214,7 @@ \subsubsection{Type encoding}
 
 \vspace{0.3cm}
 
-\textbf{Vectors} --- The BCF2 type byte may indicate that the upcoming data stream contains not a single value but a fixed length vector of values.  The vector values occur in order (1st, 2nd, 3rd, etc) encoded as expected for the type declared in the vector's type byte.  For example, a vector of 3 16-bit integers would be layed out as first the vector type byte, followed immediately by 3 2-byte values for each integer, including a total of 7 bytes.
+\textbf{Vectors} --- The BCF2 type byte may indicate that the upcoming data stream contains not a single value but a fixed length vector of values.  The vector values occur in order (1st, 2nd, 3rd, etc) encoded as expected for the type declared in the vector's type byte.  For example, a vector of 3 16-bit integers would be laid out as first the vector type byte, followed immediately by 3 2-byte values for each integer, including a total of 7 bytes.
 
 Missing values in vectors are handled slightly differently from atomic values.  There are two possibilities for missing values:
 

diff --git a/VCFv4.2.pdf b/VCFv4.2.pdf
diff --git a/VCFv4.2.tex b/VCFv4.2.tex
@@ -1231,7 +1231,7 @@ \subsubsection{Type encoding}
 
 \vspace{0.3cm}
 
-\textbf{Vectors} --- The BCF2 type byte may indicate that the upcoming data stream contains not a single value but a fixed length vector of values.  The vector values occur in order (1st, 2nd, 3rd, etc) encoded as expected for the type declared in the vector's type byte.  For example, a vector of 3 16-bit integers would be layed out as first the vector type byte, followed immediately by 3 2-byte values for each integer, including a total of 7 bytes.
+\textbf{Vectors} --- The BCF2 type byte may indicate that the upcoming data stream contains not a single value but a fixed length vector of values.  The vector values occur in order (1st, 2nd, 3rd, etc) encoded as expected for the type declared in the vector's type byte.  For example, a vector of 3 16-bit integers would be laid out as first the vector type byte, followed immediately by 3 2-byte values for each integer, including a total of 7 bytes.
 
 Missing values in vectors are handled slightly differently from atomic values.  There are two possibilities for missing values:
 

diff --git a/VCFv4.3.pdf b/VCFv4.3.pdf
diff --git a/VCFv4.3.tex b/VCFv4.3.tex
@@ -94,7 +94,9 @@ \subsection{Character encoding, non-printable characters and characters with spe
 
 
 \subsection{Data types}
-Data types supported by VCF are: Integer (32-bit, signed), Float (32-bit, formatted to match the regular expression \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$|, \texttt{NaN}, or \texttt{+/-Inf}), Flag, Character, and String.
+Data types supported by VCF are: Integer (32-bit, signed), Float (32-bit IEEE-754, formatted to match one of the regular expressions \verb|^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$| or \verb"^[-+]?(INF|INFINITY|NAN)$" case insensitively),%
+\footnote{Note Java's {\tt Double.valueOf} is particular about capitalisation, so additional code is needed to parse all VCF infinite/NaN values.}
+Flag, Character, and String.
 For the Integer type, the values from $-2^{31}$ to $-2^{31}+7$ cannot be stored in the binary version and therefore are disallowed in both VCF and BCF, see \ref{BcfTypeEncoding}.
 
 \subsection{Meta-information lines}
@@ -355,7 +357,7 @@ \subsubsection{Fixed fields}
   INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: key[=data[,data]].
   INFO keys must match the regular expression \texttt{\^{}([A-Za-z\_][0-9A-Za-z\_.]*|1000G)\$}, please note that ``1000G'' is allowed as a special legacy value.
   Duplicate keys are not allowed.
-  Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} are reserved (albeit optional).
+  Arbitrary keys are permitted, although those listed in Table~\ref{table:reserved-info} and described below are reserved (albeit optional).
 
   The exact format of each INFO key should be specified in the meta-information (as described above).
   Example for an INFO field: DP=154;MQ=52;H2.
@@ -386,7 +388,7 @@ \subsubsection{Fixed fields}
 	CIGAR		& A		& String	& Cigar string describing how to align an alternate allele to the reference allele \\
 	DB		& 0		& Flag		& dbSNP membership \\
 	DP		& 1		& Integer	& Combined depth across samples \\
-	END		& 1		& Integer	& End position (for use with symbolic alleles) \\
+	END		& 1		& Integer	& End position on CHROM (used with symbolic alleles; see below) \\
 	H2		& 0		& Flag		& HapMap2 membership \\
 	H3		& 0		& Flag		& HapMap3 membership \\
 	MQ		& 1		& Float		& RMS mapping quality \\
@@ -398,6 +400,15 @@ \subsubsection{Fixed fields}
 	1000G		& 0		& Flag		& 1000 Genomes membership \\
 \end{longtable}
 
+\begin{itemize}
+\renewcommand{\labelitemii}{$\circ$}
+\item END: End reference position (1-based), indicating the variant spans positions POS--END on reference/contig CHROM.
+Normally this is the position of the last base in the REF allele, so it can be derived from POS and the length of REF, and no END INFO field is needed.
+However when symbolic alleles are used, e.g.\ in gVCF or structural variants, an explicit END INFO field provides variant span information that is otherwise unknown.
+
+This field is used to compute BCF's {\tt rlen} field (see~\ref{BcfSiteEncoding}) and is important when indexing VCF/BCF files to enable random access and querying by position.
+\end{itemize}
+
 \subsubsection{Genotype fields}
 If genotype information is present, then the same types of data must be present for all samples.
 First a FORMAT field is given specifying the data types and order (colon-separated FORMAT keys matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate keys are not allowed).
@@ -1393,9 +1404,10 @@ \subsubsection{Phasing adjacencies in an aneuploid context}
 \pagebreak
 \subsection{Representing unspecified alleles and REF-only blocks (gVCF)}
 In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format\footnote{\url{https://help.basespace.illumina.com/articles/descriptive/gvcf-files/}}.
-The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele.
+
+The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele represented as $<$*$>$.
 Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise).
-A symbolic alternate allele $<$*$>$ is used to represent this unspecified alternate allele.
+The $<$*$>$ representation is preferred over the symbolic allele $<$NON\_REF$>$.
 
 Example records are given below:
 \scriptsize
@@ -1529,6 +1541,7 @@ \subsection{BCF2 records}
 Compression of a BCF file is recommended but not required.
 
 \subsubsection{Site encoding}
+\label{BcfSiteEncoding}
 
 {\small
 \begin{tabular}{|l | l | p{30em} | } \hline

diff --git a/htsget.md b/htsget.md
@@ -22,6 +22,7 @@ Explicitly this API does NOT:
 
 * Provide a way to discover the identifiers for valid ReadGroupSets --- clients obtain these via some out of band mechanism
 
+This protocol specification is accompanied by a [corresponding OpenAPI description](pub/htsget-openapi.yaml). OpenAPI is a language-independent way of describing REST services and is compatible with a number of [third party tools](http://openapi.tools/).
 
 # Protocol essentials
Original file line number	Diff line number	Diff line change
Expand Up		@@ -22,6 +22,7 @@ Explicitly this API does NOT:

		* Provide a way to discover the identifiers for valid ReadGroupSets --- clients obtain these via some out of band mechanism

		This protocol specification is accompanied by a [corresponding OpenAPI description](pub/htsget-openapi.yaml). OpenAPI is a language-independent way of describing REST services and is compatible with a number of [third party tools](http://openapi.tools/).

		# Protocol essentials

Expand Down