Skip to content

Commit 5d9fc52

Browse files
jmthibault79jkbonfield
authored andcommitted
Update slice header text and table
- MAPPED_SLICE_HEADER -> SLICE_HEADER - remove "unsorted" - clarify unmapped-placed and unmapped-unplaced distinction - add -2 multi-ref flag - multi-ref may have unmapped-placed but must only be external-ref - add multi-ref and unmapped text to index section threeparttable note for SLICE_HEADER clarify "unmapped-but-placed" Add RI data series text rm "unmapped-unplaced", add CIGAR comment and bold Set indexing Aln Start + Span to 0 for unmapped unmapped footnote rm idea that unmapped unplaced can't go into slices with other reads - some wording and formatting changes rm "Multiple-reference slices have index entries for each of their constituent reference sequences." not required -> not present Add MD5 clarification text, per JKB rm "as is possible in BAM" and clarify unmapped BAM bit flag
1 parent 715354a commit 5d9fc52

File tree

1 file changed

+38
-17
lines changed

1 file changed

+38
-17
lines changed

CRAMv3.tex

+38-17
Original file line numberDiff line numberDiff line change
@@ -521,6 +521,7 @@ \subsection{\textbf{Block content types}}
521521

522522
CRAM has the following block content types:
523523

524+
\begin{threeparttable}[t]
524525
\begin{tabular}{|>{\raggedright}p{143pt}|>{\raggedright}p{45pt}|>{\raggedright}p{116pt}|>{\raggedright}p{114pt}|}
525526
\hline
526527
\textbf{Block content type} & \textbf{Block content type id} & \textbf{Name} & \textbf{Contents}\tabularnewline
@@ -529,7 +530,7 @@ \subsection{\textbf{Block content types}}
529530
\hline
530531
COMPRESSION\_HEADER & 1 & Compression header block & See specific section\tabularnewline
531532
\hline
532-
MAPPED\_SLICE\_HEADER & 2 & Slice header block & See specific section\tabularnewline
533+
SLICE\_HEADER\tnote{a} & 2 & Slice header block & See specific section\tabularnewline
533534
\hline
534535
& 3 & & reserved\tabularnewline
535536
\hline
@@ -538,7 +539,10 @@ \subsection{\textbf{Block content types}}
538539
CORE\_DATA & 5 & core data block & bit stream of all encodings except for external\tabularnewline
539540
\hline
540541
\end{tabular}
541-
542+
\begin{tablenotes}
543+
\item[a] Formerly MAPPED\_SLICE\_HEADER. Now used by all slice headers regardless of mapping status.
544+
\end{tablenotes}
545+
\end{threeparttable}
542546

543547
\subsection{\textbf{Block content id}}
544548

@@ -737,30 +741,45 @@ \subsection{\textbf{Slice header block}}
737741

738742
The slice header block is never compressed (block method=raw). For reference mapped
739743
reads the slice header also defines the reference sequence context of the data
740-
blocks associated with the slice. Mapped and unmapped reads can be stored within
741-
the same slice similarly to BAM file. Slices with unsorted reads must not contain
742-
any other types of reads.
744+
blocks associated with the slice. Mapped reads can be stored along with
745+
\textbf{placed unmapped}\footnote{Unmapped reads can be \textit{placed} or \textit{unplaced}.
746+
By placed unmapped read we mean a read that is unmapped according to bit 0x4 of the
747+
BF (BAM bit flags) data series, but has position fields filled in, thus "placing" it on a reference sequence. In contrast,
748+
unplaced unmapped reads have have a reference sequence ID of -1 and alignment position of 0.}
749+
reads on the same reference within the same slice.
750+
751+
Slices with the Multiple Reference flag (-2) set as the sequence ID in the header may contain reads
752+
mapped to multiple external references, including unmapped\footnotemark[\value{footnote}] reads (placed on these references or unplaced),
753+
but multiple embedded references cannot be combined in this way. When multiple references are
754+
used, the RI data series will be used to determine the reference sequence ID for each record. This
755+
data series is not present when only a single reference is used within a slice.
756+
757+
The Unmapped (-1) sequence ID in the header is for slices containing only unplaced
758+
unmapped\footnotemark[\value{footnote}] reads.
743759

744760
A slice containing data that does not use the external reference in
745761
any sequence may set the reference MD5 sum to zero. This can happen
746762
because the data is unmapped or the sequence has been stored verbatim
747763
instead of via reference-differencing. This latter scenario is
748-
recommended for unsorted or non-coordinate sorted data.
764+
recommended for unsorted or non-coordinate-sorted data.
749765

750766
The slice header block contains the following fields.
751767

752768
\begin{tabular}{|l|l|>{\raggedright}p{200pt}|}
753769
\hline
754770
\textbf{Data type} & \textbf{Name} & \textbf{Value}\tabularnewline
755771
\hline
756-
itf8 & reference sequence id & reference sequence identifier or -1 for unmapped
757-
or unsorted reads\tabularnewline
772+
itf8 & reference sequence id & reference sequence identifier or\linebreak{}
773+
-1 for unmapped reads\linebreak{}
774+
-2 for multiple reference sequences\tabularnewline
758775
\hline
759-
itf8 & alignment start & the alignment start position or -1 for unmapped or unsorted
760-
reads\tabularnewline
776+
itf8 & alignment start & the alignment start position.\linebreak{}
777+
Ignored on read and set to 0 on write if the slice is multiple-reference
778+
or contains unmapped unplaced reads\tabularnewline
761779
\hline
762-
itf8 & alignment span & the length of the alignment or 0 for unmapped or unsorted
763-
reads\tabularnewline
780+
itf8 & alignment span & the length of the alignment.\linebreak{}
781+
Ignored on read and set to 0 on write if the slice is multiple-reference
782+
or contains unmapped unplaced reads\tabularnewline
764783
\hline
765784
itf8 & number of records & the number of records in the slice\tabularnewline
766785
\hline
@@ -774,7 +793,9 @@ \subsection{\textbf{Slice header block}}
774793
reference sequence bases or -1 for none\tabularnewline
775794
\hline
776795
byte[16] & reference md5 & MD5 checksum of the reference bases within the slice
777-
boundaries or 16 \textbackslash{}0 bytes when unused\tabularnewline
796+
boundaries. If this slice has reference sequence id of -1 (unmapped) or -2 (multi-ref)
797+
the MD5 should be 16 bytes of \textbackslash{}0. For embedded references, the MD5
798+
can either be all-zeros or the MD5 of the embedded sequence.\tabularnewline
778799
\hline
779800
byte[] & optional tags & a series of tag,type,value tuples encoded as
780801
per BAM auxiliary fields.\tabularnewline
@@ -1506,7 +1527,7 @@ \subsubsection*{General notes}
15061527

15071528
Please note that CRAM indexing is external to the file format itself and may change
15081529
independently of the file format specification in the future. For example, a new
1509-
type of index files may appear.
1530+
type of index file may appear.
15101531

15111532
Individual records are not indexed in CRAM files, slices should be used instead
15121533
as a unit of random access. Another important difference between CRAM and BAM indexing
@@ -1526,9 +1547,9 @@ \subsubsection*{CRAM index}
15261547
\begin{enumerate}
15271548
\item Sequence id
15281549

1529-
\item Alignment start
1550+
\item Alignment start (ignored on read for unmapped slices, set to 0 on write)
15301551

1531-
\item Alignment span
1552+
\item Alignment span (ignored on read for unmapped slices, set to 0 on write)
15321553

15331554
\item Container start byte offset in the file
15341555

@@ -1538,7 +1559,7 @@ \subsubsection*{CRAM index}
15381559
\end{enumerate}
15391560

15401561
Each line represents a slice in the CRAM file. Please note that all slices must
1541-
be listed in index file.
1562+
be listed in the index file.
15421563

15431564
\subsubsection*{BAM index}
15441565

0 commit comments

Comments
 (0)