@@ -521,6 +521,7 @@ \subsection{\textbf{Block content types}}
521
521
522
522
CRAM has the following block content types:
523
523
524
+ \begin {threeparttable }[t]
524
525
\begin {tabular }{|>{\raggedright }p{143pt}|>{\raggedright }p{45pt}|>{\raggedright }p{116pt}|>{\raggedright }p{114pt}|}
525
526
\hline
526
527
\textbf {Block content type } & \textbf {Block content type id } & \textbf {Name } & \textbf {Contents }\tabularnewline
@@ -529,7 +530,7 @@ \subsection{\textbf{Block content types}}
529
530
\hline
530
531
COMPRESSION\_ HEADER & 1 & Compression header block & See specific section\tabularnewline
531
532
\hline
532
- MAPPED \_ SLICE \ _ HEADER & 2 & Slice header block & See specific section\tabularnewline
533
+ SLICE \ _ HEADER\tnote {a} & 2 & Slice header block & See specific section\tabularnewline
533
534
\hline
534
535
& 3 & & reserved\tabularnewline
535
536
\hline
@@ -538,7 +539,10 @@ \subsection{\textbf{Block content types}}
538
539
CORE\_ DATA & 5 & core data block & bit stream of all encodings except for external\tabularnewline
539
540
\hline
540
541
\end {tabular }
541
-
542
+ \begin {tablenotes }
543
+ \item [a] Formerly MAPPED\_ SLICE\_ HEADER. Now used by all slice headers regardless of mapping status.
544
+ \end {tablenotes }
545
+ \end {threeparttable }
542
546
543
547
\subsection {\textbf {Block content id } }
544
548
@@ -737,30 +741,45 @@ \subsection{\textbf{Slice header block}}
737
741
738
742
The slice header block is never compressed (block method=raw). For reference mapped
739
743
reads the slice header also defines the reference sequence context of the data
740
- blocks associated with the slice. Mapped and unmapped reads can be stored within
741
- the same slice similarly to BAM file. Slices with unsorted reads must not contain
742
- any other types of reads.
744
+ blocks associated with the slice. Mapped reads can be stored along with
745
+ \textbf {placed unmapped }\footnote {Unmapped reads can be \textit {placed } or \textit {unplaced }.
746
+ By placed unmapped read we mean a read that is unmapped according to bit 0x4 of the
747
+ BF (BAM bit flags) data series, but has position fields filled in, thus "placing" it on a reference sequence. In contrast,
748
+ unplaced unmapped reads have have a reference sequence ID of -1 and alignment position of 0.}
749
+ reads on the same reference within the same slice.
750
+
751
+ Slices with the Multiple Reference flag (-2) set as the sequence ID in the header may contain reads
752
+ mapped to multiple external references, including unmapped\footnotemark [\value {footnote}] reads (placed on these references or unplaced),
753
+ but multiple embedded references cannot be combined in this way. When multiple references are
754
+ used, the RI data series will be used to determine the reference sequence ID for each record. This
755
+ data series is not present when only a single reference is used within a slice.
756
+
757
+ The Unmapped (-1) sequence ID in the header is for slices containing only unplaced
758
+ unmapped\footnotemark [\value {footnote}] reads.
743
759
744
760
A slice containing data that does not use the external reference in
745
761
any sequence may set the reference MD5 sum to zero. This can happen
746
762
because the data is unmapped or the sequence has been stored verbatim
747
763
instead of via reference-differencing. This latter scenario is
748
- recommended for unsorted or non-coordinate sorted data.
764
+ recommended for unsorted or non-coordinate- sorted data.
749
765
750
766
The slice header block contains the following fields.
751
767
752
768
\begin {tabular }{|l|l|>{\raggedright }p{200pt}|}
753
769
\hline
754
770
\textbf {Data type } & \textbf {Name } & \textbf {Value }\tabularnewline
755
771
\hline
756
- itf8 & reference sequence id & reference sequence identifier or -1 for unmapped
757
- or unsorted reads\tabularnewline
772
+ itf8 & reference sequence id & reference sequence identifier or\linebreak {}
773
+ -1 for unmapped reads\linebreak {}
774
+ -2 for multiple reference sequences\tabularnewline
758
775
\hline
759
- itf8 & alignment start & the alignment start position or -1 for unmapped or unsorted
760
- reads\tabularnewline
776
+ itf8 & alignment start & the alignment start position.\linebreak {}
777
+ Ignored on read and set to 0 on write if the slice is multiple-reference
778
+ or contains unmapped unplaced reads\tabularnewline
761
779
\hline
762
- itf8 & alignment span & the length of the alignment or 0 for unmapped or unsorted
763
- reads\tabularnewline
780
+ itf8 & alignment span & the length of the alignment.\linebreak {}
781
+ Ignored on read and set to 0 on write if the slice is multiple-reference
782
+ or contains unmapped unplaced reads\tabularnewline
764
783
\hline
765
784
itf8 & number of records & the number of records in the slice\tabularnewline
766
785
\hline
@@ -774,7 +793,9 @@ \subsection{\textbf{Slice header block}}
774
793
reference sequence bases or -1 for none\tabularnewline
775
794
\hline
776
795
byte[16] & reference md5 & MD5 checksum of the reference bases within the slice
777
- boundaries or 16 \textbackslash {}0 bytes when unused\tabularnewline
796
+ boundaries. If this slice has reference sequence id of -1 (unmapped) or -2 (multi-ref)
797
+ the MD5 should be 16 bytes of \textbackslash {}0. For embedded references, the MD5
798
+ can either be all-zeros or the MD5 of the embedded sequence.\tabularnewline
778
799
\hline
779
800
byte[] & optional tags & a series of tag,type,value tuples encoded as
780
801
per BAM auxiliary fields.\tabularnewline
@@ -1506,7 +1527,7 @@ \subsubsection*{General notes}
1506
1527
1507
1528
Please note that CRAM indexing is external to the file format itself and may change
1508
1529
independently of the file format specification in the future. For example, a new
1509
- type of index files may appear.
1530
+ type of index file may appear.
1510
1531
1511
1532
Individual records are not indexed in CRAM files, slices should be used instead
1512
1533
as a unit of random access. Another important difference between CRAM and BAM indexing
@@ -1526,9 +1547,9 @@ \subsubsection*{CRAM index}
1526
1547
\begin {enumerate }
1527
1548
\item Sequence id
1528
1549
1529
- \item Alignment start
1550
+ \item Alignment start (ignored on read for unmapped slices, set to 0 on write)
1530
1551
1531
- \item Alignment span
1552
+ \item Alignment span (ignored on read for unmapped slices, set to 0 on write)
1532
1553
1533
1554
\item Container start byte offset in the file
1534
1555
@@ -1538,7 +1559,7 @@ \subsubsection*{CRAM index}
1538
1559
\end {enumerate }
1539
1560
1540
1561
Each line represents a slice in the CRAM file. Please note that all slices must
1541
- be listed in index file.
1562
+ be listed in the index file.
1542
1563
1543
1564
\subsubsection* {BAM index }
1544
1565
0 commit comments