-
-
Notifications
You must be signed in to change notification settings - Fork 69
Read Structures
In fgbio, and also in Picard, a Read Structure refers to a String that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the --use-bases-mask
in Illumina's bcltofastq software
, but provides some additional capabilities.
A Read Structure is a sequence of <number><operator>
pairs or segments where, optionally, the last segment in the string is allowed to use +
instead of a number for its length. The +
means translates to whatever bases are left after the other segments are processed and can be thought of as meaning [0..infinity]
.
Read Structures are most commonly used in tools that convert from sequencer output formats (e.g. fastq files, BCLs) to downstream formats like SAM/BAM/CRAM, and in tools that process SAM/BAM/CRAM to extract non-template bases from the reads. Examples include:
-
DemuxFastqs
in fgbio to demultiplex a set of multi-sample fastq files and optionally extract UMIs -
FastqToBam
in fgbio to convert from fastq to BAM while preserving sample barcode and UMI information -
ExtractUmisFromBam
in fgbio which re-writes a BAM file with UMI sequences extracted from the reads and placed into tags -
IlluminaBasecallsToSam
andIlluminaBasecallsToFastq
in Picard both of which process BCLs and related files in an Illumina run folder and create BAMs or FASTQs respectively
Four kinds of operator are supported:
-
T
or Template: the bases in the segment are reads of template (e.g. genomic dna, rna, etc.) -
B
or Sample Barcode: the bases in the segment are an index sequence used to identify the sample being sequenced -
M
or Molecular Barcode: the bases in the segment are an index sequence used to identify the unique source molecule being sequence (i.e. a UMI) -
S
or Skip: the bases in the segment should be skipped or ignored, for example if they are monotemplate sequence generated by the library preparation
- Any number of segments >= 1 is valid
- The length of each segment must be a positive integer >= 1 (or
+
) - Only the last segment in a read structure may use
+
for it's length - Adjacent segments may use the same operator. E.g. if two sample indices are ligated onto a molecule separately such that they are adjacent, a structure of
6B6B+T
is perfectly acceptable.
The following handful of example attempt to describe the recommended way to describe a sequencing run in two different ways. Firstly as a single Read Structure for the entire run as you might use with IlluminaBasecallsToSam
, and secondly as a set of read structures that would map one-to-one with the physical reads after fastq-conversion and optionally adapter trimming (which will create variable length reads):
- A simple 2x150bp paired end run with no sample or molecular indices:
150T150T
- [
+T
,+T
]
- A 2x75bp paired end run with an 8bp I1 index read:
75T8B75T
- [
+T
,8B
,+T
]
- A 2x150bp paired end run with an 8bp I1 index read and an inline 6bp UMI in read 1:
8M142T8B150T
- [
8M+T
,8B
,+T
]
- A 2x150bp duplex sequencing run with dual sample-barcoding (I1 and I2) and both a 10bp UMI and 5bp monotemplate at the start of both R1 and R2:
10M5S135T8B8B10M5S135T
- [
10M5S+T
,8B
,8B
,10M5S+T
]
The formal grammar for Read Structures supported by fgbio is as follows:
<read-structure> ::= {<operator> <fixed-length>} <operator><any-length>
<operator> ::= "T" / "B" / "M" / "S"
<fixed-length> ::= <non-zero-digit>{<digit>}
<variable-length> ::= "+"
<any-length> ::= <fixed-length> / <variable-length>
<non-zero-digit> ::= "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"
<digit> ::= "0" / <non-zero-digit>