Skip to content

Read Structures

Nils Homer edited this page Apr 25, 2017 · 10 revisions

In fgbio, and also in Picard, a Read Structure refers to a String that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the --use-bases-mask in Illumina's bcltofastq software, but provides some additional capabilities.

A Read Structure is a sequence of <number><operator> pairs or segments where, optionally, the last segment in the string is allowed to use + instead of a number for it's length. The + means translates to whatever bases are left after the other segments are processed and can be thought of as meaning [0..infinity].

Read Structures are most commonly used in tools that convert from sequencer output formats (e.g. fastq files, BCLs) to downstream formats like SAM/BAM/CRAM, and in tools that process SAM/BAM/CRAM to extract non-template bases from the reads. Examples include:

  • DemuxFastqs in fgbio to demultiplex a set of multi-sample fastq files and optionally extract UMIs
  • FastqToBam in fgbio to convert from fastq to BAM while preserving sample barcode and UMI information
  • ExtractUmisFromBam in fgbio which re-writes a BAM file with UMI sequences extracted from the reads and placed into tags
  • IlluminaBasecallsToSam and IlluminaBasecallsToFastqin Picard both which processes BCLs and related files in an Illumina run folder and creates BAMs or FASTQs respectively

Operators

Four kinds of operator are supported:

  • T or Template: the bases in the segment are reads of template (e.g. genomic dna, rna, etc.)
  • B or Sample Barcode: the bases in the segment are an index sequence used to identify the sample being sequenced
  • M or Molecular Barcode: the bases in the segment are an index sequence used to identify the unique source molecule being sequence (i.e. a UMI)
  • S or Skip: the bases in the segment should be skipped or ignored, for example if they are monotemplate sequence generated by the library preparation

General Rules

  • Any number of segments >= 1 is valid
  • The length of each segment must be a positive integer >= 1 (or +)
  • Only the last segment in a read structure may use + for it's length
  • Adjacent segments may use the same operator. E.g. if two sample indices are ligated onto a molecule separately such that they are adjacent, a structure of 6B6B+T is perfectly acceptable.

Examples

The following handful of example attempt to describe the recommended way to describe a sequencing run in two different ways. Firstly as a single Read Structure for the entire run as you might use with IlluminaBasecallsToSam, and secondly as a set of read structures that would map one-to-one with the physical reads after fastq-conversion and optionally adapter trimming (which will create variable length reads):

  • A simple 2x150bp paired end run with no sample or molecular indices:
    • 150T150T
    • [+T, +T]
  • A 2x75bp paired end run with an 8bp I1 index read:
    • 75T8B75T
    • [+T, 8B, +T]
  • A 2x150bp paired end run with an 8bp I1 index read and an inline 6bp UMI in read 1:
    • 8M142T8B150T
    • [8M+T, 8B, +T]
  • A 2x150bp duplex sequencing run with dual sample-barcoding (I1 and I2) and both a 10bp UMI and 5bp monotemplate at the start of both R1 and R2:
    • 10M5S135T8B8B10M5S135T
    • [10M5S+T, 8B, 8B, 10M5S+T]

Formal Grammar

The formal grammar for Read Structures supported by fgbio is as follows:

<read-structure>  ::= {<operator> <fixed-length>} <operator><any-length> 
<operator>        ::= "T" / "B" / "M" / "S"
<fixed-length>    ::= <non-zero-digit>{<digit>}
<variable-length> ::= "+"
<any-length>      ::= <fixed-length> / <variable-length>
<non-zero-digit>  ::= "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"
<digit>           ::= "0" / <non-zero-digit>