seqspec
is a machine-readable YAML file format for genomic library sequence and structure. It was inspired by and builds off of the Teichmann Lab Single Cell Genomics Library Structure by Xi Chen.
A list of seqspec
examples for multiple assays can be found in the assays/
folder. Each spec.yaml
describes the 5'->3' "Final library structure" for the assay. Sequence specification files can be formatted with the seqspec
command line tool.
# development
pip install git+https://github.com/IGVF/seqspec.git
# released
pip install seqspec
seqspec format --help
Each assay is described by two objects: the Assay
object and the Region
object. A library is described by one Assay
object and multiple (possibly nested) Region
objects. The Region
objects are grouped with a join
operation and an order
on the subRegion
s specified. A simple (but not fully specified example) looks like the following:
modalities:
- Modality1
- Modality2
assay_spec:
- region_id: Modality1
regions:
- region_id: Region1
...
- region_id: Region2
...
- region_id: Modality2
...
In order to catalogue relevant information for each library structure, multiple properties are specified for each Assay
and each Region
. A description of the Assay
and Region
schema can be found in seqspec/schema/seqspec.schema.json
.
Below is an example of an Assay
.
!Assay
name: SPLiT-seq
doi: https://doi.org/10.1126/science.aam8999
publication_date: 15 March 2018
description: split-pool ligation-based transcriptome sequencing
modalities:
- RNA
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/SPLiT-seq.html
assay_spec:
name
is a free-form string that labels the assaydoi
is the doi link to the paper/protocol that describes the assay (if it exists)publication_date
is the date the assay was published (linked to by thedoi
). Must be in DD Month Year format.description
is a free-form string that describes the assaymodalities
is a list ofregion_types
that are contained within the library. Each string must be present in exactly oneRegion
in the first "level" of theassay_spec
.lib_struct
is a link to the manually annotated library structure developed by Xi Chen in Sarah Teichmann's lab.assay_spec
is a list ofRegions
.
Below is an example of a Region
.
!Region
region_id: barcode-1
region_type: barcode
name: barcode-1
sequence_type: onlist
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist: !Onlist
filename: barcode-1_onlist.txt
md5: null
regions: null
region_id
is a free-form string and must be unique across all regions in theseqspec
file.- if the assay contains multiple regions of the same
region_type
it may be useful to append an integer to the end of theregion_id
to differentiate those regions. For example, if the assay had fourbarcodes
then each of the individualbarcode
regions could have theregion_id
sbarcode-1
,barcode-2
,barcode-3
,barcode-4
.
- if the assay contains multiple regions of the same
region_type
can be one of the following:- RNA
- ATAC
- CRISPR
- Protein
- illumina_p5
- illumina_p7
- nextera_read1
- nextera_read2
- s5
- s7
- ME1
- ME2
- truseq_read1
- truseq_read2
- index5
- index7
- fastq
- barcode
- umi
- cDNA
- gDNA
name
is a free-form string for describing the regionsequence_type
can be one of the following:fixed
indicates that sequence string is knownjoined
indicates that the sequence is created (joined) from nested regionsonlist
indicates that the sequence is derived from an onlist (if specified, thenonlist
must be non-nullrandom
indicates that the sequence is not known a-priori
sequence
is a representation of the sequence- if the
sequence_type
isfixed
then the actual sequence string is provided - if the
sequence_type
isjoined
then field must be the concatenation of the nested regions - if the
sequence_type
isonlist
then field must anN
string of length of the shortest sequence on the onlist - if the
sequence_type
israndom
then the field must be anX
- if the
min_len
is an integer greater than or equal to zero. It represents the minimum possible length of thesequence
max_len
is an integer greater than or equal to themin_len
. It represents the maximum length of thesequence
onlist
can benull
or containfilename
which is a path (relative to theseqspec
file containing a list of sequencesmd5
is the md5sum of the uncompressed file infilename
regions
can either benull
or contain a list ofregions
as specified above.
For more information about the specification of the various fields, please see seqspec.schema.json
which is the JSON schema representation of the various fields described above.
The YAML file contains tags (strings prepended with an exclamation point !
) to describe the various objects (Assay
, Region
, Onlist
). The purpose of these tags is to make it easy to load the seqspec
into python as a python object. This makes it possibe to access the various attrbiutes of the seqspec
file with "dot notation" as follows:
from seqspec.utils import load_spec
spec = load_spec("seqspec/assays/10x-RNA-v3/spec.yaml")
print(specA.get_modality("RNA").sequence)
# AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNXAGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
For consistency across assays I suggest the following naming conventions for standard regions. Note that the region_id
for all atomic regions should be unique.
# Assay region
!Assay
name: My-RNA-Assay
doi: mydoi.org
publication_date: 01 January 2001
description: My custom assay
modalities:
- RNA
lib_struct: www.link-to-libstructs.com
assay_spec:
- !Region
region_id: RNA
region_type: RNA
name: My RNA
sequence_type: joined
sequence:
min_len: 0
max_len: 0
onlist:
regions:
# illumina_p5
- !Region
region_id: illumina_p5
region_type: illumina_p5
name: illumina_p5
sequence_type: fixed
sequence: AATGATACGGCGACCACCGAGATCTACAC
min_len: 29
max_len: 29
onlist:
regions:
# illumina_p7
- !Region
region_id: illumina_p7
region_type: illumina_p7
name: illumina_p7
sequence_type: fixed
sequence: ATCTCGTATGCCGTCTTCTGCTTG
min_len: 24
max_len: 24
onlist:
regions:
# nextera_read1
- !Region
region_id: nextera_read1
region_type: nextera_read1
name: nextera_read1
sequence_type: fixed
sequence: fixed
min_len: 33
max_len: 33
onlist:
regions:
- !Region
region_id: s5
region_type: s5
name: s5
sequence_type: TCGTCGGCAGCGTC
sequence: fixed
min_len: 14
max_len: 14
onlist:
regions:
- !Region
region_id: ME1
region_type: ME1
name: ME1
sequence_type: AGATGTGTATAAGAGACAG
sequence: fixed
min_len: 19
max_len: 19
onlist:
regions:
# nextera_read2
- !Region
region_id: nextera_read2
region_type: nextera_read2
name: nextera_read2
sequence_type: joined
sequence: CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
min_len: 34
max_len: 34
onlist:
regions:
- !Region
region_id: ME2
region_type: ME2
name: ME2
sequence_type: fixed
sequence: CTGTCTCTTATACACATCT
min_len: 19
max_len: 19
onlist:
regions:
- !Region
region_id: s7
region_type: s7
name: s7
sequence_type: fixed
sequence: CCGAGCCCACGAGAC
min_len: 15
max_len: 15
onlist:
regions:
# truseq_read1
- !Region
region_id: truseq_read1
region_type: truseq_read1
name: truseq_read1
sequence_type: fixed
sequence: ACACTCTTTCCCTACACGACGCTCTTCCGATCT
min_len: 33
max_len: 33
onlist:
regions:
# truseq_read2
- !Region
region_id: truseq_read2
region_type: truseq_read2
name: truseq_read2
sequence_type: fixed
sequence: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
min_len: 34
max_len: 34
onlist:
regions:
# index5
- !Region
region_id: I2.fastq.gz
region_type: I2.fastq.gz
name: Index 2 FASTQ
sequence_type: joined
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist:
regions:
- !Region
region_id: index5
region_type: index5
name: index5
sequence_type: onlist
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist: !Onlist
filename: index5_onlist.txt
md5: null
regions:
# index7
- !Region
region_id: I1.fastq.gz
region_type: I1.fastq.gz
name: Index 1 FASTQ
sequence_type: joined
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist:
regions:
- !Region
region_id: index7
region_type: index7
name: index7
sequence_type: onlist
sequence: NNNNNNNN
min_len: 8
max_len: 8
onlist: !Onlist
filename: index7_onlist.txt
md5: null
regions:
# Read 1 Fastq
- !Region
region_id: R1.fastq.gz
region_type: R1.fastq.gz
name: Read 1 FASTQ
sequence_type: joined
sequence:
min_len: 0
max_len: 0
onlist:
regions:
# Read 2 Fastq
- !Region
region_id: R2.fastq.gz
region_type: R2.fastq.gz
name: Read 2 FASTQ
sequence_type: joined
sequence:
min_len: 0
max_len: 0
onlist:
regions:
# barcode
# note for multiple of the same region
# the region id gets a number, i.e. barcode-1 barcode-2
- !Region
region_id: barcode
region_type: barcode
name: Barcode
sequence_type: onlist
sequence: NNNNNNNNNNNNNNNN
min_len: 16
max_len: 16
onlist: !Onlist
filename: barcode_onlist.txt
md5: null
regions:
# umi "Unique Molecular Identifier"
- !Region
region_id: umi
region_type: umi
name: Unique Molecular Identifier
sequence_type: random
sequence: NNNNNNNNNN
min_len: 10
max_len: 10
onlist:
regions:
# cDNA "complementary DNA"
- !Region
region_id: cDNA
region_type: cDNA
name: Complementary DNA
sequence_type: random
sequence: X
min_len: 1
max_len: 98
onlist:
regions:
# gDNA "genomic DNA"
- !Region
region_id: gDNA
region_type: gDNA
name: Genomic DNA
sequence_type: random
sequence: X
min_len: 1
max_len: 98
onlist:
regions:
# Regions corresponding to FASTQ files are annotated a standard naming convention
# R1.fastq.gz "Read 1"
# R2.fastq.gz "Read 2"
# I1.fastq.gz "Index 1, i7 index"
# I2.fastq.gz "Index 2, i5 index"
Thank you for wanting to improve seqspec
. If you have a bug that is related to seqspec
please create an issue. The issue should contain
- the
seqspec
command ran, - the error message, and
- the
seqspec
and python version.
If you'd like to add assays sequence specifications or make modifications to the seqspec
tool please do the following:
- Fork the project.
# Press "Fork" at the top right of the GitHub page
- Clone the fork and create a branch for your feature
git clone https://github.com/<USERNAME>/seqspec.git
cd seqspec
git checkout -b cool-new-feature
- Make changes, add files, and commit
# make changes, add files, and commit them
git add path/to/file1.yaml path/to/file2.yaml
git commit -m "I made these changes"
- Push changes to GitHub
git push origin cool-new-feature
- Submit a pull request
If you are unfamilar with pull requests, you can find more information on the GitHub help page.