GitHub - markschl/seqtool: Fast and flexible tool for reading, modifying and writing biological sequences

Seqtool is a fast and flexible command line program for dealing with large amounts of biological sequences. It provides different subcommands for converting, inspecting and modifying sequences. The standalone binary (5-7 MB) is simply named st to save some typing.

Note: this page describes the development version 0.4-beta. The older stable version (v0.3.0) is documented here.

Detailed documentation

See this page

Downloads

📥 download beta release (v0.4.0-beta.3)

Should be pretty safe to use despite the considerable refactoring. Approximate matching (find command) is yet to be fully tested.

📥 download stable release (v0.3.0)

⚠ Note: there are a few unfixed bugs in v0.3.0 (currently) when reading GZIP files or searching/replacing; see CHANGELOG for v0.4.0-beta.

Feature overview

File formats

Reads and writes FASTA, FASTQ and CSV/TSV, optionally compressed with GZIP, BZIP2, or the faster and more modern Zstandard or LZ4 formats

Example: compressed FASTQ to FASTA

Combine multiple compressed FASTQ files, converting them to FASTA, using pass.

st pass file1.fastq.gz file2.fastq.gz -o output.fasta

Note: almost every command can read multiple input files and convert between formats, but pass does nothing other than reading and writing while other command perform certain actions.

Example: FASTA to tab-separated list

Aside from ID and sequence, any variable/function such as the sequence length (seqlen) can be written to delimited text.

st pass input.fasta --to-tsv id,seq,seqlen

id1	ACG	3
id1	ACGTACGT	7
id1	ACGTA	5

Highly versatile thanks to variables/functions

See also variables/functions for more details.

Example: count sequences in a large set of FASTQ files

st count -k path data/*.fastq.gz

data/sample1.fastq.gz	30601
data/sample2.fastq.gz	15702
data/sample3.fastq.gz	264965
data/sample4.fastq.gz	1120
data/sample5.fastq.gz	7021
(...)

In count, one or several categorical variables/functions can be specified with -k/--key.

Example: summarize the GC content in 10% intervals

The function bin(variable, interval) groups continuous numeric values into intervals

st count -k 'bin(gc_percent, 10)' sequences.fasta

(10, 20]	57
(20, 30]	2113
(30, 40]	11076
(40, 50]	7184
(50, 60]	12

Example: Assign new sequence IDs

st set -i 'seq_{num}' seqs.fasta > renamed.fasta

>seq_1
SEQUENCE
>seq_2
SEQUENCE
>seq_3
SEQUENCE
(...)

Example: De-replicate by description and sequence

seqs.fasta with a 'group' annotation in the header:

>id1 group1
SEQUENCE1
>id2 group1
SEQUENCE2
>id3 group1
SEQUENCE2
>id4 group2
SEQUENCE1
>id5 group2
SEQUENCE1

st unique 'desc,seq' seqs.fasta > grouped_uniques.fasta

>id1 group1
SEQUENCE1
>id2 group1
SEQUENCE2
>id4 group2
SEQUENCE1

Expressions

From simple math to complicated filter expressions, the tiny integrated JavaScript engine (QuickJS) offers countless possibilities for customized sequence processing.

Example: filter FASTQ sequences by quality and length

This filter command removes sequencing reads with more than one expected sequencing error (like USEARCH can do) or sequence length of <100 bp.

st filter 'exp_err < 1 && seqlen >= 100' reads.fastq > filtered.fastq

Header attributes for metadata storage

key=value header attributes allow storing and passing on all kinds of information

Example: De-replicate by sequence (seq variable) and/or other properties

The unique command returns all unique sequences and annotates the number of records with the same sequence in the header:

st unique seq -a abund={n_duplicates} input.fasta > uniques.fasta

>id1 abund=3
TCTTTAATAACCTGATTAG
>id3 abund=1
GGAGGATCCGAGCG
(...)

It is also possible to de-replicate by multiple keys, e.g. by sequence, but grouped by a sample attribute in the header:

st unique 'seq,attr(sample)' input.fasta > uniques.fasta

>id1 sample=1
SEQUENCE1
>id3 sample=2
SEQUENCE2
>id10 sample=1
SEQUENCE3
>id11 sample=3
SEQUENCE4
(...)

Example: pre-processing of mixed multi-marker amplicon sequences (primer trimming, grouping by amplicon)

These steps could be part of an amplicon pipeline that de-multiplexes multi-marker amplicons. find searches for a set of primers, which are removed by trim, and finally split distributes the sequences into different files named by the forward primer.

primers.fasta

>prA
PRIMER
>prB
PRIMER

Command for searching/trimming

st find file:primers.fasta -a primer='{pattern_name}' -a end='{match_end}' sequences.fasta |
  st trim -e '{attr(end)}..' | 
  st split -o '{attr(primer)}'

prA.fasta prB.fasta undefined.fasta

>id1 primer=prA end=22
SEQUENCE
>id4 primer=prA end=21
SEQUENCE
(...)

>id2 primer=prB end=20
SEQUENCE
>id3 primer=prB end=22
SEQUENCE
(...)

>id5 primer=undefined end=undefined
UNTRIMMEDSEQUENCE
(...)

Note: no primer, sequence not trimmed since end=undefined (see ranges).

Integration of external metadata

Integration of sequence metadata sources in the form of delimited text

Example: Add Genus names from a separate tab-separated list

input.fasta	genus.tsv
`>id1 SEQUENCE >id2 SEQUENCE (...)`	`id genus seq1 Actinomyces seq2 Amycolatopsis (...)`

Using -m/--meta to include genus.tsv as metadata source:

st set -m genus.tsv --desc '{meta(genus)}' input.fasta > with_genus.fasta

with_genus.fasta
`>seq1 Actinomyces SEQUENCE >seq2 Amycolatopsis SEQUENCE (...)`

Example: Choose specific sequences given a separate file with an ID list

input.fasta	id_list.txt
`>id1 SEQUENCE >id2 SEQUENCE >id3 SEQUENCE >id4 SEQUENCE`	`id1 id4`

st filter -m id_list.txt 'has_meta()' input.fasta > subset.fasta

subset.fasta
`>id1 SEQUENCE >id4 SEQUENCE`

Commands

Basic conversion/editing

pass: Directly pass input to output without any processing, useful for converting and attribute setting

Information about sequences

view: View biological sequences, colored by base / amino acid, or by sequence quality
count: Count all records in the input (total or categorized by variables/functions)
stat: Return per-sequence statistics as tab delimited list

Subset/shuffle

sort: Sort records by sequence or any other criterion
unique: De-replicate by sequence and/or other properties, returning only unique records
filter: Keep/exclude sequences based on different properties with a mathematical (JavaScript) expression
split: Distribute sequences into multiple files based on a variable/function or advanced expression
sample: Get a random subset of sequences; either a fixed number or an approximate fraction of the input
slice: Return a range of sequence records from the input
head: Return the first N sequences
tail: Return the last N sequences
interleave: Interleave records of all files in the input

Search and replace

find: Search for pattern(s) in sequences or sequene headers for record filtering, pattern replacement or passing hits to next command
replace: Fast and simple pattern replacement in sequences or headers

Modifying commands

del: Delete header ID/description and/or attributes
set: Replace the header, header attributes or sequence with new content
trim: Trim sequences on the left and/or right (single range) or extract and concatenate several ranges
mask: Soft or hard mask sequence ranges
upper: Convert sequences to uppercase
lower: Convert sequences to lowercase
revcomp: Reverse complements DNA or RNA sequences
concat: Concatenates sequences/alignments from different files

Comparison with other tools

There are other tools with a similar focus such as Seqtk, SeqKit, the FASTX-Toolkit, as well as the more specialized USEARCH and VSEARCH offering some of the functions as well.

Seqtool performs well compared to these tools on a selection of diverse tasks:

Comparison of tools

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
.github/workflows		.github/workflows
js		js
profile		profile
scripts		scripts
src		src
var_provider		var_provider
wix		wix
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
dist-workspace.toml		dist-workspace.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detailed documentation

Downloads

Feature overview

File formats

Highly versatile thanks to variables/functions

Expressions

Header attributes for metadata storage

Integration of external metadata

Commands

Basic conversion/editing

Information about sequences

Subset/shuffle

Search and replace

Modifying commands

Comparison with other tools

About

Releases 8

Packages

Contributors 2

Languages

License

markschl/seqtool

Folders and files

Latest commit

History

Repository files navigation

Detailed documentation

Downloads

Feature overview

File formats

Highly versatile thanks to variables/functions

Expressions

Header attributes for metadata storage

Integration of external metadata

Commands

Basic conversion/editing

Information about sequences

Subset/shuffle

Search and replace

Modifying commands

Comparison with other tools

About

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 2

Languages

Packages