-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Seqtool is a fast and flexible command line program for dealing with
large amounts of biological sequences. It can read and write the
FASTA, FASTQ and QUAL files, as well as CSV and other delimited files. It also handles different common compression formats out of the box.
The tool is written in Rust and aims at solving
simple tasks that might otherwise only be solved by writing
custom scripts while being very fast. It uses
seq_io and
Rust-Bio, amongst others,
and compiles to a standalone binary named st
. See below for
instructions.
Features:
- Format conversion, including different FASTQ variants. File extensions are auto-recognized if possible
- Many commands for summarizing, viewing, searching, shuffling and modifying sequences
- Variables allow to integrate sequence properties, metadata from sequence headers and from other files, and enable a flexible configuration of commands
- Filtering of sequences using mathematical expressions containing variables
- Passing metadata of FASTA/FASTQ sequences between commands is made easy by the ability to write and parse sequence attributes, which are key=value annotations in the sequence headers.
- Commands can be connected using the pipe (
|
) operator.
- pass: This command is useful for converting from one format to another and/or setting attributes.
- view: View biological sequences, coloured by base / amino acid, or by sequence quality. The output is automatically forwarded to the 'less' pager on UNIX.
- count: This command counts the number of sequences and prints the number to STDOUT. Advanced grouping of sequences is possible by supplying or more key strings containing variables (-k).
- stat: Invalid arguments.
- head: Returns the first sequences of the input.
- tail: Returns the last sequences of the input.
- slice: Get a slice of the sequences within a defined range.
- sample: Return a random subset of sequences.
- filter: Filters sequences by a mathematical expression which may contain any variable.
- split: This command distributes sequences into multiple files based on different criteria. In contrast to other commands, the output (-o) argument can contain variables in order to determine the file path for each sequence.
- interleave: Interleaves records of all files in the input. The records will returned in the same order as the files.
- find: Fast searching for one or more patterns in sequences or ids/descriptions, with optional multithreading.
- replace: This command does fast search and replace for patterns in sequences or ids/descriptions.
- del: Deletes description field or attributes.
- set: Replaces the contents of sequence IDs, descriptions or sequences.
- trim: Trims sequences to a given range.
- mask: Masks the sequence within a given range or comma delimited list of ranges by converting to lowercase (soft mask) or replacing with a character (hard masking). Reverting soft masking is also possible.
- upper: Converts all characters in the sequence to uppercase.
- lower: Converts all characters in the sequence to lowercase.
- revcomp: Reverse complements DNA sequences. If quality scores are present, their order is just reversed.
- concat: Concatenates sequences/alignments from different files in the order in which they are provided. Fails if the IDs don't match.
Binaries for Linux, Mac OS X and Windows can be
downloaded from the releases section.
For compiling from source, install Rust, download the source
code; and inside the root directory type cargo build --release
. The binary is found in
target/release/.
st <command> [<options>] [<files>...]
All commands accept one or multiple files and STDIN input. The output is written
to STDOUT or a file (-o
, useful for format conversion). Commands can
be easily chained using the pipe.
Use st <command> -h
to see all available options. A full list of options
that are accepted by all commands can be found here.
The following run time comparison of diffferent tasks aims to give a quick overview but is not comprehensive by any means. Comparisons to a selection of other tools/toolsets are shown if there exists an equivalent operation. For all commands, a 1.1 Gb FASTQ file containing 1.73 million Illumina reads of 150-500 bp length was used. They were run on a Mac Pro (Mid 2010, 2.8 GHz Quad-Core Intel Xeon, OS X 10.9) (script).
seqtool | [4 threads] | seqtk | seqkit | FASTX | biopieces | |
---|---|---|---|---|---|---|
Simple counting | 0.62s | 46.99s | ||||
Conversion to FASTA | 1.20s | 2.85s | 4.93s | 3min 38.4s | 3min 37.8s | |
Reverse complement | 3.91s | 1.14s | 5.46s | 10.14s | 6min 11.8s | 1m33.6s |
Random subsampling (to 10%) | 0.83s | 2.05s | 2.54s | |||
DNA to RNA (T -> U) | 8.03s | 2.35s | 6.13s | 7min 9.4s | 1min 49.1s | |
Remove short sequences | 1.62s | 3.45s | 2.91s | 1min 23.6s | ||
Summarize GC content | 4.45s | |||||
.. with math formula (GC% / 100) | 4.55s | |||||
Find forward primers with max. 4 mismatches | 8.02s | 2.34s | ||||
Remove the primers if found (1.36 M seqs) | 2.26s |
Simple counting is the fastest operation, faster than the UNIX line counting
command (wc -l
, 2.70s) on OS X. The commands find
, replace
and revcomp
additionally profit from multithreading.
Compressed files are recognized based on their extension (Example:
st . seqs.lz4
). Compressed I/O is done in a separate thread by default,
which makes reading/writing faster than via the pipe (e.g. lz4 -dc seqs.lz4 | st .
),
with the exception of GZIP on OS X. Reading/writing LZ4
is almost as fast as reading uncompressed input. Writing LZ4 is only slightly
slower while providing a reasonable compression ratio. For files stored on
slow hard disks, LZ4 can be even faster than uncompressed I/O.
Zstandard was added because it provides a
better compression than LZ4 while still being very fast.
format | file size (Mb) | read | (piped) | compress | (piped) |
---|---|---|---|---|---|
uncompressed1 | 1199 | 1.28s | - | 1.23s | - |
LZ4 | 192 | 1.36s | 2.71s | 2.60s | 3.95s |
GZIP | 101 | 10.98s | 6.15s | 53.83s | 50.63s |
Zstandard | 86 | 2.33s | 3.62s | 4.29s | 5.79s |
BZIP2 | 60 | 32.85s | 30.25s | 3min 35.3s | 4min 20.4s |
1 Using -T/--read-thread
/ --write-thread
I am grateful for comments and ideas on how to improve the tool and also about feedback in general. Commands for sorting, dereplication and for working with alignments are partly implemented but not ready.
Since the tool is quite new, it is possible that there are bugs, even if tests for every command and for most parameter combinations have been written.