-
Notifications
You must be signed in to change notification settings - Fork 0
muset pipeline
muset v0.2
DESCRIPTION:
muset - a pipeline for building an abundance unitig matrix from a list of FASTA/FASTQ files.
USAGE:
muset [options] INPUT_FILE
OPTIONS:
-i PATH skip matrix construction and run the pipeline with a previosuly computed matrix
-k INT k-mer size (default: 31)
-a INT min abundance to keep a k-mer (default: 2)
-l INT minimum size of the unitigs to be retained in the final matrix (default: 2k-1)
-o PATH output directory (default: output)
-m INT minimizer length (default: 15)
-n INT minimum number of samples from which a k-mer should be absent (mutually exclusive with -f)
-f FLOAT fraction of samples from which a k-mer should be absent (default: 0.1, mutually exclusive with -n)
-N INT minimum number of samples in which a k-mer should be present (mutually exclusive with -F)
-F FLOAT fraction of samples in which a k-mer should be present (default: 0.1, mutually exclusive with -N)
-t INT number of cores (default: 4)
-s write the unitig sequence in the first column of the output matrix instead of the identifier
-h show this help message and exit
-V show version number and exit
POSITIONAL ARGUMENTS:
INPUT_FILE Input file (fof) containing the description of input samples.
It is ignored if -i option is used.
NOTES:
Options -n and -f are mutually exclusive, as well as options -N and -F.
When either -n or -f is used, -N or -F must also be provided, and vice versa.
If none of the -n, -N, -f, -F options are used the last two options are used with their default values.
If you do not have a k-mer matrix ready, make sure to create a "fof" file, that is a file which contains one line per sample with the following syntax:
<Sample ID> : <1.fastq.gz> ; ... ; <N.fastq.gz>
Files could be in either FASTA or FASTQ format, gzipped or not. Multiple files per sample can be provided by separating them with a semicolon.
Example:
A1 : /path/to/fastq_A1_1
B1 : /path/to/fastq_B1_1 ; /with/mutiple/fasta_B1_2
You can generate such an input file from a folder containing many input files as follows:
ls -1 folder/* | sort -n -t/ -k2 | xargs -L1 readlink -f | awk '{ print ++count" : "$1 }' >fof.txt
Then simply run:
muset fof.txt
If you are familiar with kmtricks
and/or have already produced a k-mer matrix on your own, you can run muset
with the -i
option and provide your own input matrix (and skip the possibly long matrix construction).
Make sure to provide a matrix in text format. You can easily output one from a kmtricks run using the command kmtricks aggregate
with parameters --matrix kmer --format text
.
By default, kmtricks
will write it on stdout, so you might want to set the --output
parameter.
Ex: kmtricks aggregate --matrix kmer --format text --cpr-in --sorted --output sorted_matrix.txt --run-dir kmtricks_output_dir
The pipeline can be then run as follows:
muset -i sorted_matrix.txt <input_fof.txt>
The output data of muset
is a folder with intermediate results and a unitigs.mat
file, which is an abundance unitig matrix. Each row corresponds to a unitig and each column to a sample. Each entry of the matrix indicates the average abundance and fraction of the unitig k-mers belonging to the sample (separated by a semicolon) Ex:
Unitig ID | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |
---|---|---|---|---|---|
0 | 0.00;0.00 | 0.00;0.00 | 0.00;0.00 | 0.00;0.00 | 2.00;1.00 |
1 | 2.00;1.00 | 2.00;1.00 | 2.00;1.00 | 2.00;1.00 | 0.00;0.00 |
2 | 0.00;0.00 | 0.00;0.00 | 0.00;0.00 | 0.00;0.00 | 2.00;1.00 |
3 | 0.00;0.00 | 0.00;0.00 | 0.00;0.00 | 0.00;0.00 | 2.00;1.00 |
4 | 2.00;1.00 | 2.00;1.00 | 2.00;1.00 | 2.00;1.00 | 0.00;0.00 |
Note: If instead of the unitig identifier you prefer to have the unitig sequence, run muset
with the flag -s
The average abundance of a unitig
where
The fraction of k-mers in a unitig
where