Skip to content

muset pipeline

Riccardo Vicedomini edited this page Jul 24, 2024 · 2 revisions


muset v0.2

   muset - a pipeline for building an abundance unitig matrix from a list of FASTA/FASTQ files.

   muset [options] INPUT_FILE

   -i PATH    skip matrix construction and run the pipeline with a previosuly computed matrix
   -k INT     k-mer size (default: 31)
   -a INT     min abundance to keep a k-mer (default: 2)
   -l INT     minimum size of the unitigs to be retained in the final matrix (default: 2k-1)
   -o PATH    output directory (default: output)
   -m INT     minimizer length  (default: 15)
   -n INT     minimum number of samples from which a k-mer should be absent (mutually exclusive with -f)
   -f FLOAT   fraction of samples from which a k-mer should be absent (default: 0.1, mutually exclusive with -n)
   -N INT     minimum number of samples in which a k-mer should be present (mutually exclusive with -F)
   -F FLOAT   fraction of samples in which a k-mer should be present (default: 0.1, mutually exclusive with -N)
   -t INT     number of cores (default: 4)
   -s         write the unitig sequence in the first column of the output matrix instead of the identifier
   -h         show this help message and exit
   -V         show version number and exit

    INPUT_FILE   Input file (fof) containing the description of input samples.
                 It is ignored if -i option is used.

   Options -n and -f are mutually exclusive, as well as options -N and -F.
   When either -n or -f is used, -N or -F must also be provided, and vice versa.
   If none of the -n, -N, -f, -F options are used the last two options are used with their default values.

Input data

I do not have a k-mer matrix

If you do not have a k-mer matrix ready, make sure to create a "fof" file, that is a file which contains one line per sample with the following syntax:

  • <Sample ID> : <1.fastq.gz> ; ... ; <N.fastq.gz>

Files could be in either FASTA or FASTQ format, gzipped or not. Multiple files per sample can be provided by separating them with a semicolon.


A1 : /path/to/fastq_A1_1
B1 : /path/to/fastq_B1_1 ; /with/mutiple/fasta_B1_2

You can generate such an input file from a folder containing many input files as follows:

ls -1 folder/*  | sort -n -t/ -k2 | xargs -L1 readlink -f | awk '{ print ++count" : "$1 }' >fof.txt

Then simply run:

muset fof.txt

I already have a k-mer matrix

If you are familiar with kmtricks and/or have already produced a k-mer matrix on your own, you can run muset with the -i option and provide your own input matrix (and skip the possibly long matrix construction).

Make sure to provide a matrix in text format. You can easily output one from a kmtricks run using the command kmtricks aggregate with parameters --matrix kmer --format text. By default, kmtricks will write it on stdout, so you might want to set the --output parameter. Ex: kmtricks aggregate --matrix kmer --format text --cpr-in --sorted --output sorted_matrix.txt --run-dir kmtricks_output_dir

The pipeline can be then run as follows:

muset -i sorted_matrix.txt <input_fof.txt>

Output data

The output data of muset is a folder with intermediate results and a unitigs.mat file, which is an abundance unitig matrix. Each row corresponds to a unitig and each column to a sample. Each entry of the matrix indicates the average abundance and fraction of the unitig k-mers belonging to the sample (separated by a semicolon) Ex:

Unitig ID Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
0 0.00;0.00 0.00;0.00 0.00;0.00 0.00;0.00 2.00;1.00
1 2.00;1.00 2.00;1.00 2.00;1.00 2.00;1.00 0.00;0.00
2 0.00;0.00 0.00;0.00 0.00;0.00 0.00;0.00 2.00;1.00
3 0.00;0.00 0.00;0.00 0.00;0.00 0.00;0.00 2.00;1.00
4 2.00;1.00 2.00;1.00 2.00;1.00 2.00;1.00 0.00;0.00

Note: If instead of the unitig identifier you prefer to have the unitig sequence, run muset with the flag -s

The average abundance of a unitig $u$ with respect to a sample $S$ (number on the left of the semicolon) is defined as:

$$ A(u, S) = \frac{\sum\limits_{i=1}^{N}{c_i}}{N} $$

where $N$ is the number of k-mers in $u$, and $c_i$ is the abundance of the $i$-th k-mer of $u$ in sample $S$.

The fraction of k-mers in a unitig $u$ that are present in a sample $S$ (number on the right of the semicolon) is defined as:

$$ f(u, S) = \frac{\sum\limits_{i=1}^{N}{x_i}}{N} $$

where $N$ is the number of k-mers in $u$, and $x_i$ is a binary variable that is 1 when the $i$-th k-mer is present in sample $S$ and 0 otherwise.