Skip to content

muset_pa pipeline

Riccardo Vicedomini edited this page Jul 24, 2024 · 2 revisions

Usage

MUSET includes also muset_pa, an auxiliary executable that generates a presence-absence unitig matrix in text format from a list of input samples using GGCAT and kmat_tools.

muset_pa v0.2

DESCRIPTION:
   muset_pa - a pipeline for building a presence-absence unitig matrix from a list of FASTA/FASTQ files.
              this pipeline has fewer parameters than muset and less filtering options as it does not build
              nor use an intermediate k-mer abundance matrix.
              If you wish a 0/1 binary matrix instead of the fraction of kmers from the sample present in the
              unitig, please use the option -r and a value x, 0 < x <=1 as minimum treshold to count a sample
              as present (1).

USAGE:
   muset_pa [options] INPUT_FILE

OPTIONS:
   -k INT     k-mer size (default: 31)
   -a INT     min abundance to keep a k-mer (default: 2)
   -l INT     minimum size of the unitigs to be retained in the final matrix (default: 2k-1)
   -r FLOAT   minimum kmer presence ratio in the unitig for 1/0 
   -o PATH    output directory (default: output)
   -m INT     minimizer length  (default: 15)
   -t INT     number of cores (default: 4)
   -s         write the unitig sequence in the first column of the output matrix instead of the identifier
   -h         show this help message and exit
   -V         show version number and exit

POSITIONAL ARGUMENTS:
    INPUT_FILE   Input file (fof) containing paths of input samples (one per line).

Input file

The input is a file containing a list of paths (one per line), as required by the -l parameter of GGCAT. Make sure to either specify absolute paths or paths relative to the directory from which you intend to run muset_pa.

A simple test example can be run from the test directory:

cd test
muset_pa -o output_pa fof_pa.txt

Output file

The pipeline will produce multiple intermediate output files, among which the jsonl dictionary of the colors for each unitig that is normally produced by ggcat. The pipeline automatically converts it into a unitig matrix in csv format (separated by column). If you choose option -r you will have it in binary format (0/1) else it will report the percentage of k-mers from each samples inside the unitigs. Samples will have the name of the input files you used. Here an example

Unitig ID Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
0 0 1 0.23 0.3 1
1 1 1 0 0.8 0.4
2 0.47 0.2 1 1 0
3 0.8 1 0.78 1 0.81
4 0.79 1 1 0.87 0.89

In case you use -r 0.8, you will have

Unitig ID Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
0 0 1 0 0 1
1 1 1 0 1 0
2 0 0 1 1 0
3 1 1 0 1 1
4 0 1 1 1 1
Clone this wiki locally