-
Notifications
You must be signed in to change notification settings - Fork 0
muset_pa pipeline
MUSET includes also muset_pa
, an auxiliary executable that generates a presence-absence unitig matrix in text format from a list of input samples using GGCAT and kmat_tools.
muset_pa v0.2
DESCRIPTION:
muset_pa - a pipeline for building a presence-absence unitig matrix from a list of FASTA/FASTQ files.
this pipeline has fewer parameters than muset and less filtering options as it does not build
nor use an intermediate k-mer abundance matrix.
If you wish a 0/1 binary matrix instead of the fraction of kmers from the sample present in the
unitig, please use the option -r and a value x, 0 < x <=1 as minimum treshold to count a sample
as present (1).
USAGE:
muset_pa [options] INPUT_FILE
OPTIONS:
-k INT k-mer size (default: 31)
-a INT min abundance to keep a k-mer (default: 2)
-l INT minimum size of the unitigs to be retained in the final matrix (default: 2k-1)
-r FLOAT minimum kmer presence ratio in the unitig for 1/0
-o PATH output directory (default: output)
-m INT minimizer length (default: 15)
-t INT number of cores (default: 4)
-s write the unitig sequence in the first column of the output matrix instead of the identifier
-h show this help message and exit
-V show version number and exit
POSITIONAL ARGUMENTS:
INPUT_FILE Input file (fof) containing paths of input samples (one per line).
The input is a file containing a list of paths (one per line), as required by the -l
parameter of GGCAT.
Make sure to either specify absolute paths or paths relative to the directory from which you intend to run muset_pa
.
A simple test example can be run from the test
directory:
cd test
muset_pa -o output_pa fof_pa.txt
The pipeline will produce multiple intermediate output files, among which the jsonl dictionary of the colors for each unitig that is normally produced by ggcat. The pipeline automatically converts it into a unitig matrix in csv format (separated by column). If you choose option -r you will have it in binary format (0/1) else it will report the percentage of k-mers from each samples inside the unitigs. Samples will have the name of the input files you used. Here an example
Unitig ID | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |
---|---|---|---|---|---|
0 | 0 | 1 | 0.23 | 0.3 | 1 |
1 | 1 | 1 | 0 | 0.8 | 0.4 |
2 | 0.47 | 0.2 | 1 | 1 | 0 |
3 | 0.8 | 1 | 0.78 | 1 | 0.81 |
4 | 0.79 | 1 | 1 | 0.87 | 0.89 |
In case you use -r 0.8, you will have
Unitig ID | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |
---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 1 |
1 | 1 | 1 | 0 | 1 | 0 |
2 | 0 | 0 | 1 | 1 | 0 |
3 | 1 | 1 | 0 | 1 | 1 |
4 | 0 | 1 | 1 | 1 | 1 |