-
Notifications
You must be signed in to change notification settings - Fork 5
Output Files
Once you've run Geneshot, you will want to look at the results. Because Geneshot runs many different types of analyses, you may have to consult this documentation in order to find the element of the results which is most important for your application.
Many of the individual aspects of the results can be found in two formats:
- A compressed text file in the output directory specified by
--output
- A table in the
results.hdf5
file written to the output directory
NOTE: The file format HDF5 is used to collect all of the results because of its flexible support for combining multiple tabular datasets into a single data object. We have found that the most convenient way to read and write this data format is with the Python/Pandas library.
If the --savereads
flag is used, all of the preprocessed WGS FASTQ files will be written to the qc/
folder.
All of the contigs, scaffolds, gene annotations, and logs produced during de novo assembly for each specimen will be found in the assembly/<specimen>/
folder.
A CSV with information on every protein-coding sequence assembled from every sample (operationally termed 'alleles'), including the contig, position, sequencing depth, and GC content will be written to assembly/allele.assembly.metrics.csv.gz
. This table will also be available in results.hdf5
under /abund/allele/assembly
.
Example:
contig_depth | contig_length | contig_num | details | gc | gene_name | gene_num | specimen | start | stop | strand |
---|---|---|---|---|---|---|---|---|---|---|
22.380753 | 4357.0 | 1 | ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.287 | 0.287 | 1_length_4357_cov_22.380753_1 | 1 | Mock__2 | 1 | 1338 | 1 |
22.380753 | 4357.0 | 1 | ID=1_2;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.302 | 0.302 | 1_length_4357_cov_22.380753_2 | 2 | Mock__2 | 1524 | 2765 | 1 |
22.380753 | 4357.0 | 1 | ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.265 | 0.265 | 1_length_4357_cov_22.380753_3 | 3 | Mock__2 | 2951 | 3052 | 1 |
22.380753 | 4357.0 | 1 | ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.322 | 0.322 | 1_length_4357_cov_22.380753_4 | 4 | Mock__2 | 3292 | 4236 | 1 |
16.069435000000002 | 4318.0 | 2 | ID=2_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.361 | 0.361 | 2_length_4318_cov_16.069435_1 | 1 | Mock__2 | 212 | 856 | 1 |
All deduplicated gene sequences are written to ref/genes.fasta.gz
in GZIP compressed FASTA format.
A table in TSV format describing which alleles were grouped together to form deduplicated genes, using the indicated percent identity threshold, will be written to ref/genes.alleles.tsv.gz
. It will also be available in results.hdf5
under /annot/allele/gene
.
The DIAMOND reference database for all of the deduplicated gene sequences will be written to ref/genes.dmnd
.
A two column CSV describing which genes were grouped into which samples will be written to ref/CAGs.csv.gz
. Columns are CAG
and gene
, with one row per gene. It will also be available in results.hdf5
under /annot/gene/cag
.
Example:
CAG | gene |
---|---|
0 | Mock__1_NODE_7510_length_263_cov_1.826923_1 |
0 | Mock__3_NODE_6989_length_270_cov_1.767442_1 |
0 | Mock__1_NODE_8655_length_250_cov_2.646154_1 |
0 | Mock__2_NODE_6791_length_271_cov_1.759259_1 |
0 | Mock__3_NODE_211_length_709_cov_5.498471_1 |
Abundance of every gene in every sample in feather format will be written to abund/gene.abund.feather
. This table will be available in results.hdf5
under /abund/gene/wide
. Abundances are calculated as the depth of sequencing for each individual gene, divided by the sum of all depths of sequencing for every gene detected in a sample.
Example:
index | Mock__1 | Mock__2 | Mock__4 | Mock__3 |
---|---|---|---|---|
Mock__4_NODE_7717_length_262_cov_2.753623_1 | 0E+002 | 2E-052 | 6E-052 | 5E-052 |
Mock__2_NODE_3808_length_333_cov_2.330935_1 | 4E-052 | 3E-052 | 0E+002 | 4E-052 |
Mock__2_NODE_12408_length_211_cov_2.435897_1 | 6E-052 | 5E-052 | 0E+002 | 0E+002 |
Mock__1_NODE_7309_length_265_cov_1.809524_1 | 4E-052 | 3E-052 | 0E+002 | 3E-052 |
Mock__1_NODE_11183_length_224_cov_2.071006_1 | 4E-052 | 2E-052 | 3E-052 | 2E-052 |
Abundance of every CAG in every sample in feather format will be written to abund/CAG.abund.feather
. This table will be available in results.hdf5
under /abund/cag/wide
. Abundances are calculated as the sum of the gene-level abundances (above) over all of the genes contained in a given CAG.
Example:
CAG | Mock__1 | Mock__2 | Mock__4 | Mock__3 |
---|---|---|---|---|
0.0 | 0.08 | 0.12 | 0.12 | 0.11 |
1.0 | 0.08 | 0.07 | 0.02 | 0.1 |
2.0 | 0.09 | 0.01 | 0.0 | 0.0 |
3.0 | 0.0 | 0.08 | 0.02 | 0.05 |
4.0 | 0.01 | 0.04 | 0.08 | 0.01 |
A more complete description of the detection of every gene in every sample can be found in the JSON files output by FAMLI, which will be written to /abund/details
with a single file per specimen. In addition, this data will be aggregated into a single table in results.hdf5
under '/abund/gene/long`. Here is an example of the information found in that table:
coverage | depth | id | length | nreads | std | specimen |
---|---|---|---|---|---|---|
1.0 | 3.42 | Mock__2_NODE_12649_length_209_cov_2.454545_1 | 69 | 6 | 0.97 | Mock__1 |
1.0 | 2.48 | Mock__1_NODE_9620_length_240_cov_2.054054_1 | 79 | 4 | 0.85 | Mock__1 |
0.66 | 2.28 | Mock__4_NODE_3651_length_340_cov_1.985965_1 | 86 | 4 | 1.78 | Mock__1 |
0.64 | 1.28 | Mock__3_NODE_10136_length_235_cov_3.166667_1 | 78 | 2 | 0.96 | Mock__1 |
1.0 | 5.69 | Mock__1_NODE_6010_length_280_cov_3.377778_2 | 32 | 6 | 0.73 | Mock__1 |
The taxonomic assignment for each gene (DIAMOND output in taxonomic assignment mode) in TSV format will be written to annot/genes.tax.aln.gz
. It will also be available in results.hdf5
under /annot/gene/tax
.
The functional assignment for each gene (eggNOG output) in TSV format will be written to annot/genes.emapper.annotations.gz
. It will also be available in results.hdf5
under /annot/gene/eggnog
.
A CSV with the Corncob results for this dataset will be written to stats/corncob.results.csv
. It will also be available in results.hdf5
under /stats/cag/corncob
.
- Getting Started
- De novo vs. Reference-Based Analysis
- Running Geneshot
- Output Files
- Input File Format
- Nextflow Configuration
- Helpful Scripts:
- Concepts: