Skip to content

Output Files

Sam Minot edited this page Jan 24, 2020 · 37 revisions

Once you've run Geneshot, you will want to look at the results. Because Geneshot runs many different types of analyses, you may have to consult this documentation in order to find the element of the results which is most important for your application.

Many of the individual aspects of the results can be found in two formats:

  • A compressed text file in the output directory specified by --output
  • A table in the results.hdf5 file written to the output directory

NOTE: The file format HDF5 is used to collect all of the results because of its flexible support for combining multiple tabular datasets into a single data object. We have found that the most convenient way to read and write this data format is with the Python/Pandas library.

Preprocessing

If the --savereads flag is used, all of the preprocessed WGS FASTQ files will be written to the qc/ folder.

De novo Assembly

All of the contigs, scaffolds, gene annotations, and logs produced during de novo assembly for each specimen will be found in the assembly/<specimen>/ folder.

A CSV with information on every protein-coding sequence assembled from every sample (operationally termed 'alleles'), including the contig, position, sequencing depth, and GC content will be written to assembly/allele.assembly.metrics.csv.gz. This table will also be available in results.hdf5 under /abund/allele/assembly.

Example:

contig_depth contig_length contig_num details gc gene_name gene_num specimen start stop strand
22.380753 4357.0 1 ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.287 0.287 1_length_4357_cov_22.380753_1 1 Mock__2 1 1338 1
22.380753 4357.0 1 ID=1_2;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.302 0.302 1_length_4357_cov_22.380753_2 2 Mock__2 1524 2765 1
22.380753 4357.0 1 ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.265 0.265 1_length_4357_cov_22.380753_3 3 Mock__2 2951 3052 1
22.380753 4357.0 1 ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.322 0.322 1_length_4357_cov_22.380753_4 4 Mock__2 3292 4236 1
16.069435000000002 4318.0 2 ID=2_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.361 0.361 2_length_4318_cov_16.069435_1 1 Mock__2 212 856 1

Gene Catalog

All deduplicated gene sequences are written to ref/genes.fasta.gz in GZIP compressed FASTA format.

A table in TSV format describing which alleles were grouped together to form deduplicated genes, using the indicated percent identity threshold, will be written to ref/genes.alleles.tsv.gz. It will also be available in results.hdf5 under /annot/allele/gene.

The DIAMOND reference database for all of the deduplicated gene sequences will be written to ref/genes.dmnd.

Co-Abundant Gene Groups (CAGs)

A two column CSV describing which genes were grouped into which samples will be written to ref/CAGs.csv.gz. Columns are CAG and gene, with one row per gene. It will also be available in results.hdf5 under /annot/gene/cag.

Example:

CAG gene
0 Mock__1_NODE_7510_length_263_cov_1.826923_1
0 Mock__3_NODE_6989_length_270_cov_1.767442_1
0 Mock__1_NODE_8655_length_250_cov_2.646154_1
0 Mock__2_NODE_6791_length_271_cov_1.759259_1
0 Mock__3_NODE_211_length_709_cov_5.498471_1

Gene Abundance

Abundance of every gene in every sample in feather format will be written to abund/gene.abund.feather. This table will be available in results.hdf5 under /abund/gene/wide. Abundances are calculated as the depth of sequencing for each individual gene, divided by the sum of all depths of sequencing for every gene detected in a sample.

Example:

index Mock__1 Mock__2 Mock__4 Mock__3
Mock__4_NODE_7717_length_262_cov_2.753623_1 0E+002 2E-052 6E-052 5E-052
Mock__2_NODE_3808_length_333_cov_2.330935_1 4E-052 3E-052 0E+002 4E-052
Mock__2_NODE_12408_length_211_cov_2.435897_1 6E-052 5E-052 0E+002 0E+002
Mock__1_NODE_7309_length_265_cov_1.809524_1 4E-052 3E-052 0E+002 3E-052
Mock__1_NODE_11183_length_224_cov_2.071006_1 4E-052 2E-052 3E-052 2E-052

Abundance of every CAG in every sample in feather format will be written to abund/CAG.abund.feather. This table will be available in results.hdf5 under /abund/cag/wide. Abundances are calculated as the sum of the gene-level abundances (above) over all of the genes contained in a given CAG.

Example:

CAG Mock__1 Mock__2 Mock__4 Mock__3
0.0 0.08 0.12 0.12 0.11
1.0 0.08 0.07 0.02 0.1
2.0 0.09 0.01 0.0 0.0
3.0 0.0 0.08 0.02 0.05
4.0 0.01 0.04 0.08 0.01

A more complete description of the detection of every gene in every sample can be found in the JSON files output by FAMLI, which will be written to /abund/details with a single file per specimen. In addition, this data will be aggregated into a single table in results.hdf5 under '/abund/gene/long`. Here is an example of the information found in that table:

coverage depth id length nreads std specimen
1.0 3.42 Mock__2_NODE_12649_length_209_cov_2.454545_1 69 6 0.97 Mock__1
1.0 2.48 Mock__1_NODE_9620_length_240_cov_2.054054_1 79 4 0.85 Mock__1
0.66 2.28 Mock__4_NODE_3651_length_340_cov_1.985965_1 86 4 1.78 Mock__1
0.64 1.28 Mock__3_NODE_10136_length_235_cov_3.166667_1 78 2 0.96 Mock__1
1.0 5.69 Mock__1_NODE_6010_length_280_cov_3.377778_2 32 6 0.73 Mock__1

Annotations

The taxonomic assignment for each gene (DIAMOND output in taxonomic assignment mode) in TSV format will be written to annot/genes.tax.aln.gz. It will also be available in results.hdf5 under /annot/gene/tax.

The functional assignment for each gene (eggNOG output) in TSV format will be written to annot/genes.emapper.annotations.gz. It will also be available in results.hdf5 under /annot/gene/eggnog.

Statistical Analysis

A CSV with the Corncob results for this dataset will be written to stats/corncob.results.csv. It will also be available in results.hdf5 under /stats/cag/corncob.