Skip to content

Output Files

Sam Minot edited this page Jan 31, 2020 · 37 revisions

Once you've run Geneshot, you will want to look at the results. Because Geneshot runs many different types of analyses, you may have to consult this documentation in order to find the element of the results which is most important for your application.

Many of the individual aspects of the results can be found in two formats:

  • A compressed text file in the output directory specified by --output
  • A table in the ${output_prefix}.full.hdf5 output file written to, with the file name specified by --output_prefix

NOTE: The file format HDF5 is used to collect all of the results because of its flexible support for combining multiple tabular datasets into a single data object. We have found that the most convenient way to read and write this data format is with the Python/Pandas library.

Preprocessing

If the --savereads flag is used, all of the preprocessed WGS FASTQ files will be written to the qc/ folder.

De novo Assembly

All of the contigs, scaffolds, gene annotations, and logs produced during de novo assembly for each specimen will be found in the assembly/<specimen>/ folder.

A CSV with information on every protein-coding sequence assembled from every sample (operationally termed 'alleles'), including the contig, position, sequencing depth, and GC content will be written to assembly/allele.assembly.metrics.csv.gz. This table will also be available in results.hdf5 under /abund/allele/assembly.

Example:

contig_depth contig_length contig_num details gc gene_name gene_num specimen start stop strand
22.380753 4357.0 1 ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.287 0.287 1_length_4357_cov_22.380753_1 1 Mock__2 1 1338 1
22.380753 4357.0 1 ID=1_2;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.302 0.302 1_length_4357_cov_22.380753_2 2 Mock__2 1524 2765 1
22.380753 4357.0 1 ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.265 0.265 1_length_4357_cov_22.380753_3 3 Mock__2 2951 3052 1
22.380753 4357.0 1 ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.322 0.322 1_length_4357_cov_22.380753_4 4 Mock__2 3292 4236 1
16.069435000000002 4318.0 2 ID=2_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.361 0.361 2_length_4318_cov_16.069435_1 1 Mock__2 212 856 1

Gene Catalog

All deduplicated gene sequences are written to ref/genes.fasta.gz in GZIP compressed FASTA format.

A table in TSV format describing which alleles were grouped together to form deduplicated genes, using the indicated percent identity threshold, will be written to ref/genes.alleles.tsv.gz. It will also be available in results.hdf5 under /annot/allele/gene.

Example:

gene allele
Mock__4_NODE_1_length_4168_cov_15.434476_1 Mock__4_NODE_1_length_4168_cov_15.434476_1
Mock__4_NODE_1_length_4168_cov_15.434476_1 Mock__2_NODE_2_length_4318_cov_16.069435_1
Mock__4_NODE_1_length_4168_cov_15.434476_1 Mock__3_NODE_2_length_4301_cov_15.401319_1
Mock__4_NODE_1_length_4168_cov_15.434476_1 Mock__1_NODE_2_length_4150_cov_16.511355_1
Mock__4_NODE_1_length_4168_cov_15.434476_2 Mock__4_NODE_1_length_4168_cov_15.434476_2

The DIAMOND reference database for all of the deduplicated gene sequences will be written to ref/genes.dmnd.

Co-Abundant Gene Groups (CAGs)

A two column CSV describing which genes were grouped into which samples will be written to ref/CAGs.csv.gz. Columns are CAG and gene, with one row per gene. It will also be available in results.hdf5 under /annot/gene/cag.

Example:

CAG gene
0 Mock__1_NODE_7510_length_263_cov_1.826923_1
0 Mock__3_NODE_6989_length_270_cov_1.767442_1
0 Mock__1_NODE_8655_length_250_cov_2.646154_1
0 Mock__2_NODE_6791_length_271_cov_1.759259_1
0 Mock__3_NODE_211_length_709_cov_5.498471_1

Gene Abundance

Abundance of every gene in every sample in feather format will be written to abund/gene.abund.feather. This table will be available in results.hdf5 under /abund/gene/wide. Abundances are calculated as the depth of sequencing for each individual gene, divided by the sum of all depths of sequencing for every gene detected in a sample.

Example:

index Mock__1 Mock__2 Mock__4 Mock__3
Mock__4_NODE_7717_length_262_cov_2.753623_1 0E+002 2E-052 6E-052 5E-052
Mock__2_NODE_3808_length_333_cov_2.330935_1 4E-052 3E-052 0E+002 4E-052
Mock__2_NODE_12408_length_211_cov_2.435897_1 6E-052 5E-052 0E+002 0E+002
Mock__1_NODE_7309_length_265_cov_1.809524_1 4E-052 3E-052 0E+002 3E-052
Mock__1_NODE_11183_length_224_cov_2.071006_1 4E-052 2E-052 3E-052 2E-052

Abundance of every CAG in every sample in feather format will be written to abund/CAG.abund.feather. This table will be available in results.hdf5 under /abund/cag/wide. Abundances are calculated as the sum of the gene-level abundances (above) over all of the genes contained in a given CAG.

Example:

CAG Mock__1 Mock__2 Mock__4 Mock__3
0.0 0.08 0.12 0.12 0.11
1.0 0.08 0.07 0.02 0.1
2.0 0.09 0.01 0.0 0.0
3.0 0.0 0.08 0.02 0.05
4.0 0.01 0.04 0.08 0.01

A more complete description of the detection of every gene in every sample can be found in the JSON files output by FAMLI, which will be written to /abund/details with a single file per specimen. In addition, this data will be aggregated into a single table in results.hdf5 under '/abund/gene/long`. Here is an example of the information found in that table:

coverage depth id length nreads std specimen
1.0 3.42 Mock__2_NODE_12649_length_209_cov_2.454545_1 69 6 0.97 Mock__1
1.0 2.48 Mock__1_NODE_9620_length_240_cov_2.054054_1 79 4 0.85 Mock__1
0.66 2.28 Mock__4_NODE_3651_length_340_cov_1.985965_1 86 4 1.78 Mock__1
0.64 1.28 Mock__3_NODE_10136_length_235_cov_3.166667_1 78 2 0.96 Mock__1
1.0 5.69 Mock__1_NODE_6010_length_280_cov_3.377778_2 32 6 0.73 Mock__1

Annotations

The taxonomic assignment for each gene (DIAMOND output in taxonomic assignment mode) in TSV format will be written to annot/genes.tax.aln.gz. It will also be available in results.hdf5 under /annot/gene/tax.

Example:

gene tax_id evalue
Mock__13_NODE_19_length_684_cov_3.917329_1 1224 8.8e-21
Mock__13_NODE_80_length_217_cov_2.308642_1 562 1.3e-29
Mock__13_NODE_82_length_215_cov_4.750000_1 2608889 1.2e-10
Mock__13_NODE_16_length_757_cov_4.584046_2 543 3.2e-51
Mock__13_NODE_29_length_514_cov_2.932462_1 1236 4.8e-17

The functional assignment for each gene (eggNOG output) in TSV format will be written to annot/genes.emapper.annotations.gz. It will also be available in results.hdf5 under /annot/gene/eggnog.

Example:

query_name seed_eggNOG_ortholog seed_ortholog_evalue seed_ortholog_score best_tax_level taxonomic scope eggNOG OGs COG Functional cat. eggNOG free text desc.
Mock__12_NODE_1_length_5360_cov_15.598303_1 1453503.AU05_02695 6.3e-21 105.9 Proteobacteria Bacteria 1P1ED@1224,2FFHF@1,347EX@2 S Bacteriophage protein K
Mock__12_NODE_1_length_5360_cov_15.598303_2 1116472.MGMO_205c00050 1.3999999999999996e-78 298.9 Gammaproteobacteria Bacteria 1NZPF@1224,1SRZ6@1236,28HVF@1,2Z81Q@2 S Bacteriophage scaffolding protein D
Mock__12_NODE_1_length_5360_cov_15.598303_3 1116472.MGMO_205c00040 3.3e-13 79.7 Gammaproteobacteria Bacteria 1NKF2@1224,1ST7F@1236,2EFW9@1,339NJ@2 S Microvirus J protein
Mock__12_NODE_1_length_5360_cov_15.598303_4 1116472.MGMO_205c00030 5.399999999999998e-258 896.3 Gammaproteobacteria Bacteria 1NRFQ@1224,1SKG7@1236,28IN7@1,2Z8NM@2 S Capsid protein (F protein)
Mock__12_NODE_1_length_5360_cov_15.598303_5 1118153.MOY_16472 1.1e-95 355.9 Gammaproteobacteria Bacteria 1R4RW@1224,1SM1Q@1236,28HGS@1,2Z7SJ@2 S Major spike protein (G protein)

Statistical Analysis

A CSV with the Corncob results for this dataset will be written to stats/corncob.results.csv. It will also be available in results.hdf5 under /stats/cag/corncob.

Example:

parameter type value CAG
mu.(Intercept) estimate -1.5722784150839353 0
mu.label1 estimate -0.2687179385894458 0
mu.label2 estimate -0.17127772136300107 0
phi.(Intercept) estimate -6.573817086548077 0
mu.(Intercept) std_error 0.15771708448454666 0

Summary HDF

In addition to the full set of results found in the ${output_prefix}.full.hdf5 HDF store, we will also make a smaller summary file: ${output_prefix}.summary.hdf5. This will contain a subset of the data found in the full HDF store:

  • Manifest table provided by the user: /manifest
  • Abundance of every CAG in every sample: /abund/cag/wide
  • Summary of all annotations available for each gene: /annot/gene/all
  • Table with statistical analysis results: /stats/cag/corncob