Skip to content

Output Files

Sam Minot edited this page Jul 19, 2020 · 37 revisions

Once you've run Geneshot, you will want to look at the results. Because Geneshot runs many different types of analyses, you may have to consult this documentation in order to find the element of the results which is most important for your application.

Many of the individual aspects of the results can be found in two formats:

  • A compressed text file in the output directory specified by --output
  • A table in either the ${output_prefix}.results.hdf5 or ${output_prefix}.details.hdf5 output files. These are both written to the same output directory using the file name prefix specified by --output_prefix

NOTE: The file format HDF5 is used to collect all of the results because of its flexible support for combining multiple tabular datasets into a single data object. We have found that the most convenient way to read and write this data format is with the Python/Pandas library.

Example:

import pandas as pd
manifest_df = pd.read_hdf("geneshot.results.hdf5", "/manifest")

Preprocessing

If the --savereads flag is used, all of the preprocessed WGS FASTQ files will be written to the qc/ folder.

Manifest

The table used as input for the entire analysis will be saved in the results.hdf5 under /manifest.

specimen R1 R2 I1 I2 batch label1 label2
Mock__11 data/Mock__11.R1.fastq.gz data/Mock__11.R2.fastq.gz data/Mock__11.I1.fastq.gz data/Mock__11.I2.fastq.gz mock 0 1
Mock__11 data/Mock__11.R1.fastq.gz data/Mock__11.R2.fastq.gz data/Mock__11.I1.fastq.gz data/Mock__11.I2.fastq.gz mock 0 1
Mock__12 data/Mock__12.R1.fastq.gz data/Mock__12.R2.fastq.gz data/Mock__12.I1.fastq.gz data/Mock__12.I2.fastq.gz mock 1 0
Mock__13 data/Mock__13.R1.fastq.gz data/Mock__13.R2.fastq.gz data/Mock__13.I1.fastq.gz data/Mock__13.I2.fastq.gz mock 1 1
Mock__14 data/Mock__14.R1.fastq.gz data/Mock__14.R2.fastq.gz data/Mock__14.I1.fastq.gz data/Mock__14.I2.fastq.gz mock 0 0

Specimen Summary

Each sample can be summarized overall with a number of metrics, including:

  • The number of paired-end reads input per sample (2X the number of read pairs, to be explicit)
  • The number of paired-end reads which successfully align against the gene catalog
  • The estimated total number of protein coding genes in a given sample (using the breakaway algorithm)
  • The number of genes detected by assembly
  • The number of genes detected by alignment

For example:

/summary/all

specimen n_reads aligned_reads n_genes_aligned error estimand estimate interval_lower interval_upper reasonable n_genes_assembled
Mock__11 8076 None 75 0.0473 richness 75.0022 75.0004 75.0114 False 63
Mock__12 876 None 7 2.2399e-24 richness 7.0 7.0 nan False 7
Mock__13 2492 None 90 0.4269 richness 90.1764 90.0338 90.9194 False 116
Mock__14 8252 None 81 0.0411 richness 81.0017 81.0003 81.0086 False 71
Mock__15 8952 None 82 0.0473 richness 82.0022 82.0004 82.0114 False 70

De novo Assembly

All of the contigs, scaffolds, gene annotations, and logs produced during de novo assembly for each specimen will be found in the assembly/<specimen>/ folder.

A CSV with information on every protein-coding sequence assembled from every sample (operationally termed 'alleles'), including the contig, position, sequencing depth, and GC content will be written to assembly/allele.assembly.metrics.csv.gz. This table will also be available in details.hdf5 under /abund/allele/assembly/<specimen> with one table per specimen.

Example:

contig flag gc_cont gene_name len multi partial rbs_motif rbs_spacer specimen start start_type stop strand catalog_gene
k59_0 1 0.386 Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_1 43660 20.7292 0 GGAG/GAGG 5-10bp Mock__15 510 ATG 716 -1 gene_f33b2fec_69aa
k59_0 1 0.326 Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_2 43660 20.7292 0 GGA/GAG/AGG 5-10bp Mock__15 1002 ATG 1412 1 gene_4f04b95b_137aa
k59_0 1 0.4320000000000001 Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_3 43660 20.7292 0 GGA/GAG/AGG 5-10bp Mock__15 1702 ATG 1995 1 gene_c49af40d_98aa
k59_0 1 0.513 Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_4 43660 20.7292 0 GGA/GAG/AGG 5-10bp Mock__15 2027 ATG 2488 -1 gene_305ad6e4_154aa
k59_0 1 0.503 Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_5 43660 20.7292 0 GGAG/GAGG 5-10bp Mock__15 2485 ATG 2961 -1 gene_2550103b_159aa

Gene Catalog

The DIAMOND reference database for all of the deduplicated gene sequences will be written to ref/genes.dmnd. This file can be converted into FASTA by the user if needed.

Co-Abundant Gene Groups (CAGs)

A two column CSV describing which genes were grouped into which samples will be written to ref/CAGs.csv.gz. Columns are CAG and gene, with one row per gene. It will also be available in results.hdf5 under /annot/gene/cag.

Example:

CAG gene
0 Mock__1_NODE_7510_length_263_cov_1.826923_1
0 Mock__3_NODE_6989_length_270_cov_1.767442_1
0 Mock__1_NODE_8655_length_250_cov_2.646154_1
0 Mock__2_NODE_6791_length_271_cov_1.759259_1
0 Mock__3_NODE_211_length_709_cov_5.498471_1

CAG Summary

A short summary of every CAG will be available in results.hdf5 under /annot/cag/all, containing the:

  • Size (number of genes) per CAG
  • Prevalence: proportion of samples in which the CAG was observed with a relative abundance > 0
  • Mean abundance: Average abundance across all samples
  • Std. abundance: Standard deviation of the abundance values across all samples

Example:

CAG mean_abundance prevalence size std_abundance
0.0 0.7048885226249695 0.8 75.0 0.42067644000053406
1.0 0.014859299175441265 0.2 12.0 0.03322640433907509
2.0 0.2802521586418152 0.8 7.0 0.41875842213630676

Gene Abundance

Abundance of every CAG in every sample in feather format will be written to abund/CAG.abund.feather. This table will be available in results.hdf5 under /abund/cag/wide. Abundances are calculated as the sum of the gene-level sequencing depth divided by the sum of sequencing depth for all genes in that sample.

Example:

CAG Mock__1 Mock__2 Mock__4 Mock__3
0.0 0.08 0.12 0.12 0.11
1.0 0.08 0.07 0.02 0.1
2.0 0.09 0.01 0.0 0.0
3.0 0.0 0.08 0.02 0.05
4.0 0.01 0.04 0.08 0.01

Abundance of every gene in every sample in will be available in results.hdf5 under /abund/gene/long/<specimen>. Here is an example of the information found in that table:

coverage depth id length nreads std specimen
1.0 3.42 Mock__2_NODE_12649_length_209_cov_2.454545_1 69 6 0.97 Mock__1
1.0 2.48 Mock__1_NODE_9620_length_240_cov_2.054054_1 79 4 0.85 Mock__1
0.66 2.28 Mock__4_NODE_3651_length_340_cov_1.985965_1 86 4 1.78 Mock__1
0.64 1.28 Mock__3_NODE_10136_length_235_cov_3.166667_1 78 2 0.96 Mock__1
1.0 5.69 Mock__1_NODE_6010_length_280_cov_3.377778_2 32 6 0.73 Mock__1

Annotations

The taxonomic assignment for each gene (DIAMOND output in taxonomic assignment mode) in TSV format will be written to annot/genes.tax.aln.gz. It will also be available in results.hdf5 under /annot/gene/tax.

Example:

gene tax_id evalue
Mock__13_NODE_19_length_684_cov_3.917329_1 1224 8.8e-21
Mock__13_NODE_80_length_217_cov_2.308642_1 562 1.3e-29
Mock__13_NODE_82_length_215_cov_4.750000_1 2608889 1.2e-10
Mock__13_NODE_16_length_757_cov_4.584046_2 543 3.2e-51
Mock__13_NODE_29_length_514_cov_2.932462_1 1236 4.8e-17

The functional assignment for each gene (eggNOG output) in TSV format will be written to annot/genes.emapper.annotations.gz. It will also be available in results.hdf5 under /annot/gene/eggnog.

Example:

query_name seed_eggNOG_ortholog seed_ortholog_evalue seed_ortholog_score best_tax_level taxonomic scope eggNOG OGs COG Functional cat. eggNOG free text desc.
Mock__12_NODE_1_length_5360_cov_15.598303_1 1453503.AU05_02695 6.3e-21 105.9 Proteobacteria Bacteria 1P1ED@1224,2FFHF@1,347EX@2 S Bacteriophage protein K
Mock__12_NODE_1_length_5360_cov_15.598303_2 1116472.MGMO_205c00050 1.3999999999999996e-78 298.9 Gammaproteobacteria Bacteria 1NZPF@1224,1SRZ6@1236,28HVF@1,2Z81Q@2 S Bacteriophage scaffolding protein D
Mock__12_NODE_1_length_5360_cov_15.598303_3 1116472.MGMO_205c00040 3.3e-13 79.7 Gammaproteobacteria Bacteria 1NKF2@1224,1ST7F@1236,2EFW9@1,339NJ@2 S Microvirus J protein
Mock__12_NODE_1_length_5360_cov_15.598303_4 1116472.MGMO_205c00030 5.399999999999998e-258 896.3 Gammaproteobacteria Bacteria 1NRFQ@1224,1SKG7@1236,28IN7@1,2Z8NM@2 S Capsid protein (F protein)
Mock__12_NODE_1_length_5360_cov_15.598303_5 1118153.MOY_16472 1.1e-95 355.9 Gammaproteobacteria Bacteria 1R4RW@1224,1SM1Q@1236,28HGS@1,2Z7SJ@2 S Major spike protein (G protein)

Taxonomy

The taxonomy used for the taxonomic annotation is provided in the output HDF5 under /ref/taxonomy, e.g.:

tax_id name parent rank
1 root 1 no rank
2 Bacteria 131567 superkingdom
6 Azorhizobium 335928 genus
7 Azorhizobium caulinodans 6 species
9 Buchnera aphidicola 32199 species

Ordination

For convenience, ordination is performed using the CAG abundance information to help provide a quick measure of sample similarity to the user. This includes both PCA and t-SNE, and the output tables can be found in results.hdf5 under /ordination/pca and /ordination/tsne. Examples of those outputs are:

specimen PC1 (94.4%) PC2 (5.6%) PC3 (4.8E-15%)
Mock__12 0.986596 -0.097462 -7.470813e-11
Mock__14 -0.395635 -0.052598 6.401242e-09
Mock__13 0.141401 0.257170 -5.739458e-16
Mock__15 -0.305487 -0.055524 -7.774022e-10
Mock__11 -0.426874 -0.051585 -5.549131e-09
specimen t-SNE 1 t-SNE 2
Mock__12 -67.135109 26.804821
Mock__14 -115.964615 10.169141
Mock__13 -126.923698 62.394192
Mock__15 -38.498631 69.708954
Mock__11 -86.144646 93.732803

Statistical Analysis

A CSV with the Corncob results for this dataset will be written to stats/corncob.results.csv. It will also be available in results.hdf5 under /stats/cag/corncob.

Example:

parameter type value CAG
mu.(Intercept) estimate -1.5722784150839353 0
mu.label1 estimate -0.2687179385894458 0
mu.label2 estimate -0.17127772136300107 0
phi.(Intercept) estimate -6.573817086548077 0
mu.(Intercept) std_error 0.15771708448454666 0

File Hierarchy

Here's an example of how the output files can be structured. This is what you get when you analyze some of the test data, so depending on how many samples you have, etc., you will see different exact set of files, but in the same pattern.

output
├── [ 160]  abund
│   ├── [ 664]  CAG.abund.feather
│   ├── [ 130]  CAG.readcounts.csv.gz
│   └── [ 224]  details
│       ├── [2.9K]  Mock__11.json.gz
│       ├── [ 404]  Mock__12.json.gz
│       ├── [3.7K]  Mock__13.json.gz
│       ├── [3.1K]  Mock__14.json.gz
│       └── [3.1K]  Mock__15.json.gz
├── [ 224]  assembly
│   ├── [ 224]  Mock__11
│   │   ├── [ 14K]  Mock__11.contigs.fasta.gz
│   │   ├── [1.8K]  Mock__11.csv.gz
│   │   ├── [ 10K]  Mock__11.faa.gz
│   │   ├── [2.9K]  Mock__11.gff.gz
│   │   └── [ 67K]  Mock__11.megahit.log
│   ├── [ 224]  Mock__12
│   │   ├── [1.8K]  Mock__12.contigs.fasta.gz
│   │   ├── [ 399]  Mock__12.csv.gz
│   │   ├── [1.3K]  Mock__12.faa.gz
│   │   ├── [ 653]  Mock__12.gff.gz
│   │   └── [ 51K]  Mock__12.megahit.log
│   ├── [ 224]  Mock__13
│   │   ├── [ 15K]  Mock__13.contigs.fasta.gz
│   │   ├── [3.4K]  Mock__13.csv.gz
│   │   ├── [ 11K]  Mock__13.faa.gz
│   │   ├── [6.8K]  Mock__13.gff.gz
│   │   └── [ 68K]  Mock__13.megahit.log
│   ├── [ 224]  Mock__14
│   │   ├── [ 16K]  Mock__14.contigs.fasta.gz
│   │   ├── [2.1K]  Mock__14.csv.gz
│   │   ├── [ 11K]  Mock__14.faa.gz
│   │   ├── [3.4K]  Mock__14.gff.gz
│   │   └── [ 67K]  Mock__14.megahit.log
│   └── [ 224]  Mock__15
│       ├── [ 16K]  Mock__15.contigs.fasta.gz
│       ├── [2.0K]  Mock__15.csv.gz
│       ├── [ 11K]  Mock__15.faa.gz
│       ├── [3.2K]  Mock__15.gff.gz
│       └── [ 67K]  Mock__15.megahit.log
├── [316K]  geneshot.details.hdf5
├── [372K]  geneshot.results.hdf5
├── [  96]  qc
│   └── [  86]  readcounts.csv
├── [  96]  ref
│   └── [ 21K]  genes.dmnd
└── [ 128]  stats
    ├── [1.4K]  corncob.results.csv
    └── [ 104]  geneshot.breakaway.csv.gz