Output Files

Once you've run Geneshot, you will want to look at the results. Because Geneshot runs many different types of analyses, you may have to consult this documentation in order to find the element of the results which is most important for your application.

Many of the individual aspects of the results can be found in two formats:

A compressed text file in the output directory specified by --output
A table in either the ${output_prefix}.results.hdf5 or ${output_prefix}.details.hdf5 output files. These are both written to the same output directory using the file name prefix specified by --output_prefix

NOTE: The file format HDF5 is used to collect all of the results because of its flexible support for combining multiple tabular datasets into a single data object. We have found that the most convenient way to read and write this data format is with the Python/Pandas library.

Example:

import pandas as pd
manifest_df = pd.read_hdf("geneshot.results.hdf5", "/manifest")

Preprocessing

If the --savereads flag is used, all of the preprocessed WGS FASTQ files will be written to the qc/ folder.

Manifest

The table used as input for the entire analysis will be saved in the results.hdf5 under /manifest.

specimen	R1	R2	I1	I2	batch	label1	label2
Mock__11	data/Mock__11.R1.fastq.gz	data/Mock__11.R2.fastq.gz	data/Mock__11.I1.fastq.gz	data/Mock__11.I2.fastq.gz	mock	0	1
Mock__11	data/Mock__11.R1.fastq.gz	data/Mock__11.R2.fastq.gz	data/Mock__11.I1.fastq.gz	data/Mock__11.I2.fastq.gz	mock	0	1
Mock__12	data/Mock__12.R1.fastq.gz	data/Mock__12.R2.fastq.gz	data/Mock__12.I1.fastq.gz	data/Mock__12.I2.fastq.gz	mock	1	0
Mock__13	data/Mock__13.R1.fastq.gz	data/Mock__13.R2.fastq.gz	data/Mock__13.I1.fastq.gz	data/Mock__13.I2.fastq.gz	mock	1	1
Mock__14	data/Mock__14.R1.fastq.gz	data/Mock__14.R2.fastq.gz	data/Mock__14.I1.fastq.gz	data/Mock__14.I2.fastq.gz	mock	0	0

Specimen Summary

Each sample can be summarized overall with a number of metrics, including:

The number of paired-end reads input per sample (2X the number of read pairs, to be explicit)
The number of paired-end reads which successfully align against the gene catalog
The estimated total number of protein coding genes in a given sample (using the breakaway algorithm)
The number of genes detected by assembly
The number of genes detected by alignment

For example:

/summary/all

specimen	n_reads	aligned_reads	n_genes_aligned	error	estimand	estimate	interval_lower	interval_upper	reasonable	n_genes_assembled
Mock__11	8076	None	75	0.0473	richness	75.0022	75.0004	75.0114	False	63
Mock__12	876	None	7	2.2399e-24	richness	7.0	7.0	nan	False	7
Mock__13	2492	None	90	0.4269	richness	90.1764	90.0338	90.9194	False	116
Mock__14	8252	None	81	0.0411	richness	81.0017	81.0003	81.0086	False	71
Mock__15	8952	None	82	0.0473	richness	82.0022	82.0004	82.0114	False	70

De novo Assembly

All of the contigs, scaffolds, gene annotations, and logs produced during de novo assembly for each specimen will be found in the assembly/<specimen>/ folder.

A CSV with information on every protein-coding sequence assembled from every sample (operationally termed 'alleles'), including the contig, position, sequencing depth, and GC content will be written to assembly/allele.assembly.metrics.csv.gz. This table will also be available in details.hdf5 under /abund/allele/assembly/<specimen> with one table per specimen.

Example:

contig	flag	gc_cont	gene_name	len	multi	rbs_motif	rbs_spacer	specimen	start	start_type	stop	strand	catalog_gene
k59_0	1	0.386	Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_1	43660	20.7292	GGAG/GAGG	5-10bp	Mock__15	510	ATG	716	-1	gene_f33b2fec_69aa
k59_0	1	0.326	Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_2	43660	20.7292	GGA/GAG/AGG	5-10bp	Mock__15	1002	ATG	1412	1	gene_4f04b95b_137aa
k59_0	1	0.4320000000000001	Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_3	43660	20.7292	GGA/GAG/AGG	5-10bp	Mock__15	1702	ATG	1995	1	gene_c49af40d_98aa
k59_0	1	0.513	Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_4	43660	20.7292	GGA/GAG/AGG	5-10bp	Mock__15	2027	ATG	2488	-1	gene_305ad6e4_154aa
k59_0	1	0.503	Mock__15__GENE__k59_0__flag=1__multi=20.7292__len=43660_5	43660	20.7292	GGAG/GAGG	5-10bp	Mock__15	2485	ATG	2961	-1	gene_2550103b_159aa

Gene Catalog

The DIAMOND reference database for all of the deduplicated gene sequences will be written to ref/genes.dmnd. This file can be converted into FASTA by the user if needed.

Co-Abundant Gene Groups (CAGs)

A two column CSV describing which genes were grouped into which samples will be written to ref/CAGs.csv.gz. Columns are CAG and gene, with one row per gene. It will also be available in results.hdf5 under /annot/gene/cag.

Example:

CAG	gene
0	Mock__1_NODE_7510_length_263_cov_1.826923_1
0	Mock__3_NODE_6989_length_270_cov_1.767442_1
0	Mock__1_NODE_8655_length_250_cov_2.646154_1
0	Mock__2_NODE_6791_length_271_cov_1.759259_1
0	Mock__3_NODE_211_length_709_cov_5.498471_1

CAG Summary

A short summary of every CAG will be available in results.hdf5 under /annot/cag/all, containing the:

Size (number of genes) per CAG
Prevalence: proportion of samples in which the CAG was observed with a relative abundance > 0
Mean abundance: Average abundance across all samples
Std. abundance: Standard deviation of the abundance values across all samples

Example:

CAG	mean_abundance	prevalence	size	std_abundance
0.0	0.7048885226249695	0.8	75.0	0.42067644000053406
1.0	0.014859299175441265	0.2	12.0	0.03322640433907509
2.0	0.2802521586418152	0.8	7.0	0.41875842213630676

Gene Abundance

Abundance of every CAG in every sample in feather format will be written to abund/CAG.abund.feather. This table will be available in results.hdf5 under /abund/cag/wide. Abundances are calculated as the sum of the gene-level sequencing depth divided by the sum of sequencing depth for all genes in that sample.

Example:

CAG	Mock__1	Mock__2	Mock__4	Mock__3
0.0	0.08	0.12	0.12	0.11
1.0	0.08	0.07	0.02	0.1
2.0	0.09	0.01	0.0	0.0
3.0	0.0	0.08	0.02	0.05
4.0	0.01	0.04	0.08	0.01

Abundance of every gene in every sample in will be available in results.hdf5 under /abund/gene/long/<specimen>. Here is an example of the information found in that table:

coverage	depth	id	length	nreads	std	specimen
1.0	3.42	Mock__2_NODE_12649_length_209_cov_2.454545_1	69	6	0.97	Mock__1
1.0	2.48	Mock__1_NODE_9620_length_240_cov_2.054054_1	79	4	0.85	Mock__1
0.66	2.28	Mock__4_NODE_3651_length_340_cov_1.985965_1	86	4	1.78	Mock__1
0.64	1.28	Mock__3_NODE_10136_length_235_cov_3.166667_1	78	2	0.96	Mock__1
1.0	5.69	Mock__1_NODE_6010_length_280_cov_3.377778_2	32	6	0.73	Mock__1

Annotations

The taxonomic assignment for each gene (DIAMOND output in taxonomic assignment mode) in TSV format will be written to annot/genes.tax.aln.gz. It will also be available in results.hdf5 under /annot/gene/tax.

Example:

gene	tax_id	evalue
Mock__13_NODE_19_length_684_cov_3.917329_1	1224	8.8e-21
Mock__13_NODE_80_length_217_cov_2.308642_1	562	1.3e-29
Mock__13_NODE_82_length_215_cov_4.750000_1	2608889	1.2e-10
Mock__13_NODE_16_length_757_cov_4.584046_2	543	3.2e-51
Mock__13_NODE_29_length_514_cov_2.932462_1	1236	4.8e-17

The functional assignment for each gene (eggNOG output) in TSV format will be written to annot/genes.emapper.annotations.gz. It will also be available in results.hdf5 under /annot/gene/eggnog.

Example:

query_name	seed_eggNOG_ortholog	seed_ortholog_evalue	seed_ortholog_score	best_tax_level	taxonomic scope	eggNOG OGs	COG Functional cat.	eggNOG free text desc.
Mock__12_NODE_1_length_5360_cov_15.598303_1	1453503.AU05_02695	6.3e-21	105.9	Proteobacteria	Bacteria	1P1ED@1224,2FFHF@1,347EX@2	S	Bacteriophage protein K
Mock__12_NODE_1_length_5360_cov_15.598303_2	1116472.MGMO_205c00050	1.3999999999999996e-78	298.9	Gammaproteobacteria	Bacteria	1NZPF@1224,1SRZ6@1236,28HVF@1,2Z81Q@2	S	Bacteriophage scaffolding protein D
Mock__12_NODE_1_length_5360_cov_15.598303_3	1116472.MGMO_205c00040	3.3e-13	79.7	Gammaproteobacteria	Bacteria	1NKF2@1224,1ST7F@1236,2EFW9@1,339NJ@2	S	Microvirus J protein
Mock__12_NODE_1_length_5360_cov_15.598303_4	1116472.MGMO_205c00030	5.399999999999998e-258	896.3	Gammaproteobacteria	Bacteria	1NRFQ@1224,1SKG7@1236,28IN7@1,2Z8NM@2	S	Capsid protein (F protein)
Mock__12_NODE_1_length_5360_cov_15.598303_5	1118153.MOY_16472	1.1e-95	355.9	Gammaproteobacteria	Bacteria	1R4RW@1224,1SM1Q@1236,28HGS@1,2Z7SJ@2	S	Major spike protein (G protein)

Taxonomy

The taxonomy used for the taxonomic annotation is provided in the output HDF5 under /ref/taxonomy, e.g.:

tax_id	name	parent	rank
1	root	1	no rank
2	Bacteria	131567	superkingdom
6	Azorhizobium	335928	genus
7	Azorhizobium caulinodans	6	species
9	Buchnera aphidicola	32199	species

Ordination

For convenience, ordination is performed using the CAG abundance information to help provide a quick measure of sample similarity to the user. This includes both PCA and t-SNE, and the output tables can be found in results.hdf5 under /ordination/pca and /ordination/tsne. Examples of those outputs are:

specimen	PC1 (94.4%)	PC2 (5.6%)	PC3 (4.8E-15%)
Mock__12	0.986596	-0.097462	-7.470813e-11
Mock__14	-0.395635	-0.052598	6.401242e-09
Mock__13	0.141401	0.257170	-5.739458e-16
Mock__15	-0.305487	-0.055524	-7.774022e-10
Mock__11	-0.426874	-0.051585	-5.549131e-09

specimen	t-SNE 1	t-SNE 2
Mock__12	-67.135109	26.804821
Mock__14	-115.964615	10.169141
Mock__13	-126.923698	62.394192
Mock__15	-38.498631	69.708954
Mock__11	-86.144646	93.732803

Statistical Analysis

A CSV with the Corncob results for this dataset will be written to stats/corncob.results.csv. It will also be available in results.hdf5 under /stats/cag/corncob.

Example:

parameter	type	value
mu.(Intercept)	estimate	-1.5722784150839353
mu.label1	estimate	-0.2687179385894458
mu.label2	estimate	-0.17127772136300107
phi.(Intercept)	estimate	-6.573817086548077
mu.(Intercept)	std_error	0.15771708448454666

File Hierarchy

Here's an example of how the output files can be structured. This is what you get when you analyze some of the test data, so depending on how many samples you have, etc., you will see different exact set of files, but in the same pattern.

output
├── [ 160]  abund
│   ├── [ 664]  CAG.abund.feather
│   ├── [ 130]  CAG.readcounts.csv.gz
│   └── [ 224]  details
│       ├── [2.9K]  Mock__11.json.gz
│       ├── [ 404]  Mock__12.json.gz
│       ├── [3.7K]  Mock__13.json.gz
│       ├── [3.1K]  Mock__14.json.gz
│       └── [3.1K]  Mock__15.json.gz
├── [ 224]  assembly
│   ├── [ 224]  Mock__11
│   │   ├── [ 14K]  Mock__11.contigs.fasta.gz
│   │   ├── [1.8K]  Mock__11.csv.gz
│   │   ├── [ 10K]  Mock__11.faa.gz
│   │   ├── [2.9K]  Mock__11.gff.gz
│   │   └── [ 67K]  Mock__11.megahit.log
│   ├── [ 224]  Mock__12
│   │   ├── [1.8K]  Mock__12.contigs.fasta.gz
│   │   ├── [ 399]  Mock__12.csv.gz
│   │   ├── [1.3K]  Mock__12.faa.gz
│   │   ├── [ 653]  Mock__12.gff.gz
│   │   └── [ 51K]  Mock__12.megahit.log
│   ├── [ 224]  Mock__13
│   │   ├── [ 15K]  Mock__13.contigs.fasta.gz
│   │   ├── [3.4K]  Mock__13.csv.gz
│   │   ├── [ 11K]  Mock__13.faa.gz
│   │   ├── [6.8K]  Mock__13.gff.gz
│   │   └── [ 68K]  Mock__13.megahit.log
│   ├── [ 224]  Mock__14
│   │   ├── [ 16K]  Mock__14.contigs.fasta.gz
│   │   ├── [2.1K]  Mock__14.csv.gz
│   │   ├── [ 11K]  Mock__14.faa.gz
│   │   ├── [3.4K]  Mock__14.gff.gz
│   │   └── [ 67K]  Mock__14.megahit.log
│   └── [ 224]  Mock__15
│       ├── [ 16K]  Mock__15.contigs.fasta.gz
│       ├── [2.0K]  Mock__15.csv.gz
│       ├── [ 11K]  Mock__15.faa.gz
│       ├── [3.2K]  Mock__15.gff.gz
│       └── [ 67K]  Mock__15.megahit.log
├── [316K]  geneshot.details.hdf5
├── [372K]  geneshot.results.hdf5
├── [  96]  qc
│   └── [  86]  readcounts.csv
├── [  96]  ref
│   └── [ 21K]  genes.dmnd
└── [ 128]  stats
    ├── [1.4K]  corncob.results.csv
    └── [ 104]  geneshot.breakaway.csv.gz

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output Files

Preprocessing

Manifest

Specimen Summary

De novo Assembly

Gene Catalog

Co-Abundant Gene Groups (CAGs)

CAG Summary

Gene Abundance

Annotations

Taxonomy

Ordination

Statistical Analysis

File Hierarchy

Clone this wiki locally