Output Files

Once you've run Geneshot, you will want to look at the results. Because Geneshot runs many different types of analyses, you may have to consult this documentation in order to find the element of the results which is most important for your application.

Many of the individual aspects of the results can be found in two formats:

A compressed text file in the output directory specified by --output
A table in the ${output_prefix}.full.hdf5 output file written to, with the file name specified by --output_prefix

NOTE: The file format HDF5 is used to collect all of the results because of its flexible support for combining multiple tabular datasets into a single data object. We have found that the most convenient way to read and write this data format is with the Python/Pandas library.

Preprocessing

If the --savereads flag is used, all of the preprocessed WGS FASTQ files will be written to the qc/ folder.

De novo Assembly

All of the contigs, scaffolds, gene annotations, and logs produced during de novo assembly for each specimen will be found in the assembly/<specimen>/ folder.

A CSV with information on every protein-coding sequence assembled from every sample (operationally termed 'alleles'), including the contig, position, sequencing depth, and GC content will be written to assembly/allele.assembly.metrics.csv.gz. This table will also be available in results.hdf5 under /abund/allele/assembly.

Example:

contig_depth	contig_length	contig_num	details	gc	gene_name	gene_num	specimen	start	stop	strand
22.380753	4357.0	1	ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.287	0.287	1_length_4357_cov_22.380753_1	1	Mock__2	1	1338	1
22.380753	4357.0	1	ID=1_2;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.302	0.302	1_length_4357_cov_22.380753_2	2	Mock__2	1524	2765	1
22.380753	4357.0	1	ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.265	0.265	1_length_4357_cov_22.380753_3	3	Mock__2	2951	3052	1
22.380753	4357.0	1	ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.322	0.322	1_length_4357_cov_22.380753_4	4	Mock__2	3292	4236	1
16.069435000000002	4318.0	2	ID=2_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.361	0.361	2_length_4318_cov_16.069435_1	1	Mock__2	212	856	1

Gene Catalog

All deduplicated gene sequences are written to ref/genes.fasta.gz in GZIP compressed FASTA format.

A table in TSV format describing which alleles were grouped together to form deduplicated genes, using the indicated percent identity threshold, will be written to ref/genes.alleles.tsv.gz. It will also be available in results.hdf5 under /annot/allele/gene.

Example:

gene	allele
Mock__4_NODE_1_length_4168_cov_15.434476_1	Mock__4_NODE_1_length_4168_cov_15.434476_1
Mock__4_NODE_1_length_4168_cov_15.434476_1	Mock__2_NODE_2_length_4318_cov_16.069435_1
Mock__4_NODE_1_length_4168_cov_15.434476_1	Mock__3_NODE_2_length_4301_cov_15.401319_1
Mock__4_NODE_1_length_4168_cov_15.434476_1	Mock__1_NODE_2_length_4150_cov_16.511355_1
Mock__4_NODE_1_length_4168_cov_15.434476_2	Mock__4_NODE_1_length_4168_cov_15.434476_2

The DIAMOND reference database for all of the deduplicated gene sequences will be written to ref/genes.dmnd.

Co-Abundant Gene Groups (CAGs)

A two column CSV describing which genes were grouped into which samples will be written to ref/CAGs.csv.gz. Columns are CAG and gene, with one row per gene. It will also be available in results.hdf5 under /annot/gene/cag.

Example:

CAG	gene
0	Mock__1_NODE_7510_length_263_cov_1.826923_1
0	Mock__3_NODE_6989_length_270_cov_1.767442_1
0	Mock__1_NODE_8655_length_250_cov_2.646154_1
0	Mock__2_NODE_6791_length_271_cov_1.759259_1
0	Mock__3_NODE_211_length_709_cov_5.498471_1

Gene Abundance

Abundance of every gene in every sample in feather format will be written to abund/gene.abund.feather. This table will be available in results.hdf5 under /abund/gene/wide. Abundances are calculated as the depth of sequencing for each individual gene, divided by the sum of all depths of sequencing for every gene detected in a sample.

Example:

index	Mock__1	Mock__2	Mock__4	Mock__3
Mock__4_NODE_7717_length_262_cov_2.753623_1	0E+002	2E-052	6E-052	5E-052
Mock__2_NODE_3808_length_333_cov_2.330935_1	4E-052	3E-052	0E+002	4E-052
Mock__2_NODE_12408_length_211_cov_2.435897_1	6E-052	5E-052	0E+002	0E+002
Mock__1_NODE_7309_length_265_cov_1.809524_1	4E-052	3E-052	0E+002	3E-052
Mock__1_NODE_11183_length_224_cov_2.071006_1	4E-052	2E-052	3E-052	2E-052

Abundance of every CAG in every sample in feather format will be written to abund/CAG.abund.feather. This table will be available in results.hdf5 under /abund/cag/wide. Abundances are calculated as the sum of the gene-level abundances (above) over all of the genes contained in a given CAG.

Example:

CAG	Mock__1	Mock__2	Mock__4	Mock__3
0.0	0.08	0.12	0.12	0.11
1.0	0.08	0.07	0.02	0.1
2.0	0.09	0.01	0.0	0.0
3.0	0.0	0.08	0.02	0.05
4.0	0.01	0.04	0.08	0.01

A more complete description of the detection of every gene in every sample can be found in the JSON files output by FAMLI, which will be written to /abund/details with a single file per specimen. In addition, this data will be aggregated into a single table in results.hdf5 under '/abund/gene/long`. Here is an example of the information found in that table:

coverage	depth	id	length	nreads	std	specimen
1.0	3.42	Mock__2_NODE_12649_length_209_cov_2.454545_1	69	6	0.97	Mock__1
1.0	2.48	Mock__1_NODE_9620_length_240_cov_2.054054_1	79	4	0.85	Mock__1
0.66	2.28	Mock__4_NODE_3651_length_340_cov_1.985965_1	86	4	1.78	Mock__1
0.64	1.28	Mock__3_NODE_10136_length_235_cov_3.166667_1	78	2	0.96	Mock__1
1.0	5.69	Mock__1_NODE_6010_length_280_cov_3.377778_2	32	6	0.73	Mock__1

Annotations

The taxonomic assignment for each gene (DIAMOND output in taxonomic assignment mode) in TSV format will be written to annot/genes.tax.aln.gz. It will also be available in results.hdf5 under /annot/gene/tax.

Example:

gene	tax_id	evalue
Mock__13_NODE_19_length_684_cov_3.917329_1	1224	8.8e-21
Mock__13_NODE_80_length_217_cov_2.308642_1	562	1.3e-29
Mock__13_NODE_82_length_215_cov_4.750000_1	2608889	1.2e-10
Mock__13_NODE_16_length_757_cov_4.584046_2	543	3.2e-51
Mock__13_NODE_29_length_514_cov_2.932462_1	1236	4.8e-17

The functional assignment for each gene (eggNOG output) in TSV format will be written to annot/genes.emapper.annotations.gz. It will also be available in results.hdf5 under /annot/gene/eggnog.

Example:

query_name	seed_eggNOG_ortholog	seed_ortholog_evalue	seed_ortholog_score	best_tax_level	taxonomic scope	eggNOG OGs	COG Functional cat.	eggNOG free text desc.
Mock__12_NODE_1_length_5360_cov_15.598303_1	1453503.AU05_02695	6.3e-21	105.9	Proteobacteria	Bacteria	1P1ED@1224,2FFHF@1,347EX@2	S	Bacteriophage protein K
Mock__12_NODE_1_length_5360_cov_15.598303_2	1116472.MGMO_205c00050	1.3999999999999996e-78	298.9	Gammaproteobacteria	Bacteria	1NZPF@1224,1SRZ6@1236,28HVF@1,2Z81Q@2	S	Bacteriophage scaffolding protein D
Mock__12_NODE_1_length_5360_cov_15.598303_3	1116472.MGMO_205c00040	3.3e-13	79.7	Gammaproteobacteria	Bacteria	1NKF2@1224,1ST7F@1236,2EFW9@1,339NJ@2	S	Microvirus J protein
Mock__12_NODE_1_length_5360_cov_15.598303_4	1116472.MGMO_205c00030	5.399999999999998e-258	896.3	Gammaproteobacteria	Bacteria	1NRFQ@1224,1SKG7@1236,28IN7@1,2Z8NM@2	S	Capsid protein (F protein)
Mock__12_NODE_1_length_5360_cov_15.598303_5	1118153.MOY_16472	1.1e-95	355.9	Gammaproteobacteria	Bacteria	1R4RW@1224,1SM1Q@1236,28HGS@1,2Z7SJ@2	S	Major spike protein (G protein)

Statistical Analysis

A CSV with the Corncob results for this dataset will be written to stats/corncob.results.csv. It will also be available in results.hdf5 under /stats/cag/corncob.

Example:

parameter	type	value
mu.(Intercept)	estimate	-1.5722784150839353
mu.label1	estimate	-0.2687179385894458
mu.label2	estimate	-0.17127772136300107
phi.(Intercept)	estimate	-6.573817086548077
mu.(Intercept)	std_error	0.15771708448454666

Summary HDF

In addition to the full set of results found in the ${output_prefix}.full.hdf5 HDF store, we will also make a smaller summary file: ${output_prefix}.summary.hdf5. This will contain a subset of the data found in the full HDF store:

Manifest table provided by the user: /manifest
Abundance of every CAG in every sample: /abund/cag/wide
Summary of all annotations available for each gene: /annot/gene/all
Table with statistical analysis results: /stats/cag/corncob

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output Files

Preprocessing

De novo Assembly

Gene Catalog

Co-Abundant Gene Groups (CAGs)

Gene Abundance

Annotations

Statistical Analysis

Summary HDF

Clone this wiki locally