Output Files

Once you've run Geneshot, you will want to look at the results. Because Geneshot runs many different types of analyses, you may have to consult this documentation in order to find the element of the results which is most important for your application.

Many of the individual aspects of the results can be found in two formats:

A compressed text file in the output directory specified by --output
A table in the results.hdf5 file written to the output directory

NOTE: The file format HDF5 is used to collect all of the results because of its flexible support for combining multiple tabular datasets into a single data object. We have found that the most convenient way to read and write this data format is with the Python/Pandas library.

Preprocessing

If the --savereads flag is used, all of the preprocessed WGS FASTQ files will be written to the qc/ folder.

De novo Assembly

All of the contigs, scaffolds, gene annotations, and logs produced during de novo assembly for each specimen will be found in the assembly/<specimen>/ folder.

A CSV with information on every protein-coding sequence assembled from every sample (operationally termed 'alleles'), including the contig, position, sequencing depth, and GC content will be written to assembly/allele.assembly.metrics.csv.gz. This table will also be available in results.hdf5 under /abund/allele/assembly.

Example:

contig_depth	contig_length	contig_num	details	gc	gene_name	gene_num	specimen	start	stop	strand
22.380753	4357.0	1	ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.287	0.287	1_length_4357_cov_22.380753_1	1	Mock__2	1	1338	1
22.380753	4357.0	1	ID=1_2;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.302	0.302	1_length_4357_cov_22.380753_2	2	Mock__2	1524	2765	1
22.380753	4357.0	1	ID=1_3;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.265	0.265	1_length_4357_cov_22.380753_3	3	Mock__2	2951	3052	1
22.380753	4357.0	1	ID=1_4;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.322	0.322	1_length_4357_cov_22.380753_4	4	Mock__2	3292	4236	1
16.069435000000002	4318.0	2	ID=2_1;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.361	0.361	2_length_4318_cov_16.069435_1	1	Mock__2	212	856	1

Gene Catalog

All deduplicated gene sequences are written to ref/genes.fasta.gz in GZIP compressed FASTA format.

A table in TSV format describing which alleles were grouped together to form deduplicated genes, using the indicated percent identity threshold, will be written to ref/genes.alleles.tsv.gz. It will also be available in results.hdf5 under /annot/allele/gene.

The DIAMOND reference database for all of the deduplicated gene sequences will be written to ref/genes.dmnd.

Co-Abundant Gene Groups (CAGs)

A two column CSV describing which genes were grouped into which samples will be written to ref/CAGs.csv.gz. Columns are CAG and gene, with one row per gene. It will also be available in results.hdf5 under /annot/gene/cag.

Example:

CAG	gene
0	Mock__1_NODE_7510_length_263_cov_1.826923_1
0	Mock__3_NODE_6989_length_270_cov_1.767442_1
0	Mock__1_NODE_8655_length_250_cov_2.646154_1
0	Mock__2_NODE_6791_length_271_cov_1.759259_1
0	Mock__3_NODE_211_length_709_cov_5.498471_1

Gene Abundance

Abundance of every gene in every sample in feather format will be written to abund/gene.abund.feather. This table will be available in results.hdf5 under /abund/gene/wide. Abundances are calculated as the depth of sequencing for each individual gene, divided by the sum of all depths of sequencing for every gene detected in a sample.

Example:

index	Mock__1	Mock__2	Mock__4	Mock__3
Mock__4_NODE_7717_length_262_cov_2.753623_1	0E+002	2E-052	6E-052	5E-052
Mock__2_NODE_3808_length_333_cov_2.330935_1	4E-052	3E-052	0E+002	4E-052
Mock__2_NODE_12408_length_211_cov_2.435897_1	6E-052	5E-052	0E+002	0E+002
Mock__1_NODE_7309_length_265_cov_1.809524_1	4E-052	3E-052	0E+002	3E-052
Mock__1_NODE_11183_length_224_cov_2.071006_1	4E-052	2E-052	3E-052	2E-052

Abundance of every CAG in every sample in feather format will be written to abund/CAG.abund.feather. This table will be available in results.hdf5 under /abund/cag/wide. Abundances are calculated as the sum of the gene-level abundances (above) over all of the genes contained in a given CAG.

Example:

CAG	Mock__1	Mock__2	Mock__4	Mock__3
0.0	0.08	0.12	0.12	0.11
1.0	0.08	0.07	0.02	0.1
2.0	0.09	0.01	0.0	0.0
3.0	0.0	0.08	0.02	0.05
4.0	0.01	0.04	0.08	0.01

A more complete description of the detection of every gene in every sample can be found in the JSON files output by FAMLI, which will be written to /abund/details with a single file per specimen. In addition, this data will be aggregated into a single table in results.hdf5 under '/abund/gene/long`. Here is an example of the information found in that table:

coverage	depth	id	length	nreads	std	specimen
1.0	3.42	Mock__2_NODE_12649_length_209_cov_2.454545_1	69	6	0.97	Mock__1
1.0	2.48	Mock__1_NODE_9620_length_240_cov_2.054054_1	79	4	0.85	Mock__1
0.66	2.28	Mock__4_NODE_3651_length_340_cov_1.985965_1	86	4	1.78	Mock__1
0.64	1.28	Mock__3_NODE_10136_length_235_cov_3.166667_1	78	2	0.96	Mock__1
1.0	5.69	Mock__1_NODE_6010_length_280_cov_3.377778_2	32	6	0.73	Mock__1

Annotations

The taxonomic assignment for each gene (DIAMOND output in taxonomic assignment mode) in TSV format will be written to annot/genes.tax.aln.gz. It will also be available in results.hdf5 under /annot/gene/tax.

The functional assignment for each gene (eggNOG output) in TSV format will be written to annot/genes.emapper.annotations.gz. It will also be available in results.hdf5 under /annot/gene/eggnog.

Statistical Analysis

A CSV with the Corncob results for this dataset will be written to stats/corncob.results.csv. It will also be available in results.hdf5 under /stats/cag/corncob.

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output Files

Preprocessing

De novo Assembly

Gene Catalog

Co-Abundant Gene Groups (CAGs)

Gene Abundance

Annotations

Statistical Analysis

Clone this wiki locally