00. These scripts are purely intended as a descriptive extention to the computational methodologies used in the MARsym paper. They are not intended as tools for further usage and should be considered accordingly.

01. Mapping: mapping.bbmapv36x.sh

[01] Purpose of script:

Shows detailed commands that were used for read mapping in the MARsym paper. Reads of quality q20 were mapped to a consensus reference sequence with minimum nucleotide identity of 0.95 using BBMap.

[01] Programs that need to be installed to execute script:

BBmap v36.x https://sourceforge.net/projects/bbmap/ (for later versions command details might need to be adatpted)
samtools https://github.com/samtools/samtools

[01] Required input files:

Illumina reads in gzipped fastq format for each sample s1, s2, s3, ... sn (e.g. s1.fq.gz)
Reference sequence in fasta format: ref.fasta info: the script uses a consensus reference for all samples. If you have one reference per sample, you need to replace all $ref in the code with $sample

[01] Example of output files for sample s1

mapping.$DATE.log - logfile for mapping (all samples s1, s2, s3, ... sn)
coverage_depth_bam.txt - coverage depths of the mapping file of each sample (s1, s2, s3, ... sn)
coverage_depth_rmdup.txt - coverage depths of the mapping file of each sample (s1, s2, s3, ... sn) after PCR duplicate removal
s1.q20.bam - sorted and indexed mapping file
s1.rmdup.bam - sorted and indexed mapping file after PCR duplicate removal

02. SNPcalling: SNPcalling.gatkv3.3.0.sh

[02] Purpose of script:

Shows detailed commands that were used for SNP calling in the MARsym paper. Script calls SNPs on mapped reads to a reference using GATK and a ploidy setting of 10. Input files can be created with MARsym_mapping.

[02] Programs that need to be installed to execute script:

GenomeAnalysisToolKit (GATK) v3.3.0 https://software.broadinstitute.org/gatk/download/ (for later versions command details might need to be adatpted)
picard-tools v1.119 https://github.com/broadinstitute/picard (for later versions command details might need to be adatpted)
samtools https://github.com/samtools/samtools
ucsc tools (executable faCount) https://github.com/adamlabadorf/ucsc_tools

[02] Required input files:

Reference sequence in fasta format: ref.fasta info: the script uses a consensus reference for all samples. If you have one reference per sample, you need to replace all $ref in the code with $sample
indexed bamfile of reads mapping to .fasta where PCR duplicates were removed in sorted and indexed bam format: s1.id95.rmdup.bam s2.id95.rmdup.bam s3.id95.rmdup.bam ... sn.id95.rmdup.bam, where s1, s2, s3, sn are the individual $samples. Input bam-files can be created with MARsym_mapping

[02] Example of output files for sample s1 with target read coverage of 100x

ref.dict - reference dictionary (for all samples s1, s2, s3, ... sn)
SNPcounts.txt - final SNP counts: absolute and per kbp (for all samples s1, s2, s3, ... sn)
SNPcalling.$DATE.log - logfile (for all samples s1, s2, s3, ... sn)
s1.real.bam - bamfiles with readgroups and realigned reads around INDELs
100x.s1.bam - downsampled bam file (with readgroups and realigned reads around INDELs) to target read coverage 100x
100x.s1.rawVar_q30_ploidy10.vcf - raw variants called with ploidy 10
100x.s1.rawSNPs_ploidy10.vcf - raw SNPs called with ploidy 10
100x.s1.rawINDELs_ploidy10.vcf - raw INDELs called with ploidy 10
100x.s1.filtSNPs_ploidy10.vcf - filtered SNPs: SNPs that don't pass the filter are flagged with corresponding filter(s)
100x.s1.filtINDELs_ploidy20.vcf - filtered INDELs: SNPs that don't pass the filter are flagged with corresponding filter(s)
100x.s1.filtSNPs_PASS_ploidy10.vcf - file with only those SNPs that passed the all filters
100x.s1.filtINDELs_PASS_ploidy10.vcf - file with only those INDELs that passed the all filters

03. Strain number estimation: geneHaplotyping.viquas1.3.sh

[03] Purpose of script:

Shows detailed commands that were used for estimating the number of gene versions for the provided set of genes using the tool ViQuaS in the MARsym paper.

[03] Programs that need to be installed to execute script:

ViQuaS https://academic.oup.com/bioinformatics/article/31/6/886/215466
samtools https://github.com/samtools/samtools

[03] Required input files:

Reference sequence in fasta format: ref.fasta info: the script uses a consensus reference for all samples. If you have one reference per sample, you need to replace all $ref in the code with $sample
s1.real.bam - bamfiles with readgroups and realigned reads around INDELs OR (depends on sample/question)
100x.s1.bam - downsampled bam file (with readgroups and realigned reads around INDELs) with target read coverage 100x

[03] Example of output files for sample s1

Spectrum-file contains reconstructed fasta sequences and abundance of sequence
Richness-file contains f_min value. We discarded all reconstructed sequences below this frequency

04. Identification of low-coverage genes

Shows detailed commands that were used for identification of genes with coverage below the range of coverage from gammaproteobacterial marker genes. These genes were classified as strain-specific in the MARsym paper.

[04] Programs that need to be installed to execute script:

samtools https://github.com/samtools/samtools
ucsc tools (executable faCount) https://github.com/adamlabadorf/ucsc_tools
bedtools https://github.com/arq5x/bedtools2
R https://www.r-project.org/
PhylaAmphora https://github.com/martinwu/Phyla_AMPHORA

[04] Required input files:

Reference sequence in fasta format: ref.fasta
Annotations of reference in gff3 format: ref.gff
Predicted amino acid sequences of proteins: ref.faa
s1.real.bam - bamfiles with readgroups and realigned reads around INDELs OR (depends on sample/question)
100x.s1.bam - downsampled bam file (with readgroups and realigned reads around INDELs) with target read coverage 100x

[04] Example of output files

(among others)

LowCovGenes_$sample - list with low-coverage genes
LowCovGenes_noHypo_$sample - list with low-coverage genes, excluding hypothetical proteins
$ref.plots.pdf - * plots with coverage distribution of genes*

03. Produce figures and analyses from source data: all .Rmd scripts

With the manuscript we submitted excel files of source data for Figures 2, 3, 4; Extended Data 5, 6, 7; and the PERMANOVA analysis of pi values. Every excel sheet needs to be extracted into a separate file and the term "PATH" in the .Rmd scripts needs to be replaced with the location of files. (for accessing the excel files please wait for publication of the manuscript)

--

If you find these workflows useful and use some of them in your study, please cite the following manuscript and the studies that developed the bioinformatic tools within the workflow: Ansorge, R., Romano, S., Sayavedra, L., Kupczok, A., Tegetmeyer, H. E., Dubilier, N., Petersen, J. Diversity matters: Deep-sea mussels harbor multiple symbiont strains, in prep

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
ExtendedData5.Rmd		ExtendedData5.Rmd
ExtendedData6.Rmd		ExtendedData6.Rmd
ExtendedData7.Rmd		ExtendedData7.Rmd
Fig2.Rmd		Fig2.Rmd
Fig3.Rmd		Fig3.Rmd
Fig4.Rmd		Fig4.Rmd
LowCovGene_identification.sh		LowCovGene_identification.sh
README.md		README.md
SNPcalling.gatk3.3.0.sh		SNPcalling.gatk3.3.0.sh
geneHaplotyping.viquas1.3.sh		geneHaplotyping.viquas1.3.sh
mapping.bbmapv36x.sh		mapping.bbmapv36x.sh
permanova.Rmd		permanova.Rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

00. These scripts are purely intended as a descriptive extention to the computational methodologies used in the MARsym paper. They are not intended as tools for further usage and should be considered accordingly.

01. Mapping: mapping.bbmapv36x.sh

[01] Purpose of script:

[01] Programs that need to be installed to execute script:

[01] Required input files:

[01] Example of output files for sample s1

02. SNPcalling: SNPcalling.gatkv3.3.0.sh

[02] Purpose of script:

[02] Programs that need to be installed to execute script:

[02] Required input files:

[02] Example of output files for sample s1 with target read coverage of 100x

03. Strain number estimation: geneHaplotyping.viquas1.3.sh

[03] Purpose of script:

[03] Programs that need to be installed to execute script:

[03] Required input files:

[03] Example of output files for sample s1

04. Identification of low-coverage genes

[04] Programs that need to be installed to execute script:

[04] Required input files:

[04] Example of output files

03. Produce figures and analyses from source data: all .Rmd scripts

About

Releases

Packages

Languages

dashan1928/MARsym_paper

Folders and files

Latest commit

History

Repository files navigation

00. These scripts are purely intended as a descriptive extention to the computational methodologies used in the MARsym paper. They are not intended as tools for further usage and should be considered accordingly.

01. Mapping: mapping.bbmapv36x.sh

[01] Purpose of script:

[01] Programs that need to be installed to execute script:

[01] Required input files:

[01] Example of output files for sample s1

02. SNPcalling: SNPcalling.gatkv3.3.0.sh

[02] Purpose of script:

[02] Programs that need to be installed to execute script:

[02] Required input files:

[02] Example of output files for sample s1 with target read coverage of 100x

03. Strain number estimation: geneHaplotyping.viquas1.3.sh

[03] Purpose of script:

[03] Programs that need to be installed to execute script:

[03] Required input files:

[03] Example of output files for sample s1

04. Identification of low-coverage genes

[04] Programs that need to be installed to execute script:

[04] Required input files:

[04] Example of output files

03. Produce figures and analyses from source data: all .Rmd scripts

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages