Population-based framework for introgression/selection/resequencing experiments
Designed to be an extension to the "PsiSeq" protocol, which identifies candidate genome locii in selection/backcross experiments (Earley & Jones 2011).
PopPsiSeq updates the original protocol in several ways:
- consolidation into a Snakemake pipeline simplifies a somewhat unruly workflow
- improved data QA/QC, quality & uniqueness of mapping
- uses empirical sequenced reads rather than fragmented reference genome to characterize species.
- incorporates software advances in variant calling (eg, freebayes rather than directly examining pileup), data processing & visualization (eg, ggplot), and other utilities (eg, vcftools, bedtools)
- PsiSeq uses a reciprocal mapping scheme to call variants (eg, simulans reads vs sechellia reference and sechellia reads vs simulans reference), whereas PopPsiSeq currently maps both to a third, common reference (eg, simulans reads & sechellia reads vs melanogaster reference).
- PsiSeq assumes that differences between species are fixed; PopPsiSeq examines local changes in allele frequency (of which fixation is an extreme case).
The core pipeline is contained in Snakefile
and expects operational information (such as the path to the reference genome files) and metadata about the samples analyzed (including their files' paths and their relationship to the backcross).
As an example, the data from (Earley & Jones 2011) can be reanalyzed with other published DNA-Seq from Drosophila simulans and Drosophila sechellia by running the snakemake command
snakemake data/ultimate/freq_shift/freebayes/all.Earley2011_with_allSim_and_allSech.vs_dm6.bwaUniq.windowed_w100000_s100000.frqShift.bed --configfile configurations/config.basicExample.yaml
This will download the reads from NCBI, map them to the dm6 reference genome with a filtered bwa algorithm, call variants with freebayes, calculate the allele frequency shift, and smooth by bookended 100kB genomic windows.
The core pipeline can be included as a module in a larger workflow. as a simple example, this command will act as a wrapper for the above data generation; it will build the data, summarize/visualize it, and also quantify/document the workflow itself:
snakemake --snakefile workflows/Snakefile.basicExample --configfile configurations/config.basicExample.yaml
The original PsiSeq, as well as an intermediate rewrite (PsiSeq2) are included as unsupported legacy code. Their pipelines can be imported by including utils/modules/Snakefile.legacy
. An example workflow can be run:
snakemake data/ultimate/shared_SNPs/PsiSeq/droSim1/bwaUniq/Earley2011.SNPs_shared_with.fragSimulated_dSec1.vs_droSim1.bwaUniq.genomeWindowed_w100000_s100000.bed data/ultimate/shared_SNPs/PsiSeq2/droSim1/bwaUniq/Earley2011.SNPs_shared_with.fragSimulated_dSec1.vs_droSim1.bwaUniq.genomeWindowed_w100000_s100000.bed --snakefile workflows/Snakefile.legacy --configfile configurations/config.legacyExample.yaml
or, with self-documentation:
snakemake --snakefile workflows/Snakefile.legacyExample --configfile configurations/config.legacyExample.yaml
The PopPsiSeq algorithm was used to analyze a backcross & introgression experiment in (citation). There results can be generated by running:
snakemake --configfile configurations/config.Moehring2024.yaml --snakefile workflows/Snakefile.Moehring2024
This analysis was originally written as a test comparison of the PopPsiSeq algorithm with earlier versions. This comparison, including the legacy results, can be generated:
snakemake results/Moehring_PsiSeqDev.pdf --configfile configurations/config.Moehring2024.yaml --snakefile workflows/Snakefile.Moehring2024
This workflow illustrates how modules in the original pipelines can be swapped out (eg, the smrtFreeBayes variant caller and the PsiSeq2_relaxed) as well as build upon with project-specific tasks.
A look back at the development of the PsiSeq software, comparing versions 1 and 2 with the present algorithm on a variety of data sets. The real results are the population genetics we met along the way.
snakemake --configfile configurations/config.PsiSeqDeepDive.yaml --snakefile workflows/Snakefile.PsiSeqDeepDive
In development, with unpublished data. Coming soon!
PopPsiSeq/
├── configurations # configuration files - sample metadata, important filepaths, etc
│ ├── config.basicExample.yaml
│ └── ...
├── data
│ ├── external # SRA downloads stored here
│ ├── intermediate # alignments, variant calls, etc
│ ├── raw # ie, unpublished
│ ├── summaries # summary data eg read QC
│ └── ultimate # windowed results are stored here
├── markdowns # markdown files for self-summary and writeup
│ ├── PopPsiSeq_basicExample.Rmd
│ └── ...
├── README.md
├── scripts
│ ├── freqShifter.R # this is the script that polarizes and calculates the allele shift
│ └── legacy # unsupported code from v1 and v2
│ ├── PsiSeq
│ └── PsiSeq2
├── Snakefile # core pipeline
├── utils
│ ├── genelists
│ ├── genome_windows
│ ├── legacy
│ │ └── PsiSeq.zip # the SI for Earley 2011 is not currently available so it's mirrored here
│ └── modules # useful sub-pipelines
│ ├── Snakefile.legacy
│ └── Snakefile.popgentools
└── workflows # example use cases
├── Snakefile.basicExample
└── ...
Earley, Eric J., and Corbin D. Jones. 2011. “Next-generation mapping of complex traits with phenotype-based selection and introgression.” Genetics 189 (4): 1203–9. doi:10.1534/genetics.111.129445.