DiPPER2: Diagnostic Primer Picking and Evaluation pipeline for Reliability and Reproducibility
+++++NOTE - IMPORTANT:+++++
This is a work in progress.
++++++++++++++++++++++++++
This pipeline and modules are meant to facilitate reliable and reproducible finding of diagnostic targets and to make picking primers for those targets as user-friendly as possible. The approach taken is a phylogeny-driven and clade-specific approach.
The pipeline, once functional, automatically finds unique genomic regions for a target species, pathovar or race with moderate user intervention, automatically picks primers for these regions that are amenable to both conventional PCR and quantitative PCR, and finally tests them for specificity and sensitivity.
DiPPER2 is a tool to find unique genomic regions of target species, build primers (conv. or qPCR) for them, validate them in silico, define the identity of the unique genomic regions (either by homology or genome coordinates) and produce a user-friendly html report (and a more machine-readable .txt report).
Key features of DiPPER2:
- generates a html and .txt report with the primers, the identity of the region the primers target and information of specificity and sensitivity.
- generates .txt files (in fasta format) with primers for the unique genomic regions found
- generates fasta files with the sequence of the unique genomic regions the primers target
- generates .txt files that contain information on the primer's characteristics (Tm, GC, product size etc.)
- generates .bed files of unique genomic regions that cannot be identified by homology (blastx). These can be used in IGV to find where the unique genomic region falls in a reference genome (either provided, or longest)
- generates .bed files for the in silico tests of primers
For more details on the outputs and the folder structure, compare the 'DiPPER2 results walkthrough' section.
The following programs need to be in path:
DiPPER2 was developed using python 3.9 and 3.12. It runs stably on both version 3.9 and 3.12, but a python version >=3.12 is recommended.
DiPPER2 automatically generates a python environment and installs the necessary python modules defined in the requirements file.
Seqkit, BLAST+ and primer3 can be installed via conda. primer3 can also be installed using homebrew.
Assuming all the required programs listed in Requirements are installed and in path, DiPPER2 can be installed by cloning the repo:
git clone https://github.com/ThWacker/DiPPER2.git
-
subfolders:
- FUR.P3.PRIMERS
- FUR.P3.PRIMERS/primer_data
- FUR.P3.PRIMERS/in_silico_tests
- FUR.P3.TARGETS
-
files:
- Results.txt and Results.html:
the results summary files <Prefix>
.targets.txt and<Prefix>
.neighbours.txt: files listing the accessions/ files used as targets and neighbours<Prefix>
.FUR.db.out.txt:
the FUR results file (a multifasta file)- .FUR.db.out.primers.txt:
the FUR results file reformatted in the format required by Primer3 as an input <Prefix>
.FUR.db.out.primers.primer3_out.txt:
the primer3 output results file- FUR_Summary_output.txt:
FUR output to STDERR which summarizes how many sequences were retained after each processing step, including the length in bp and the number of Ns - optional:
bed files generated by seqkit locate
- Results.txt and Results.html:
- within FUR.P3.PRIMERS:
Primer fasta files (.txt ending, but fasta formatted). Primers are numbered. The number is the unique identity of the primer. - within FUR.P3.PRIMERS/primer_data:
Here, textfiles with information about Tm, amplicon length, GC etc are found. These follow the Primer3 conventions, please compare here for explanation: (https://primer3.org/manual.html#outputTags) - within FUR.P3.PRIMERS/in_silico_tests
Here, the results of the in silico PCR tests, performed using seqkit locate, can be found. Files names with 'target' in the name are files used for sensitivity testing, running an in silico PCR against the concatenated targets. File names with 'neighbour' in the name are files used for specificity testing, running an in silico PCR against the concatenated neighbours. The 'm' in the name refers to the number of allowed mismatches in the primer - within FUR.P3.TARGETS:
Here, we find target files with 'Target' in the name, followed by a number that matches the unique identifier of the primer pair, which are fasta files containing the target sequence. Also, blastx results are found here. These are tab-separated and the headers of the blastx files are "qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"
All relevant information are found in the results files, which will give you an idea whether a primer passed or failed. Currently, the results files do not contain amplicon lengths or Tms etc, please refer to the txt files in the FUR.P3.PRIMERS/primer_data folder for that.
Up to 4 primers are picked, but sometimes less than 4 primers are generated by Primer3. Having less than 4 primers in the results does not mean the run/ pipeline did not complete.
Currently, the default target parameters for conventional primers are:
primMinTm=58 primOptTm=60 primMaxTm=62 inMinTm=63 inOptTm=65 inMaxTm=67 prodMinSize=200 prodMaxSize=1000 Oligo=0
The default target parameters for qPCR primers are:
primMinTm=58 primOptTm=60 primMaxTm=62 inMinTm=63 inOptTm=65 inMaxTm=67 prodMinSize=100 prodMaxSize=200
This means that conventional primers have a target optimum Tm of 60 and a product size of 200-1000 and qPCR primers a Tm of 60 and an amplicon size of 100-200. For qPCR primers an internal probe is also picked with an optimal Tm of 65.
Either open an issue in this repo or contact Theresa Wacker at t[dot]wacker2[at]exeter[dot]ac[dot]uk.
DiPPER2 uses FUR, primer3, seqkit, BLAST+ and eutilies, all of which have been published. When using DiPPER2, please also cite:
Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K. and Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bioinformatics, 10(1), p.421.
Haubold, B., Klötzl, F., Hellberg, L., Thompson, D. and Cavalar, M. (2021). Fur: Find unique genomic regions for diagnostic PCR. Bioinformatics, 37(15), pp.2081–2087. doi:https://doi.org/10.1093/bioinformatics/btab059.
Kans, J. (2016). Entrez Direct: E-utilities on the UNIX Command Line. [online] Available at: https://www.ncbi.nlm.nih.gov/books/NBK179288/ [Accessed 4 Dec. 2024].
Shen, W., Le, S., Li, Y. and Hu, F. (2016). SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE, 11(10), p.e0163962. doi:https://doi.org/10.1371/journal.pone.0163962.
Untergasser, A., Cutcutache, I., Koressaar, T., Ye, J., Faircloth, B.C., Remm, M. and Rozen, S.G. (2012). Primer3—new capabilities and interfaces. Nucleic Acids Research, 40(15), pp.e115–e115. doi:https://doi.org/10.1093/nar/gks596.