Skip to content

pipeline for optimizing the clustering threshold in ipyrad RADseq assemblies

Notifications You must be signed in to change notification settings

rancilhac/optiRADCT

Repository files navigation

optiRADCT

OptiRADCT is a pipeline for optimizing Clustering Threshold values in Restrinction-site Associated DNA sequencing (RADseq) data assemblies performed with ipyrad for phylogenetic and phylogeography studies. It can be use to assemble RADseq datasets with a range of clustering parameters and vizualise how properties of the assemblies (e.g., proportion of informative sites and of missing data) and downstream phylogenetic inferences are affected by the values used.

Dependencies

The use of OptiRADCT presupposes a good understanding of the way ipyrad works. I strongly encourage the users to first get familiar with ipyrad's documentation.

Usage

A typical RADseq analysis workflow with de novo assembly would include the following steps:

  • Step 1: Data demultiplexing and reads filtering. This can be performed by running steps 1 & 2 of ipyard, or using the process_radtags module of stacks.
  • Step 2: Optimization of the clustering parameters.
    • Step 2.1: Optimization of the intra-samples clustering threshold, which is used to cluster reads within each samples.
    • Step 2.2: Optimization of the between-samples clustering threshold, which is used to cluster consensus sequences across samples.
  • Step 3: Additionnal filtering of the dataset assembled with "optimal" parameters, and data formating. This is performed by re-running step 7 of ipyrad on the "best" assembly produced in Step 2.2
  • Step 4: Downstream analyses.

OptiRADCT is meant to facilitate data exploration and vizualisation for the optimization step 2. Because ipyrad encodes a unique clustering threshold value for intra-samples and across-samples clustering, this is done by running ipyrad sequentially twice. For more detailed explanations, refer to Rancilhac et al. (2023) BioRxiv.

1) optimization of the intra-samples Clustering Threshold (1_run_iCT.sh)

in this first part, intra-samples clustering is performed with a user-specified range of thresholds (iCTs). Between samples clustering can be subsequently completed in each assembly using a user-specified standardized between-samples threshold (bCT).

bash 1_run_iCT.sh -d /path/to/data -n assemblies_base_name -w /path/to/work/dir -i '0.80 0.81 0.82 0.83' -b 0.9 -D 8 -T 4 -t pairddrad -s /path/to/scripts -p -r /path/to/reference.tre

arguments:

-d | --data : full path to a directory containing the demultiplexed data. Files are expected to be gzip-compressed (.gz extension) and all .gz files in the directory will be used. See ipyrad's documentation for more details on files naming conventions.

-n | --name : a base name for ipyrad's assemblies.

-w | --work : full path to working directory where ipyrad assemblies and optiRADCT will be stored

-i | --iCT : the range of iCTs to test, provided between apostrophes and seprated by single spaces, e.g. '0.80 0.85 0.90'.

-b | --bCT : standardized clustering threshold to be used for between-samples clustering (e.g. 0.90) (optional)

-D | --depth : minimum depth of ipyrad's clusters (ipyrad parameter #11 & #12)

-T | --threads : number of threads to use (default=1)

-t | --type : the data type used (one of: rad, ddrad, gbs, pairddrad, pairgbs, 2brad, pair3rad; ipyard parameter #7)

-s | --scripts : full path to the directory containing the optiRADCT scripts

-p | --tree : if this option is specified, maximum likelihood trees will be inferred with RAxML for each iCT.

-r | --ref : can be used to specify a reference tree in newick format to compare the RAxML trees to (using topological distances). The tip labels must be identical to the samples names in the assemblies. Used only if -p is specified.

-h | --help : prints a help message and exits.

This script will execute ipyrad's steps 1-5 for each of the iCTs specified with -i. If -b is specified, it will then finish each ipyrad run (i.e., run steps 6-7) with the standardized bCT. Assembly metrics for the clusters and loci are harvested internaly with harvest_stats.sh and plots are produced. The assembly metrics are saved in the directory stats_iCT, and plots vizualizing the different metrics can be found in stats_iCT/iCT_plots.pdf. Finally, if -p if specified, a directory trees_iCT is created, containing the assemblies concatenation matrices in phylip format, the RAxML trees and plots showing changes in branch length and bootstrap support in .

2) optimization of the between-samples Clustering Threshold (3_run_bCT.sh)

Once the best iCT value has been chosen, it is time to optimize the between-samples clustering threshold. This is done by using the clusters generated by ipyrad at the previous step with the chosen iCT to repeat steps 6-7 with a range of bCT.

bash 3_run_bCT.sh -n assemblies_base_name -b '0.80 0.85 0.90' -w /path/to/working/directory -i params-best_iCT.txt -T 4 -s /path/to/scripts -p -r reference.tre

arguments:

-n | --name : a base name for ipyrad's assemblies

-b | --bCT : the range of bCTs to test, provided between apostrophes and seprated by single spaces, e.g. '0.80 0.85 0.90'.

-w | --work : path to working directory, where the outputs of the previous step are saved.

-i | --iCT : name of the parameters file generated by ipyrad corresponding to the chosen iCT value. Assuming the chosen value is 0.95 and the base name in the previous step (-n) was test, the file will be names params-test_iCT_95.txt.

-T | --threads: number of threads to use (default=1)

-s | --scripts : : full path to the directory containing the optiRADCT scripts

-p | --tree : if this option is specified, maximum likelihood trees will be inferred with RAxML for each iCT.

-r | --ref : can be used to specify a reference tree in newick format to compare the RAxML trees to (using topological distances). The tip labels must be identical to the samples names in the assemblies. Used only if -p is specified.

-h | --help : prints a help message and exits.

The outputs are similar to that of the previous step, and saved in the directories stats_bCT and trees_bCT

Note: the R scripts, harvest_stats.sh and 5_infer_trees.sh are executed internally, so that only 1_run_iCT.sh and 3_run_bCT.sh need to be run by the user.

About

pipeline for optimizing the clustering threshold in ipyrad RADseq assemblies

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published