Skip to content

Recovering genes from targeted sequence capture data

Notifications You must be signed in to change notification settings

polymathcat/HybSeqPipeline

 
 

Repository files navigation

HybSeq Pipeline

by Matt Johnson and Norm Wickett, Chicago Botanic Gardens

Purpose: 
Targeted bait capture is a technique for sequencing many loci simultaneously based on bait sequences.
This pipeline starts with Illumina reads, and assigns them to target genes using BLASTx.
The reads are distributed to separate directories, where they are assembled separately using Velvet and CAP3. 
The main output is one FASTA file per protein, containing the CDS portion of the homologous protein sequence.

------------Dependencies----------------
Python 2.7 or later
BIOPYTHON 1.59 or later: http://biopython.org/wiki/Main_Page
EXONERATE: http://www.ebi.ac.uk/~guy/exonerate/
BLAST command line tools: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Velvet: https://www.ebi.ac.uk/~zerbino/velvet/
CAP3: http://seq.cs.iastate.edu/cap3.html
GNU Parallel: http://www.gnu.org/software/parallel/

This pipeline runs with Python, version 2.7 or later, to take advantage of the argparse module for command-line argument.
It requires BIOPYTHON (version 1.6 or later) for parsing and handling nucleotide and protein sequences (www.biopython.org)
EXONERATE is used for alignment between bait proteins and assemblies, and retrieval of in-frame CDS sequences. 
This script requires that BLAST command-line tools (specifically blastx) are installed in the $PATH
GNU Parallel is a *nix tool for running many jobs (for example, 1000s of Velvet assemblies) simultaneously on a multi-processor computer. It is required for every step of the pipeline.

-----------------------------------------

-----------------Setup-------------------
Install the dependencies.
Normal installations are fine for most tools (see below for installation of Velvet), as long as they are in your $PATH.
You can check the installation by executing the "readsfirst.py" script with the --check-depend flag.

NOTE: Velvet must be compiled with the ability to handle k-mer values > 31.
	make 'MAXKMERLENGTH=67' 

--------MACOSX installation notes---------

Both Exonerate and Velvet require zlib. 
Perhaps the easiest way to install Exonerate is with homebrew: http://brew.sh/
Install homebrew, and then "tap" the science repository:
	brew tap homebrew/science
Now install exonerate:
	brew install exonerate

For velvet, this command line worked for me in Mac OS 10.9:
	make 'MAXKMERLENGTH=67' LDFLAGS='-L third-party/zlib-1.2.3/ -lm’

--------Preparing your files-------------

Construct a "bait file" of protein sequences. If you have just one sequence per bait, simply concatenate all of them into one file.
If you have more than one sequence per bait, they need to be identified before concatenation.
The ID for each sequence should include the bait source and the protein ID, separated by a hyphen. For example:
>Amborella-atpH
MNPLISAASVIAAGLAVGLASIGPGVGQGTAAGQAVEGIARQPEAEGKIRGTLLLSLAFM
[Future plans: Add helper script to do this]

To execute the entire pipeline, create a directory containing the paired-end read files.
The script "reads_first.py" will create a directory based on the fastq filenames:
	Anomodon-rostratus_L0001_R1.fastq ---> Anomodon-rostratus/

-----------Running the pipeline----------

The following command will execute the entire pipeline on a pair of Illumina read files, using the baits in the file "baits.fasta":

python reads_first.py -r MySpecies_R1.fastq MySpecies_R2.fastq -b baits.fasta

For best results, these three input files should be in the current directory.



--------------Pipeline Scripts-----------

---------------reads_first.py-------------
A wrapper script that:
	1. Can check if all dependencies are installed correctly. (--check-depend)
	2. Creates sub-directories.
	3. Calls all downstream analyses

You can tell the script to skip upstream steps (for example: --no-blast) but the script will still assume that the output files of these steps still exist!

This script will call, in order:
	1. Blastx
	2. distribute_reads_to_targets.py and distribute_targets.py
	3. velveth and velvetg
	4. CAP3
	5. exonerate_hits.py
	
Some program-specific options may be passed at the command line. 

For example, the e-value threshold for BLASTX (--evalue, default is 1e-9) or the coverage-cutoff level for Velvet assemblies (--cov_cutoff, default is 5)	

-------distribute_reads_to_targets.py-----
After a BLASTx search of the raw reads against the target sequences, the reads need to be 
sorted according to the successful hits. This script takes the BLASTx output (tabular)
and the raw read files, and distributes the reads into FASTA files ready for assembly.

If there are multiple BLAST results (for example, one for each read direction),
concatenate them prior to sorting.


----------distribute_targets.py-----------
Given a file containing all of the "baits" for a target enrichment, create separate
FASTA files with all copies of that bait. Multiple copies of the same bait can be 
specified using a "-" delimiter. For example, the following will be sorted in to the same
file:

Anomodon-rbcl
Physcomitrella-rbcl

Given multiple baits, the script will choose the most appropriate 'reference' sequence
using the highest cumulative BLAST scores across all hits.

Output directories can also be created, one for each target category
	(the default is to put them all in the current one)
The field delimiter may also be changed.


----------------exonerate_hits.py----------
After the query_file_builder is complete, run this script next.
The minimal inputs are the tailored bait file and the assembly.

If run immediately after query_file_builder, use the --prefix flag to specify the file names:

EXAMPLE COMMAND LINE

exonerate_hits.py speciesName/baitfile.FAA speciesName/assembly.fasta --prefix=speciesName

The threshold for accepting an exonerate hit can be adjusted with -t (Default: 55 percent)


--------------RESULTS AND OUTPUT FILES---------------

Current directory: 
	the master bait file is copied here
	a BLAST database is generated
	One directory is generated for every gene with BLAST hits. 
	A file "bait_tallies.txt" summarizes which bait sources were chosen.
	
Within the Directory For Each Gene:
	Velvet and CAP3 results. Final assembly is at "GeneName_cap3ed.fa"
	Fasta file for reference bait chosen by the distribute_targets.py script.
	Directory of Exonerate results (with same name as the sample)

Within the Exonerate results directory (same name as sample)
	exonerate_results.fasta -- Results of the initial exonerate search for all contigs.
	supercontig_exonerate.fasta -- Long concatenated contig from final exonerate search.
	sequences directory

Within the sequences directory:
	FNA/GeneName.FNA: In-frame nucleotide sequence
	FAA/GeneName.FAA: Amino acid sequence.
	

	
The major steps of the pipeline include:

1) Blast search of the reads against the target sequences.

2) Distribution of reads into separate directories, one per gene.

3) Assembly of reads for each gene into contigs with Velvet, using multiple k-mer values. The multiple runs of Velvet are summarized using CAP3.

4) Conduct one or more exonerate searches for each contig in the assembly. If multiple contigs match the same protein in non-overlapping sequences, stitch the hits together into a “supercontig”

4) In a subdirectory, generate separate FASTA files containing either the nucleotide (FNA) or amino acid (FAA) sequence for each protein. 

Still to come:
5) Optional utilities after running the pipeline for multiple assemblies: 
     NOTE: for these utilities to work, the files must be in the same directory hierarchy created by the pipeline.
               (i.e. species/alignments/FAA/ and species/alignments/FNA/)
     a) Generate a single file for one or more proteins.
               Simple FASTA or use MAFFT to align it.
     b) Summarize the completeness of the alignments, gene by gene.
               Creates a matrix: each row is a species with length of each protein (in alphabetical order) separated by tabs.
     c) Matrixmerge: generate supermatrix from all aligned matricies.                                                                                     

----------DEPRECATED SCRIPTS----------
These scripts are left over from a version of the pipeline that started with sequence assemblies, rather than raw reads.

----------------query_file_builder.py-----

This script generates the "tailored baitfile" for the species by choosing the best representative at each gene using a BLAST search against the assembly file. It also sets up all of the necessary file hierarchy to run the next step of the pipeline.

The input to the script requires a fasta file containing protein bait sequences, as described in Setup, and the nucleotide assembly file.

The script will use the prefix of the assembly file to generate a directory containing all the results. For the cleanest results, create a new directory, and use relative or absolute paths to indicate the locations of both the protein and assembly file.

For example, if one level up there is one directory containing baits and another containing assemblies:

EXAMPLE COMMAND LINE

query_file_builder.py ../baits/all_plastid_baits.FAA ../assemblies/speciesName.fasta


About

Recovering genes from targeted sequence capture data

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%