You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The basic idea is to make two sets of calls, one from global assembly of the input FASTQs (fermikit), and another from global alignment and variant calling (freebayes).
https://github.com/lh3/fermikit
fermikit by Heng Li is a de novo assembly based variant calling pipeline for Illumina short reads.
Instead of mapping reads to the reference genome, fermikit constructs a de novo assembly of the reads into uniquely assemble-able contigs or "unitigs" which preserve distinct haplotypes to the extent possible and unambiguous. Subsequently, fermikit aligns these unitigs to the reference genome and calls both small variants and structural variants.
This app maps Illumina FASTQ reads to the human reference genome (hs37d5 or GRCh37), using BWA-MEM. The mappings are then sorted, indexed and marked for duplicates using bamsormadup from biobambam2.
# If no output name was given, construct one from the readsif [[ "$output_name"=="" ]];then
output_name="$reads_prefix"# Reads are often called "something_1.fastq.gz"; remove "_1"
output_name="${output_name%_1}"
output_name="${output_name%_R1}"fi# Clean up sample
sample="${sample//[^-.0-9@A-Z_a-z]/}"# If no read group was given, use same value as sampleif [[ "$read_group_id"=="" ]];then
read_group_id="$sample"fi# Set up BWA options
opts=()
opts+=("-t""`nproc`")
opts+=("-R""@RG\tID:${read_group_id}\tPL:ILLUMINA\tSM:${sample}")
if [[ "$mark_as_secondary"=="true" ]];then
opts+=("-M")
fiif [[ "$use_hs37d5"=="true" ]];then
opts+=("hs37d5.fa")
else
opts+=("grch37.fa")
fi
opts+=("$reads_path")
if [[ "$reads2"!="" ]];then
opts+=("$reads2_path")
fi
bwa mem "${opts[@]}"| /opt/biobambam2/bin/bamsormadup inputformat=sam indexfilename="$output_name".bam.bai M="$output_name".duplication_metrics >"$output_name".bam
emit sorted_bam "$output_name".bam
emit sorted_bai "$output_name".bam.bai
emit duplication_metrics "$output_name".duplication_metrics
We then derive a VCF from this output using freebayes. This is very fast and sometimes it made sense to reconfigure it, so in practice I've done so in line with the HHGA generation, which is a bit messy. This is the pipeline I'd use to do so:
ref=hs37d5.fa # we're using GRCh37 plus the "decoy" sequences
region=22:100-200 # dummy region, for example
freebayes -f $ref --region $region --no-partial-observations --min-repeat-entropy 1 -C 2 -F 0.05 $bam \
| vcffilter -f 'QUAL > 1e-8'| bgzip >freebayes.vcf.gz
Work is parallelized over a list of regions that have roughly equal number of alignments in them for one of the Garvan NA12878 sequencing runs.
The basic idea is to make two sets of calls, one from global assembly of the input FASTQs (fermikit), and another from global alignment and variant calling (freebayes).
We run fermikit on the FASTQs, then take its raw VCF output as a list of candidates. (likely private: https://precision.fda.gov/apps/app-BvJPP100469368x7QvJkKG9Y/fork, but documented here)
We run bwa mem on the FASTQs and reference genome, then mark PCR duplicates in the alignments. (https://precision.fda.gov/apps/app-BpF9YGQ0bxjbjk6Fx1F0KJF0/fork)
We then derive a VCF from this output using freebayes. This is very fast and sometimes it made sense to reconfigure it, so in practice I've done so in line with the HHGA generation, which is a bit messy. This is the pipeline I'd use to do so:
Work is parallelized over a list of regions that have roughly equal number of alignments in them for one of the Garvan NA12878 sequencing runs.
I made them like this:
This would be the process per region:
The bgziptabix and vcfjoincalls scripts are in vcflib: https://github.com/vcflib/vcflib/blob/master/scripts/vcfjoincalls, https://github.com/vcflib/vcflib/blob/master/scripts/bgziptabix
hhga_region is in this repo: https://github.com/ekg/hhga/blob/master/scripts/hhga_region
The text was updated successfully, but these errors were encountered: