Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document and reproduce the precisionFDA pipeline #23

Open
ekg opened this issue May 2, 2016 · 0 comments
Open

document and reproduce the precisionFDA pipeline #23

ekg opened this issue May 2, 2016 · 0 comments

Comments

@ekg
Copy link
Owner

ekg commented May 2, 2016

The basic idea is to make two sets of calls, one from global assembly of the input FASTQs (fermikit), and another from global alignment and variant calling (freebayes).

We run fermikit on the FASTQs, then take its raw VCF output as a list of candidates. (likely private: https://precision.fda.gov/apps/app-BvJPP100469368x7QvJkKG9Y/fork, but documented here)

https://github.com/lh3/fermikit
fermikit by Heng Li is a de novo assembly based variant calling pipeline for Illumina short reads.
Instead of mapping reads to the reference genome, fermikit constructs a de novo assembly of the reads into uniquely assemble-able contigs or "unitigs" which preserve distinct haplotypes to the extent possible and unambiguous. Subsequently, fermikit aligns these unitigs to the reference genome and calls both small variants and structural variants.

read_reads=$(find /work/in -type f -path "/work/in/reads*/*" | xargs -n 100 echo zcat)

/fermi.kit/fermi2.pl unitig "-s${estimated_genome_size}" -t$(nproc) "-l${read_length}" -p genome "$read_reads" > genome.mak
cat genome.mak
make -f genome.mak
zcat /work/in/reference_genome/* > reference_genome.fa
samtools faidx reference_genome.fa
/fermi.kit/bwa index reference_genome.fa
/fermi.kit/run-calling -t$(nproc) reference_genome.fa genome.mag.gz | sh

ls -lh

mv genome.srt.bam "${output_name}.bam"
emit unitigs_bam "${output_name}.bam"

mv genome.raw.vcf.gz "${output_name}.raw.vcf.gz"
emit raw_vcf "${output_name}.raw.vcf.gz"

mv genome.flt.vcf.gz "${output_name}.flt.vcf.gz"
emit filtered_vcf "${output_name}.flt.vcf.gz"

mv genome.sv.vcf.gz "${output_name}.sv.vcf.gz"
emit sv_vcf "${output_name}.sv.vcf.gz"

zip "${output_name}.fermikit.logs.zip" *.log
emit logs "${output_name}.fermikit.logs.zip"

We run bwa mem on the FASTQs and reference genome, then mark PCR duplicates in the alignments. (https://precision.fda.gov/apps/app-BpF9YGQ0bxjbjk6Fx1F0KJF0/fork)

This app maps Illumina FASTQ reads to the human reference genome (hs37d5 or GRCh37), using BWA-MEM. The mappings are then sorted, indexed and marked for duplicates using bamsormadup from biobambam2.

# If no output name was given, construct one from the reads
if [[ "$output_name" == "" ]]; then
  output_name="$reads_prefix"
  # Reads are often called "something_1.fastq.gz"; remove "_1"
  output_name="${output_name%_1}"
  output_name="${output_name%_R1}"
fi

# Clean up sample
sample="${sample//[^-.0-9@A-Z_a-z]/}"

# If no read group was given, use same value as sample
if [[ "$read_group_id" == "" ]]; then
  read_group_id="$sample"
fi

# Set up BWA options
opts=()
opts+=("-t" "`nproc`")
opts+=("-R" "@RG\tID:${read_group_id}\tPL:ILLUMINA\tSM:${sample}")
if [[ "$mark_as_secondary" == "true" ]]; then
  opts+=("-M")
fi
if [[ "$use_hs37d5" == "true" ]]; then
  opts+=("hs37d5.fa")
else
  opts+=("grch37.fa")
fi
opts+=("$reads_path")
if [[ "$reads2" != "" ]]; then
  opts+=("$reads2_path")
fi

bwa mem "${opts[@]}" | /opt/biobambam2/bin/bamsormadup inputformat=sam indexfilename="$output_name".bam.bai M="$output_name".duplication_metrics > "$output_name".bam

emit sorted_bam "$output_name".bam
emit sorted_bai "$output_name".bam.bai
emit duplication_metrics "$output_name".duplication_metrics

We then derive a VCF from this output using freebayes. This is very fast and sometimes it made sense to reconfigure it, so in practice I've done so in line with the HHGA generation, which is a bit messy. This is the pipeline I'd use to do so:

ref=hs37d5.fa    # we're using GRCh37 plus the "decoy" sequences
region=22:100-200  # dummy region, for example
freebayes -f $ref --region $region --no-partial-observations --min-repeat-entropy 1 -C 2 -F 0.05 $bam \
    | vcffilter -f 'QUAL > 1e-8' | bgzip >freebayes.vcf.gz

Work is parallelized over a list of regions that have roughly equal number of alignments in them for one of the Garvan NA12878 sequencing runs.

I made them like this:

bamprofile -f $ref -b aln.bam -e 1000000 >wg.e1M.regions

This would be the process per region:

#!/bin/bash

name=vial1l
ref=~/graphs/hs37d5.fa
fermikitvcf=NA12878-Garvan-Vial_1.fermikit.raw.vcf.gz
bam=NA12878-Garvan-Vial1.bam
region=$1

mkdir -p "$name"

freebayes -f "$ref" --region $region \
         --no-partial-observations \
         --min-repeat-entropy 1 \
         -C 2 -F 0.05 $bam \
    | vcffilter -f \"QUAL > 1e-8\" \
    | bgziptabix "$name"/{}.freebayes.vcf.gz

tabix -h "$fermikitvcf" {} | bgziptabix "$name"/{}.fermikit.vcf.gz

vcfjoincalls "$ref" "$name"/{}.freebayes.vcf.gz "$name"/{}.fermikit.vcf.gz fermikit "$name" \
    | bgziptabix "$name"/{}.union.vcf.gz

hhga_region \
    -r {} \
    -w 16 \
    -x 10 \
    -f "$ref" \
    -T giab_v2.19.vcf.gz \
    -B giab_callable_v2.19.bed.gz \
    -b "$bam" \
    -v "$name"/{}.union.vcf.gz \
    -o "$name" -S NA12878

The bgziptabix and vcfjoincalls scripts are in vcflib: https://github.com/vcflib/vcflib/blob/master/scripts/vcfjoincalls, https://github.com/vcflib/vcflib/blob/master/scripts/bgziptabix

hhga_region is in this repo: https://github.com/ekg/hhga/blob/master/scripts/hhga_region

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant