-
Notifications
You must be signed in to change notification settings - Fork 2
Hap10 Wiki page
The goal is to reconstruct accurate and long haplotypes polyploid genome using linked reads (10xgenomics). You can use it for haplotype phasing (phase the SNPs) also know as haplotype assembly of 10x read data (10x genomics).
Step 0. Preparation procedure
Step 1. Extracting haplotype information
Step 2. Extracting molecule-specific fragments
Step 3. Extracting strongly connected components of fragments
Step 4. Haplotyping
First you need to clone the Hap10 package.
git clone https://github.com/smajidian/Hap10.git
cd Hap10
Consider that you have a reference genome, ref/ref.fa
, and two FASTQ files, reads/R1.fastq.gz
and reads/R2.fastq.gz
, corresponding to the Illumina paired-end reads. As mentioned in the paper, you first need to align them to the reference genome using Longranger. Then, variants can be called using freebayes.
longranger mkref ref/ref.fasta
mkdir longranger_output
longranger align --id=longranger_output --fastqs=reads --reference=refdata-ref
freebayes -f ref/ref.fasta -p $k longranger_output/outs/possorted_bam.bam > var.vcf
in which k
is the ploidy level. The outputs of this step are aligned reads and called SNVs as BAM and VCF files, respectively.
Here, a fragment file (as a text file) is generated using BAM and filtered VCF file. Firstly, you need to build extract_poly, an edited version (under test) of extracthairs in which polyploid genomes are also allowed. The makefile will attempt to build samtools and htslib as git submodules. The output of this step is a binary file extractHAIRS
in folder build
.
git clone https://github.com/smajidian/extract_poly
cd extract_poly
make
The inputs of this step are two files:
- BAM file for an individual containing reads aligned to a reference genome
- VCF file containing only heterzygous SNVs. Complex SNVs and indels are not allowed for this version. The VCF file should only contain genotyped SNPs.
When the SNP rate is high,specially more than 1%, Freebayes has a problem: SNV that are so close to each other are considered as complex variants. break_vcf.sh
is a bash code from TriPoly to overcome this issue and converts those complex variants to SNVs.
utilities/break_vcf.sh var.vcf var_break.vcf
cat var_break.vcf | grep -v "1/1/1" | grep -v "0/0/0" grep -v "/2" > var_het.vcf #This is for triploid.
./extract_poly/build/extractHAIRS --10X 1 --bam out_longranger/outs/possorted_bam.bam --VCF var_het.vcf --out unlinked_fragment_file
The second bash line is for triploid, you can easily change it for higher ploidy levels.
In the output file of the last bash line, unlinked_fragment_file
, each line corresponds to one read. Now, We link the reads with the same barcode and generate a barcode-specific fragment file.
grep -v "NULL" unlinked_fragment_file > unlinked_fragment_file_filtered # removing those reads without barcode
python3 utilities/LinkFragments_brcd_based.py unlinked_fragment_file_filtered frag.txt
The goal of this this step is to extracting molecule-specific fragments from barcode-specific fragment file.
python3 utilities/splitter.py frag.txt $m frag_sp.txt
in which m
is the mean 10X molecule length (in Kb) which can be set as 50.
python3 utilities/extract_scc.py frag_sp.txt scc ./out
If you want to have larger haplotype block but with a bit lower quality, you can use cc
instead of scc
.
ُThe output of the previous step is several fragment files. The haplotype assembly core should be run on each of them. For haplotyping, the user can use one of the two modes: fast or accurate. The core of fast mode, Hap++, is SDhaP which is written in C. The core of accurate mode, Hap10, is MATLAB code.
You need to install SDhaP. To do so, I provide a through instruction here.
python2 utilities/FragmentPoly.py -f frag_sp.txt -o frag_sd.txt -x SDhaP
./sdhap/hap_poly frag_sd.txt out.hap 3
python2 utilities/ConvertAllelesSDhaP.py -p out.hap -o out_with_genomic_position.hap -v var_het.vcf
You may refer to this page.
You can convert the output as phased VCF file using the following line. You need the unphased VCF and the haplotype file generated by either Hap++ or Hap10.
python hap2vcf.py input.hap input.vcf
Sina Majidian, Mohammad Hossein Kahaei, Dick de Ridder, "Hap10: reconstructing accurate and long polyploid haplotypes using linked reads." https://www.biorxiv.org/content/10.1101/2020.01.08.899013v1
This package is distributed under the Creative Commons Attribution-ShareAlike 4.0 International Public License.