pansel
aims at finding over/under diverse regions.
The tool considers a pangenome graph, and a reference haplotype. It cuts the haplotype into fixed size bins, and computes the number of paths that are found in each bin. The bins with low numbers are probably under selection.
In practice, the number of paths between two genomic positions cannot be always computed. The easiest is to find nodes that are present in "many" haplotypes, that we call anchor nodes and compute the number of paths between these nodes. However, these nodes may not be located at each end of the bins. In these cases, we find the closest anchor nodes, which are located outside of the bins, and compute the number of paths between these nodes. This way, the number of paths may be over-estimated, but not under-estimated.
Typing make
with a fairly decent C++ compiler (compatible with C++11) should do.
pansel [parameters] > output_file 2> log_file
Compulsory parameters:
-i string: file name in GFA format
-r string: reference path name (should be in the GFA)
Optional parameters:
-z int: bin size (default: 1000)
-n int: min # paths
-b string: use BED format, and print the parameter as reference name
Other:
-h: print this help and exit
-v: print version number to stderr
- The GFA file should not be rGFA, and contain segments (
S
) and full length paths (P
) or walks (W
). Other lines are unused. This GFA should only store one chromosome. - The reference path name should be the name of a
P
line or aW
line in the GFA file. - The bin size is the distance between two anchor nodes that will be considered.
- The
-n
parameter is a threshold, that help selecting highly conserved nodes. The nodes that are traversed by at least n different paths are selected this way. If this parameter is not provided, it will be computed by the programme. Briefly, it counts the number of times each node is traversed, and takes the mode of the distribution (with n > 3).
It is a flat, tabulation-separated file. The meaning is:
- An ID of the bin.
- The targeted start position of the bin.
- The targeted end position of the bin.
- The Jaccard index between each pair of paths from first anchor node to the last anchor node.
- The number of different paths from first anchor node to the last anchor node (2 identical paths will be counted once).
- The total number of paths from first anchor node to the last anchor node (2 identical paths will be counted twice).
- The start position used (is different from the targeted start position when no anchor node overlaps in the bin start position).
- The end position used (is different from the targeted end position when no anchor node overlaps in the bin end position).
- The name of the left-most anchor node.
- The start position of the left-most anchor node.
- The end position of the left-most anchor node.
- The name of the right-most anchor node.
- The start position of the right-most anchor node.
- The end position of the right-most anchor node.
Notes:
- An anchor node may cover the whole bin. In this case, the number of different paths is 1, and the left-most anchor node is equal to the right-most anchor node.
This file contains different statistics on the GFA file, as well as the reference path.
You can use pansel
on a small sample, provided by the BubbleGun package:
wget -c https://zenodo.org/record/7937947/files/ecoli50.gfa.zst
./pansel -i <( zstd -d -c ecoli50.gfa.zst ) -r GCF_019614135.1#1#NZ_CP080645.1 > ecoli.out 2> ecoli.log
This will detail an example of a real use of pansel
.
Dowload the HPRC data (1 file per chromosome, restrict to autosomes), and tranform from VG to GFA format using vg view
:
for i in `seq 1 22`
do
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/scratch/2022_03_11_minigraph_cactus/chrom-graphs-hprc-v1.1-mc-chm13-full/chr${i}.vg
vg view chr${i}.vg | gzip -c > chr${i}.gfa.gz
rm chr${i}.vg
done
Run pansel
on each file:
for i in `seq 1 22`
do
/usr/bin/time ./pansel -i <( zcat chr${i}.gfa.gz ) -r GRCh38.0.chr${i} -n 91 > chr${i}.tsv 2> chr${i}.log
done
Merge output files, and produce a BED file:
for i in `seq 1 22`
do
sed "s/^/chr${i}\t/g" chr${i}.tsv
done | awk '{print $1 "\t" ($3-1) "\t" $4 "\tbin_" NR "\t" $5 "\t+"}' > chrall.bed
Fit distributions to the number of paths, and get the 5% threshold for the most conserved, and the most divergent regions (the R script file is included in the repository):
Rscript getExtremes.R -i chrall.bed -p 0.05 -P 0.05 -t fit.png -o fit_conserved.bed -O fit_divergent.bed &> fit.log
The parameters are:
-i string: the input file, output of the previous step
-p float: the p-value threshold of the conserved regions
-o string: the conserved regions, in BED format
-P float: the p-value threshold of the divergent regions
-O string: the divergent regions, in BED format
-t string: a plot of the fit (observed: solid black, fit: dashed green, conserved threshold: dotted blue, divergent threshold: dotted red)
-h: a help message
There are many ways to do so. One possibility is to use a dedicated pipe-line: the nf-core pangenome pipe-line. Otherwise, you can follow the procedure I used to build a small pangenome (see Create graph for M_xanthus).