-
Notifications
You must be signed in to change notification settings - Fork 21
1. Prepare meryl dbs
To run merqury, we need child.meryl
and hap-mer dbs.
The hap-mer dbs, mat.hapmers.meryl
and pat.hapmers.meryl
can be generated with hapmers.sh
using parental meryl dbs.
When not sure which k-mer size to use, run best_k.sh
with genome_size
in num. of bases.
sh $MERQURY/best_k.sh <genome_size> [tolerable_collision_rate=0.001]
For a human genome, the best k-mer size is k=21
for both haploid (3.1G) or diploid (6.2G).
Ways to build k-mer dbs may vary between sequencing platforms. Here, we provide scripts to run meryl on a single node or in batches using job arrays.
For typical illumina whole genome sequencing, it is recommended to pre-process the reads (i.e. adapter removal) before building the meryl dbs. A meryl db can be created in one step, or parallelized on multiple nodes by counting on each input file then merging at the end.
meryl k=$k count *.fastq.gz output $genome.meryl
# 1. Build meryl dbs on each input, could be run on different nodes
meryl k=$k count output read$i.meryl read$i.fastq.gz
# 2. Merge
meryl union-sum output $genome.meryl read*.meryl
$MERQURY/build/count.sh
builds meryl dbs, $MERQURY/build/union-sum.sh
does the merging.
By default, meryl
uses up to 64 cpus and all memories available.
Limit cpus
and memory
if needed, ie. cpus=24 memory=48g
.
When memory is full, meryl does counting in batches.
_submit_build.sh
automatically submits builds and merge jobs on SLURM environment.
Homopolymer compressed k-mers can be counted by adding the compress
option to meryl count
or -c
option to Merqury counting scripts. For example,
meryl k=$k count compress *.fastq.gz output $genome.meryl
or
$MERQURY/build/count.sh -c $k input.fofn [offset line_num]
# Once each meryl dbs are build in compressed space,
# no need to add -c in the merging step
$MERQURY/build/union_sum.sh $k <meryl.list> <output_prefix>
Again, on SLURM, this can be done with $MERQURY/_submit_meryl.sh -c $k input.fofn out_prefix
.
Building and merging is the same as regular illumina data, except trimming the first 23 bases in R1.
Use $MERQURY/build/count_10x.sh
for R1, $MERQURY/build/count.sh
for R2.
$MERQURY/build/count_10x.sh
trims and counts as following:
zcat $input | awk '{if (NR%2==1) {print $1} else {print substr($1,24)}}' | meryl k=$k count output $output -
_submit_build_10x.sh
automatically submits builds and merge jobs on SLURM environment.
Assume we have generated maternal.meryl
, paternal.meryl
and child.meryl
.
sh $MERQURY/trio/hapmers.sh maternal.meryl paternal.meryl child.meryl
or submit as a job
sh $MERQURY/_submit_hapmers.sh maternal.meryl paternal.meryl child.meryl
This will generate
* parental specific dbs: `mat.only.meryl` and `pat.only.meryl`
* inherited dbs: `mat.inherited.meryl` and `pat.inherited.meryl`
* inherited hap-mer dbs (which will be used for evaluation): `mat.hapmers.meryl` and `pat.hapmers.meryl`
* inherited_hapmers.png: k-mer distribution of the inherited dbs and cutoffs used to generate hap-mer dbs
One way is to generate parental dbs with down-sampled maternal and paternal sequencing sets to match coverage. Perform union-sum maternal.meryl paternal.meryl
to make child.meryl
. Note the single-copy peak contains both inherited and not-inherited k-mers, so we expect a larger portion of read-only
in the spectra-cn analysis.
Find haplotype-*.meryl
under haplotype/0-kmers/
. These are k-mers found only in one parent. We need to filter out the erroneous k-mers to reduce false positives.
Let’s assume we have haplotype-Maternal.meryl
and haplotype-Paternal.meryl
.
Find the following in splitHaplotype.*.out
or haplotype.log
:
-- Haplotype './0-kmers/haplotype-Maternal.meryl':
-- use kmers with frequency at least 19.
(…)
-- Haplotype './0-kmers/haplotype-Paternal.meryl':
-- use kmers with frequency at least 17.
Filter out erroneous k-mers:
meryl greater-than 19 output mat.only.meryl haplotype-Maternal.meryl
meryl greater-than 17 output pat.only.meryl haplotype-Paternal.meryl
Here, mat.only.meryl
and pat.only.meryl
are the parental specific k-mers used for binning.
To generate inherited_hapmers.png
, we can re-use them.
Make a directory, link mat.only.meryl
, pat.only.meryl
and the complete .meryl dbs of the parents.
mkdir hapmers
cd hapmers
ln -s /path/to/mat.only.meryl
ln -s /path/to/pat.only.meryl
ln -s /path/to/maternal.meryl # Complete maternal db, made outside of trioCanu
ln -s /path/to/paternal.meryl # Complete paternal db, made outside of trioCanu
Now we can run hapmers.sh
. This generates the inherited, filtered *.hapmer.meryl
.
sh $MERQURY/trio/hapmers.sh maternal.meryl paternal.meryl child.meryl