Skip to content

1. Prepare meryl dbs

Arang Rhie edited this page Dec 16, 2022 · 5 revisions

To run merqury, we need child.meryl and hap-mer dbs. The hap-mer dbs, mat.hapmers.meryl and pat.hapmers.meryl can be generated with hapmers.sh using parental meryl dbs.

1. Get the right k size

When not sure which k-mer size to use, run best_k.sh with genome_size in num. of bases.

sh $MERQURY/best_k.sh <genome_size> [tolerable_collision_rate=0.001]

For a human genome, the best k-mer size is k=21 for both haploid (3.1G) or diploid (6.2G).

2. Build k-mer dbs with meryl

Ways to build k-mer dbs may vary between sequencing platforms. Here, we provide scripts to run meryl on a single node or in batches using job arrays.

Illumina whole genome sequencing reads

For typical illumina whole genome sequencing, it is recommended to pre-process the reads (i.e. adapter removal) before building the meryl dbs. A meryl db can be created in one step, or parallelized on multiple nodes by counting on each input file then merging at the end.

One-step counting

meryl k=$k count *.fastq.gz output $genome.meryl

Two-step counting

# 1. Build meryl dbs on each input, could be run on different nodes
meryl k=$k count output read$i.meryl read$i.fastq.gz

# 2. Merge
meryl union-sum output $genome.meryl read*.meryl

$MERQURY/build/count.sh builds meryl dbs, $MERQURY/build/union-sum.sh does the merging.

By default, meryl uses up to 64 cpus and all memories available. Limit cpus and memory if needed, ie. cpus=24 memory=48g. When memory is full, meryl does counting in batches.

_submit_build.sh automatically submits builds and merge jobs on SLURM environment.

Counting compressed k-mers

Homopolymer compressed k-mers can be counted by adding the compress option to meryl count or -c option to Merqury counting scripts. For example,

meryl k=$k count compress *.fastq.gz output $genome.meryl

or

$MERQURY/build/count.sh -c $k input.fofn [offset line_num]

# Once each meryl dbs are build in compressed space,
# no need to add -c in the merging step
$MERQURY/build/union_sum.sh $k <meryl.list> <output_prefix>

Again, on SLURM, this can be done with $MERQURY/_submit_meryl.sh -c $k input.fofn out_prefix.

10X Genomics whole genome sequencing reads

Building and merging is the same as regular illumina data, except trimming the first 23 bases in R1. Use $MERQURY/build/count_10x.sh for R1, $MERQURY/build/count.sh for R2. $MERQURY/build/count_10x.sh trims and counts as following:

zcat $input | awk '{if (NR%2==1) {print $1} else {print substr($1,24)}}' | meryl k=$k count output $output -

_submit_build_10x.sh automatically submits builds and merge jobs on SLURM environment.

3. Build hap-mer dbs for trios

Assume we have generated maternal.meryl, paternal.meryl and child.meryl.

sh $MERQURY/trio/hapmers.sh maternal.meryl paternal.meryl child.meryl

or submit as a job

sh $MERQURY/_submit_hapmers.sh maternal.meryl paternal.meryl child.meryl

This will generate

* parental specific dbs: `mat.only.meryl` and `pat.only.meryl`
* inherited dbs: `mat.inherited.meryl` and `pat.inherited.meryl`
* inherited hap-mer dbs (which will be used for evaluation): `mat.hapmers.meryl` and `pat.hapmers.meryl`
* inherited_hapmers.png: k-mer distribution of the inherited dbs and cutoffs used to generate hap-mer dbs

When child’s read is not available

One way is to generate parental dbs with down-sampled maternal and paternal sequencing sets to match coverage. Perform union-sum maternal.meryl paternal.meryl to make child.meryl. Note the single-copy peak contains both inherited and not-inherited k-mers, so we expect a larger portion of read-only in the spectra-cn analysis.

Hap-mers used to classify child's reads in TrioCanu

Find haplotype-*.meryl under haplotype/0-kmers/. These are k-mers found only in one parent. We need to filter out the erroneous k-mers to reduce false positives. Let’s assume we have haplotype-Maternal.meryl and haplotype-Paternal.meryl.

Find the following in splitHaplotype.*.out or haplotype.log:

--  Haplotype './0-kmers/haplotype-Maternal.meryl':
--   use kmers with frequency at least 19.
(…)
--  Haplotype './0-kmers/haplotype-Paternal.meryl':
--   use kmers with frequency at least 17.

Filter out erroneous k-mers:

meryl greater-than 19 output mat.only.meryl haplotype-Maternal.meryl
meryl greater-than 17 output pat.only.meryl haplotype-Paternal.meryl

Here, mat.only.meryl and pat.only.meryl are the parental specific k-mers used for binning.

To generate inherited_hapmers.png, we can re-use them.

Make a directory, link mat.only.meryl, pat.only.meryl and the complete .meryl dbs of the parents.

mkdir hapmers
cd hapmers
ln -s /path/to/mat.only.meryl
ln -s /path/to/pat.only.meryl
ln -s /path/to/maternal.meryl	# Complete maternal db, made outside of trioCanu
ln -s /path/to/paternal.meryl	# Complete paternal db, made outside of trioCanu

Now we can run hapmers.sh. This generates the inherited, filtered *.hapmer.meryl.

sh $MERQURY/trio/hapmers.sh maternal.meryl paternal.meryl child.meryl