Skip to content

Best practices for Merfin

Giulio Formenti edited this page Sep 13, 2022 · 5 revisions

Merfin is a k-mer based tool for variant call filtering, polishing, and assembly evaluation. Provided here are the best-practices with examples from our paper.

1. Preparing k-mer dbs

Count k-mers from the sequencing reads using Meryl. A more detailed description is available at Merqury.

Merfin works best with Illumina reads. If only HiFi reads are available, that could be used as an alternate option however less recommended due to its known homopolymer or microsatellite repeat sequencing biases.

After constructing the k-mer db, we recommend excluding k-mers with frequency 1 for computational acceleration and removal of erroneous k-mers.

# Construct k-mer db
meryl count k=$k reads.fastq.gz output reads.meryl

# Collect histogram for GenomeScope
meryl histogram reads.meryl > reads.hist

# Exclude frequency = 1 k-mers
meryl greater-than 1 reads.meryl output reads.gt1.meryl

2. Genotyping

A small example from HG002 that was described in our paper can be found here along with a smaller test example.

The variant call sets were submitted to PrecisionFDA challenge II, in which the raw (default) calls were obtained through GATK variant calling pipeline. Variant callings were performed on Illumina reads (ill.vcf) and combined Illumina, HiFi, and ONT reads (multi.vcf). Post-filtered with Merfin are also provided.

For test-running Merfin, download:

  • chr20.fasta.gz
  • HG002.k21.gt1.meryl.tar.gz
  • test.vcf.gz

Once downloaded, decompress the meryl db. For example, using pigz:

pigz -d HG002.k21.gt1.meryl.tar.gz
tar -xf HG002.k21.gt1.meryl.tar

Finally, run Merfin:

# Calls from Illumina
merfin -filter -sequence chr20.fasta.gz  \
           -readmers HG002.k21.gt1.meryl \
           -vcf ill.vcf.gz               \
           -output test.merfin

This generates chr20.fasta.gz.meryl as seqmers and loads both readmers and seqmers. Alternatively, the seqmers could be generated beforehand and re-used:

meryl count k=21 chr20.fasta.gz output chr20.meryl
merfin -filter -sequence chr20.fasta.gz   \
           -seqmers chr20.meryl           \
           -readmers HG002.k21.gt1.meryl  \
           -vcf ill.vcf.gz                \
           -output test.merfin

It takes ~1 hour to process the dataset with 24 cpus and ~30 GB of memory. Note parallelization takes place per sequence entry (i.e. chromosome, scaffolds or contigs).

3. De novo assembly evaluation

Here, we are taking the intermediate Archocentrus centrarchus genome assembly presented in one of the VGP assemblies we polished as an example.

The data can be downloaded from here. We need:

  • fArcCen1.21.meryl.tar.gz
  • gs2/lookup_table.txt
  • gs2/model.txt (or linear_plot.png. optional, for the kcov or kmercov)
  • arrow/fArcCen1_s4.fasta
  • arrow/fArcCen1_s4.reshaped.vcf.gz

As we are going to use the sequence k-mer database multiple times, let’s count the k-mers first:

meryl count k=21 fArcCen1_s4.fasta output fArcCen1_s4.fasta.meryl

This is optional, and the fArcCen1_s4.fasta.meryl can be provided as -seqmers in the following examples instead of -sequence.

Possible collapse and expansions

Dump readK (Kr), asmK (KC), and K* per bases (k-mers) in the sequence.

merfin -dump                              \
           -sequence fArcCen1_s4.fasta    \
           -readmers fArcCen1.21.meryl    \
           -peak 10.8                     \
           -prob lookup_table.txt         \
           -output test.dump

It takes ~45 minutes to process with 24 cpus and a maximum of 110 GB of memory.

  • For advanced users: it is one option to use -skipMissing for not displaying erroneous k-mers in the track.

The output test.dump contains 5 columns:

seqName    - name of the sequence this k-mer is from
seqPos     - start position (0-based) of the k-mer in the sequence
readK      - normalized read copies (read multiplicity / peak)
asmK       - assembly copies as found in <seq.meryl>
K*         - 0-centered K* value

This file can be converted to a .wig or .bw format with:

dump_output=test.dump
asm_fa=fArcCen1_s4.fasta

# Convert to .wig
awk 'BEGIN {print "track autoScale=on"} { if ($1!=chr) { chr=$1; print "variableStep chrom="chr" span=1"}; if ( $3!=0 ) { printf $2+1"\t"$5"\n" } }' $dump_output > $dump_output.wig

# Generate chromosome sizes:
awk '$0 ~ ">" {if (NR > 1) { print c; } c=0; printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' $asm_fa > $asm_fa.sizes

# Convert to bigWig:
wigToBigWig $dump_output.wig $asm_fa.sizes $dump_output.bw

The wigToBigWig is part of the UCSC Kent Tools.

K* histogram and QV*

This generates a 0-centered K* histogram for sequence.

merfin -hist                                  \
           -sequence fArcCen1_s4.fasta        \
           -readmers fArcCen1.21.meryl        \
           -peak 10.8                         \
           -prob lookup_table.txt             \
           -output test.hist 2> test.hist.log

It takes ~10 minutes to process with 24 cpus and a maximum of 64 GB of memory.

The output contains two columns:

K* value <tab> frequency
  • Positive k* values are expected collapsed copies.
  • Negative k* values are expected expanded copies.
  • Closer to 0 means the expected and found k-mers are well balanced, 1:1.

QV* is printed at the end of the log; test.hist.log.

K-mers not found in reads (missing) : 64922262
K-mers overly represented in assembly: 248125467.40
K-mers found in the assembly: 1984603394
Missing QV: 28.01
Merfin QV*: 20.89

K* Completeness

Compute k-mer completeness using expected copy numbers.

merfin -completeness                         \
           -seqmers fArcCen1_s4.fasta.meryl  \
           -readmers fArcCen1.21.meryl       \
           -prob lookup_table.txt            \
           -peak 10.8 2> test.completeness

It takes ~10 minutes to process with 24 cpus and a maximum of 69.1 GB of memory.

The completeness is printed at the end of test.completeness:

TOTAL readK:     2069659896.00
TOTAL undrcpy:    436891737.00000
COMPLETENESS:             0.78891

Each values mean:

TOTAL readK: total kmers in reads
TOTAL undrcpy: number of kmers under the expected copy number
COMPLETENESS: completeness

3. Polishing

Score each variant, or variants within distance k and their combinations by K*.

Preparing input vcf

Merfin expects a proper GT field to be present under the first SAMPLE column.

In this example, we use the output from Arrow. By default, Arrow is missing a GT field, so we added it using this script:

bash reshape_arrow.sh arrow.vcf

Through experiments for haplotype switching, we found the heterozygous calls were not so useful in partially phased pseudo-haplotype assemblies and recommended removing them before supplying to Merfin.

This can be done with bcftools:

bcftools view -Oz -i '(GT="AA" || GT="Aa")' in.vcf > out.vcf.gz

In addition, spurious variant calls could be removed with low stringency filters, i.e. QUAL>1, which will improve computation time, especially in freebayes outputs.

Polish

With the input .vcf prepared, we are ready to run Merfin:

merfin -polish                               \
           -sequence fArcCen1_s4.fasta       \
           -readmers fArcCen1.21.meryl       \
           -peak 10.8                        \
           -prob lookup_table.txt            \
           -vcf fArcCen1_s4.reshaped.vcf.gz  \
           -output test

It takes ~1.7 hours to process with 24 cpus and a maximum of 68.0 GB of memory used.

Alternatively, we can run Merfin to only include k-mers with frequency > 1, which ensures at least two k-mers (possibly from two reads) support the alternate variant.

Again, we can filter the readmers:

meryl greater-than 1 fArcCen1.21.meryl output fArcCen1.21.gt1.meryl

And use it to run Merfin:

merfin -polish                               \
           -sequence fArcCen1_s4.fasta       \
           -readmers fArcCen1.21.gt1.meryl   \
           -peak 10.8                        \
           -prob lookup_table.txt            \
           -vcf fArcCen1_s4.reshaped.vcf.gz  \
           -output test.gt1

It takes ~1.5 hours to process with 24 cpus and a maximum of 24.3 GB of memory.

Consensus

Finally, we can generate a consensus fasta files with bcftools:

bcftools view -Oz test.merfin.vcf > test.merfin.vcf.gz
bcftools index test.merfin.vcf.gz
bcftools consensus -H1 --chain chain_file.txt -f fArcCen1_s4.fasta test.merfin.vcf.gz > test.merfin.fasta

And restart the evaluation.

Happy Merfin!