Skip to content

Pairwise assembly comparison

Ryan Wick edited this page May 30, 2023 · 18 revisions

This page describes in detail the logic behind by the Verticall pairwise command. While Verticall is typically run on a group of assemblies (see Distance tree workflow and Alignment tree workflow), pairwise comparisons are at the core of its analysis. So to understand how Verticall works, it's best to think about a single pairwise comparison: assembly A vs assembly B.

Briefly, Verticall uses a nonparametric approach to identify what parts of the assemblies are vertically-inherited or horizontally-inherited, relative to each other. This allows for a genomic distance that is less influenced (ideally not at all) by horizontal gene transfer. If you're interested in the nitty-gritty of how it works, read on!

Toy example

This page uses this highly simplified toy example to illustrate the process:

assembly A: ACTCTCAGCGGAACTGCCGCTATGCTTTACACATACCGTAGGACGTAGGTGGGCCGATCTATCGCAGAA
assembly B: TGGTCTCAGAGGAACTGCCGCTATTTACACGTATCCGTAGGACGAACCTAAGCCGATCTATCTAGCC

See the illustrated example pages (1, 2 and 3) for real assembly pairs.

Step 1: align the assemblies

Verticall first produces a set of alignments between homologous sequences in the two assemblies using minimap2. For our toy example, we only get a single alignment:

 ACTCTCAGCGGAACTGCCGCTATGCTTTACACATA-CCGTAGGACGTAGGTGGGCCGATCTATCGCAGAA
   |||||| |||||||||||||   ||||||| || |||||||||| |  |  |||||||||||
TGGTCTCAGAGGAACTGCCGCTA---TTTACACGTATCCGTAGGACGAACCTAAGCCGATCTATCTAGCC

Each alignment has a CIGAR string that describes matches (=), mismatches (X), insertions (I) and deletions (D). Most of the operations described on this page work on CIGARs, with the actual sequences only returning in Step 5. In our toy example, the CIGAR string looks like this: 6=1X13=3D7=1X2=1I10=1X1=2X1=2X11=

Verticall then expands the CIGARs to make them easier to work with:

expanded CIGAR: ======X=============DDD=======X==I==========X=XX=XX===========

And it compresses the indels1, effectively treating each indel as a single difference regardless of size:

simplified CIGAR: ======X=============D=======X==I==========X=XX=XX===========

A simple way of quantifying the genomic distance would be to now count differences in the simplified CIGAR and divide by the length. For our toy example, this is 9 over 60, giving a distance of 0.15. This is similar to the concept of gap-compressed identity2.

However, in our case, this distance is elevated by the cluster of five mismatches. What if those mismatches were horizontally-acquired, and we therefore want to exclude them from our genomic distance? This is the exact scenario Verticall was made for: identifying regions of the alignments that look different from the rest and masking them out of the distance calculation.

Step 2: build a distance distribution

Using a sliding window over the simplified CIGARs, Verticall counts the differences in each window.

Here is our toy example using a window size of 12 and a step of 4, resulting in 13 windows3. The numbers give the difference count in each window:

  ======X=============D=======X==I==========X=XX=XX===========
1:━━━━━━━━━━━━      3:━━━━━━━━━━━━      5:━━━━━━━━━━━━
    1:━━━━━━━━━━━━      2:━━━━━━━━━━━━      4:━━━━━━━━━━━━
        0:━━━━━━━━━━━━      2:━━━━━━━━━━━━      1:━━━━━━━━━━━━
            1:━━━━━━━━━━━━      1:━━━━━━━━━━━━
                1:━━━━━━━━━━━━      4:━━━━━━━━━━━━

Using these per-window difference counts, Verticall can build a discrete distribution:

      ┃
      ┃
      ┃
      ┃  
      ┃  ┃     ┃
   ┃  ┃  ┃  ┃  ┃  ┃
  ──────────────────
   0  1  2  3  4  5
differences per window

While our toy example contains only a single alignment, most real assembly pairs will produce many alignments, and Verticall builds a single distribution using all alignments (not a separate distribution for each alignment).

Step 3: smooth and partition the distribution

Our toy example is too small to demonstrate smoothing, and due to its simple nature, this distribution doesn't require smoothing. But for real cases, the distribution is typically noisy, especially as it extends to the right (higher difference counts), so Verticall smooths it to better identify peaks. It does this by conducting kernel smoothing with an Epanechnikov kernel. However, instead of a constant smoothing bandwidth, Verticall uses a bandwidth that is a function of the distribution's x-axis. Specifically, the bandwidth is equal to ds, where d is the difference count and s is the value set by --smoothing_factor (default of 0.8, check out this interactive plot). This has the effect of little-to-no smoothing on the left side of the distribution and an increasing amount of smoothing on the right side of the distribution. See the illustrated examples (1, 2 and 3) for more information on smoothing.

Verticall then identifies peaks in the smoothed distribution by starting at local maxima (2 in our toy example) and broadening them to the local minima:

      ┃
      ┃
      ┃
      ┃  
      ┃  ┃     ┃
   ┃  ┃  ┃  ┃  ┃  ┃
  ──────────────────
   0  1  2  3  4  5
  ┗━━━━━━━━━━┛
    peak 1
    (77%)  ┗━━━━━━━┛
             peak 2
             (31%)

Each peak has an associated mass4, and Verticall chooses the one with the largest mass as the 'true' distance peak, i.e. the vertically-inherited part of the alignment5.

Verticall then partitions the distribution into three categories: vertical (confidently part of the most-massive peak), horizontal (confidently not part of the most-massive peak) and ambiguous (in between):

      ┃
      ┃
      ┃
      ┃  
      ┃  ┃     ┃
   ┃  ┃  ┃  ┃  ┃  ┃
  ──────────────────
   0  1  2  3  4  5
  ┗━━━━━━━┛   ┗━━━━┛
  vertical   horizontal

           ┗━┛
        ambiguous

While our toy example only has a horizontal partition to the right of the most-massive peak (i.e. too many differences), Verticall can also define horizontal partitions to the left (i.e. too few differences).

The exact rules for partitioning the smoothed distribution are as follows:

  • Set 'high' and 'very-high' thresholds (thigh and tv-high) to the right of the most-massive peak:
    • Starting at the local maximum in the most-massive peak (maxpeak), find the local minimum to the right (minright) and the local maximum to the right of that (maxright).
    • thigh is halfway between the peak and the minimum to the right: (maxpeak + minright) / 2
    • tv-high is halfway between the minimum to the right and the maximum to the right: (minright + maxright) / 2
    • If there is no minright (i.e. the distribution decreases to zero on the right), then thigh and tv-high are set to ∞.
  • Set 'low' and 'very-low' thresholds (tlow and tv-low) to the left of the most-massive peak:
    • Starting at the local maximum in the most-massive peak (maxpeak), find the local minimum to the left (minleft) and the local maximum to the left of that (maxleft).
    • tlow is halfway between the peak and the minimum to the left: (maxpeak + minleft) / 2
    • tv-low is halfway between the minimum to the left and the maximum to the left: (minleft + maxleft) / 2
    • If there is no minleft (i.e. the distribution decreases to zero on the left), then tlow and tv-low are set to -∞.
  • Each difference-count (d) in the distribution is then labelled:
    • Horizontal: d < tv-low
    • Ambiguous: tv-lowd < tlow
    • Vertical: tlowdthigh
    • Ambiguous: thigh < dtv-high
    • Horizontal: tv-high < d

In our toy example: tv-low = -∞, tlow = -∞, thigh = 2, tv-high = 3.5

Step 4: paint the alignments

Now that each difference-count has a classification, we can apply those to our sliding windows (V=vertical, A=ambiguous, H=horizontal):

  ======X=============D=======X==I==========X=XX=XX===========
1:VVVVVVVVVVVV      3:AAAAAAAAAAAA      5:HHHHHHHHHHHH
    1:VVVVVVVVVVVV      2:VVVVVVVVVVVV      4:HHHHHHHHHHHH
        0:VVVVVVVVVVVV      2:VVVVVVVVVVVV      1:VVVVVVVVVVVV
            1:VVVVVVVVVVVV      1:VVVVVVVVVVVV
                1:VVVVVVVVVVVV      4:HHHHHHHHHHHH

To simplify things, it is easier to consider windows which have been trimmed so they are non-overlapping:

  ======X=============D=======X==I==========X=XX=XX===========
  VVVVVVVV                AAAA                HHHH
          VVVV                VVVV                HHHH
              VVVV                VVVV                VVVVVVVV
                  VVVV                VVVV
                      VVVV                HHHH

Now these window classifications can be 'painted' onto the CIGAR:

  ======X=============D=======X==I==========X=XX=XX===========
  VVVVVVVVVVVVVVVVVVVVVVVVAAAAVVVVVVVVVVVVHHHHHHHHHHHHVVVVVVVV

Ambiguous regions are then resolved based on their neighbours:

  • Ambiguous regions surrounded on both sides by vertical regions are changed to vertical.
  • Ambiguous regions that are adjacent to a horizontal region are changed to horizontal.
  • Ambiguous regions at the start/end of an alignment are changed to match their neighbouring region.
  • Ambiguous regions that span an entire alignment are conservatively changed to horizontal.

For our toy example, the ambiguous region resolves to vertical:

  ======X=============D=======X==I==========X=XX=XX===========
  VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVHHHHHHHHHHHHVVVVVVVV

There are a number of different ways one can extract a distance value from this process – see [[Columns in pairwise TSV file]] for descriptions. However, the most intuitive and useful is likely what Verticall calls the 'mean vertical distance': the number of differences in vertical regions over the total number of vertical positions. For our toy example this is: 4 / 48 = 0.0833. Since the horizontal region was more divergent than the rest of the sequence in this case, the mean vertical distance is less than the simple distance of 0.15 we got from our gap-compressed CIGAR.

If we are only interested in the distance between assemblies (e.g. for the distance tree workflow), then we are finished! However, if we are interested in which regions of the assemblies are vertical or horizontal, then there is one more step...

Step 5: paint the contigs

Since the positions of the painted CIGAR can be mapped back onto the original sequences, Verticall can now transfer the 'paint' back onto onto the contigs:

assembly A: ACTCTCAGCGGAACTGCCGCTATGCTTTACACATACCGTAGGACGTAGGTGGGCCGATCTATCGCAGAA
            UUVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVHHHHHHHHHHHHVVVVVVVVUUUUUU

assembly B: TGGTCTCAGAGGAACTGCCGCTATTTACACGTATCCGTAGGACGAACCTAAGCCGATCTATCTAGCC
            UUUVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVHHHHHHHHHHHHVVVVVVVVUUUUU

Note that in addition to vertical and horizontal (which came from the painted CIGAR) there is a third 'colour': unaligned (shown above as U). This indicates which regions of the assemblies were not covered by any of the pairwise alignments, i.e. sequences unique to just one of the assemblies.

This contig-level painting is needed for the alignment tree workflow, where we want to mask out parts of a whole-genome pseudo-alignment.

TSV output

Each pairwise comparison carried out by the Verticall pairwise command produces a single line in the TSV output file. This file contains a lot of information about all the steps described above and is used by other Verticall commands: Verticall matrix, Verticall mask and Verticall summary. See Columns in pairwise TSV file for a full description of that file.

Footnotes

1: If you use the --ignore_indels option, Verticall will remove all indels from the CIGAR. This is useful for when you want to consider only SNVs or you don't trust indels in your assembly, e.g. when using a sequencing platform with a high rate of homopolymer indel errors.

2: Identity equals one minus distance. So to calculate distance (as Verticall does) you count the mismatches+indels in the CIGAR, while for identity you count the matches in the CIGAR.

3: This window size and step were chosen by me to make the example easy to understand. In real assembly pairs, Verticall dynamically chooses a window size which gives a target number of total windows (set by --window_count, default=50000), and it uses a window step equal to 1% of the window size.

4: Note that since the peaks share the local minima between them, the total mass of all peaks can exceed 100%.

5: In some cases, the most-massive peak and second-most-massive peak might have similar mass – see Primary vs secondary results for more discussion on this.