Sources of Error

Leaky Alignments

Alignment Leakage

If a known virus A is present at a high read-count, then things like sequencing error, biological artifacts and mis-mapping will result in a small fraction of reads being assigned to a related, but not the ideal sequence (B and C). The distance (in nt- or aa-substitutions) from the virus in the sequencing library may be in the "known range" to virus A, and in the unknown range to virus B and C.

Often this falls well below the level of "noise", but in libraries with high viral read-counts (10,000s), this may lead to an appreciable signal in neighboring viruses.

The best way to mitigate this issue is to consider a higher level of the hierarchy for locating novel viruses. For instance instead of asking "Find me a novel PCV2-related sequence". You first ask "Find a novel Circovirus sequence." and then sub-set those results to "Which of those libraries is the best-available match PCV2."

PCV1 and PCV2 Leak