-
Notifications
You must be signed in to change notification settings - Fork 34
Sources of Error
If a known virus A
is present at a high read-count, then things like sequencing error, biological artifacts and mis-mapping will result in a small fraction of reads being assigned to a related, but not the ideal sequence (B
and C
). The distance (in nt- or aa-substitutions) from the virus in the sequencing library may be in the "known range" to virus A
, and in the unknown range to virus B
and C
.
Often this falls well below the level of "noise", but in libraries with high viral read-counts (10,000s), this may lead to an appreciable signal in neighboring viruses.
The best way to mitigate this issue is to consider a higher level of the hierarchy for locating novel viruses. For instance instead of asking "Find me a novel PCV2-related sequence". You first ask "Find a novel Circovirus sequence." and then sub-set those results to "Which of those libraries is the best-available match PCV2."