Feature request: Utilize Diamond's contig features #10

schorlton · 2021-03-21T02:57:04Z

Thanks for the awesome software! If I understand the code correctly, Diamond is executed using largely default parameters. I'd suggest adding in ‐‐range‐culling ‐‐top 10 -F 15 (source), but this will likely require rewrites of other areas of contigtax. These parameters will perform local Diamond alignment, retaining the top hit (within 10%) in each area of the query contig. We'd then have to not filter by bitscore in contigtax, and also should rely on the evalue parameter of Diamond instead of filtering on that. Just wanted to get the discussion started - there's probably a bunch of other design decisions that I'm missing.

The text was updated successfully, but these errors were encountered:

johnne · 2021-04-05T20:29:38Z

Thanks for using the software and for making the suggestion!

Yes most of the call to diamond is made with default parameters, and configurable options are so far only --evalue, --top, --mode (blastx/blastp), --blocksize and --chunks. Note that you can use --top in both the search and assign steps of contigtax. In search the value is passed directly to Diamond which means the resulting output will already be filtered to whatever --top setting is used (10% by default). In the assign step you have the option of supplying the --top parameter again and here the value is used to filter the results file once more prior to assigning taxonomies (default here is 5%). My idea was that since the search step is the most time consuming you can run it with slightly more relaxed bitscore filtering and then modify the stringency at the assign step if need be.

I briefly remember thinking about the --range-culling feature of diamond a long time ago, but never got around to testing it that much. From what I understand the idea with it is to allow several hits with lower scores than the best-scoring hit to be reported from the same contig. This may impact the output from contigtax, especially for long contigs, depending on how the assignment step is run.
I've noticed that contigtax seems to perform best at contigs <10 kbp in length and that for very long contigs (close to complete bacterial chromosomes) hits for ribosomal sequences are likely to limit the resolution of assignments made because these genes have high bitscores and are well conserved between lineages which pushes the LCA up in the taxonomic hierarchy.
Using range-culling would probably at least make sure more hits are reported for long contigs. However, with the default rank_lca mode of assigning taxonomy I suspect the final output will be the same, because all reported hits hits are used to assign the LCA (also those with high bitscores). With the rank_vote assign mode the output may however change since here contigtax makes a decision (takes a 'vote') from the list of hit taxa, choosing the one which makes up at least vote_threshold (default = 0.5) of taxa at the considered rank. For a contig queried with range-culling the extra reported hits could then help push some taxa above the vote_threshold, leading them to be assigned to the contig. That may however take some additional coding, maybe to make contigtax take a vote on a per region basis.

Again, thanks for making the suggestion. It's interesting to think about and discuss these things. Adding the range-culling feature as an option to contigtax should not be a problem as it doesn't appear to affect the output format and thus doesn't cause problems with downstream assignments. I noticed that the feature requires diamond to be run with a frameshift penalty which is mostly recommended for error prone long-read sequence output and not the assembled short-read sequences I had in mind when designing contigtax, but nevertheless it may have it's benefits.

I've found that the lca classify functionality of sourmash is a good complement to contigtax because it performs very well at the long contigs where contigtax struggles.

johnne added the enhancement New feature or request label Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Utilize Diamond's contig features #10

Feature request: Utilize Diamond's contig features #10

schorlton commented Mar 21, 2021

johnne commented Apr 5, 2021

Feature request: Utilize Diamond's contig features #10

Feature request: Utilize Diamond's contig features #10

Comments

schorlton commented Mar 21, 2021

johnne commented Apr 5, 2021