-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RocksDB-based fastmultigather
is broken.
#322
Comments
Oh, and here are |
Related - in sourmash-bio/sourmash#3138 (comment), commit sourmash-bio/sourmash@10d5ee8, I add an analog to the Python test |
This PR fixes an issue introduced in #2943 where we introduced a subtly broken calculation that uses the _current_ size of the query metagenome as the denominator for the `f_unique_to_query` calculation. Fixes #3137 This PR also adds some commented-out test code that demonstrates #3139 / sourmash-bio/sourmash_plugin_branchwater#322. That's something I haven't been able to debug, so I'd suggest fixing that independently - I'd rather fix _a_ problem _now_, rather than waiting until we can fix _multiple_ problems at some later indeterminate time :). ## Notes - [x] do we need to fix same problem in `linear.rs`? or just rename things per #3137? - [x] we should add some tests for this
…d by plugins (#3193) This PR fixes a bug in `disk_revindex.rs::RevIndexOps::gather` where RocksDB-based `gather` was incorrectly subtracting hashes multiple times from the counter in situations of high redundancy. For example, consider this Venn diagram of the 3-way intersection between three sketches: ![image](https://github.com/sourmash-bio/sourmash/assets/51016/f98bf6a0-acc4-415f-bf39-1c48a599996a) When a metagenome contains the union of all three of these sketches, the broken implementation would subtract the `10` at the center multiple times. This was caused by removing hashes from the matches, rather than the intersection, each pass through the counter. Of note, this made RocksDB-based `fastmultigather` return incorrect results, ref sourmash-bio/sourmash_plugin_branchwater#322; first discovered in #3138 (comment). The PR fixes this, and adds a more complete pair of tests, based on `test_gather_metagenome_num_results` in the Python tests for sourmash. This PR also adjusts the hash function string for DNA sketches in Rust to be uppercase `DNA` rather than lowercase `dna`, ref sourmash-bio/sourmash_plugin_directsketch#49 And remember, it's not *just* the destination - it's the friends you make along the way, like `env_logger`. * Fixes #3139 * Fixes sourmash-bio/sourmash_plugin_directsketch#49 For consideration: Right now we are calculating the intersection twice, once in `disk_revindex.rs` and once in `calculate_gather_stats` in `revindex/mod.rs`. This is unnecessary, of course. But the function signature for `calculate_gather_stats` would need to change to take the intersection as an argument. We could: * keep calculating it twice, just for simplicity; * change `calculate_gather_stats` to take an _optional_ intersection, and calculate it if not provided; * change `calculate_gather_stats` to require an intersection. TODO: - [x] add at least one explicit test for the moltype fix Other notes: * there is a discrepancy between the Python (`sourmash gather`) and Rust (`sourmash_plugin_branchwater` results) calculations for `remaining_bp`. It seems to me like the Python one is definitely wrong; not yet sure about Rust. Viz #3194.
On latest
main
, the following demonstrates thatfastmultigather
is yielding incorrect (incomplete) results. (This was discovered as part of debugging differences in #298, but the behavior is present in the main branch.)Extract leptothrix sig:
Find all overlaps with that:
Extract to new collection:
Index:
Run fmg with resulting rocksdb:
Run fastgather with the original:
Observe different output sizes: 10 matches from fmg-rdb, 13 for fg:
Inspection of results, ordered by unique_intersect_bp:
shows rapidly increasing disagreement - the numbers in the first column are the fastmultigather results, the numbers in the second are from fastgather.
See notebook compare-fmg-limited for processing & comparison code.
The text was updated successfully, but these errors were encountered: