Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MRG: Calculate all gather stats in rust; use for rocksdb gather (#2943)
This PR adds a `calculate_gather_stats` function (and criterion benchmark) to calculate all gather statistics directly in rust. It makes use of this function in `disk_revindex.rs`, meaning that gather on a rocksdb database will now produce the full `GatherResult`. Note that this PR does not change the python layer to use these stats. This is just a step on the way to that :). ## New functions: - `Sketch::MinHash::sum_abunds()` - sum abundances in a MinHash - `Sketch::MinHash::n_unique_kmers()` - len(self) * self.scaled. - `Sketch::inflate(abunds_from)` - inflate abundances in self using abund values from `abunds_from` MinHash - `Sketch::inflated_abundances(abunds_from)` - same process as `inflate`, but just return the abundance vector and sum of abundances. - `ani_utils::ani_from_containment` - a streamlined function to get ANI from containment. I suspect a lot of the time added for ANI calculation (ref: testing over in branchwater) is the way we calculate it in python (including calculating a total number of k-mers). This excludes anything not really needed for basic ANI without confidence intervals. - `ani_utils::ani_from_containment_ci` - get high and low CI ANI from containment - `index::calculate_gather_stats` - calculate all gather statistics that can be separated from the gather iteration and return the full `GatherResult` The "expensive" calculations: - abundance-weighted values require re-calculating intersections and/or manipulating the abundance vector > Thankfully I think the new `inflate`/`inflated_abundances` code is pretty efficient? It uses the `itertools` `merge_join_by` strategy from https://github.com/sourmash-bio/sourmash/compare/lirber/itertools_merge. Thanks @luizirber :) - abundance `median` and `stddev` currently require cloning abunds, not sure how big of an issue that is. - ANI with confidence intervals ## Punted Issues - #3020: calculate all gather stats for all other database types. start with `linear.rs` since it uses `GatherResult` struct already. - #3021 try moving `calculate_gather_stats` outside of the `gather` functions to enable postprocessing in parallel. - #3022 : write ffi for the new functions? - `sketch::minhash::sum_abunds` - `sketch::minhash::n_unique_kmers`, - `sketch::minhash::inflate` - `sketch::minhash::inflated_abundances` - `ani_utils::ani_from_containment` - `ani_utils::ani_from_containment_ci` - `ani_utils::prob_nothing_in_common` ref sourmash-bio/sourmash_plugin_branchwater#187. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Mohamed Abuelanin <mabuelanin@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com> Co-authored-by: Luiz Irber <contact+github@luizirber.org>
- Loading branch information