Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optionally enable full gather csv output from fastmultigather #187

Closed
bluegenes opened this issue Jan 18, 2024 · 2 comments · Fixed by #298
Closed

optionally enable full gather csv output from fastmultigather #187

bluegenes opened this issue Jan 18, 2024 · 2 comments · Fixed by #298

Comments

@bluegenes
Copy link
Contributor

after #134, I was hoping to provide all gather columns in the branchwater fastmultigather results csv by using the rust GatherResult and adding in the query-specific columns within branchwater.

I think that's feasible, but might require a little more work in rust core:

Ref:

As noted in #107, we need to be wary of performance hits as a result of calculating the additional columns. We might want to make lightweight version default, and support an option for calculating + outputting the full csv (or vice versa)

@ctb
Copy link
Collaborator

ctb commented Jan 26, 2024

or we could shortcut!

https://github.com/ctb/2024-calc-full-gather

@ctb
Copy link
Collaborator

ctb commented Feb 4, 2024

So a few lessons from calc-full-gather - basically, calc-full-gather works with fastgather results, but not with fastmultigather results. This is because fastmultigather for various reasons doesn't have the same columns as fastgather. SIGH.

Ref ctb/2024-calc-full-gather#2

bluegenes added a commit to sourmash-bio/sourmash that referenced this issue Feb 21, 2024
This PR adds a `calculate_gather_stats` function (and criterion
benchmark) to calculate all gather statistics directly in rust. It makes
use of this function in `disk_revindex.rs`, meaning that gather on a
rocksdb database will now produce the full `GatherResult`.

Note that this PR does not change the python layer to use these stats.
This is just a step on the way to that :).

## New functions:

- `Sketch::MinHash::sum_abunds()` - sum abundances in a MinHash
- `Sketch::MinHash::n_unique_kmers()` - len(self) * self.scaled.
- `Sketch::inflate(abunds_from)` - inflate abundances in self using
abund values from `abunds_from` MinHash
- `Sketch::inflated_abundances(abunds_from)` - same process as
`inflate`, but just return the abundance vector and sum of abundances.
- `ani_utils::ani_from_containment` - a streamlined function to get ANI
from containment. I suspect a lot of the time added for ANI calculation
(ref: testing over in branchwater) is the way we calculate it in python
(including calculating a total number of k-mers). This excludes anything
not really needed for basic ANI without confidence intervals.
- `ani_utils::ani_from_containment_ci` - get high and low CI ANI from
containment
- `index::calculate_gather_stats` - calculate all gather statistics that
can be separated from the gather iteration and return the full
`GatherResult`

The "expensive" calculations:
- abundance-weighted values require re-calculating intersections and/or
manipulating the abundance vector
> Thankfully I think the new `inflate`/`inflated_abundances` code is
pretty efficient? It uses the `itertools` `merge_join_by` strategy from
https://github.com/sourmash-bio/sourmash/compare/lirber/itertools_merge.
Thanks @luizirber :)
- abundance `median` and `stddev` currently require cloning abunds, not
sure how big of an issue that is.
- ANI with confidence intervals

## Punted Issues
- #3020: calculate all gather stats for all other database types. start
with `linear.rs` since it uses `GatherResult` struct already.
- #3021 try moving `calculate_gather_stats` outside of the `gather`
functions to enable postprocessing in parallel.
- #3022 : write ffi for the new functions?
  - `sketch::minhash::sum_abunds`
  - `sketch::minhash::n_unique_kmers`,
  - `sketch::minhash::inflate`
  - `sketch::minhash::inflated_abundances`
  - `ani_utils::ani_from_containment`
  - `ani_utils::ani_from_containment_ci`
  - `ani_utils::prob_nothing_in_common`

ref
sourmash-bio/sourmash_plugin_branchwater#187.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Mohamed Abuelanin <mabuelanin@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Luiz Irber <luizirber@users.noreply.github.com>
Co-authored-by: Luiz Irber <contact+github@luizirber.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants