-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MRG: write full gather result from fastgather
and non-rocksdb fastmultigather
#298
Conversation
benchmarking with this branch: session:
location: hackmd version: https://hackmd.io/IgakZLceRyeTCXuCgH0iDA?edit
I ran non-rocksdb Just looking at the number of lines in the gather result:
... so rocksdb fastmultigather produces 79 result lines, while the rest had 85 🤔 |
fastgather
/fastmultigather
fastgather
/fastmultigather
@ctb ready for review. Note: When I did column-by-column comparisons with python-based sourmash gather, the |
fastgather
/fastmultigather
fastgather
and non-rocksdb fastmultigather
A few notes and questions -
|
when I look at the differences in the matching names between To me this looks like it's simply the difference between different implementations of the underlying gather algorithm: there are different matches, but they're all the same species, so I bet there is a tie at some point and See https://github.com/ctb/2024-debug-gather-difference for notebook for exploring the differences. |
I think the column name update is good, per sourmash-bio/sourmash#1555, since we want to use the prefetch-like column names. Just, y'know, we should put them in the PR description and also document it. I'll look at the set of column names here and see if there's something else we should change here in prep for sourmash v5. |
Actually, I'm finding (on a stripped-down subset of 90 sketches that contains the union of all matches across the various CSVs) that my newly generated That suggests that maybe fastmultigather is the problem, not rocksdb. |
I see the same 79-line output from fastmultigather against my |
ok, I'm wrong, because of the way fastmultigather -o works. Digging in more. |
Debugged one mismatch here: #318 - this was causing Fixed by #319, which can be merged into this PR. There's another set of mismatches occurring from the same error in the RocksDB code, which uses sourmash-rs core code to calculate And there's an additional discrepancy in the RocksDB-based fastmultigather, as noted above. I've zeroed in on one specific problem that shows up when looking at gather results (from Python, or fastgather, for fastmultigather against a sig list ;). In brief, the 12th match in RocksDB-fastmultigather is returning too small an overlap. I can spend more time debugging this in the future, but it suggests to me that something is rotten in |
Random additional thought - is there any reason not to make That change would also fix #254 |
Yes, we can do that. I set it up this way b/c you were originally concerned about slowdown :) |
and I appreciate it, but I think you've done a great job of showing that that's not a concern! |
NOTE: PR into #298 Removes `--full-results` and updates tests for switch to full results by default.
some things to do before merge -
|
done! ready for re-review, etc |
fastgather
and non-rocksdb fastmultigather
fastgather
and non-rocksdb fastmultigather
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran some separate validation scripts and it all looks great! Thank you!! It's a go for merge!
This PR adds utilities for building full gather results file for
fastgather
and non-rocksdbfastmultigather
, and makes full output default.fastmultigather
should report full gather results for non rocksdb databases! #287gather
csv output fromfastmultigather
#187fastgather
CSV output forintersect_bp
is in hashes, not in hashes * scaled as it is insourmash gather
output #254fastgather
and non-rocksdbfastmultigather
full output here matches sourmash gather. Issues with rocksdb gather are being handled elsewhere.Benchmarking
fastgather
fastgather
fastmultigather
fastmultigather
fastmultigather
progress/separate PRs:
match_filename
in full results (useRecord.filename
to get match filename for full gather outputs #303; requires new sourmash core release with MRG: allow get/set record.filename sourmash#3121)KmerMinHashBTree
for hash subtraction +benchmark. Per luiz,KmerMinHashBTree
are better for any situation where we'll be subtracting/adding hashes to a sketch WIP: use KmerMinHashBTree for hash subtraction #310Record
.filename public in order to keep match_filename and write it to full results. (MRG: allow get/set record.filename sourmash#3121)