Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: support RocksDB databases in sourmash proper through FFI #3545

Open
wants to merge 89 commits into
base: latest
Choose a base branch
from

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Feb 23, 2025

This PR exposes RocksDB-based RevIndex to Python via the FFI layer, supports the load/search/gather APIs, and adds command-line based creation via sourmash index --rocksdb.

Tackles #3558
Fixes #3570

Specifics:

  • adds support for all basic operations on RocksDB databases through FFI wrapping of sourmash::index::revindex::disk_revindex::RevIndex, including creation with sourmash index, support for the Index search/gather protocol, and full manifest/picklist operations 🎉 .
  • adds support for zipfile output from sourmash index, as well as RocksDB, via -F/--index-type, viz viz support more outputs from index command & add more on-disk index tests? #3570;
  • extends tests/test_revindex.py to test finicky details; adds branchwater-created RocksDB files for testing;
  • refactors a number of command-line tests in tests/test_sourmash.py and moves them to test_cmd_index.py; expands the tests to cover SBT, zip, and RocksDB index types;
  • adds a picklist/Idx filter to the disk_revindex.rs code to support picklists on RocksDB databases (but, so far, only for the limited set of operations exposed via FFI);
  • adds some skip-mer tests to test_cmd_index.py viz document and test skipmer moltypes #3449

TODO:

  • clean up code, generally :)
  • clean up code: comment/cleanup revindex.py
  • do targeted test additions based on Python layer, FFI layer
  • add documentation!!
  • update docstring in cli/index.py;
  • benchmark index vs branchwater, search vs manysearch, gather vs other implementations...
  • consider picklist & manifest support; take advantage of Feature: implement FFI for manifest, picklist and selection in Rust #2726?
  • create issues for potential optimization: track gather counters via intermediate object?
  • add rust test for dataset picklist
  • update documentation; esp note that 'index' at command line is same ksize/moltype/scaled.
  • fix scaled skip test
  • fix: flatten in revindex CounterGather add
  • improve/refactor manifest creation
  • benchmark picklists on RocksDB RevIndex, potentially using the entire manifest? :)
  • clean up code: refactor index() in commands.py
  • clean up code: change save/load code to check for directory, then try loading
  • clean up code: consider: revert changes to test_index_protocol?
  • clean up code: revert comments in test_sbt.py
  • clean up code: do a separate PR to remove tests in test_sourmash.py? or, consider dup tests, or move more?

Build RocksDB:

sourmash sig cat tests/test-data/{2,47,63}.fa.sig -o 3sigs.sig.zip
sourmash scripts index 3sigs.sig.zip -k 31 -s 1000 -o tests/test-data/3sigs.branch_0913.rocksdb --rocksdb

run describe

% sourmash sig describe tests/test-data/3sigs.branch_0913.rocksdb 

== This is sourmash version 4.8.15.dev0. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==

---
signature filename: b'tests/test-data/3sigs.branch_0913.rocksdb'
signature: CP001071.1 Akkermansia muciniphila ATCC BAA-835, complete genome
source file: 2.fa
md5: f3a90d4e5528864a5bcc8434b0d0c3b1
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 2701
sum hashes: 2701
signature license: CC0

---
signature filename: b'tests/test-data/3sigs.branch_0913.rocksdb'
signature: NC_009665.1 Shewanella baltica OS185, complete genome
source file: 47.fa
md5: 09a08691ce52952152f0e866a59f6261
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 5177
sum hashes: 5177
signature license: CC0

---
signature filename: b'tests/test-data/3sigs.branch_0913.rocksdb'
signature: NC_011663.1 Shewanella baltica OS223, complete genome
source file: 63.fa
md5: 38729c6374925585db28916b82a6f513
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 5238
sum hashes: 5238
signature license: CC0

loaded 3 signatures total, from 1 files

run gather

% sourmash gather SRR606249.abundtrim.sig tests/test-data/3sigs.branch_0913.rocksdb
...
overlap     p_query p_match
---------   ------- -------
5.2 Mbp        1.4%   99.0%    NC_011663.1 Shewanella baltica OS223, complete...
2.7 Mbp        0.7%  100.0%    CP001071.1 Akkermansia muciniphila ATCC BAA-83...
5.2 Mbp        0.7%   51.0%    NC_009665.1 Shewanella baltica OS185, complete...
found less than 50.0 kbp in common. => exiting

Copy link

codecov bot commented Feb 23, 2025

Codecov Report

Attention: Patch coverage is 63.47305% with 122 lines in your changes missing coverage. Please review.

Project coverage is 87.72%. Comparing base (9e255ee) to head (ce517a8).

Files with missing lines Patch % Lines
src/core/src/ffi/index/disk_revindex.rs 0.00% 84 Missing ⚠️
src/sourmash/commands.py 60.34% 14 Missing and 9 partials ⚠️
src/sourmash/index/revindex.py 94.15% 5 Missing and 5 partials ⚠️
src/core/src/index/revindex/disk_revindex.rs 58.33% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #3545      +/-   ##
==========================================
- Coverage   88.03%   87.72%   -0.31%     
==========================================
  Files         136      137       +1     
  Lines       22275    22588     +313     
  Branches     2225     2260      +35     
==========================================
+ Hits        19609    19815     +206     
- Misses       2353     2450      +97     
- Partials      313      323      +10     
Flag Coverage Δ
hypothesis-py 25.39% <22.22%> (-0.05%) ⬇️
python 92.31% <85.89%> (-0.01%) ⬇️
rust 81.30% <11.00%> (-0.77%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ctb
Copy link
Contributor Author

ctb commented Mar 8, 2025

interim benchmarks:

1234.2674s / 20m 34s
1.1 GB of RAM

for python gather --no-prefetch against RocksDB on SRR1976948.

with prefetch it's much worse, of course: 1886s / 8 GB of RAM.

compare with fmg x rocksdb at 120s and 476 MB of RAM 😭 link

Still, it's better than straight Python gather on both RAM and time!!

@ctb
Copy link
Contributor Author

ctb commented Mar 9, 2025

full set of benchmarks -

prefix s max_rss
fastmultigather_rocksdb 21.5 0.5
fastgather 199.8 12.0
fastmultigather 910.9 12.8
pygather_rocksdb 1471.0 1.1
pygather 2594.4 13.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support more outputs from index command & add more on-disk index tests?
1 participant