Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supporting RocksDB-based RevIndex in the Python API and command line #3558

Open
ctb opened this issue Feb 27, 2025 · 4 comments
Open

supporting RocksDB-based RevIndex in the Python API and command line #3558

ctb opened this issue Feb 27, 2025 · 4 comments
Labels

Comments

@ctb
Copy link
Contributor

ctb commented Feb 27, 2025

In #3545, I am adding support for RocksDB-based RevIndex databases to sourmash proper. While this database type is fully supported in the Rust layer and used by both branchwater and the branchwater plugin, the true power of this fully operational inverted index has not yet been made available in Python!

It turns out to fit pretty well! A few points:

  • thanks to branchwater and the branchwater plugin, we already know that RocksDB databases work quite well and integrate with the rest of the sourmash ecosystem 🎉 . Major motivation!
  • the RocksDB implementation is single-threaded, so we are not introducing any complicated threading issues into sourmash proper.
  • providing full support for RocksDB in the Python layer enables database composability (searching multiple RocksDB databases, and multiple types of databases, all in one command).

Items to consider and discuss:

  • for now, you can only create RevIndex databases using one of the branchwater repos. I am not sure how to implement RevIndex creation in sourmash via the CLI; maybe expand the index command, which is only used for SBTs currently? EDIT: done
  • I have not implemented picklists, manifests, or selection yet. We probably need all three. It would be ~straightforward to implement picklists in Python, I think, but picklists would be much faster in Rust. Not sure how difficult implementing them in Rust will be. EDIT: working on this in EXP: implement picklists in RocksDB RevIndex #3568
  • our test framework does not do a great job of generalizing to new database types, so doing thorough testing is going to be a pain. Probably this will involve some combination of bespoke tests, additions to the index protocol tests, and generalization of existing database tests to encompass zipfiles, SBTs, and RocksDB databases. I might look at refactoring SBT tests here, specifically... EDIT: see new test_index_cmd.py, and also support more outputs from index command & add more on-disk index tests? #3570
  • this might be a good opportunity to expand the in-memory RevIndex per index.RevIndex is not a fully implemented Index class #1939.

Related issues:

@ctb ctb added the rust label Feb 27, 2025
@ctb
Copy link
Contributor Author

ctb commented Feb 27, 2025

A reasonably simple partial solution to the test & creation problem:

  • expand sourmash index to create RocksDB indices;
  • update SBT tests to test both rocksdb and SBT functionality;

EDIT: done in new test_index_cmd.py

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2025

Over in #3545, I dug into the options for creating a new RocksDB via the FFI interface. RevIndex::create needs a CollectionSet, and the options there seem to be (1) zipfile (2) list of in-memory sigs (3) pathlist (4) another RocksDB. I decided, for now, to go with a list of in-memory sigs, since that will let me test things.

We may want to stick with that for a while, though, until we deal with #3321 in some form. And, in any case, we have other methods for creating RocksDB databases over in branchwater and sourmash_plugin_branchwater.

@ctb
Copy link
Contributor Author

ctb commented Mar 2, 2025

misc thoughts for planning:

  • might be good to tackle mem_revindex and RevIndex. The former is a not-yet-complete (?) implementation of the RevIndexOps trait over in src/core/src/index/revindex/mem_revindex.rs, and the latter wraps that for Python via FFI. Once functional, this could potentially be used for the CounterGather implementation, presumably speeding it up across the board.

@ctb
Copy link
Contributor Author

ctb commented Mar 3, 2025

also TODO: add some skip-mer tests for revindex

EDIT: done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant