Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thoughts on Storage, manifests, etc. #1901

Open
ctb opened this issue Mar 26, 2022 · 1 comment
Open

thoughts on Storage, manifests, etc. #1901

ctb opened this issue Mar 26, 2022 · 1 comment

Comments

@ctb
Copy link
Contributor

ctb commented Mar 26, 2022

This issue extracts the leftover bits from #1352, mostly to do with evolving the Storage class and SBTs to better intersect.

From comment - @luizirber:

Quick comment on supporting multiple indexes per storage: during #648 I added the list_sbts method to ZipStorage with a trivial implementation, but having in mind that in the future it would be useful to have a similar method in Storage to be able to pull the indices descriptions and be able to select the desired one. Current impl just returns the first .sbt.json file it finds, but we can store multiple indices in one storage and share the signatures/ between them, for example.

From comment - @ctb:

I think we can evolve Storage classes a little bit to support the relevant functionality - we'd want some functionality like,

  • Storage.get_manifest(...) - loads exactly one manifest file

  • Index.signatures_with_internal(...), the current method for getting all signatures in a location in order to build a manifest.

In the short term, supporting manifests on SBTs would be pretty easy as long as we don't try to allow multiple indices within an SBT. I would probably start by supporting separate manifest creation, storage, and loading as per the ZipFileLinearIndex implementation in #1590. This would give us good picklist functionality with SBT.signatures(...) which is very useful in the short term for supporting charcoal and grist use cases.

In the longer-term, some questions will arise -

  • do we want to have separate SBT indices and manifests in a single file? I think the answer is "yes" because that way we can support full signature files in SBTs with Save full signature file in the SBT leaves? #198, and multiple indices on those signature files, all within a single zip file. (Which is good, right?)
  • if we want to support multiple SBT indices and want to keep them separate from a manifest, then I think we must forego using the SBT index itself as a manifest, because it will not necessarily have all the signatures in it and also that gets confusing.
  • if we don't want to support multiple SBT indices, we could upgrade the SBT index to support all the information that needs to be in the manifest, e.g. ksize and moltype (SBT JSON file should contain ksize and moltype #63).

and more -

hmm, maybe we want to start thinking about using Storage for arbitrary collections of signatures, with manifests; and Index for indexed collections that support search/gather/prefetch?

Then the .signatures() method (and related methods) are defined on Storage, while .find(...) is provided for Index classes, and things like SBTs and LCA DBs support both.

Could also add a Storage.indexes(...) that returns all indexes (currently just SBTs?) in any given storage.

Then manifests belong to Storage, and can be used efficiently with picklists; while Indexes are good for search, but don't work efficiently with picklists.

@ctb
Copy link
Contributor Author

ctb commented Aug 15, 2022

maybe relevant: perhaps Storage classes could provide internal_location information for direct loading viaIndex.get, suggested in #1848.

ctb added a commit that referenced this issue Dec 1, 2023
On-disk RevIndex based on RocksDB, initially implemented in
https://github.com/luizirber/2022-06-26-rocksdb-eval
This is the index/core data structure backing
https://mastiff.sourmash.bio

There are many changes in the Rust code, so bumping the version to
`0.12.0`.

This is mostly not exposed thru the FFI yet. Tests from the from the
in-memory `RevIndex` (greyhound) from #1238 were kept working, but it is
not well supported (doesn't allow saving/loading from disk, for
example), and should be wholly replaced by
`sourmash::index::revindex::disk_revindex` (the on-disk RevIndex) in the
future.
It is confusing to have these different RevIndex impls in Rust, and I
started converging them, but the work is not completely done yet.

#2727 is a better starting point for how `Index` abc/trait should work
acrosss Python/Rust, and I started moving the Rust indices to start from
a `LinearIndex` and later specialize into a `RevIndex`, which will make
easier to translate the work from #2727 for future indices across FFI.

A couple of new concepts introduced in this PR:

- a `Collection` is a `Manifest` + `Storage`. So a zip file like the
ones for GTDB databases fit this easily (storage = `ZipStorage`,
manifest is read from the zipfile), but a file paths list does too
(manifest built from the file paths, storage = `FSStorage`). This goes
in a bit of different direction than #1901, which was extending
`Storage` to support more functionality. I think `Storage` should stay
pretty bare and mostly deal with loading/saving data, but not much
knowledge of **what** data is there (this is covered with `Manifest`).

- a `CollectionSet` is a consistent collection of signatures. Consistent
here means: same k-size, downsample-compatible for scaled, same moltype.
You can create a `CollectionSet` by running `.select()` on a
`Collection`. `CollectionSet` is required for building indices (because
we shouldn't be building indices mixing k-size/moltype), and greatly
simplifies the logic in many places because it is not necessary to check
for compatibility.

- `LinearIndex` was rewritten based on `Collection` (and also the
`greyhound`/`branchwater` parallelism), and this supports the "parallel
search without an index" use case. There is no index construction per se
here, pretty much just a thin layer on top of `Collection` implementing
functionality expected from indices.

- `Manifest`, `Selection`, and `Picklist` are still incomplete, but the
relevant function definitions are in place, need to barrage it with
tests (and potentially exposing to Python and reusing the ones there in
#2726)

## Feature

- Initial implementation for `Manifest`, `Selection`, and `Picklist`
following
  the Python API.
- `Collection` is a new abstraction for working with a set of
signatures. A
collection needs a `Storage` for holding the signatures (on-disk,
in-memory,
or remotely), and a `Manifest` to describe the metadata for each
signature.
- Expose CSV parsing and RocksDB errors.
- New module `sourmash::index::revindex::disk_revindex` with the on-disk
  RevIndex implementation based on RocksDB.
- Add `iter` and `iter_mut` methods for `Signature`.
- Add `load_sig` and `save_sig` methods to `Storage` trait for
higher-level data
  manipulation and caching.
- Add `spec` method to `Storage` to allow constructing a concrete
`Storage` from
  a string description.
- Add `InnerStorage` for synchronizing parallel access to `Storage`
  implementations.
- Add `MemStorage` for keeping signatures in-memory (mostly for
debugging and
  testing).

## Refactor

- Rename `HashFunctions` variants to follow camel-case, so
`Murmur64Protein` instead of `murmur64_protein`
- `LinearIndex` is now implemented as a thin layer on top of
`Collection`.
- Move `GatherResult` to `sourmash::index` module.
- Move `sourmash::index::revindex` to `sourmash::index::mem_revindex`
(this is
the Greyhound version of revindex, in-memory only). It was also
refactored
internally to build a version of a `LinearIndex` that will be merged in
the
  future with `sourmash::index::LinearIndex`
- Move `select` method from `Index` trait into a separate `Select`
trait,
  and implement it for `Signature` based on the new `Selection` API.
- Move `SigStore` into `sourmash::storage` module, and remove the
generic. Now
  it always stores `Signature`. Also implement `Select` for it.

## Build

- Add new `branchwater` feature (enabled by default), which can be
disabled by downstream projects to limit bringing heavy dependencies
like rocksdb
- Add new `rkyv` feature (disabled by default), making `MinHash`
serializable
  with the `rkyv` crate.
- Add semver checks for CI (so we bump versions accordingly, or avoid
breaking changes)
- Reduce features combinations on Rust checks (takes much less time to
run)
- Disable `musllinux` wheels (need to figure out how to build rocksdb
for it)

---------

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Co-authored-by: C. Titus Brown <titus@idyll.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant