Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

standalone manifests and storage and internal_location - what do we recommend? #3008

Open
ctb opened this issue Feb 17, 2024 · 6 comments
Open

Comments

@ctb
Copy link
Contributor

ctb commented Feb 17, 2024

over in In sourmash-bio/sourmash_plugin_branchwater#213, I tried to describe standalone manifests and their benefits for lazy loading. I wrote:

There are various places where we recommend using manifests instead of zip files. Why?

Well, first, if you are using a zip file created by sourmash, you are already using a manifest! And you will get all of the benefits described above!

But if you want to use a collection of multiple very large metagenomes (as search targets in manysearch, or as queries in fastmultigather), then standalone manifests might be a good solution for you.

This is for two specific reasons:

  • first, metagenome sketches are often extremely large (100s of MBs to GBs), and it is not ideal to zip many large sketches into a single zip file;
  • second, both manysearch and fastmultigather take a single argument that specifies collections of metagenomes which need to be loaded on demand, because they cannot fit into memory;

so the question becomes, how do you provide collections of large metagenomes to manysearch and fastmultigather in a single filename?

And the answer is: manifests. Manifests are a sourmash filetype that contains information about sketches without containing the actual sketch content, and they can be used as "catalogs" of sketch content.

The branchwater plugin supports manifest CSVs. These can be created from lists of sketches by using sourmash sig collect or sourmash sig manifest; for example,

sourmash sig manifest <from file> -o manifest.csv

will create a manifest CSV from a list of sketches.

And when I actually had to create a standalone manifest for my own use (for benchmarking) I was careful to use absolute paths.

Which made me think, ok, I just made a manifest, and it's got absolute paths to a bunch of sig files in it, and that's nice and robust. But it means if I move the directory full of sig files, I have to regenerate the manifest from scratch. Now, I could use my own special knowledge of manifest files and go in and change all the path names manually. But maybe it would be better to better support relative paths? Not sure.

A few things I could imagine:

  • a standard way to say, hey, please interpret this manifest relative to this directory, with it defaulting to cwd.
  • in addition, support container files (e.g. zip files) and/or Web URLs via plugins.

Related issues:

@luizirber
Copy link
Member

Hmm, do we want standalone manifest to be a thing? Or should they be paired with a storage?

I like the internal_location because it makes things relative, so it becomes easy to move from dir to zipfile to (http/s3/ipfs/rocksdb) as different backends for the same content (more info: #2230 (comment)).

so in this case you would call
sourmash sig describe --storage sigs.zip mf.csv
to explicitly set the storage.

or, maybe... a standalone manifest is a collection? This way it ties manifest + storage, and the command would be
sourmash sig describe mf.collection
with an option to "overwrite" the storage to point to another backend? so you could say
sourmash sig describe --storage sigs.zip mf.collection
or
sourmash sig describe --storage s3://bucket-name mf.collection
and so on

(originally posted in #3009, moved comment here)

@luizirber
Copy link
Member

In branchwater-index I called it location in the CLI:
https://github.com/sourmash-bio/branchwater/blob/9ab2de5df0795ea5b90791e6aa2d2dd7432d9193/crates/index/src/main.rs#L23-L25
and this was needed because I definitely don't want to build a zipfile of all SRA metagenomes, or force people to have /data/wort in their systems for loading data from a directory...

@ctb
Copy link
Contributor Author

ctb commented Feb 23, 2024

some hot-take thoughts (so, may be quite dumb, but I've got to get them out of my head ;) -

standalone manifests are already a thing, of course, so 🤷

I'm thinking of tackling the MultiIndex/pathlists-load-everything-all-together problem by using @bluegenes genius solution in the branchwater plugin, in which we load the pathlist twice - once to generate a manifest, and then again when we actually need the sketches. This will also have the beneficial effect of subtly deprecating pathlists in favor of standalone manifests, for performance and flexibility reasons, without actually breaking pathlists.

It feels like we should implement a focused Index.get(...) method per #1848. In one of my custom scripts I'm using storage there quite explicitly.

could a simple hack that addresses the storage idea be to support a generic --prefix argument when loading standalone manifest?

Or something that fits better with the way we're doing save/load and save/load plugins - what if we introduce a new URI-focused scheme (beta-tested via a plugin) per #2258, and then use that to interpret manifest paths somehow?

There are some interesting thoughts on Storage lingering in this issue over here: #1901

It feels like there is a possible convergence now that we have the ability to really use the Rust Collection stuff in action after sourmash-bio/sourmash_plugin_branchwater#197 &c.

@ctb
Copy link
Contributor Author

ctb commented Feb 24, 2024

Restating what I think @luizirber said, but having thought about it some more, esp in terms of implementation and CLI -

What if we add a --storage option to everything?

It will default to ., i.e., interpret paths relative to local directory.

But you could also specify:

  • --storage file:///some/abs/path or --storage file://../some/rel/path to interpret paths relative to that location
  • --storage zip://path/to/zipfile.zip to interpret paths relative to that location
  • --storage http:// etc for other URIs

then we could have storage plugins separate from save/load plugins, too.

A key thing each storage would have is a relpath or internal method that for a given location, gave you the internal path for that storage.

I think this pretty closely matches what luiz said above :)

@luizirber
Copy link
Member

luizirber commented Feb 24, 2024

Restating what I think @luizirber said, but having thought about it some more, esp in terms of implementation and CLI -

What if we add a --storage option to everything?

the branchwater search index (server) is initialized like this:

CMD ["/app/bin/branchwater-server", "--port", "80", "-k21", "-a", "/app/assets", "--location", "/data/sigs.zip", "/data/index"]

that --location (which could be --storage) is implement with

    let location = opts.location.map(|path| if path.ends_with(".zip") {
        format!("zip://{}", path)
    } else { format!("fs://{}", path) });

But you could also specify:

  • --storage file:///some/abs/path or --storage file://../some/rel/path to interpret paths relative to that location
  • --storage zip://path/to/zipfile.zip to interpret paths relative to that location
  • --storage http:// etc for other URIs
    then we could have storage plugins separate from save/load plugins, too.

the file:// from your example maps tofs:// , zip is still the same, and they are parsed/created the right Storage on the Rust side here:

pub fn from_spec(spec: String) -> Result<Self> {
Ok(match spec {
x if x.starts_with("fs") => {
let path = x.split("://").last().expect("not a valid path");
InnerStorage::new(FSStorage::new("", path))
}
x if x.starts_with("memory") => InnerStorage::new(MemStorage::new()),
x if x.starts_with("zip") => {
let path = x.split("://").last().expect("not a valid path");
InnerStorage::new(ZipStorage::from_file(path)?)
}
_ => todo!("storage not supported, throw error"),
})
}

file:// does sound better than fs:// because it matches browsers. In general I suggest we keep the syntax as close as possible to fsspec

A key thing each storage would have is a relpath or internal method that for a given location, gave you the internal path for that storage.

Both ZipStorage and FSStorage in Rust already do that (with path and fullpath):

/// Store files locally into a directory
#[derive(TypedBuilder, Debug, Clone, Default)]
pub struct FSStorage {
/// absolute path for the directory where data is saved.
fullpath: PathBuf,
subdir: String,
}
#[ouroboros::self_referencing]
pub struct ZipStorage {
mapping: Option<memmap2::Mmap>,
#[borrows(mapping)]
#[covariant]
archive: piz::ZipArchive<'this>,
subdir: Option<String>,
path: Option<PathBuf>,
#[borrows(archive)]
#[covariant]
metadata: Metadata<'this>,
}

I think this pretty closely matches what luiz said above :)

Yup. =]

@ctb
Copy link
Contributor Author

ctb commented Mar 3, 2024

while debugging #3053 I realized that we have some inconsistencies in the code -

  • paths within standalone manifests are interpreted relative to the manifest location: if a manifest is at mf_dir/manifest.csv, and it contains a path sketches/sketch.sig.gz, sourmash will try to load mf_dir/sketches/sketch.sig.gz.
  • however, sig collect and sig check make standalone manifests that are (by default) relative to the current working directory.
  • so, sig collect / sig check by default create manifests that are incompatible with the way we load manifests. sigh.
  • also note that pathlists are interpreted relative to the current working directory, not the location of the pathlist (document that files in pathlists are relative to the cwd, not to the pathlist location. #1414).

I am slowly coming around to the idea that loading things relative to the manifest path is correct:

  • it makes standalone manifests more relocatable, in that you can move the directory hierarchy containing the manifest AND the signatures together, and it will all work;
  • it mimics the way zip files behave: a zip file is at zip_dir/collection.zip, and contains sketches at signatures/<md5>.sig.gz, and the manifest in the zip file contains that path signatures/<md5>.sig.gz, and loads as signatures/<md5>.sig.gz in zip_dir/collection.zip. So, the analogy of "zip file and directories are both storage, and paths are relative to the storage" seems to fit.

So in PR #3054 I'm trying out the following:

  • adding a new option --relpath to sig check and sig collect that changes their default behavior to output manifests relative to the manifest output location.
  • supporting --abspath on both sig check and sig collect so that you can also output manifests containing absolute paths. (sig collect already supports --abs-path)

@ctb ctb changed the title manifests: should we encourage absolute paths, and/or better support relative paths? standalone manifests and storage and internal_location - what do we recommend? Mar 5, 2024
ctb added a commit that referenced this issue Mar 8, 2024
…` and `sig collect` (#3054)

This PR updates `sig collect` and `sig check` so that they can produce
standalone manifests that work properly with default sourmash loading
behavior. The default behavior produces broken manifests in some
situations and is not changed, but will be deprecated in v5.

## Details

Currently, `sig collect` and `sig check` default to producing standalone
manifests with internal path locations relative to the current working
directory. This conflicts with the default `StandaloneManifest` behavior
implemented in `save_load.py` that loads path locations relative to the
manifest location. As a result, whenever the manifest was in a
subdirectory, the standalone manifests output by `sig check` and `sig
collect` were broken. The only way to make good manifests in this
situation was to use `sig collect --abspath`, but `sig check` didn't
support `--abspath`, and using absolute paths is brittle in situations
where you want to distribute manifests.

This PR adds `--relpath` to both `sig check` and `sig collect`, and adds
`--abspath` to `sig check`. It also demonstrates the bad behavior in
tests and annotates the tests appropriately.

See
#3008 (comment)
for more detailed discussion of why I think `--relpath` is the right
behavior for the future.

- [x] adds `--abspath` and `--relpath` to `sig check`, to properly
support relative paths;
- [x] adds `--relpath` to `sig collect`, to properly support relative
paths;
- [x] documents this behavior properly for creating standalone
manifests;
- [ ] create issue to change default `sig check` and `sig collect`
behavior for v4, and disable cwd behavior.

Techie TODO:
- [x] explicitly test `relpath` and `abspath` behavior in `sig check`;
- [x] explicitly test `relpath` behavior in `sig collect`
- [x] write some tests for `sig check` and `sig collect` to explore the
relative path loading issue, with all three combinations of relpath: mf
in cwd, sigs in subdir; mf in subdir, sigs in cwd; mf in subdir, sigs in
subdir.

Related issues:
* Addresses #3008
* Addresses issues in
#3048 by updating `sig
check` to support `--relpath`;
* Fixes #3053 -
`--relpath` again

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants