Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add ZipStorage, support loading tree from storage #648

Merged
merged 18 commits into from
Apr 30, 2020

Conversation

luizirber
Copy link
Member

@luizirber luizirber commented Feb 23, 2019

Fixes #490 , closes #60

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@luizirber
Copy link
Member Author

Needs more testing, but it is slower than uncompressing the file and loading from the hidden dir for now

@phiweger phiweger mentioned this pull request Feb 25, 2019
@luizirber luizirber force-pushed the feature/load_from_storage branch from bdcda6a to 95dab12 Compare July 7, 2019 20:28
@luizirber luizirber added this to the 3.3 milestone Jan 14, 2020
@luizirber luizirber force-pushed the feature/load_from_storage branch from 95dab12 to e7677c8 Compare April 22, 2020 17:13
@luizirber luizirber changed the base branch from master to rust_nodegraph April 22, 2020 17:13
@luizirber luizirber force-pushed the feature/load_from_storage branch from e7677c8 to 1e0889c Compare April 22, 2020 18:21
@luizirber
Copy link
Member Author

From #799 (comment):

SBTs are currently v4; saving compressed nodes imply a v5 (because if we keep the same version number, older versions of sourmash can't load a v4 anymore if it is compressed...)

There is already experimental support for loading v5 in #694, but we still save as v4 by default. Since we don't save as v5 yet, there is no officially supported v5 SBTs in existence, so it's OK to update the format (probably also adding some field for the LocalizedSBT in #925?)

(Currently the difference from v4 to v5 is splitting leaves and internal nodes, which allows saving LinearIndex in the same format. Also, only the leaves are required, and maybe start calling them datasets or signatures is also good)

@luizirber luizirber force-pushed the feature/load_from_storage branch from 1e0889c to 9c8451f Compare April 23, 2020 18:50
@luizirber luizirber changed the base branch from rust_nodegraph to master April 23, 2020 18:50
@luizirber luizirber force-pushed the feature/load_from_storage branch 2 times, most recently from d8d3683 to f53ab67 Compare April 24, 2020 01:40
@luizirber
Copy link
Member Author

luizirber commented Apr 24, 2020

Loading from a zipfile already works, and I implemented the same logic for tar files but it was waaay to slow (I tried with the current prepared DBs and... it just never finishes), so I removed it.

Following what we did with previous SBT updates, I propose:

  • start saving to v5 (we support loading from previous versions, but always save to the most recent one)
  • [-] v5 prepared DBs are zip files, instead of tar files (so people don't try to run older DBs with newer sourmash versions) -> upgrade sourmash_databases #970
  • use zip files only as a containers, and store data as it is added to the containers. This avoids slowness in the zipfile Python module (since it will just outputs what was added, without trying to decompress)
  • core is ready for loading compressed files (nodegraphs or signatures), so...
  • start always saving nodegraphs as compressed. This is not compatible with khmer (only CountGraph can be gzipped), but... it doesn't matter?
  • also compress signatures when they are added to an SBT storage (but not signatures in general, because it would break previous sourmash versions)
  • [-] update docs/tutorials in this repo to use new v5 zipped DBs -> document new SBT zip file output. #973
  • adding a field for SBT/index type, for the LocalizedSBT in Add option for SBT creation to be localized  #925?
  • Currently the difference from v4 to v5 is splitting leaves and internal nodes, which allows saving LinearIndex in the same format. Also, only the leaves are required. Calling leaves as signatures in the SBT JSON now, but internally still using leaves.

Still missing:

  • saving into a zipfile... I created tests/test-data/v5.zip manually, running
cd tests/test-data
zip -0 v5.zip .sbt.v3 v5.sbt.json

It is a bit weird to use the zip file just as containers, and do the compression/decompression outside it, but the main benefit is that even if the file is unzipped the FSStorage will use about the same size (because each individual node data will still be compressed).

@luizirber luizirber force-pushed the feature/load_from_storage branch 2 times, most recently from b1d38bc to 0e956ad Compare April 26, 2020 02:15
@ctb
Copy link
Contributor

ctb commented Apr 26, 2020

👍

@luizirber luizirber force-pushed the feature/load_from_storage branch 2 times, most recently from 0bff7d7 to 47aecfc Compare April 27, 2020 16:23
@luizirber luizirber marked this pull request as ready for review April 27, 2020 16:25
@luizirber
Copy link
Member Author

OK, time to bikeshed: I'm checking for .zip extensions for loading/saving indices, but should it be .sbt.zip instead? I think I prefer the latter because it has .sbt in the name.

@ctb
Copy link
Contributor

ctb commented Apr 27, 2020 via email

@luizirber luizirber force-pushed the feature/load_from_storage branch from 75e1f78 to 667c95d Compare April 27, 2020 19:55
@ctb
Copy link
Contributor

ctb commented Apr 27, 2020

I got this to work manually by taking an already-existing SBT and zipping it up. Nice!

Misc questions in initial conceptual review --

  • what requirements are there for the name of the index? does it just find the first .sbt.json file in there?
  • how do you create these with sourmash? Is this PR only adding the functionality to read them? Skimming the code I would have expected to be able to provide a '.sbt.zip' filename and automatically produce a zipfile, but it doesn't seem to work. That might be because the sourmash index CLI automatically adds a .sbt.json on to the end of files... if so, should we fix that here or in another, later PR? (That later PR could also add documentation :)
  • this PR also seems to add loading of compressed (.gz) signatures!? ref Enable --gzip for 'sourmash compute' #60

@luizirber
Copy link
Member Author

I got this to work manually by taking an already-existing SBT and zipping it up. Nice!

This is a nice side benefit, but it is slower. If you zip it up this way it's up to the Python ZipFile to decompress, and that is slower than letting the Rust code decompress. Hence why all the discussion of storing "uncompressed" zip files, because this way ZipFile will read the data and only when it is passed to Rust it will be properly decompressed.

Misc questions in initial conceptual review --

what requirements are there for the name of the index? does it just find the first .sbt.json file in there?

It find the ONLY .sbt.json in there. If there is more than one it complains at the moment:
https://github.com/dib-lab/sourmash/pull/648/files#diff-f01fb6a1122709b5d46828f26c23854aR609-R613

how do you create these with sourmash? Is this PR only adding the functionality to read them? Skimming the code I would have expected to be able to provide a '.sbt.zip' filename and automatically produce a zipfile, but it doesn't seem to work. That might be because the sourmash index CLI automatically adds a .sbt.json on to the end of files...

Was this before 667c95d? The name was just .zip (instead of .sbt.zip) before this commit.

Up for discussion: default the index creation to zipped SBTs, if the name doesn't specify .sbt.json? For now it still defaults to FSStorage (the hidden dir).

Also important: SBTs created with whatever this version of sourmash ends up being (3.3?) won't be backward-compatible (because they will be saved as v5). So I think defaulting to zipped SBTs is OK.

if so, should we fix that here or in another, later PR? (That later PR could also add documentation :)

It's in the TODO list, together with rewriting the docs to use the new trees. Maybe add the zipped SBTs creation to this PR, but punt tutorials/docs changes to another PR (but pre 3.3?)

this PR also seems to add loading of compressed (.gz) signatures!? ref #60

yup =]

They are still saved uncompressed by default, but the SBT.save method uses compressed=True in save_signatures, so they are compressed starting in v5.

@ctb
Copy link
Contributor

ctb commented Apr 27, 2020

  • yep, creation with a .sbt.zip suffix now works!
  • defaulting to zipped SBT seems like a breach of semantic versioning for v3, since the output filename(s) change, right? There is a simple way to override it - looks like explicitly specifying .sbt.json works ok. I dunno. Seems like having it be an easter egg for people until v4 is safest.
  • good to bump to v5 by default now, I think? what happens if you try to load a v5 with sourmash 3.2?
  • I think having a separate PR for docs and command line (keeping this PR from getting bigger and uglier) is better. That leaves this as "technical support for ZipStorage."

I still want to go through some of the bits of code and tests and understand them, tho.

@ctb
Copy link
Contributor

ctb commented Apr 27, 2020

(have you run this on a really big SBT? How well does it perform?)

@ctb
Copy link
Contributor

ctb commented Apr 29, 2020 via email

@luizirber
Copy link
Member Author

On Wed, Apr 29, 2020 at 08:36:14AM -0700, Luiz Irber wrote: Better solution: create a new exception type, raise it and catch where appropriate (search/gather CLI, for example). For interactive uses it is horrible to raise SystemExit(1)...
yes. => new issue?

this is now raising IndexNotSupported, still need to catch it in other places for nicer messages (but the traceback is more informative, at least)

maybe we could add an explicit test for the name? that way we're aware of when it changes and have that included in a test for diff/blame/changelog purposes.

Added test, and a sourmash index test too. I think it is checking too deep into the generated file (in a way that might complicate future changes that keep the test working), but better safe than sorry?

@ctb
Copy link
Contributor

ctb commented Apr 29, 2020

Looking pretty fully baked - I have two more files to dig into, but the functionality all seems there! Let me know if you want me to expedite the review.

@luizirber
Copy link
Member Author

Looking pretty fully baked - I have two more files to dig into, but the functionality all seems there! Let me know if you want me to expedite the review.

Almost, it was missing a test for sourmash index --append. Now it has one, and the ZipStorage can write new entries and avoid duplications (check the latest commit for untold horrors =])

Ready for final review and merge

@luizirber luizirber changed the title [WIP] add ZipStorage, support loading tree from storage [MRG] add ZipStorage, support loading tree from storage Apr 29, 2020
# TODO: leave it open, or close/open every time?

if path is None:
# TODO: Open a temporary file?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to flag things here (raise an exception, or some such)? or just require path in the constructor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be interesting to do what is happening with the buffer now, where it is kept in memory, and throw a Warning if the file is not open? But probably just make it required is easier.

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks excellent to me! On or before merge, pls create issue to update docs and do other leftovers (e.g. for v4, make sbt.zip the default).

@ctb
Copy link
Contributor

ctb commented Apr 29, 2020

added the issues - now just gotta get the tests passing, I guess :)

@luizirber luizirber merged commit ccfcb77 into master Apr 30, 2020
@luizirber luizirber deleted the feature/load_from_storage branch April 30, 2020 03:22
@ctb
Copy link
Contributor

ctb commented Apr 30, 2020

very nice work, @luizirber!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support all-in-one SBT indexes
2 participants