-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] implement a simple ZipFileLinearIndex class #1349
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #1349 +/- ##
==========================================
+ Coverage 89.33% 89.44% +0.11%
==========================================
Files 123 123
Lines 18913 19105 +192
Branches 1463 1471 +8
==========================================
+ Hits 16895 17089 +194
+ Misses 1785 1784 -1
+ Partials 233 232 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Interested in comments, @bluegenes @luizirber! I still have a few tests to write and some documentation to update, but curious what you think. So far I really like the convenience of supporting everything in one file, and it's particularly nice that it supports incompatible signatures (unlike the indexed data types, for now). |
love this -- much easier to distribute one file than many, plus compression! A couple thoughts:
|
Love the comments, excellent brainstorming :). I will probably punt most of them to new issues as out of my desired scope for this PR, but I think they're great next steps! Note that outputting '.gz' formats is just as good as a zip file when compression is desired; zip files are mostly nice for situations where you have a directory of things to distribute, and want to be able to update them etc (the random access case). I like the idea of a |
Can you expand on this a bit? I guess you mean that for compression, we could gzip the individual sigfiles, and then optionally zip the directory, which enables distribution/updating/etc. But I guess I haven't been thinking about compression much-- do we currently enable reading/writing |
On the plus side, this is a simple implementation, with easy-to-use commands in the CLI, and covers an increasingly common use case. But =]
Not at first sight, but we will only know when we measure =]
This already works transparently, because So, I'm +1 on this PR. I think it fragments a bit the codebase and makes it harder to maintain in the long run, but it is useful NOW, which is more important =] |
some responses and thoughts -
sourmash can load multiple signatures from So the question becomes, what does a .zip file bring that goes beyond compression? The main advantages that I see -
That is, zip files are a package of compressed files and directories, not a compressed package of signatures.
This is an excellent point, and one I want to dig into. Are you think maybe that
I'm not too concerned about the versioning, since zipfile format is pretty stable, is supported by Python, and presumably is supported by Rust. But your point about this being Yet Another file format to support is well taken. How many more
So this brings up a key question: do we want to require that there be a manifest? A required manifest can add loading overhead and maintenance overhead (if we update the things in the zip file, we need to update the manifest). In the greyhound use case we simply want to be able to iterate across the signatures as quickly as possible, and I think the current SBT format is ill suited for that.
k, I'll take a look.
Yes, it should, but we need to recognize
This matches my hot take (once I'd finished getting it working) - good functionality, but not that cleanly integrated into sourmash. There's no immediate hurry, so let's keep talking. So a follow-on question: What about having some logic where the Storage tells us what kind of file it is? Then we can implement ZipStorage logic to tell us if it's an SBT or something else. |
Wrote a bit about how a tl;dr I am wondering if we should provide a URL designator for signature origins? In doing so, I came across this interesting discussion about how to write URLs for files in zip archives (from frictionlessdata that does a nice breakdown of some issues with origin URLs. Which led me to this issue on bdbags 🤣 so it's all getting very complicated! |
while I'm weaving a web of issues this morning...
|
further brain breaking fun thoughts: aren't indexes separate from collections? indexes are indices ON collections of sketches...
thinking down this road a bit further,
there is a slightly messy interplay with sketch types and selectors and tagging -
but this all seems like buying ourselves conceptual trouble for the future that might not be that interesting in practice... note we reframe This further opens up the idea of commands to generate manifests on collections separately from indices. |
Here's some simple code to load a pile o' signatures into a zipfile: #! /usr/bin/env python
import sys
import zipfile
import sourmash
import argparse
def main():
p = argparse.ArgumentParser()
p.add_argument('zipfile')
p.add_argument('signatures', nargs='+')
args = p.parse_args()
zf = zipfile.ZipFile(args.zipfile, 'w', zipfile.ZIP_DEFLATED)
n = 0
for filename in args.signatures:
print(f"reading signatures from '{filename}'")
for sig in sourmash.load_file_as_signatures(filename):
md5 = 'signatures/' + sig.md5sum() + '.sig'
sigstr = sourmash.save_signatures([sig])
zf.writestr(md5, sigstr)
n += 1
print(f"wrote {n} signatures to '{args.zipfile}'")
return 0
if __name__ == '__main__':
sys.exit(main()) |
more idle thinking, would the results of ...perhaps with a new kind of index, a summary sketch that contains aggregate hashes plus their abundances. or maybe that's just an lca/revindex index. which leads me to think that the output of commands like prefetch and |
Could the result of Would be really nice to avoid saving a |
With #1406 and #1420 being merged, loading of databases and selection of compatible signatures has improved muchly; while there are still many to be improved, I think this PR is reasonably well scoped and fully baked. I'll take care of creating new issues based on the conversations above before merge, but for now I am happy to say: ready for review! pls review at your leisure, @bluegenes! (because I know you're enthusiastic about this feature :) |
(the test failures appear to be related to something else going on with the docs; I'll look into this separately) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good to me, just a few questions /comments!
Co-authored-by: Tessa Pierce <bluegenes@users.noreply.github.com>
FYI this may have issues with duplicated md5sums, see #1483 (comment) |
This PR implements linear indices that are just zipped collections of signatures. Fixes #1320.
Example usage:
A few notes -
zip
at the command line.I'm a little concerned at how easy this was, TBH... what am I missing? 😅
Questions:
sourmash sig describe .
) and list-of-file inputs (--from-file
)?TODO:
find
functionality forIndex
classes #1392 currently.