Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] implement a simple ZipFileLinearIndex class #1349

Merged
merged 95 commits into from
Apr 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
6bacb41
implement a simple ZipFileLinearIndex class
ctb Feb 25, 2021
61245eb
fix load_file_as_signatures
ctb Feb 25, 2021
31b0ca4
Merge branch 'latest' of github.com:dib-lab/sourmash into add/zipfile…
ctb Feb 26, 2021
b22248b
add tests for zipfile searching etc.
ctb Feb 26, 2021
8c9c9ae
add sig describe test for loading from zipfile
ctb Feb 26, 2021
7087d82
fix load_file_as_index to support zipfiles
ctb Feb 26, 2021
c75bd83
rename force; add docstrings
ctb Feb 26, 2021
da3a58c
Merge branch 'latest' into add/zipfile_index
ctb Mar 5, 2021
92e5fdc
add an IndexOfIndexes class
ctb Mar 6, 2021
5c71e11
rename to MultiIndex
ctb Mar 7, 2021
85efdaf
switch to using MultiIndex for loading from a directory
ctb Mar 7, 2021
04f9de1
some more MultiIndex tests
ctb Mar 7, 2021
201a89a
add test of MultiIndex.signatures
ctb Mar 7, 2021
07d2c32
add docstring for MultiIndex
ctb Mar 7, 2021
61d15c3
stop special-casing SIGLISTs
ctb Mar 7, 2021
16f9ee2
fix test to match more informative error message
ctb Mar 7, 2021
c6bf314
switch to using LinearIndex.load for stdin, too
ctb Mar 7, 2021
dd0f3b8
add __len__ to MultiIndex
ctb Mar 8, 2021
9211a74
add check_csv to check for appropriate filename loading info
ctb Mar 8, 2021
75069ff
add comment
ctb Mar 8, 2021
d2294fb
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 9, 2021
9f39623
fix databases load
ctb Mar 9, 2021
ac63cf8
more tests needed
ctb Mar 9, 2021
d5059eb
Merge branch 'latest' into add/multi_index
ctb Mar 9, 2021
3e06dbf
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 9, 2021
5590d70
add tests for incompatible signatures
ctb Mar 9, 2021
14891bd
add filter to LinearIndex and MultiIndex
ctb Mar 9, 2021
40395ff
clean up sourmash_args some more
ctb Mar 9, 2021
8c51452
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 9, 2021
fbf3bb9
Merge branch 'latest' into add/multi_index
ctb Mar 12, 2021
abd84b2
Merge branch 'latest' of github.com:dib-lab/sourmash into add/zipfile…
ctb Mar 12, 2021
dd52be6
Merge branch 'latest' of github.com:dib-lab/sourmash into add/multi_i…
ctb Mar 24, 2021
f377dc4
shift loading over to Index classes
ctb Mar 24, 2021
250c49a
refactor, fix tests
ctb Mar 24, 2021
9a921f9
switch to a list of loader functions
ctb Mar 25, 2021
780fb71
comments, docstrings, and tests passing
ctb Mar 26, 2021
d261963
update to use f strings throughout sourmash_args.py
ctb Mar 26, 2021
4b4174e
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/db…
ctb Mar 26, 2021
93fca04
add docstrings
ctb Mar 26, 2021
0203357
update comments
ctb Mar 26, 2021
cd53f02
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/db…
ctb Mar 26, 2021
8a0200a
remove unnecessary changes
ctb Mar 26, 2021
e9df90f
revert to original test
ctb Mar 26, 2021
9e427e3
remove unneeded comment
ctb Mar 26, 2021
0dd390a
clean up a bit
ctb Mar 26, 2021
2c0ee29
debugging update
ctb Mar 27, 2021
edcb483
better exception raising and capture for signature parsing
ctb Mar 27, 2021
3f6c3f2
more specific error message
ctb Mar 27, 2021
78dbb1d
revert change in favor of creating new issue
ctb Mar 27, 2021
229b1d7
add commentary => TODO
ctb Mar 28, 2021
20ed9f0
add tests for MultiIndex.load_from_directory; fix traverse code
ctb Mar 28, 2021
16a119e
switch lca summarize over to usig MultiIndex
ctb Mar 28, 2021
cb1e8a3
switch to using MultiIndex in categorize
ctb Mar 28, 2021
c9e176d
remove LoadSingleSignatures
ctb Mar 28, 2021
8f914f1
test errors in lca database loading
ctb Mar 28, 2021
a43b011
remove unneeded categorize code
ctb Mar 28, 2021
15328ae
add testme info
ctb Mar 28, 2021
f674232
verified that this was tested
ctb Mar 28, 2021
01c54c0
remove testme comments
ctb Mar 28, 2021
ae3f66d
add tests for MultiIndex.load_from_file_list
ctb Mar 28, 2021
7f52d7c
refactor select, add scaled/num/abund
ctb Mar 28, 2021
dde14fd
more work
ctb Mar 28, 2021
3f498a4
catch ValueError from db.select
ctb Mar 29, 2021
df19926
update debug print to sys.stder
ctb Mar 29, 2021
e8233ca
fix scaled check for LCA database
ctb Mar 29, 2021
b44c3cf
add debug_literal
ctb Mar 29, 2021
7133ac1
break things when filter returns empty Index
ctb Mar 29, 2021
f5f1c9c
fix scaled check for SBT
ctb Mar 29, 2021
d6f156f
fix a few tests
ctb Mar 30, 2021
785a9a4
fix LCA database ksize message & test
ctb Mar 30, 2021
23d7ac4
flag for removal
ctb Mar 30, 2021
efc07cd
Merge branch 'latest' of github.com:dib-lab/sourmash into refactor/si…
ctb Mar 30, 2021
12399e7
add 'containment' to 'select'
ctb Mar 31, 2021
b6a4dff
Merge branch 'latest' into refactor/db_load_multiindex
ctb Mar 31, 2021
2b7acb9
fix remaining tests
ctb Mar 31, 2021
f663426
Merge branch 'refactor/db_load_multiindex' into refactor/siglist_loading
ctb Mar 31, 2021
9aae1cb
update comments
ctb Mar 31, 2021
2630be2
remove all the cruft, yay
ctb Mar 31, 2021
4f1a7fe
added 'is_database' flag for nicer UX
ctb Mar 31, 2021
736ddf3
remove overly broad exception catching
ctb Mar 31, 2021
16719ce
add docstrings
ctb Mar 31, 2021
6d8663e
document downsampling foo
ctb Mar 31, 2021
9832810
Merge branch 'latest' of github.com:dib-lab/sourmash into add/zipfile…
ctb Apr 1, 2021
4854325
Merge branch 'refactor/siglist_loading' into merge
ctb Apr 1, 2021
c4de8fb
update for additional test files
ctb Apr 1, 2021
31194bf
update ZipFileLinearIndex for new selector criteria
ctb Apr 1, 2021
be502ab
remove leftover code fragment
ctb Apr 1, 2021
c072866
Merge branch 'latest' of github.com:dib-lab/sourmash into add/zipfile…
ctb Apr 2, 2021
ec13d03
add zipfile API tests; use .location
ctb Apr 3, 2021
0339ed7
update docs to include zipfile collections
ctb Apr 3, 2021
9b016dd
add zipfile loading tests
ctb Apr 3, 2021
7c0e54a
add __len__ to ZipFileLinearIndex and test MultiIndex load of zipfile
ctb Apr 3, 2021
d755f95
Update doc/command-line.md
ctb Apr 3, 2021
b0f5241
Merge branch 'latest' into add/zipfile_index
ctb Apr 3, 2021
6ce3ce5
add test of incompatible sig search for zipfile
ctb Apr 3, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 29 additions & 18 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -842,7 +842,9 @@ scaled values will be made compatible.
### Storing (and searching) signatures

Backing up a little, there are many ways to store and search
signatures.
signatures. `sourmash` supports storing and loading signatures from JSON
files, directories, lists of files, Zip files, and indexed databases.
These can all be used interchangeably for sourmash operations.

The simplest is one signature in a single JSON file. You can also put
many signatures in a single JSON file, either by building them that
Expand All @@ -851,7 +853,29 @@ commands. Searching or comparing these files involves loading them
sequentially and iterating across all of the signatures - which can be
slow, especially for many (100s or 1000s) of signatures.

Indexed databases can make searching signatures a lot faster. SBT
### Zip files

All of the `sourmash` commands support loading collections of
signatures from zip files. You can create a compressed collection of
signatures using `zip -r collection.zip *.sig` and then specify
`collections.zip` on the command line.

### Loading all signatures under a directory

All of the `sourmash` commands support loading signatures from
beneath directories; provide the paths on the command line.

#### Passing in lists of files

Most sourmash commands will also take `--from-file` or
`--query-from-file`, which will take a path to a text file containing
a list of file paths. This can be useful for situations where you want
to specify thousands of queries, or a subset of signatures produced by
some other command.

#### Indexed databases

Indexed databases can make searching signatures much faster. SBT
databases are low memory and disk-intensive databases that allow for
fast searches using a tree structure, while LCA databases are higher
memory and (after a potentially significant load time) are quite fast.
Expand All @@ -869,19 +893,6 @@ will complain. In contrast, signature files can
contain many different types of signatures, and compatible ones will
be discovered automatically.

### Passing in lists of files

Various sourmash commands will also take `--from-file` or
`--query-from-file`, which will take a path to a text file containing
a list of file paths. This can be useful for situations where you want
to specify thousands of queries, or a subset of signatures produced by
some other command.

### Loading all signatures under a directory

All of the `sourmash` commands support loading signatures from
beneath directories; provide the paths on the command line.

### Combining search databases on the command line

All of the commands in sourmash operate in "online" mode, so you can
Expand All @@ -902,9 +913,9 @@ been useful. :)

### Using stdin

Most commands will take stdin via the usual UNIX convention, `-`.
Moreover, `sourmash sketch` and the `sourmash sig` commands will
output to stdout. So, for example,
Most commands will take signature JSON data via stdin using the usual
UNIX convention, `-`. Moreover, `sourmash sketch` and the `sourmash
sig` commands will output to stdout. So, for example,

`sourmash sketch ... -o - | sourmash sig describe -` will describe the
signatures that were just created.
Expand Down
68 changes: 64 additions & 4 deletions src/sourmash/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import sourmash
from abc import abstractmethod, ABC
from collections import namedtuple
import zipfile
import os


Expand Down Expand Up @@ -83,7 +84,7 @@ def search(self, query, threshold=None,
for ss in self.signatures():
score = query_match(ss)
if score >= threshold:
matches.append((score, ss, self.filename))
matches.append((score, ss, self.location))

# sort!
matches.sort(key=lambda x: -x[0])
Expand Down Expand Up @@ -119,7 +120,7 @@ def gather(self, query, *args, **kwargs):
for ss in self.signatures():
cont = query.minhash.contained_by(ss.minhash, True)
if cont and cont >= threshold:
results.append((cont, ss, self.filename))
results.append((cont, ss, self.location))

results.sort(reverse=True, key=lambda x: (x[0], x[1].md5sum()))

Expand Down Expand Up @@ -182,7 +183,7 @@ def __init__(self, _signatures=None, filename=None):
self._signatures = []
if _signatures:
self._signatures = list(_signatures)
self.filename = filename
self.location = filename

def signatures(self):
return iter(self._signatures)
Expand Down Expand Up @@ -219,7 +220,66 @@ def select(self, **kwargs):
if select_signature(ss, **kwargs):
siglist.append(ss)

return LinearIndex(siglist, self.filename)
return LinearIndex(siglist, self.location)


class ZipFileLinearIndex(Index):
"""\
A read-only collection of signatures in a zip file.

Does not support `insert` or `save`.
"""
is_database = True

def __init__(self, zf, selection_dict=None,
traverse_yield_all=False):
self.zf = zf
self.selection_dict = selection_dict
self.traverse_yield_all = traverse_yield_all

def __len__(self):
return len(list(self.signatures()))

@property
def location(self):
return self.zf.filename

def insert(self, signature):
raise NotImplementedError

def save(self, path):
raise NotImplementedError

@classmethod
def load(cls, location, traverse_yield_all=False):
"Class method to load a zipfile."
zf = zipfile.ZipFile(location, 'r')
return cls(zf, traverse_yield_all=traverse_yield_all)

def signatures(self):
"Load all signatures in the zip file."
from .signature import load_signatures
for zipinfo in self.zf.infolist():
# should we load this file? if it ends in .sig OR we are forcing:
if zipinfo.filename.endswith('.sig') or \
zipinfo.filename.endswith('.sig.gz') or \
self.traverse_yield_all:
fp = self.zf.open(zipinfo)

# now load all the signatures and select on ksize/moltype:
selection_dict = self.selection_dict
for ss in load_signatures(fp):
ctb marked this conversation as resolved.
Show resolved Hide resolved
if selection_dict:
if select_signature(ss, **self.selection_dict):
yield ss
else:
yield ss
ctb marked this conversation as resolved.
Show resolved Hide resolved

def select(self, **kwargs):
"Select signatures in zip file based on ksize/moltype/etc."
return ZipFileLinearIndex(self.zf,
selection_dict=kwargs,
traverse_yield_all=self.traverse_yield_all)


class MultiIndex(Index):
Expand Down
4 changes: 1 addition & 3 deletions src/sourmash/sbt.py
Original file line number Diff line number Diff line change
Expand Up @@ -754,9 +754,7 @@ def load(cls, location, *, leaf_loader=None, storage=None, print_version_warning

if storage:
sbts = storage.list_sbts()
if len(sbts) != 1:
print("no SBT, or too many SBTs!")
else:
if len(sbts) == 1:
tree_data = storage.load(sbts[0])

tempfile = NamedTemporaryFile()
Expand Down
29 changes: 22 additions & 7 deletions src/sourmash/sourmash_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
from . import signature
from .logging import notify, error, debug_literal

from .index import LinearIndex, MultiIndex
from .index import (LinearIndex, ZipFileLinearIndex, MultiIndex)
from . import signature as sig
from .sbt import SBT
from .sbtmh import SigLeaf
Expand Down Expand Up @@ -143,7 +143,7 @@ def traverse_find_sigs(filenames, yield_all_files=False):

def load_dbs_and_sigs(filenames, query, is_similarity_query, *, cache_size=None):
"""
Load one or more SBTs, LCAs, and/or signatures.
Load one or more SBTs, LCAs, and/or collections of signatures.

Check for compatibility with query.

Expand Down Expand Up @@ -251,7 +251,7 @@ def _load_sbt(filename, **kwargs):

try:
db = load_sbt_index(filename, cache_size=cache_size)
except FileNotFoundError as exc:
except (FileNotFoundError, TypeError) as exc:
raise ValueError(exc)

return db
Expand All @@ -263,6 +263,16 @@ def _load_revindex(filename, **kwargs):
return db


def _load_zipfile(filename, **kwargs):
"Load collection from a .zip file."
db = None
if filename.endswith('.zip'):
traverse_yield_all = kwargs['traverse_yield_all']
db = ZipFileLinearIndex.load(filename,
traverse_yield_all=traverse_yield_all)
return db


# all loader functions, in order.
_loader_functions = [
("load from stdin", _load_stdin),
Expand All @@ -271,6 +281,7 @@ def _load_revindex(filename, **kwargs):
("load from file list", _multiindex_load_from_pathlist),
("load SBT", _load_sbt),
("load revindex", _load_revindex),
("load collection from zipfile", _load_zipfile),
]


Expand Down Expand Up @@ -328,8 +339,10 @@ def _load_database(filename, traverse_yield_all, *, cache_size=None):
def load_file_as_index(filename, yield_all_files=False):
"""Load 'filename' as a database; generic database loader.

If 'filename' contains an SBT or LCA indexed database, will return
the appropriate objects.
If 'filename' contains an SBT or LCA indexed database, or a regular
Zip file, will return the appropriate objects. If a Zip file and
yield_all_files=True, will try to load all files within zip, not just
.sig files.

If 'filename' is a JSON file containing one or more signatures, will
return an Index object containing those signatures.
Expand All @@ -346,8 +359,10 @@ def load_file_as_signatures(filename, select_moltype=None, ksize=None,
progress=None):
"""Load 'filename' as a collection of signatures. Return an iterable.

If 'filename' contains an SBT or LCA indexed database, will return
a signatures() generator.
If 'filename' contains an SBT or LCA indexed database, or a regular
Zip file, will return a signatures() generator. If a Zip file and
yield_all_files=True, will try to load all files within zip, not just
.sig files.

If 'filename' is a JSON file containing one or more signatures, will
return a list of those signatures.
Expand Down
Binary file added tests/test-data/prot/all.zip
Binary file not shown.
Binary file added tests/test-data/prot/dayhoff.zip
Binary file not shown.
1 change: 1 addition & 0 deletions tests/test-data/prot/dna-sig.noext

Large diffs are not rendered by default.

Binary file added tests/test-data/prot/dna-sig.sig.gz
Binary file not shown.
Binary file added tests/test-data/prot/hp.zip
Binary file not shown.
Binary file added tests/test-data/prot/protein.zip
Binary file not shown.
16 changes: 16 additions & 0 deletions tests/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,22 @@ def test_load_index_3():
assert len(sigs) == 2


def test_load_index_4():
testfile = utils.get_test_data('prot/all.zip')
idx = sourmash.load_file_as_index(testfile)

sigs = list(idx.signatures())
assert len(sigs) == 7


def test_load_index_4_b():
testfile = utils.get_test_data('prot/protein.zip')
idx = sourmash.load_file_as_index(testfile)

sigs = list(idx.signatures())
assert len(sigs) == 2


def test_load_fasta_as_signature():
# try loading a fasta file - should fail with informative exception
testfile = utils.get_test_data('short.fa')
Expand Down
21 changes: 21 additions & 0 deletions tests/test_cmd_signature.py
Original file line number Diff line number Diff line change
Expand Up @@ -1421,6 +1421,27 @@ def test_sig_describe_1_dir(c):
assert line.strip() in out


@utils.in_tempdir
def test_sig_describe_1_zipfile(c):
# get basic info on multiple signatures in a zipfile
sigs = utils.get_test_data('prot/all.zip')
c.run_sourmash('sig', 'describe', sigs)

out = c.last_result.out
print(c.last_result)

expected_output = """\
k=19 molecule=dayhoff num=0 scaled=100 seed=42 track_abundance=0
k=19 molecule=dayhoff num=0 scaled=100 seed=42 track_abundance=0
k=19 molecule=hp num=0 scaled=100 seed=42 track_abundance=0
k=19 molecule=hp num=0 scaled=100 seed=42 track_abundance=0
k=19 molecule=protein num=0 scaled=100 seed=42 track_abundance=0
k=19 molecule=protein num=0 scaled=100 seed=42 track_abundance=0
""".splitlines()
for line in expected_output:
assert line.strip() in out


@utils.in_thisdir
def test_sig_describe_stdin(c):
sig = utils.get_test_data('prot/protein/GCA_001593925.1_ASM159392v1_protein.faa.gz.sig')
Expand Down
Loading