Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add sig fileinfo and standard manifest-generating functionality #1837

Merged
merged 51 commits into from
Feb 24, 2022
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
49af6f2
initial addition of 'sig fileinfo'
ctb Feb 12, 2022
f3b399a
finish first-draft implementation of fileinfo and get_manifest
ctb Feb 12, 2022
ca7630b
cleanup and move over to sourmash_args
ctb Feb 12, 2022
190d53f
add manifest and length support to LCA_Database
ctb Feb 12, 2022
f814e01
add rebuild/no-rebuild args
ctb Feb 12, 2022
323651b
fix __len__ for zipfiles, __bool__ interpretation
ctb Feb 13, 2022
aabd459
add some comments
ctb Feb 14, 2022
1c9d392
fix error on bad path
ctb Feb 16, 2022
490c67d
fix len(LCA_Database)
ctb Feb 16, 2022
90ad101
fix fileinfo on a single .sig file
ctb Feb 16, 2022
ebdc572
initial stab at updating sig extract to use manifests
ctb Feb 16, 2022
6744f1f
fix sig extract mistake
ctb Feb 16, 2022
fab92ec
fix manifest test for LCA databases, which now works
ctb Feb 16, 2022
c702ce2
add __len__ to base Index class
ctb Feb 16, 2022
b30159e
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Feb 16, 2022
d34070f
a test, a test
ctb Feb 16, 2022
3a6028f
more tests
ctb Feb 16, 2022
4808308
fix more MultiIndex
ctb Feb 17, 2022
0e1330b
change up MultiIndex
ctb Feb 18, 2022
5cc2a82
fix tests etc
ctb Feb 19, 2022
a106c70
fix remaining test
ctb Feb 19, 2022
deebee2
minor refactor
ctb Feb 19, 2022
046058e
add tests for LCA_Database __len__ and __bool__
ctb Feb 19, 2022
9160c55
add test for zip manifest + select
ctb Feb 19, 2022
2021967
add explicit tests for location
ctb Feb 19, 2022
3fcd522
add tests for prepend_location
ctb Feb 20, 2022
59d7c77
update sbts
ctb Feb 20, 2022
7cb67b4
cleanup and fix
ctb Feb 20, 2022
80d4403
update docs
ctb Feb 20, 2022
023ac9c
test sbt.json inputs for fileinfo
ctb Feb 20, 2022
cb5ad6c
update to show combinations of sketches
ctb Feb 20, 2022
660977f
update docstring at top of sourmash_args
ctb Feb 20, 2022
3cc1499
fix up stdin loading
ctb Feb 20, 2022
11f1671
switch to wrapping stdin with MultiIndex
ctb Feb 20, 2022
dea3869
update docs
ctb Feb 20, 2022
53c6425
rough out last set of tests
ctb Feb 20, 2022
e9d892d
add tests for get_manifest()
ctb Feb 20, 2022
02aa19b
fix both test and code ;)
ctb Feb 20, 2022
b8b9b31
add debug, do more tests, uncover some ...puzzling behavior
ctb Feb 21, 2022
10bc508
fix abund problem
ctb Feb 21, 2022
c8934d7
add abunds sig test
ctb Feb 21, 2022
8c7c29f
clear up the abund comments
ctb Feb 21, 2022
fca7526
add yaml and json out
ctb Feb 22, 2022
dc81741
more cleanup, test yaml and json output contents
ctb Feb 22, 2022
169df97
remove debug print
ctb Feb 22, 2022
02a1b42
add YAML to install
ctb Feb 22, 2022
ee36324
fix pyyaml spec
ctb Feb 22, 2022
4b9feab
remove yaml & pyyaml dep
ctb Feb 23, 2022
8d1ae14
Apply suggestions from code review
ctb Feb 24, 2022
56a678e
put 'total hashes' in same format as rest
ctb Feb 24, 2022
d4facce
update docs with latest format
ctb Feb 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -993,6 +993,39 @@ size: 5177
signature license: CC0
```

### `sourmash signature fileinfo` - display a summary of the contents of a sourmash collection

Display signature file, database, or collection.

For example,
```
sourmash sig fileinfo tests/test-data/prot/all.zip
```
will display:
```
path filetype: ZipFileLinearIndex
location: /Users/t/dev/sourmash/tests/test-data/prot/all.zip
is database? yes
has manifest? yes
is nonempty? yes
num signatures: 8
** examining manifest...
ctb marked this conversation as resolved.
Show resolved Hide resolved
31758 total hashes
ctb marked this conversation as resolved.
Show resolved Hide resolved
summary of sketches:
2 sketches with dayhoff, k=19, scaled=100, abund
2 sketches with hp, k=19, scaled=100, abund
2 sketches with protein, k=19, scaled=100, abund
2 sketches with DNA, k=31, scaled=1000, abund
ctb marked this conversation as resolved.
Show resolved Hide resolved
```

`sig fileinfo` will recognize
[all accepted sourmash input files](#loading-signatures-and-databases),
including individual .sig and .sig.gz files, Zip file collections, SBT
databases, LCA databases, and directory hierarchies.

`sourmash sig fileinfo` provides optional JSON and YAML output, and
those formats are under semantic versioning.

### `sourmash signature split` - split signatures into individual files

Split each signature in the input file(s) into individual files, with
Expand Down Expand Up @@ -1271,6 +1304,20 @@ exit on the first bad k-mer. If `--check-sequence --force` is provided,
`sig kmers` will provide error messages (and skip bad sequences), but
will continue processing input sequences.

### `sourmash signature manifest` - output a manifest for a file

Output a manifest for a file, database, or collection.

For example,
```
sourmash sig manifest tests/test-data/prot/all.zip -o manifest.csv
```
will create a CSV file, `manifest.csv`, in the internal sourmash
manifest format. The manifest will contain an entry for every
signature in the file, database, or collection. This format is largely
meant for internal use, but it can serve as a picklist pickfile for
subsetting large collections.

## Advanced command-line usage

### Loading signatures and databases
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/cli/sig/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from . import extract
from . import filter
from . import flatten
from . import fileinfo
from . import kmers
from . import intersect
from . import manifest
Expand Down
31 changes: 31 additions & 0 deletions src/sourmash/cli/sig/fileinfo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
"""provide summary information on the given file"""


def subparser(subparsers):
subparser = subparsers.add_parser('fileinfo')
subparser.add_argument('path')
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true',
help='output debug information'
)
subparser.add_argument(
'-f', '--force', action='store_true',
help='try to load all files as signatures'
)
subparser.add_argument(
'--rebuild-manifest', help='forcibly rebuild the manifest',
action='store_true'
)
subparser.add_argument(
'--json-out', help='output information in JSON format only',
action='store_true'
)


def main(args):
import sourmash
return sourmash.sig.__main__.fileinfo(args)
8 changes: 8 additions & 0 deletions src/sourmash/cli/sig/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ def subparser(subparsers):
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true',
help='output debug information'
)
subparser.add_argument(
'-o', '--output', '--csv', metavar='FILE',
help='output information to a CSV file',
Expand All @@ -17,6 +21,10 @@ def subparser(subparsers):
'-f', '--force', action='store_true',
help='try to load all files as signatures'
)
subparser.add_argument(
'--no-rebuild-manifest', help='use existing manifest if available',
action='store_true'
)


def main(args):
Expand Down
73 changes: 51 additions & 22 deletions src/sourmash/index/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,10 @@ class Index(ABC):
is_database = False
manifest = None

@abstractmethod
def __len__(self):
"Return the number of signatures in this Index object."

@property
def location(self):
"Return a resolvable location for this index, if possible."
Expand Down Expand Up @@ -408,11 +412,13 @@ def save(self, path):
save_signatures(self.signatures(), fp)

@classmethod
def load(cls, location):
def load(cls, location, filename=None):
"Load signatures from a JSON signature file."
si = load_signatures(location, do_raise=True)

lidx = LinearIndex(si, filename=location)
if filename is None:
filename=location
lidx = LinearIndex(si, filename=filename)
return lidx

def select(self, **kwargs):
Expand Down Expand Up @@ -557,6 +563,14 @@ def __bool__(self):
return True

def __len__(self):
"calculate number of signatures."

# use manifest, if available.
m = self.manifest
if self.manifest is not None:
return len(m)

# otherwise, iterate across all signatures.
n = 0
for _ in self.signatures():
n += 1
Expand Down Expand Up @@ -845,12 +859,20 @@ class MultiIndex(Index):

Concrete class; signatures held in memory; builds and uses manifests.
"""
def __init__(self, manifest, parent=""):
def __init__(self, manifest, parent, *, prepend_location=False):
"""Constructor; takes manifest containing signatures, together with
optional top-level location to prepend to internal locations.
the top-level location.
"""
self.manifest = manifest
self.parent = parent
self.prepend_location = prepend_location

if prepend_location and self.parent is None:
raise ValueError("must set 'parent' if 'prepend_location' is set")

@property
def location(self):
return self.parent

def signatures(self):
for row in self.manifest.rows:
Expand All @@ -861,7 +883,7 @@ def signatures_with_location(self):
loc = row['internal_location']
# here, 'parent' may have been removed from internal_location
# for directories; if so, add it back in.
if self.parent:
if self.prepend_location:
loc = os.path.join(self.parent, loc)
yield row['signature'], loc

Expand All @@ -877,13 +899,16 @@ def _signatures_with_internal(self):


def __len__(self):
if self.manifest is None:
return 0

return len(self.manifest)

def insert(self, *args):
raise NotImplementedError

@classmethod
def load(cls, index_list, source_list, parent=""):
def load(cls, index_list, source_list, parent, *, prepend_location=False):
"""Create a MultiIndex from already-loaded indices.

Takes two arguments: a list of Index objects, and a matching list
Expand All @@ -903,10 +928,11 @@ def sigloc_iter():
yield ss, iloc

# build manifest; note, signatures are stored in memory.
# CTB: could do this on demand?
manifest = CollectionManifest.create_manifest(sigloc_iter())

# create!
return cls(manifest, parent=parent)
return cls(manifest, parent, prepend_location=prepend_location)

@classmethod
def load_from_directory(cls, pathname, *, force=False):
Expand Down Expand Up @@ -942,7 +968,8 @@ def load_from_directory(cls, pathname, *, force=False):
if not index_list:
raise ValueError(f"no signatures to load under directory '{pathname}'")

return cls.load(index_list, source_list, parent=pathname)
return cls.load(index_list, source_list, pathname,
prepend_location=True)

@classmethod
def load_from_path(cls, pathname, force=False):
Expand All @@ -957,19 +984,20 @@ def load_from_path(cls, pathname, force=False):

if os.path.isdir(pathname): # traverse
return cls.load_from_directory(pathname, force=force)
else: # load as a .sig/JSON file
index_list = []
source_list = []
try:
idx = LinearIndex.load(pathname)
index_list = [idx]
source_list = [pathname]
except (IOError, sourmash.exceptions.SourmashError):
if not force:
raise ValueError(f"no signatures to load from '{pathname}'")
return None

return cls.load(index_list, source_list)
# load as a .sig/JSON file
index_list = []
source_list = []
try:
idx = LinearIndex.load(pathname)
index_list = [idx]
source_list = [pathname]
except (IOError, sourmash.exceptions.SourmashError):
if not force:
raise ValueError(f"no signatures to load from '{pathname}'")
return None

return cls.load(index_list, source_list, pathname)

@classmethod
def load_from_pathlist(cls, filename):
Expand All @@ -992,15 +1020,16 @@ def load_from_pathlist(cls, filename):
idx_list.append(idx)
src_list.append(src)

return cls.load(idx_list, src_list)
return cls.load(idx_list, src_list, filename)

def save(self, *args):
raise NotImplementedError

def select(self, **kwargs):
"Run 'select' on the manifest."
new_manifest = self.manifest.select_to_manifest(**kwargs)
return MultiIndex(new_manifest, parent=self.parent)
return MultiIndex(new_manifest, self.parent,
prepend_location=self.prepend_location)


class LazyLoadedIndex(Index):
Expand Down
16 changes: 16 additions & 0 deletions src/sourmash/lca/lca_db.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,9 @@ def __init__(self, ksize, scaled, moltype='DNA'):
def location(self):
return self.filename

def __len__(self):
return self._next_index

def _invalidate_cache(self):
if hasattr(self, '_cache'):
del self._cache
Expand Down Expand Up @@ -177,6 +180,10 @@ def signatures(self):
for v in self._signatures.values():
yield v

def _signatures_with_internal(self):
for idx, ss in self._signatures.items():
yield ss, self.location, idx

def select(self, ksize=None, moltype=None, num=0, scaled=0, abund=None,
containment=False, picklist=None):
"""Make sure this database matches the requested requirements.
Expand Down Expand Up @@ -297,6 +304,15 @@ def load(cls, db_name):
for k, v in load_d['idx_to_lid'].items():
db.idx_to_lid[int(k)] = v

if db.ident_to_idx:
db._next_index = max(db.ident_to_idx.values()) + 1
else:
db._next_index = 0
if db.idx_to_lid:
db._next_lid = max(db.idx_to_lid.values()) + 1
else:
db._next_lid = 0

db.filename = db_name

return db
Expand Down
6 changes: 5 additions & 1 deletion src/sourmash/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Manifests for collections of signatures.
"""
import csv
import ast

from sourmash.picklist import SignaturePicklist

Expand Down Expand Up @@ -40,6 +41,9 @@ def __bool__(self):
def __len__(self):
return len(self.rows)

def __eq__(self, other):
return self.rows == other.rows

@classmethod
def load_from_csv(cls, fp):
"load a manifest from a CSV file."
Expand Down Expand Up @@ -70,7 +74,7 @@ def load_from_csv(cls, fp):
for k in introws:
row[k] = int(row[k])
for k in boolrows:
row[k] = bool(row[k])
row[k] = bool(ast.literal_eval(str(row[k])))
row['signature'] = None
manifest_list.append(row)

Expand Down
Loading