Skip to content

Commit

Permalink
[MRG] add sig fileinfo and standard manifest-generating functionali…
Browse files Browse the repository at this point in the history
…ty (#1837)

* initial addition of 'sig fileinfo'

* finish first-draft implementation of fileinfo and get_manifest

* cleanup and move over to sourmash_args

* add manifest and length support to LCA_Database

* add rebuild/no-rebuild args

* fix __len__ for zipfiles, __bool__ interpretation

* add some comments

* fix error on bad path

* fix len(LCA_Database)

* fix fileinfo on a single .sig file

* initial stab at updating sig extract to use manifests

* fix sig extract mistake

* fix manifest test for LCA databases, which now works

* add __len__ to base Index class

* a test, a test

* more tests

* fix more MultiIndex

* change up MultiIndex

* fix tests etc

* fix remaining test

* minor refactor

* add tests for LCA_Database __len__ and __bool__

* add test for zip manifest + select

* add explicit tests for location

* add tests for prepend_location

* update sbts

* cleanup and fix

* update docs

* test sbt.json inputs for fileinfo

* update to show combinations of sketches

* update docstring at top of sourmash_args

* fix up stdin loading

* switch to wrapping stdin with MultiIndex

* update docs

* rough out last set of tests

* add tests for get_manifest()

* fix both test and code ;)

* add debug, do more tests, uncover some ...puzzling behavior

* fix abund problem

* add abunds sig test

* clear up the abund comments

* add yaml and json out

* more cleanup, test yaml and json output contents

* remove debug print

* add YAML to install

* fix pyyaml spec

* remove yaml & pyyaml dep

* Apply suggestions from code review

thanks @bluegenes!

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

* put 'total hashes' in same format as rest

* update docs with latest format

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
  • Loading branch information
ctb and bluegenes authored Feb 24, 2022
1 parent 530d833 commit 27826d8
Show file tree
Hide file tree
Showing 18 changed files with 1,016 additions and 99 deletions.
47 changes: 47 additions & 0 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -993,6 +993,39 @@ size: 5177
signature license: CC0
```

### `sourmash signature fileinfo` - display a summary of the contents of a sourmash collection

Display signature file, database, or collection.

For example,
```
sourmash sig fileinfo tests/test-data/prot/all.zip
```
will display:
```
path filetype: ZipFileLinearIndex
location: /Users/t/dev/sourmash/tests/test-data/prot/all.zip
is database? yes
has manifest? yes
is nonempty? yes
num signatures: 8
** examining manifest...
31758 total hashes
summary of sketches:
2 sketches with dayhoff, k=19, scaled=100 7945 total hashes
2 sketches with hp, k=19, scaled=100 5184 total hashes
2 sketches with protein, k=19, scaled=100 8214 total hashes
2 sketches with DNA, k=31, scaled=1000 10415 total hashes
```

`sig fileinfo` will recognize
[all accepted sourmash input files](#loading-signatures-and-databases),
including individual .sig and .sig.gz files, Zip file collections, SBT
databases, LCA databases, and directory hierarchies.

`sourmash sig fileinfo` provides optional JSON and YAML output, and
those formats are under semantic versioning.

### `sourmash signature split` - split signatures into individual files

Split each signature in the input file(s) into individual files, with
Expand Down Expand Up @@ -1271,6 +1304,20 @@ exit on the first bad k-mer. If `--check-sequence --force` is provided,
`sig kmers` will provide error messages (and skip bad sequences), but
will continue processing input sequences.

### `sourmash signature manifest` - output a manifest for a file

Output a manifest for a file, database, or collection.

For example,
```
sourmash sig manifest tests/test-data/prot/all.zip -o manifest.csv
```
will create a CSV file, `manifest.csv`, in the internal sourmash
manifest format. The manifest will contain an entry for every
signature in the file, database, or collection. This format is largely
meant for internal use, but it can serve as a picklist pickfile for
subsetting large collections.

## Advanced command-line usage

### Loading signatures and databases
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/cli/sig/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from . import extract
from . import filter
from . import flatten
from . import fileinfo
from . import kmers
from . import intersect
from . import manifest
Expand Down
31 changes: 31 additions & 0 deletions src/sourmash/cli/sig/fileinfo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
"""provide summary information on the given file"""


def subparser(subparsers):
subparser = subparsers.add_parser('fileinfo')
subparser.add_argument('path')
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true',
help='output debug information'
)
subparser.add_argument(
'-f', '--force', action='store_true',
help='try to load all files as signatures'
)
subparser.add_argument(
'--rebuild-manifest', help='forcibly rebuild the manifest',
action='store_true'
)
subparser.add_argument(
'--json-out', help='output information in JSON format only',
action='store_true'
)


def main(args):
import sourmash
return sourmash.sig.__main__.fileinfo(args)
8 changes: 8 additions & 0 deletions src/sourmash/cli/sig/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ def subparser(subparsers):
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true',
help='output debug information'
)
subparser.add_argument(
'-o', '--output', '--csv', metavar='FILE',
help='output information to a CSV file',
Expand All @@ -17,6 +21,10 @@ def subparser(subparsers):
'-f', '--force', action='store_true',
help='try to load all files as signatures'
)
subparser.add_argument(
'--no-rebuild-manifest', help='use existing manifest if available',
action='store_true'
)


def main(args):
Expand Down
73 changes: 51 additions & 22 deletions src/sourmash/index/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,10 @@ class Index(ABC):
is_database = False
manifest = None

@abstractmethod
def __len__(self):
"Return the number of signatures in this Index object."

@property
def location(self):
"Return a resolvable location for this index, if possible."
Expand Down Expand Up @@ -408,11 +412,13 @@ def save(self, path):
save_signatures(self.signatures(), fp)

@classmethod
def load(cls, location):
def load(cls, location, filename=None):
"Load signatures from a JSON signature file."
si = load_signatures(location, do_raise=True)

lidx = LinearIndex(si, filename=location)
if filename is None:
filename=location
lidx = LinearIndex(si, filename=filename)
return lidx

def select(self, **kwargs):
Expand Down Expand Up @@ -557,6 +563,14 @@ def __bool__(self):
return True

def __len__(self):
"calculate number of signatures."

# use manifest, if available.
m = self.manifest
if self.manifest is not None:
return len(m)

# otherwise, iterate across all signatures.
n = 0
for _ in self.signatures():
n += 1
Expand Down Expand Up @@ -845,12 +859,20 @@ class MultiIndex(Index):
Concrete class; signatures held in memory; builds and uses manifests.
"""
def __init__(self, manifest, parent=""):
def __init__(self, manifest, parent, *, prepend_location=False):
"""Constructor; takes manifest containing signatures, together with
optional top-level location to prepend to internal locations.
the top-level location.
"""
self.manifest = manifest
self.parent = parent
self.prepend_location = prepend_location

if prepend_location and self.parent is None:
raise ValueError("must set 'parent' if 'prepend_location' is set")

@property
def location(self):
return self.parent

def signatures(self):
for row in self.manifest.rows:
Expand All @@ -861,7 +883,7 @@ def signatures_with_location(self):
loc = row['internal_location']
# here, 'parent' may have been removed from internal_location
# for directories; if so, add it back in.
if self.parent:
if self.prepend_location:
loc = os.path.join(self.parent, loc)
yield row['signature'], loc

Expand All @@ -877,13 +899,16 @@ def _signatures_with_internal(self):


def __len__(self):
if self.manifest is None:
return 0

return len(self.manifest)

def insert(self, *args):
raise NotImplementedError

@classmethod
def load(cls, index_list, source_list, parent=""):
def load(cls, index_list, source_list, parent, *, prepend_location=False):
"""Create a MultiIndex from already-loaded indices.
Takes two arguments: a list of Index objects, and a matching list
Expand All @@ -903,10 +928,11 @@ def sigloc_iter():
yield ss, iloc

# build manifest; note, signatures are stored in memory.
# CTB: could do this on demand?
manifest = CollectionManifest.create_manifest(sigloc_iter())

# create!
return cls(manifest, parent=parent)
return cls(manifest, parent, prepend_location=prepend_location)

@classmethod
def load_from_directory(cls, pathname, *, force=False):
Expand Down Expand Up @@ -942,7 +968,8 @@ def load_from_directory(cls, pathname, *, force=False):
if not index_list:
raise ValueError(f"no signatures to load under directory '{pathname}'")

return cls.load(index_list, source_list, parent=pathname)
return cls.load(index_list, source_list, pathname,
prepend_location=True)

@classmethod
def load_from_path(cls, pathname, force=False):
Expand All @@ -957,19 +984,20 @@ def load_from_path(cls, pathname, force=False):

if os.path.isdir(pathname): # traverse
return cls.load_from_directory(pathname, force=force)
else: # load as a .sig/JSON file
index_list = []
source_list = []
try:
idx = LinearIndex.load(pathname)
index_list = [idx]
source_list = [pathname]
except (IOError, sourmash.exceptions.SourmashError):
if not force:
raise ValueError(f"no signatures to load from '{pathname}'")
return None

return cls.load(index_list, source_list)
# load as a .sig/JSON file
index_list = []
source_list = []
try:
idx = LinearIndex.load(pathname)
index_list = [idx]
source_list = [pathname]
except (IOError, sourmash.exceptions.SourmashError):
if not force:
raise ValueError(f"no signatures to load from '{pathname}'")
return None

return cls.load(index_list, source_list, pathname)

@classmethod
def load_from_pathlist(cls, filename):
Expand All @@ -992,15 +1020,16 @@ def load_from_pathlist(cls, filename):
idx_list.append(idx)
src_list.append(src)

return cls.load(idx_list, src_list)
return cls.load(idx_list, src_list, filename)

def save(self, *args):
raise NotImplementedError

def select(self, **kwargs):
"Run 'select' on the manifest."
new_manifest = self.manifest.select_to_manifest(**kwargs)
return MultiIndex(new_manifest, parent=self.parent)
return MultiIndex(new_manifest, self.parent,
prepend_location=self.prepend_location)


class LazyLoadedIndex(Index):
Expand Down
16 changes: 16 additions & 0 deletions src/sourmash/lca/lca_db.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,9 @@ def __init__(self, ksize, scaled, moltype='DNA'):
def location(self):
return self.filename

def __len__(self):
return self._next_index

def _invalidate_cache(self):
if hasattr(self, '_cache'):
del self._cache
Expand Down Expand Up @@ -177,6 +180,10 @@ def signatures(self):
for v in self._signatures.values():
yield v

def _signatures_with_internal(self):
for idx, ss in self._signatures.items():
yield ss, self.location, idx

def select(self, ksize=None, moltype=None, num=0, scaled=0, abund=None,
containment=False, picklist=None):
"""Make sure this database matches the requested requirements.
Expand Down Expand Up @@ -297,6 +304,15 @@ def load(cls, db_name):
for k, v in load_d['idx_to_lid'].items():
db.idx_to_lid[int(k)] = v

if db.ident_to_idx:
db._next_index = max(db.ident_to_idx.values()) + 1
else:
db._next_index = 0
if db.idx_to_lid:
db._next_lid = max(db.idx_to_lid.values()) + 1
else:
db._next_lid = 0

db.filename = db_name

return db
Expand Down
6 changes: 5 additions & 1 deletion src/sourmash/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Manifests for collections of signatures.
"""
import csv
import ast

from sourmash.picklist import SignaturePicklist

Expand Down Expand Up @@ -40,6 +41,9 @@ def __bool__(self):
def __len__(self):
return len(self.rows)

def __eq__(self, other):
return self.rows == other.rows

@classmethod
def load_from_csv(cls, fp):
"load a manifest from a CSV file."
Expand Down Expand Up @@ -70,7 +74,7 @@ def load_from_csv(cls, fp):
for k in introws:
row[k] = int(row[k])
for k in boolrows:
row[k] = bool(row[k])
row[k] = bool(ast.literal_eval(str(row[k])))
row['signature'] = None
manifest_list.append(row)

Expand Down
Loading

0 comments on commit 27826d8

Please sign in to comment.