Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible inconsistency of methods of class MinHash #82

Closed
lgautier opened this issue Jan 3, 2017 · 9 comments · Fixed by #1128
Closed

Possible inconsistency of methods of class MinHash #82

lgautier opened this issue Jan 3, 2017 · 9 comments · Fixed by #1128

Comments

@lgautier
Copy link
Contributor

lgautier commented Jan 3, 2017

The class defines a __len__(), but the length of what is returned by get_mins is not necessarily consistent:

In [1]: import sourmash_lib._minhash

In [2]: mh = sourmash_lib._minhash.MinHash(1000, 21)

In [3]: len(mh)
Out[3]: 1000

In [4]: mh.get_mins()
Out[4]: []
@ctb
Copy link
Contributor

ctb commented Jan 3, 2017 via email

@lgautier
Copy link
Contributor Author

lgautier commented Jan 3, 2017

May be drop __len__ altogether then. MinHash does not seem to implement of Python's sequence protocol anyways.

Note that PR #79 is fixing a bit of the C code to allow the use of getters/setters (which allow size or max_size to be properties).

@ctb
Copy link
Contributor

ctb commented Jan 3, 2017 via email

@lgautier
Copy link
Contributor Author

lgautier commented Jan 3, 2017

May be drop __len__ altogether then. MinHash does not seem to implement of Python's sequence protocol anyways.

well, that could be fixed...?
__len__ is used for true/false or empty/not-empty calculations too, which I
like. Seems pythonic to have it work :)

Then if __len__ is desirable, it would make sense that get_mins() returns an object of identical length.

An alternative to that would be to have an attribute mins that is implementing __len__. Places where len(mh) is currently used would then become len(mh.mins) (keeping, length, and removing possible ambiguities).

@ctb
Copy link
Contributor

ctb commented Jan 3, 2017

yes, from my perspective:

__len__ should return what len(mh.get_mins()) would return (for efficiency)

new attribute mh.max_size returns the minhash num

It is important to have a mh.get_mins() function so that we can pass
in with_abundance keyword. In that context I worry that mh.mins would
be confusing if we had both.

@lgautier
Copy link
Contributor Author

lgautier commented Jan 3, 2017

What about the following as an alternative to mh.get_mins(with_abundance=True) ?

zip(mh.mins, mh.abundances)

@ctb
Copy link
Contributor

ctb commented Jan 3, 2017

We would have to encode the logic in minhash_get_mins (in _minhash.cc) in Python then.

@ctb
Copy link
Contributor

ctb commented Jan 3, 2017

This is probably OK and not too confusing, but it seems like something that should be done in fast C land if we already have the code working.

@lgautier
Copy link
Contributor Author

lgautier commented Jan 3, 2017

I am uncertain the loss in performance would matter much (but depends on use cases). If C at all costs one could have a specialized method mh.mins_with_abundance() (reusing the code already implemented). An other way to look at might be that abundances are bolted onto a container of hash values and an alternative design pattern could be considered (e.g, additional class QuantifiedMinHash).

Not a priority and this can be revisited later, if at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants