If you have sketched many samples and you want to remove "rare" k-mers (present in 1, or only a few samples), this plugin is for you! This procedure helps reduce noise in Jaccard comparisons between samples.
See sourmash#2383 for an extended discussion!
Thanks to Taylor Reiter and Jessica Lumian for all their work on this!
pip install sourmash_plugin_commonhash
sourmash scripts commonhash <multiple sketches> -o commonhashes.zip
commonhash will output one filtered sketch for each input sketch.
You can then use the various sourmash sig
commands to union these
sketches, extract individual ones, etc.
sourmash scripts commonhash examples/*.sig.gz -o commonhash.zip
should yield:
...
Selecting k=31, DNA
Loaded 10587 hashes from 3 sketches in 3 files.
Of 10587 hashes, keeping 2529 that are in 2 or more samples.
Saved 3 signatures to 'commonhash.zip'
We suggest filing issues in the main sourmash issue tracker as that receives more attention!
commonhash
is developed at https://github.com/ctb/sourmash_plugin_commonhash.
Bump version number in pyproject.toml
and push.
Make a new release on github.
Then pull, and:
python -m build
followed by twine upload dist/...
.