-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kProcessor/kSpider/sourmash thoughts for downsampling/containment matrix #90
Comments
note that sourmash is lower case 😁 |
Some sourmash code - haven't tested it, but it should mostly work :) Load signatures from ...anything - a .sig file, a .zip file, a directory: >>> loaded_sigs = sourmash.load_file_as_signatures(fpath) You might want to select out only the scaled signatures, since num signatures are a different beast and can't really be used the way we would like: >>> loaded_sigs = loaded_sigs.select(scaled=True) Retrieve sketches: >>> for ss in loaded_sigs:
... mh = ss.minhash Retrieve ksize and moltype and scaled/num from the sketches: >>> ksize = mh.ksize
>>> moltype = mh.moltype # 'DNA', 'protein', 'dayhoff', 'hp'
>>> scaled = mh.scaled # if 0, this is a 'num' sketch Get actual hashes: >>> for hashval in mh.hashes:
... print(hashval) Retrieve abundances: >>> for hashval, abund in mh.hashes.items()
... print(hashval, abund) 🎉 |
@ctb |
it's complicated but tl;dr the code above will work, because each signature will be made to have exactly one sketch. (the signature creation and save code does things slightly differently; see sourmash-bio/sourmash#1647 and sourmash-bio/sourmash#616 esp), but the load code splits it out so it's one signature for one sketch, and each sketch has one ksize.) you might also be getting confused because a single .sig file can contain many different signatures, as well as signatures with multiple sketches. so, as I said, confusing. but the code above will work. |
Extending sourmash-bio/sourmash#1750, I am copying a conversation between @ctb and @drtamermansour to be detailed later into tasks.
Slack Conversation
Tamer Mansour 8:38 AM
@titus @taylorreiter We can solve this problem VERY efficiently by using kProcessor/kSpider.
Titus Brown:speech_balloon: 8:39 AM
it’s not hard to do in sourmash, either, although I suspect kProcessor/kSpider has specialized lookup tables that make it even faster.
8:39
the issue is integrating it into the CLI.
Tamer Mansour 8:41 AM
kProcessor/kSpider can perform pairwise calculation of shared kmers for 20k genes in a couple GB space and few minutes
8:42
We can prepare the output in a format that sourmash can read
8:44
All what we need is to implement a simple parser for sourmash signature files
8:45
In the CLI, i think sourmash can call kProcessor/kSpider under the hood
Titus Brown:speech_balloon: 8:46 AM
so, two issues here -
Rob is talking about (potentially) very large metagenomes, and the sourmash downsampling approach is probably quite important for his application, since metagenomes can be so much larger than transcriptomes. e.g. “a couple GB space” rapidly becomes 100s of GB.
As Taylor experienced, sourmash isn’t doing a good job with sparse matrices, either, so if you have 20k x 20k queries you run out of memory regardless :)
In this case, kProcessor/kSpider would be completely replacing sourmash, not producing output for it to read.
I think what we want is something similar to what you suggested - a parser for kProcessor/kSpider to read sourmash sig files. (edited)
8:46
Integrating kProcessor into sourmash is not simple or easy, especially since we moved sourmash over to rust.
Tamer Mansour 8:50 AM
Sourmash does not need to integrate kProcessor. Just use it as a third party tool. Sourmash has to do the downsampling, prep the signature files, call kProcessor/kSpider module, and finally present the output through the sourmash visualization scripts (edited)
Titus Brown:speech_balloon: 8:50 AM
call kProcessor/kSpider module
8:51
if we did it that way, sourmash would now include kProcessor/kSpider as a dependency 🙂
Tamer Mansour 8:51 AM
yes
8:51
it is a python package
8:52
This is what kProcessor is made for 🙂
Titus Brown:speech_balloon: 8:54 AM
I don’t think that’s a good idea; sourmash is pretty strict about versioning and dependencies.
It should be pretty straightforward to have kProcessor read sourmash sig files (it’s just k-mers and hashes!), do the comparison, and output a numpy matrix that can be read by sourmash plot. In this case I’m pretty sure Rob (and Taylor) don’t want to use the viz tools, anyway, which won’t scale to that number of samples.
8:55
If you put together a demo of the 20kx transcript query somewhere, we can suggest it, even without the downsampling.
8:56
If we had an extensions framework in sourmash, could do it that way, too.
8:56
But I don’t think we want sourmash to have kProcessor as a required dependency.
Tamer Mansour 8:57 AM
Sure we can prep a demo. But our current indexing can not make 20k whole datasets.
8:57
Mostafa is working on the new indexing algorithm to do so in the soon future
Titus Brown:speech_balloon: 8:58 AM
So this seems like a good use case for future development, but it’s not something we should suggest to Rob right now ;)
Tamer Mansour 8:59 AM
We can implement downsampling in kProcessor. That should be even easier than developing a parser for sourmash signatures
9:00
I can make something to try next week
Titus Brown:speech_balloon: 9:00 AM
yes, it’s easy, but now you have distinct code bases and you don’t get the advantage of all the sourmash signature manipulation utilities. If it’s simple to parse sourmash signatures (which it should be - they’re “just” JSON, plus we have a sourmash API to load it) then might as well add that too.
9:01
If nothing else, it’s a good way to test your downsampling code in kProcessor by running the same operations in kProcessor directly vs loading with sourmash.
9:02
(the downsampling code is now pretty simple in sourmash, but it took a while to get there, and we have a LOT of tests for it. so it’s pretty robustly tested. No reason to discard that.)
Tamer Mansour 9:03 AM
This is also a good solution
Titus Brown:speech_balloon: 9:03 AM
also check out the sourmash sketch documentation. Soooo maaaaany oppppppptions to implement. Ugh.
Tamer Mansour 9:05 AM
I was thinking of a very simple approach. Incorporating one kmer every 1000 while reading the input sequences. That is it 😄
👍
1
9:05
But using sourmash is a better idea
Titus Brown:speech_balloon: 9:06 AM
I see value in both, TBH. No reason to force people to jump through hoops either way, and as you say, the code is simple for scaled downsampling.
👍
1
Titus Brown:speech_balloon: 9:28 AM
One other thought - the seq-to-hashes stuff that @mr-eyes implemented for sourmash could be used directly by kProcessor to build scaled=1 dataframes in DNA space (which you already have working) as well as translated and protein queries. Again, ultimately you probably want this in kProcessor directly, but it’s a pretty simple call to sourmash to get the functionality working right now.
The text was updated successfully, but these errors were encountered: