Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multigather individual contigs from one file #3089

Closed
Krasnopeev opened this issue Mar 20, 2024 · 5 comments
Closed

multigather individual contigs from one file #3089

Krasnopeev opened this issue Mar 20, 2024 · 5 comments

Comments

@Krasnopeev
Copy link

Krasnopeev commented Mar 20, 2024

Hi,

I can't figure out how does multigather option work. Need help with it.
My goal is to classify each contig from metagenome assembly for downstream analysis.

I use latest vesion of sourmash from conda installation:

== This is sourmash version 4.8.6. ==

I have an assembly file contigs.fasta with many contigs in it (about 30k contigs in total).

commands I use:

sourmash sketch dna -p scaled=100,k=51 contigs.fasta --singleton --name-from-first
sourmash multigather --query contigs.fasta.sig --db gtdb-rs214-reps.k51.zip --threshold-bp 1500

The problem is that multigather command replace output file on each iterate.

So should I made a bunch of fasta files with individual contig and try to reproduce pipeline from issue #2816 (comment) or there is another way to do get taxonomy classification for each contig?

Thanks a lot!

@ctb
Copy link
Contributor

ctb commented Mar 20, 2024

hi @Krasnopeev, in re the overwriting of outputs, I think you're running into the gap between our latest development version and the latest release! See the comment here; as of a few weeks ago, #2722 was merged, but has not yet been released - it will be in sourmash v4.8.7, sometime soon (next 3-4 weeks).

If you want to use the latest dev version, you can try compiling sourmash yourself as per comment with the following modification:

git clone https://github.com/sourmash-bio/sourmash 
cd sourmash
pip install -e .

but I know that's a lot of unwelcome work ;(.

I still have some more work to do on #2816 to respond to the latest comments, too. Apologies!

In terms of things to do today with v4.8.6 -

  • it should work to do 30k different files, one for each contig, with multigather; I know that's annoying but it will work. You will probably need to resketch them, I'm afraid, so that the filename attribute of the sketch is correct. (You could also use the branchwater plugin's manysketch command link to sketch 30k individual FASTA files - it's nice and fast!)
  • you could brave fastmultigather

sorry again for all the construction dust!

@ctb ctb changed the title multigather induvidual contigs from one file multigather individual contigs from one file Mar 20, 2024
@Krasnopeev
Copy link
Author

thanks @ctb !
Yesterday I tried to setup dev package of sourmash but faceted with errors in Rust environment and get stuck there. Today I noticed that you published a new version 4.8.7 and it seems to work just fine!
I finally got separated reports for individual contigs.

Also I noticied that multigather function takes a lot of time for computing my 30k contigs input with gtdb-rs214-reps.k51.zip database. Is there is any solution to speed up this process?
My suggetstion is get back to your solution to split input file into single queries but then run sourmash with parallel library for example. But it may require a lot of RAM... Just need to try. I have a node with 128 threads and 1Tb RAM, yet not tried this but hope it will work)

@ctb
Copy link
Contributor

ctb commented Mar 22, 2024

Today I noticed that you published a new version 4.8.7 and it seems to work just fine!
I finally got separated reports for individual contigs.

Excellent! Yes, I was going to let you know about 4.8.7 but didn't get a chance before you found it for yourself ;)

Also I noticied that multigather function takes a lot of time for computing my 30k contigs input with gtdb-rs214-reps.k51.zip database. Is there is any solution to speed up this process?

Yep - fastmultigather should help. It uses multithreading.

Please post here (this issue, or a new issue) if you have trouble using fastmultigather!

@ctb
Copy link
Contributor

ctb commented Mar 22, 2024

(I need to write a little tutorial for using fastmultigather, TBH. It's a little bit different from gather at the moment.)

@ctb
Copy link
Contributor

ctb commented Mar 22, 2024

I've written a short fastmultigather quickstart here: #3095.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants