-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problem with creating gcsa index and singularity #861
Comments
Hello @Chinaza11! Nice to see you are interested in using out software. If you enclose your logs in
Then they will be a lot easier to read because Github will not try and format them as Markdown. It looks like your problem with Singularity is:
You're not actually allowed to run that But you probably don't need Singularity; you managed to get the dependencies installed just fine without it. I think your real problem is that your graph is too complex for GCSA indexing. If graphs have too many closely-spaced variants, computing the GCSA index becomes impractical. This is one of the main reasons our lab has moved away from Now, we also have tools to "prune" compelx graphs, to get a subgraph that can be indexed with GCSA indexing. You can then use the smaller graph's index to map against the full graph with But by default, after pruning the graph, we put back all the nodes and edges that were on named paths embedded in the graph. This makes sense for VCF-based graphs: we want to remove extra VCF variants we can't handle but keep the backbone linear reference. But if you're using the yeast graph, which is made from yeast assemblies, then every node is on a path, since each assembly has a path in the graph. So pruning might not actually be doing anything, since we immediately put back all the pruned nodes since they're on paths. It looks like the only way to avoid that would be to avoid the toil-vg/src/toil_vg/vg_index.py Lines 203 to 209 in 58773e1
And it looks like you can do that by using a GBWT file. In that case, we take the path "theads" stored in the GBWT and replace the pruned-out areas with all the threads as long nodes that are alternatives to each other. So basically we undo the alignments in complex regions and replace them with un-aligned sequences.
So, overall, I would recommend:
|
Thanks a lot @adamnovak for your quick, detailed response and recommendations.
|
Hmm, we might not have written the logic to generate a GBWT from embedded paths into Maybe instead of |
Interestingly, I used I will revisit That aside, I will also read up on Thanks once again for your quick responses and insights, I now have a clearer understanding and path to execute this project. |
Hi,
Thanks for creating this tool. I have been trying to index a variation graph but I have been running into errors. Please will you be able to help me point out what I might be missing?
I am working on an HPC cluster environment and docker is not allowed. So I installed the required dependencies (vg, pigz, tabix, bcftools) in a conda environment. I ran the task with 700GB mem and the task was killed due to memory issues. When I used 1.37TB mem, the task didn't get killed but didn't run successfully.
The variation graph was the first chromosome of five yeast species (in your paper). My actual data is much larger but I was using the yeast data for a speed test. So, I think 700GB mem should have been enough. I did some research and kind of came to the conclusion that there might be a memory leak and the different dependencies versions in the conda environment might be the issue. So I decided to use the singularity container option. That also threw a different kind of error that I have not been able to find a solution to from online research. So far, I saw something about embedding an overlay to fix the read-only issue but haven't gotten so far with this.
The text was updated successfully, but these errors were encountered: