Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicycler stalls indefinitely while 'creating simple long read bridges' #256

Closed
biolene opened this issue Mar 3, 2021 · 3 comments
Closed

Comments

@biolene
Copy link

biolene commented Mar 3, 2021

I've been testing Unicycler extensively for hybrid assembly on a number of bacterial isolates (Bacillus). For 3 out of 10 samples I'm working on, Unicycler delivers a fairly complete, and, as it turns out, correct and accurate assembly, which confirms that this tool is very performant (and kudos for that!).
However, for the other 7 samples it stalls indefinitely at the step where long read alignments are used to resolve repeat structures (I've pasted an example of the last part of the unicycler report below). In that case, the process is killed (either by me or automatically after several hours) and Unicycler never finishes. This is especially peculiar since we know by now that the 10 isolates represent in essence the same strain. I've tried all kinds of fixes, but none of them really resolves this problem. I tried changing parameter settings in the unicycler command, of these only enforcing the use of a specific kmer length for spades assembly sometimes succeeds in making unicycler run to completion (however, the resulting assembly is bad). I also tried more extensive/severe filtering of the long read (nanopore) data set, but this does not help. I tried more rigorous filtering of the illumina reads, or combining short read sets of different samples, the latter of which sometimes solves the issue (but this is not a sustainable solution obviously). The isolates all carry an extrachromosomal recombinant plasmid, in which a gene is inserted that is also present on the genome. If I remove all reads from short and long read data that match the plasmid, Unicycler also runs to completion with the filtered data. A look at the spades assembly graph shows that spades initially assembles the plasmid sequence as part of the genome (which I know for sure to be incorrect). But this is the case for all the isolates, also the ones for which unicycler runs to completion, so this in itself can not explain why it fails on other samples.
Despite all this digging, I am still failing to grasp what is exactly the problem, and how we could fix it. Any thoughts and suggestions are highly appreciated. Or does somebody experience a similar problem?

`Example output of Unicycler:

Creating simple long read bridges (2021-02-26 21:58:09)

Unicycler uses long read alignments (from minimap) to resolve simple repeat structures in the graph. This takes care of some "low-hanging fruit" of the graph simplification.

Aligning long reads to graph using minimap

Two-way junctions are defined as cases where two graph contigs (A and B) join together (C) and then split apart again (D and E). This usually represents a simple 2-copy repeat, and there are two possible options for its resolution: (A→C→D and B→C→E) or (A→C→E and B→C→D). Each read which spans such a junction gets to "vote" for option 1, option 2 or neither. Unicycler creates a bridge at each junction for the most voted for option.

                                           Op. 1   Op. 2   Neither   Final    Bridge

Junction Option 1 Option 2 votes votes votes op. quality
87 -40 → 87 → -39, -40 → 87 → 17, 635 0 37 1 59.9
-39 → 87 → 17 -39 → 87 → -39
47 2 → 47 → 28, 44 2 → 47 → 44, 44 155 13 518 1 0.0
→ 47 → 44 → 47 → 28
93 -5 → 93 → -12, -5 → 93 → 23, 453 1 33 1 85.0
-3 → 93 → 23 -3 → 93 → -12

Simple loops are parts of the graph where two contigs (A and B) are connected via a repeat (C) which loops back to itself (via D). It is possible to traverse the loop zero times (A→C→B), one time (A→C→D→C→B), two times (A→C→D→C→D→C→B), etc. Long reads which span the loop inform which is the correct number of times through. In this step, such reads are found and each is aligned against alternative loop counts. A reads casts its "vote" for the loop count it agrees best with, and Unicycler creates a bridge using the most voted for count.

                               Read                         Loop    Bridge

Start Repeat Middle End count Read votes count quality
-17 -87 39 40 100 1 loop: 96 votes 1 59.1
2 loops: 4 votes `
=> here it stalls indefinitely

@rrwick
Copy link
Owner

rrwick commented Jan 19, 2022

I don't fully understand what's going on here, and I'll need to get my hands on a dataset which causes the issue to really investigate. I have, however, got a workaround in place. In the new version of Unicycler I'm working on now, there is an option, --no_simple_bridges, to skip the simple long-read bridging step entirely. There were options to turn off other bridging steps (--no_miniasm and --no_long_read_alignment) so this new option fits in well enough.

So even if I don't know what the root problem is, you can now use --no_simple_bridges when you encounter this to avoid the stall.

Ryan

@biolene
Copy link
Author

biolene commented Jan 19, 2022

Hi, good to hear that you are working on a new version of Unicycler! Meanwhile we have published these results (https://doi.org/10.3390/foods10112637), and the datasets are publicly available under study accession number PRJEB44065. Details concerning which assemblies succeeded and failed are given in the supplementary data of the manuscript. So feel free to have a go at the data, and if you find out something I would be very interested to learn about it :)

@rrwick
Copy link
Owner

rrwick commented Jan 20, 2022

Great - I think I've figured it out. Unicycler was encountering what it thought might be a simple loop in the graph, but was actually a high-depth plasmid. This led it to trying way too many loop count possibilities in the simple long-read bridging step. It would have finished eventually, but it was wasting tons of time.

I've put a simple fix in place: f4afc33. It might still be slow in cases like this, but it should be much better than before. Thanks for the help with the link to the dataset!

Ryan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants