Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: list index out of range and AssertionError #186

Closed
gemygk opened this issue Jun 19, 2019 · 20 comments
Closed

IndexError: list index out of range and AssertionError #186

gemygk opened this issue Jun 19, 2019 · 20 comments
Assignees
Milestone

Comments

@gemygk
Copy link
Collaborator

gemygk commented Jun 19, 2019

Hi @lucventurini,

The Mikado pick stage is giving me some errors for the version - mikado-20190610_94160dd.

Please see below the error and logs.

CMD:

Mikado pick command:

singularity exec /ei/software/testing/mikado/20190610_94160dd/x86_64/Singularity.img mikado pick --procs 32 --json-conf mikado.configuration.update_protein.yaml --subloci_out mikado.subloci.gff3

Pick Log:

/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all/pick.log

WD:

/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all

ERROR:

2019-06-18 19:23:04,018 - scaffold_2:828029-24085450 - loci_processer.py:589 - ERROR - analyse_locus - LociProcesser-22 - list index out of range
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Mikado/picking/loci_processer.py", line 583, in analyse_locus
    stranded_locus.define_loci()
  File "/usr/local/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 1106, in define_loci
    self.define_monosubloci()
  File "/usr/local/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 988, in define_monosubloci
    self.define_subloci()
  File "/usr/local/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 898, in define_subloci
    transcript_graph = self.reduce_complex_loci(transcript_graph)
  File "/usr/local/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 742, in reduce_complex_loci
    transcript_graph, max_edges = self.reduce_method_two(transcript_graph)
  File "/usr/local/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 789, in reduce_method_two
    if neigh_first_corr[0][0] > current.start:
IndexError: list index out of range
2019-06-18 19:23:04,019 - scaffold_2:828029-24085450 - loci_processer.py:590 - ERROR - analyse_locus - LociProcesser-22 - Removing failed locus superlocus:scaffold_2-:2729704-24066884

Mikado did not generate any files after - Jun 18 20:26, so is Mikado hanging at the moment?

In addition, there is one more error that I can see:

AssertionError: (492, 220, 492, 'Invalid CDS length: 220 % 3 = 1', '#')

Can you please look into this?

Thanks,
Gemy

@lucventurini
Copy link
Collaborator

Hi @gemygk, would you be able to repeat the run with the latest container I uploaded yesterday?
mikado-20190618_3b32a01

I possibly have already solved this bug, but in case I have not, I'll try to fix it today.

@gemygk
Copy link
Collaborator Author

gemygk commented Jun 19, 2019

Hi @lucventurini,

Sure, I will test it on just one scaffold and update you. Cannot test it on the full run though, as the full run has already taken >5days for the pick stage alone (with 32 threads).

@lucventurini
Copy link
Collaborator

Hi @gemygk ,

Cannot test it on the full run though, as the full run has already taken >5days for the pick stage alone (with 32 threads).

That is really strange and worrying. I will have a look at it.

@gemygk
Copy link
Collaborator Author

gemygk commented Jun 19, 2019

I think it is the depth of transcripts we have at a locus causing these long runtimes.

@gemygk
Copy link
Collaborator Author

gemygk commented Jun 19, 2019

Hi @lucventurini,

I am getting another error now

WD
/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all/test_mikado-20190618_3b32a01

CMD used:
source mikado-20190618_3b32a01 && /usr/bin/time -v mikado pick --procs 32 --json-conf mikado.configuration.update_protein.yaml --subloci_out mikado.subloci.gff3

The error that I am getting is below:

2019-06-19 10:20:14,144 - pick_init - configurator.py:601 - ERROR - check_json - MainProcess - 'Scoring file not found: ../plant.yaml'
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 585, in check_json
    json_conf, overwritten = _check_scoring_file(json_conf, logger)
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 539, in _check_scoring_file
    raise InvalidJson("Scoring file not found: {0}".format(json_conf["pick"]["scoring_file"]))
Mikado.exceptions.InvalidJson: 'Scoring file not found: ../plant.yaml'
2019-06-19 10:20:14,145 - main - __init__.py:123 - ERROR - main - MainProcess - Mikado crashed, cause:
2019-06-19 10:20:14,145 - main - __init__.py:124 - ERROR - main - MainProcess - (InvalidJson('Scoring file not found: ../plant.yaml'), '/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all/test_mikado-20190618_3b32a01/mikado.configuration.update_protein.yaml')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 652, in to_json
    json_dict = check_json(json_dict, simple=simple, logger=logger)
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 585, in check_json
    json_conf, overwritten = _check_scoring_file(json_conf, logger)
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 539, in _check_scoring_file
    raise InvalidJson("Scoring file not found: {0}".format(json_conf["pick"]["scoring_file"]))
Mikado.exceptions.InvalidJson: 'Scoring file not found: ../plant.yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Mikado/__init__.py", line 109, in main
    args.func(args)
  File "/usr/local/lib/python3.7/site-packages/Mikado/subprograms/pick.py", line 175, in pick
    args.json_conf = to_json(args.json_conf.name, logger=logger)
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 654, in to_json
    raise OSError((exc, string))
OSError: (InvalidJson('Scoring file not found: ../plant.yaml'), '/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all/test_mikado-20190618_3b32a01/mikado.configuration.update_protein.yaml')

@lucventurini
Copy link
Collaborator

Hi @gemygk , it is due to the fact that you are launching Mikado from a different folder and therefore the soft link is not valid ("../plant.yaml").
I have created the soft link and relaunched.

@gemygk
Copy link
Collaborator Author

gemygk commented Jun 19, 2019

@lucventurini , ah I see your point. My mistake of not changing the scoring file location in the Mikado configuration file.

@lucventurini
Copy link
Collaborator

Out of curiosity, by the way, how did we end up with a region (scaffold_2:=828029..24085450) that has a whopping 167,901 transcripts?

I am not surprised at Mikado having difficulties in managing such a huge amount of data! I would have to investigate where the choke is, but this is some orders of magnitude bigger than what I wrote the program for ...

@swarbred swarbred self-assigned this Jun 19, 2019
@swarbred
Copy link
Collaborator

"Out of curiosity, by the way, how did we end up with a region (scaffold_2:=828029..24085450) that has a whopping 167,901 transcripts?"

@lucventurini had a look with @gemygk and this is due to our pacbio gmap alignments having some very large introns 1.8Mb. This is even though in gmap we set intron sizes that are max 50kb for middle introns. Looking at the gmap parmeters there is a --split_large_intron option that indicates that gmap will generate alignments with introns over the max middle intron settings unless this is also set.

While we have max intron settings in the requirements section of pick these are applied after the superloci construction I assume.

It might be useful to add to the prepare a max intron size (we have a min cdna size already) so that we would remove these at the prepare stage. The default for this would be large 1Mb (suitable for mamalian genomes) but that would at least filter out the most problematic alignments and avoid users having similar issues.

@lucventurini
Copy link
Collaborator

@swarbred agreed. This should also avoid other issues as well.
It does not answer the question of what else is making Mikado so slow in the locus, but it might be a start.

@lucventurini
Copy link
Collaborator

Hi @swarbred , this should have been implemented in the latest commit (a131b94). I have put a generous default value of 1 million bps.
You can modify it like this in the configuration file:

prepare:
  max_intron_length: 50000

I would recommend redoing the whole M. persicae analysis with this parameter in place, to be honest, given the spurious alignments.

@swarbred
Copy link
Collaborator

"I would recommend redoing the whole M. persicae analysis with this parameter in place, to be honest, given the spurious alignments."

Thanks @lucventurini yes we were going to just filter the prepare output manually that way we dont need to redo the blast / orf prediction. I assume it shouldn't be an issue having say orfs loaded for the seralise step that are not in the prepare output that is then passed to pick

@lucventurini
Copy link
Collaborator

Thanks @lucventurini yes we were going to just filter the prepare output manually that way we dont need to redo the blast / orf prediction. I assume it shouldn't be an issue having say orfs loaded for the seralise step that are not in the prepare output that is then passed to pick

No, absolutely, it should not pose any problem at all. Please let me know how it goes.

@lucventurini
Copy link
Collaborator

Hi @gemygk , @swarbred , issue no. 1 (Mikado taking forever and crashing), it seems that the filtering did the trick.
I think that what happened is that with such long introns, Mikado ended up having to digest in a single block a number of transcripts equivalent to the whole of the A. thaliana test set we used for the article. Unfortunately, it is not robust to such a deluge of data.

Regarding issue no.2, there are still some transcripts for which the splitting mechanism seems to fail with the AssertionError reported above. I can definitely investigate those, but it should be a minor issue (I count 14 cases at ~31% of the run done).

@lucventurini
Copy link
Collaborator

This particular error seems to be triggered by a specific instance - ORFs assigned by the caller to the negative strand, that are incorrectly assigned a zero phase. Eg:

sex_morph_FW.stringtie_sex_morph_FW_str.232.2	Prodigal_v2.6.3	CDS	934	1137	5.7	-	0	ID=12199_3;partial=00;start_type=GTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.657;conf=78.73;score=5.69;cscore=7.02;sscore=-1.33;rscore=0.33;uscore=-0.91;tscore=-0.75;

Here the ORF found by prodigal has a GTG start, which is discarded by Mikado; the ORF is consequently enlarged until the end of the transcript; and instead of it being assigned a phase of 2, it was assigned a phase of 0.
As Mikado ignores ORFs on the negative strand unless the transcript is mono-exonic and unstranded, this is a rare enough case that this bug was not triggered earlier.

@lucventurini
Copy link
Collaborator

lucventurini commented Jul 4, 2019

Hi @gemygk ,
I found the origin of the bug. It indeed affected any ORF on the negative strand with an "invalid" start codon (typically, GTG). I think this is not a massive bug in terms of effects, but it would probably be wise to redo all mikado serialises. To clarify, I think that the wrong section of the code made a hash of things and created completely invalid ORFs in many of those instances (probably losing them completely instead of serialising them). Again, it should not affect a huge number of cases, but it probably is not as benign as I initially thought.

I understand that Mikado serialise took an extremely long time for your data on M. persicae, which is due to the poor parsing of XMLs, looking at the logs. I will see what can be done about that at a later date.

@lucventurini
Copy link
Collaborator

Hi @gemygk ,
as a postscript to my previous comment: 3b32a01 should have made the serialisation faster, especially for XML files. Emphasis on should. It would be really good therefore to test the most recent version on your M. persicae data, to see whether there has been the improvement I hope.

@lucventurini
Copy link
Collaborator

Hi @gemygk,
using 10 cores instead of 1 (default when you launched serialise) and increasing the number of objects to keep in memory to 1M (default 100k) before dumping to the database definitely made a difference for your M. persicae serialisation runtime, together with the other code changes.

The running time was of about 6 hours in total. This is not a free lunch, though, as the total memory requirement increased to ~40GB (indeed, I had to relaunch multiple times while fine tuning the parameters, as memory increased too much). See:

/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all/serialise_test_7f87abe

and Job ID 21943543.

I am now analysing the same run using the new database (Job ID 21961614). So far I cannot see any error in the log. If it completes without warnings or errors, I will close the issue.

@gemygk
Copy link
Collaborator Author

gemygk commented Jul 8, 2019

Hi @lucventurini,

Thanks for the update. Yes, will keep monitoring.

@lucventurini
Copy link
Collaborator

No error found in either the serialise or the pick step. Closing the issue.

lucventurini added a commit that referenced this issue Jul 8, 2019
* Fix #189
* Fix #186
* #183: added static seed from CLI for pick.
* #186: introduced a maximum intron length parameter for mikado prepare (prepare/max_intron_length), with a default value of 1M bps and a minimum value of 20.
* #186: there was a very serious bug in the evaluation of negative truncated ORFs, which potentially led to a lot of them being called incorrectly at the serialisation stage. Refactored the function responsible for the mishap and added a unit-test which confirmed fixing of the bug.
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
…ioinformatics#191)

* Fix EI-CoreBioinformatics#189
* Fix EI-CoreBioinformatics#186
* EI-CoreBioinformatics#183: added static seed from CLI for pick.
* EI-CoreBioinformatics#186: introduced a maximum intron length parameter for mikado prepare (prepare/max_intron_length), with a default value of 1M bps and a minimum value of 20.
* EI-CoreBioinformatics#186: there was a very serious bug in the evaluation of negative truncated ORFs, which potentially led to a lot of them being called incorrectly at the serialisation stage. Refactored the function responsible for the mishap and added a unit-test which confirmed fixing of the bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants