IndexError: list index out of range and AssertionError #186

gemygk · 2019-06-19T09:12:43Z

The Mikado pick stage is giving me some errors for the version - mikado-20190610_94160dd.

Please see below the error and logs.

CMD:

Mikado pick command:

singularity exec /ei/software/testing/mikado/20190610_94160dd/x86_64/Singularity.img mikado pick --procs 32 --json-conf mikado.configuration.update_protein.yaml --subloci_out mikado.subloci.gff3

Pick Log:

/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all/pick.log

WD:

/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all

ERROR:

2019-06-18 19:23:04,018 - scaffold_2:828029-24085450 - loci_processer.py:589 - ERROR - analyse_locus - LociProcesser-22 - list index out of range
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Mikado/picking/loci_processer.py", line 583, in analyse_locus
    stranded_locus.define_loci()
  File "/usr/local/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 1106, in define_loci
    self.define_monosubloci()
  File "/usr/local/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 988, in define_monosubloci
    self.define_subloci()
  File "/usr/local/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 898, in define_subloci
    transcript_graph = self.reduce_complex_loci(transcript_graph)
  File "/usr/local/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 742, in reduce_complex_loci
    transcript_graph, max_edges = self.reduce_method_two(transcript_graph)
  File "/usr/local/lib/python3.7/site-packages/Mikado/loci/superlocus.py", line 789, in reduce_method_two
    if neigh_first_corr[0][0] > current.start:
IndexError: list index out of range
2019-06-18 19:23:04,019 - scaffold_2:828029-24085450 - loci_processer.py:590 - ERROR - analyse_locus - LociProcesser-22 - Removing failed locus superlocus:scaffold_2-:2729704-24066884

Mikado did not generate any files after - Jun 18 20:26, so is Mikado hanging at the moment?

In addition, there is one more error that I can see:

AssertionError: (492, 220, 492, 'Invalid CDS length: 220 % 3 = 1', '#')

Can you please look into this?

Thanks,
Gemy

The text was updated successfully, but these errors were encountered:

lucventurini · 2019-06-19T09:42:07Z

Hi @gemygk, would you be able to repeat the run with the latest container I uploaded yesterday?
mikado-20190618_3b32a01

I possibly have already solved this bug, but in case I have not, I'll try to fix it today.

gemygk · 2019-06-19T10:00:11Z

Hi @lucventurini,

Sure, I will test it on just one scaffold and update you. Cannot test it on the full run though, as the full run has already taken >5days for the pick stage alone (with 32 threads).

lucventurini · 2019-06-19T10:02:12Z

Hi @gemygk ,

Cannot test it on the full run though, as the full run has already taken >5days for the pick stage alone (with 32 threads).

That is really strange and worrying. I will have a look at it.

gemygk · 2019-06-19T10:05:18Z

I think it is the depth of transcripts we have at a locus causing these long runtimes.

gemygk · 2019-06-19T10:51:14Z

Hi @lucventurini,

I am getting another error now

WD
/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all/test_mikado-20190618_3b32a01

CMD used:
source mikado-20190618_3b32a01 && /usr/bin/time -v mikado pick --procs 32 --json-conf mikado.configuration.update_protein.yaml --subloci_out mikado.subloci.gff3

The error that I am getting is below:

2019-06-19 10:20:14,144 - pick_init - configurator.py:601 - ERROR - check_json - MainProcess - 'Scoring file not found: ../plant.yaml'
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 585, in check_json
    json_conf, overwritten = _check_scoring_file(json_conf, logger)
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 539, in _check_scoring_file
    raise InvalidJson("Scoring file not found: {0}".format(json_conf["pick"]["scoring_file"]))
Mikado.exceptions.InvalidJson: 'Scoring file not found: ../plant.yaml'
2019-06-19 10:20:14,145 - main - __init__.py:123 - ERROR - main - MainProcess - Mikado crashed, cause:
2019-06-19 10:20:14,145 - main - __init__.py:124 - ERROR - main - MainProcess - (InvalidJson('Scoring file not found: ../plant.yaml'), '/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all/test_mikado-20190618_3b32a01/mikado.configuration.update_protein.yaml')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 652, in to_json
    json_dict = check_json(json_dict, simple=simple, logger=logger)
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 585, in check_json
    json_conf, overwritten = _check_scoring_file(json_conf, logger)
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 539, in _check_scoring_file
    raise InvalidJson("Scoring file not found: {0}".format(json_conf["pick"]["scoring_file"]))
Mikado.exceptions.InvalidJson: 'Scoring file not found: ../plant.yaml'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Mikado/__init__.py", line 109, in main
    args.func(args)
  File "/usr/local/lib/python3.7/site-packages/Mikado/subprograms/pick.py", line 175, in pick
    args.json_conf = to_json(args.json_conf.name, logger=logger)
  File "/usr/local/lib/python3.7/site-packages/Mikado/configuration/configurator.py", line 654, in to_json
    raise OSError((exc, string))
OSError: (InvalidJson('Scoring file not found: ../plant.yaml'), '/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all/test_mikado-20190618_3b32a01/mikado.configuration.update_protein.yaml')

lucventurini · 2019-06-19T10:56:27Z

Hi @gemygk , it is due to the fact that you are launching Mikado from a different folder and therefore the soft link is not valid ("../plant.yaml").
I have created the soft link and relaunched.

gemygk · 2019-06-19T11:00:26Z

@lucventurini , ah I see your point. My mistake of not changing the scoring file location in the Mikado configuration file.

lucventurini · 2019-06-19T11:15:13Z

Out of curiosity, by the way, how did we end up with a region (scaffold_2:=828029..24085450) that has a whopping 167,901 transcripts?

I am not surprised at Mikado having difficulties in managing such a huge amount of data! I would have to investigate where the choke is, but this is some orders of magnitude bigger than what I wrote the program for ...

swarbred · 2019-06-19T12:36:45Z

"Out of curiosity, by the way, how did we end up with a region (scaffold_2:=828029..24085450) that has a whopping 167,901 transcripts?"

@lucventurini had a look with @gemygk and this is due to our pacbio gmap alignments having some very large introns 1.8Mb. This is even though in gmap we set intron sizes that are max 50kb for middle introns. Looking at the gmap parmeters there is a --split_large_intron option that indicates that gmap will generate alignments with introns over the max middle intron settings unless this is also set.

While we have max intron settings in the requirements section of pick these are applied after the superloci construction I assume.

It might be useful to add to the prepare a max intron size (we have a min cdna size already) so that we would remove these at the prepare stage. The default for this would be large 1Mb (suitable for mamalian genomes) but that would at least filter out the most problematic alignments and avoid users having similar issues.

lucventurini · 2019-06-19T12:49:58Z

@swarbred agreed. This should also avoid other issues as well.
It does not answer the question of what else is making Mikado so slow in the locus, but it might be a start.

lucventurini · 2019-06-19T13:43:10Z

Hi @swarbred , this should have been implemented in the latest commit (a131b94). I have put a generous default value of 1 million bps.
You can modify it like this in the configuration file:

prepare:
  max_intron_length: 50000

I would recommend redoing the whole M. persicae analysis with this parameter in place, to be honest, given the spurious alignments.

swarbred · 2019-06-19T14:08:04Z

"I would recommend redoing the whole M. persicae analysis with this parameter in place, to be honest, given the spurious alignments."

Thanks @lucventurini yes we were going to just filter the prepare output manually that way we dont need to redo the blast / orf prediction. I assume it shouldn't be an issue having say orfs loaded for the seralise step that are not in the prepare output that is then passed to pick

lucventurini · 2019-06-19T14:11:05Z

Thanks @lucventurini yes we were going to just filter the prepare output manually that way we dont need to redo the blast / orf prediction. I assume it shouldn't be an issue having say orfs loaded for the seralise step that are not in the prepare output that is then passed to pick

No, absolutely, it should not pose any problem at all. Please let me know how it goes.

lucventurini · 2019-06-19T16:50:13Z

Hi @gemygk , @swarbred , issue no. 1 (Mikado taking forever and crashing), it seems that the filtering did the trick.
I think that what happened is that with such long introns, Mikado ended up having to digest in a single block a number of transcripts equivalent to the whole of the A. thaliana test set we used for the article. Unfortunately, it is not robust to such a deluge of data.

Regarding issue no.2, there are still some transcripts for which the splitting mechanism seems to fail with the AssertionError reported above. I can definitely investigate those, but it should be a minor issue (I count 14 cases at ~31% of the run done).

lucventurini · 2019-07-04T14:53:59Z

This particular error seems to be triggered by a specific instance - ORFs assigned by the caller to the negative strand, that are incorrectly assigned a zero phase. Eg:

sex_morph_FW.stringtie_sex_morph_FW_str.232.2	Prodigal_v2.6.3	CDS	934	1137	5.7	-	0	ID=12199_3;partial=00;start_type=GTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.657;conf=78.73;score=5.69;cscore=7.02;sscore=-1.33;rscore=0.33;uscore=-0.91;tscore=-0.75;

Here the ORF found by prodigal has a GTG start, which is discarded by Mikado; the ORF is consequently enlarged until the end of the transcript; and instead of it being assigned a phase of 2, it was assigned a phase of 0.
As Mikado ignores ORFs on the negative strand unless the transcript is mono-exonic and unstranded, this is a rare enough case that this bug was not triggered earlier.

lucventurini · 2019-07-04T16:27:39Z

Hi @gemygk ,
I found the origin of the bug. It indeed affected any ORF on the negative strand with an "invalid" start codon (typically, GTG). I think this is not a massive bug in terms of effects, but it would probably be wise to redo all mikado serialises. To clarify, I think that the wrong section of the code made a hash of things and created completely invalid ORFs in many of those instances (probably losing them completely instead of serialising them). Again, it should not affect a huge number of cases, but it probably is not as benign as I initially thought.

I understand that Mikado serialise took an extremely long time for your data on M. persicae, which is due to the poor parsing of XMLs, looking at the logs. I will see what can be done about that at a later date.

lucventurini · 2019-07-05T07:07:31Z

Hi @gemygk ,
as a postscript to my previous comment: 3b32a01 should have made the serialisation faster, especially for XML files. Emphasis on should. It would be really good therefore to test the most recent version on your M. persicae data, to see whether there has been the improvement I hope.

lucventurini · 2019-07-08T11:05:37Z

Hi @gemygk,
using 10 cores instead of 1 (default when you launched serialise) and increasing the number of objects to keep in memory to 1M (default 100k) before dumping to the database definitely made a difference for your M. persicae serialisation runtime, together with the other code changes.

The running time was of about 6 hours in total. This is not a free lunch, though, as the total memory requirement increased to ~40GB (indeed, I had to relaunch multiple times while fine tuning the parameters, as memory increased too much). See:

/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-20190606_6c8d542/trans_run1/mikado_all/serialise_test_7f87abe

and Job ID 21943543.

I am now analysing the same run using the new database (Job ID 21961614). So far I cannot see any error in the log. If it completes without warnings or errors, I will close the issue.

gemygk · 2019-07-08T11:20:32Z

Hi @lucventurini,

Thanks for the update. Yes, will keep monitoring.

lucventurini · 2019-07-08T20:48:00Z

No error found in either the serialise or the pick step. Closing the issue.

* Fix #189 * Fix #186 * #183: added static seed from CLI for pick. * #186: introduced a maximum intron length parameter for mikado prepare (prepare/max_intron_length), with a default value of 1M bps and a minimum value of 20. * #186: there was a very serious bug in the evaluation of negative truncated ORFs, which potentially led to a lot of them being called incorrectly at the serialisation stage. Refactored the function responsible for the mishap and added a unit-test which confirmed fixing of the bug.

…ioinformatics#191) * Fix EI-CoreBioinformatics#189 * Fix EI-CoreBioinformatics#186 * EI-CoreBioinformatics#183: added static seed from CLI for pick. * EI-CoreBioinformatics#186: introduced a maximum intron length parameter for mikado prepare (prepare/max_intron_length), with a default value of 1M bps and a minimum value of 20. * EI-CoreBioinformatics#186: there was a very serious bug in the evaluation of negative truncated ORFs, which potentially led to a lot of them being called incorrectly at the serialisation stage. Refactored the function responsible for the mishap and added a unit-test which confirmed fixing of the bug.

gemygk assigned lucventurini Jun 19, 2019

swarbred self-assigned this Jun 19, 2019

lucventurini added bug EI-Internal labels Jun 19, 2019

lucventurini added this to the 1.5 milestone Jun 19, 2019

lucventurini assigned gemygk Jul 4, 2019

lucventurini mentioned this issue Jul 8, 2019

ncRNA_gene with child types ncRNA and transcript #189

Closed

lucventurini closed this as completed Jul 8, 2019

lucventurini mentioned this issue Aug 8, 2019

pick : overly long and high memory usage #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexError: list index out of range and AssertionError #186

IndexError: list index out of range and AssertionError #186

gemygk commented Jun 19, 2019

lucventurini commented Jun 19, 2019

gemygk commented Jun 19, 2019 •

edited

Loading

lucventurini commented Jun 19, 2019

gemygk commented Jun 19, 2019

gemygk commented Jun 19, 2019

lucventurini commented Jun 19, 2019

gemygk commented Jun 19, 2019

lucventurini commented Jun 19, 2019

swarbred commented Jun 19, 2019

lucventurini commented Jun 19, 2019

lucventurini commented Jun 19, 2019

swarbred commented Jun 19, 2019

lucventurini commented Jun 19, 2019

lucventurini commented Jun 19, 2019

lucventurini commented Jul 4, 2019

lucventurini commented Jul 4, 2019 •

edited

Loading

lucventurini commented Jul 5, 2019

lucventurini commented Jul 8, 2019

gemygk commented Jul 8, 2019

lucventurini commented Jul 8, 2019

IndexError: list index out of range and AssertionError #186

IndexError: list index out of range and AssertionError #186

Comments

gemygk commented Jun 19, 2019

lucventurini commented Jun 19, 2019

gemygk commented Jun 19, 2019 • edited Loading

lucventurini commented Jun 19, 2019

gemygk commented Jun 19, 2019

gemygk commented Jun 19, 2019

lucventurini commented Jun 19, 2019

gemygk commented Jun 19, 2019

lucventurini commented Jun 19, 2019

swarbred commented Jun 19, 2019

lucventurini commented Jun 19, 2019

lucventurini commented Jun 19, 2019

swarbred commented Jun 19, 2019

lucventurini commented Jun 19, 2019

lucventurini commented Jun 19, 2019

lucventurini commented Jul 4, 2019

lucventurini commented Jul 4, 2019 • edited Loading

lucventurini commented Jul 5, 2019

lucventurini commented Jul 8, 2019

gemygk commented Jul 8, 2019

lucventurini commented Jul 8, 2019

gemygk commented Jun 19, 2019 •

edited

Loading

lucventurini commented Jul 4, 2019 •

edited

Loading