Add option to allow requirements and not_fragmentary checks to be applied to models marked as reference #240

swarbred · 2019-10-17T08:46:22Z

Hi @lucventurini
In GMC we currently make use of the --only-reference-update flag as we have labels (sets of models) marked as reference (e.g. multiple sets of augustus models + subset of mikado models) but also other sets e.g. say the full mikado output that we only include for the purposes of adding splice variants (so marked as not reference). To avoid these models being selected as a primary models we give them a low base score and run with the --only-reference-update flag so loci based on these transcripts alone are excluded.

This works but it means that the requirements and not_fragmentary checks are not applied to the models marked as reference. Most of the time this is the desired behaviour but in our case I would like these checks to be applied to the reference models.

I wondered if we could add an option to override the default behaviour and allow requirements and not_fragmentary checks to be applied to reference models?

lucventurini · 2019-10-17T13:09:53Z

Hi @swarbred, yes, I think it is feasible.
I should have the fix soon.

…STED.

…tch for both `pick` and `configure`.

lucventurini · 2019-10-17T15:13:51Z

Hi @swarbred,
I should have done this in b1d0cf1. Contextually, I have also added the relevant option to both pick and configure.

I have not tested this as creating the right dataset would take quite a bit of time. If you could have a go with it, and verify it functions, it would be great.

gemygk · 2019-10-18T09:21:07Z

Thanks @lucventurini

@swarbred, I have now installed mikado-2.0rc6_b1d0cf1_CBG to our cluster for testing.

singularity exec /ei/software/cb/mikado/2.0rc6_b1d0cf1_CBG/x86_64/mikado-2.0rc6_b1d0cf1_CBG.img show_commit_hash
b1d0cf10f926d299dcd9c8731cba1480353408bf

swarbred · 2019-10-25T12:25:55Z

@lucventurini

Have just tested this with version 89eca18 and this looks to have an issue (or i'm not understanding how it should work)

My test data set is at

/tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET

I ran two picks the only difference being check_references being set as either true or false

mikado pick -v --mode nosplit --seed 10 --only-reference-update --procs 1 --json-conf mikado.configuration.run5-test_check_references_true.yaml --subloci-out mikado.subloci.gff3 -od mikado_pick_check_references_true

mikado pick -v --mode nosplit --seed 10 --only-reference-update --procs 1 --json-conf mikado.configuration.run5-test_check_references_false.yaml --subloci-out mikado.subloci.gff3 -od mikado_pick_check_references_false

The false run outputs models in the mikado.loci.gff3 file the true version has no models.

So I was expecting the true run to apply the requirements to models marked as reference, some filtering should occur but not all models should be removed. Looking at the pick log and it looks like models were removed as the gene had no reference transcripts but the model below is marked as a reference (and was output in the "false" run). So perhaps check_references is not working as expected when set to true.

2019-10-25 12:01:52,094 - scaffold_1:815344-829529 - abstractlocus.py:1151 - DEBUG - _check_not_passing - MainProcess - Transcript MYZPE13164_run2_MYZPE13164_r2_0000950.1 (source MYZPE13164_run2) is not a reference transcript (references: ['Agly_coding', 'Apis_coding', 'BtabMEAM1_coding', 'BtabMED_coding', 'Dnox_coding', 'Mcer_coding', 'Mikado_All_gold', 'Mikado_PacBio_gold', 'MperG006_coding', 'MYZPE13164_run1', 'MYZPE13164_run2', 'MYZPE13164_run3', 'Nlug_coding', 'Pnig_coding', 'Rmai_coding', 'Rpad_coding', 'Sfuc_coding']; in it: True) 2019-10-25 12:01:52,097 - scaffold_1:815344-829529 - locus.py:195 - INFO - finalize_alternative_splicing - MainProcess - MYZPE13164_run2_MYZPE13164_r2_0000950.1 is now padded 2019-10-25 12:01:52,097 - scaffold_1:815344-829529 - locus.py:199 - DEBUG - finalize_alternative_splicing - MainProcess - gene:scaffold_1+:823385-829529 has 2 transcripts (MYZPE13164_run3_MYZPE13164_r3_0000980.1, MYZPE13164_run2_MYZPE13164_r2_0000950.1) 2019-10-25 12:01:52,099 - scaffold_1:815344-829529 - superlocus.py:1222 - DEBUG - define_loci - MainProcess - Removing gene:scaffold_1+:823385-829529 (primary: MYZPE13164_run2_MYZPE13164_r2_0000950.1) as it has no reference transcripts.

lucventurini · 2019-10-25T12:36:26Z

Hi @swarbred,
no, that's a different bug. I have no idea of why, but Mikado has started (only in the single-threaded mode) to use the "True/False" values as "1/0". This has ... created all sorts of havoc. Including the reference check.

I was planning to investigate that.

In the meantime, da0265b contains both the fix for the reference check problem and should fix the other issues with the alias.

I am really sorry about all of this - I did some changes that should have been transparent to the backbone, but apparently are not.

…that caused boolean values to be converted into integers.

lucventurini · 2019-10-25T16:08:34Z

Hi @swarbred , @gemygk , solved the bug for the boolean values being converted into integers in pick (which caused the strange behaviour observed by @swarbred earlier).

swarbred · 2019-10-25T16:31:26Z

@lucventurini I'm losing track of versions, if we are going to test everything works we should pull 5d9794a ? I had a go at installing using singularity but am getting an error @gemygk can educate me how to do this on Monday :-)

lucventurini · 2019-10-25T16:35:19Z

Hi @swarbred,
yes, let's put everything back into master. I will create the pull request now and delete the branch. If it does not function, I will re-create another one :-)

Sorry for all these commits. Basically when I was about at rc4 various people noticed how Mikado was slow on certain datasets, and that pushed me to do many changes under the hood to make it work with big datasets of millions of transcripts. It was worth it but I fear it pushed the software back into a beta state ...

lucventurini · 2019-10-25T17:07:00Z

Hi @swarbred, the latest commits to master (160a3a3 and 10f0c51) have ported all the bug-fixes etc. back into the main trunk.

Let us use this as the base for the time being.

Add fixes for #240

swarbred · 2019-10-30T10:15:46Z

@lucventurini
cc @gemygk

Using the test region at /ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET

running version mikado-2.0rc6_89eca18_CBG
Output = /ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET/mikado_pick_check_references_false

we see models in the output that are identical to other outputted models
e.g. MYZPE13164_run2_MYZPE13164_r2_0000950.1 and MYZPE13164_run3_MYZPE13164_r3_0000980.1 but there are multiple occurrences

I've just tested this with mikado-2.0_rc1
output dir = /ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET/mikado-2.0_rc1_mikado_pick_check_references_false

and this is correct

running with mikado-2.0rc6_a9ef516
output dir = /ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET/mikado-2.0rc6_a9ef516_mikado_pick_check_references_false

and mikado errors (with an unrelated issue)

mikado/2.0rc6_a9ef516 is sourced from /ei/software/cb location
Usage:
  mikado --help
2019-10-30 09:59:00,391 - main - __init__.py:120 - ERROR - main - MainProcess - Mikado crashed, cause:
2019-10-30 09:59:00,392 - main - __init__.py:121 - ERROR - main - MainProcess - Invalid value for has_start_codon: <class 'int'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Mikado/__init__.py", line 106, in main
    args.func(args)
  File "/usr/local/lib/python3.7/site-packages/Mikado/subprograms/pick.py", line 205, in pick
    creator()
  File "/usr/local/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1124, in __call__
    self._parse_and_submit_input()
  File "/usr/local/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1092, in _parse_and_submit_input
    self.__submit_single_threaded()
  File "/usr/local/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1020, in __submit_single_threaded
    gene_counter, curr_chrom, locus_printer, submit_locus)
  File "/usr/local/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1049, in __check_transcript
    current_transcript.finalize()
  File "/usr/local/lib/python3.7/site-packages/Mikado/transcripts/transcript.py", line 1244, in finalize
    finalize(self)
  File "/usr/local/lib/python3.7/site-packages/Mikado/transcripts/transcript_methods/finalizing.py", line 780, in finalize
    setattr(transcript, prop, val)
  File "/usr/local/lib/python3.7/site-packages/Mikado/transcripts/transcript.py", line 2831, in has_start_codon
    "Invalid value for has_start_codon: {0}".format(type(value)))
TypeError: Invalid value for has_start_codon: <class 'int'>
Command exited with non-zero status 1
	Command being timed: "mikado pick -v --mode nosplit --seed 10 --only-reference-update --procs 1 --json-conf mikado.configuration.run5-test_check_references_false.yaml --subloci-out mikado.subloci.gff3 -od mikado-2.0rc6_a9ef516_mikado_pick_check_references_false"
	User time (seconds): 3.26
	System time (seconds): 2.42
	Percent of CPU this job got: 74%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.68
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 141516
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 326
	Minor (reclaiming a frame) page faults: 123399
	Voluntary context switches: 1560
	Involuntary context switches: 337
	Swaps: 0
	File system inputs: 114664
	File system outputs: 128
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 1

lucventurini · 2019-10-30T10:33:31Z

Hi @swarbred, I thought I had solved that specific, unrelated bug earlier last week .. I will have to double check. Please give me ~10 minutes.

lucventurini · 2019-10-30T11:01:01Z

Hi @swarbred, the bug is in the GTF file. If you inspect it you will see:

scaffold_1	mikado_all	mRNA	815344	829474	.	+	.	gene_id "mikado_all_mikado.scaffold_1G122.1.gene"; transcript_id "mikado_all_mikado.scaffold_1G122.1"; Name "mikado_all_mikado.scaffold_1G122.1"; alias "LIB1777.scallop_sca.gene.LIB1777.110.0.1"; primary "1"; canonical_number "1"; canonical_proportion "1"; canonical_junctions "1"; has_start_codon "1"; has_stop_codon "1";

That's wrong, those "has_start_codon", "has_stop_codon" and "is_reference" should be True (instead of 1) or False (instead of 0). That's due to a bug I already solved but that is affecting this file ...

In 5744533 I made the code resilient to that. In the meantime, I did correct the input GTF and reran with the same mikado version (a9ef516) and the job finished successfully.

See

/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET/mikado-2.0rc6_a9ef516_mikado_pick_check_references_false

lucventurini · 2019-10-30T11:05:45Z

PS: the secondary problem - double models being retained after padding - is still there though.

lucventurini · 2019-10-30T16:56:26Z

@swarbred
cc @gemygk

f89b0f2 should have solved the problem. There were three separate issues:

1- Reference models were never marked as redundant - even when ending up with two identical reference models.
2- When doing the padding, in order to allow for UTR expansion, Mikado has to allow redundant models into the loci in the first place. However, when doing the padding, this should be reverted. This has now been fixed.
3- There was a bug in another part, which marked two identical models as "non-intersecting" ... hence leading to duplicated loci.

f89b0f2 fixes all of this, so that now in

/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET/mikado-2.0rc6_a9ef516_mikado_pick_check_references_false

there are no duplicated models.

lucventurini · 2019-10-31T14:51:57Z

PPS: @swarbred , @gemygk

I have run both with and without the "check_references" flag on. Results are different, ie the checks are triggered as expected.

lucventurini · 2019-11-14T11:27:21Z

Hi @swarbred, @gemygk,
would either of you be able to certify that the changes are as we want after the last commit to master?

swarbred · 2019-11-14T14:48:44Z

Hi @lucventurini Yes we will take a look at this next week as like you we want to tie up the release.

On a separate note while I remember for ease of testing it would be handy to be able to give a list of comma separated scaffold:region..region via the command line to restrict pick to specified regions (easier than cutting the gtf down).

lucventurini · 2019-11-14T14:59:58Z

Hi @lucventurini Yes we will take a look at this next week as like you we want to tie up the release.

Great thanks

On a separate note while I remember for ease of testing it would be handy to be able to give a list of comma separated scaffold:region..region via the command line to restrict pick to specified regions (easier than cutting the gtf down).

I will have a look at this, it should not be too difficult to do.

…ovide target regions for analysis. This should speed up testing.

lucventurini · 2019-11-15T16:10:23Z

Hi @swarbred , @gemygk , commit be2a884 introduces the option you were asking for.

Specifically, using the -r/--regions flag in the command line, it will be possible to feed mikado pick a region/list of regions that have to be considered for the analysis.

The regions should be in the form of chrom:start..end, ie the same format used by WebApollo (a touch that I hope you will find useful to go back and forth from the browser!).

Please let me know how you get along with this new function.

swarbred · 2019-11-20T11:04:46Z

Thanks @lucventurini will install this and test, if I'm going to download and install to test this and other changes should I use be2a884 or something else?

lucventurini · 2019-11-20T11:21:17Z

The commit from yesterday (0723988) has cleared up a bit the code for the configuration files and added a new simplified configuration format, TOML (https://github.com/toml-lang/toml; see #239). I would suggest using the latest release but be2a884 should be functionally equivalent.

Best

swarbred · 2019-11-20T14:14:49Z

Thanks @lucventurini

@gemygk installed as mikado-2.0rc6_0723988_CBG

lucventurini · 2019-11-20T14:21:44Z

Great. Please do not hate me, but this morning I have also finished working on #237, with ab11a75: the code in that branch now splits models with too-long introns as the default.

I would recommend testing that separately as the only thing in that branch that really has changed from the code you just installed is the corner case in mikado prepare. So that change can be eventually tested later, quickly, on a subset of stuff.

lucventurini · 2020-02-24T10:57:03Z

@swarbred

A quick notice to please check and update on this issue as well when testing the develop branch.

lucventurini · 2020-02-27T16:56:24Z

Closing for now as no further report of problems since changes in November 2019.

…STED.

…tch for both `pick` and `configure`.

* Fix EI-CoreBioinformatics#240, EI-CoreBioinformatics#243 * Solved a bug that caused boolean values to be converted into integers for `pick`.

Add fixes for EI-CoreBioinformatics#240

…ovide target regions for analysis. This should speed up testing.

swarbred added the enhancement label Oct 17, 2019

lucventurini self-assigned this Oct 17, 2019

lucventurini added this to the 2.0 milestone Oct 17, 2019

lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 17, 2019

This should provide a solution to EI-CoreBioinformatics#240. TO BE TE…

9ada270

…STED.

lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 17, 2019

EI-CoreBioinformatics#240: fix 9ada270 and provide a command line swi…

ec0a433

…tch for both `pick` and `configure`.

lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 17, 2019

EI-CoreBioinformatics#240: fix ec0a433

b1d0cf1

lucventurini mentioned this issue Oct 19, 2019

Issue 241 lucventurini/mikado#9

Merged

lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 25, 2019

EI-CoreBioinformatics#240, EI-CoreBioinformatics#243: solved the bug …

5d9794a

…that caused boolean values to be converted into integers.

lucventurini mentioned this issue Oct 25, 2019

Issue 240 #245

Merged

lucventurini closed this as completed in 160a3a3 Oct 25, 2019

lucventurini reopened this Oct 25, 2019

lucventurini added a commit that referenced this issue Oct 28, 2019

Issue 240 (#246)

b9d8ce3

Add fixes for #240

lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 30, 2019

EI-CoreBioinformatics#240: solving the double structure problem

f89b0f2

lucventurini added a commit to lucventurini/mikado that referenced this issue Nov 15, 2019

EI-CoreBioinformatics#240: added a -r option to mikado pick to pr…

be2a884

…ovide target regions for analysis. This should speed up testing.

lucventurini closed this as completed Feb 27, 2020

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

This should provide a solution to EI-CoreBioinformatics#240. TO BE TE…

9c707ae

…STED.

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

EI-CoreBioinformatics#240: fix 9c707ae and provide a command line swi…

45cf8ec

…tch for both `pick` and `configure`.

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

EI-CoreBioinformatics#240: fix 45cf8ec

6f66389

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Issue 240 (EI-CoreBioinformatics#245)

31a88d5

* Fix EI-CoreBioinformatics#240, EI-CoreBioinformatics#243 * Solved a bug that caused boolean values to be converted into integers for `pick`.

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

Issue 240 (EI-CoreBioinformatics#246)

d013410

Add fixes for EI-CoreBioinformatics#240

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

EI-CoreBioinformatics#240: solving the double structure problem

e200c51

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021

EI-CoreBioinformatics#240: added a -r option to mikado pick to pr…

eab4425

…ovide target regions for analysis. This should speed up testing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to allow requirements and not_fragmentary checks to be applied to models marked as reference #240

Add option to allow requirements and not_fragmentary checks to be applied to models marked as reference #240

swarbred commented Oct 17, 2019

lucventurini commented Oct 17, 2019

lucventurini commented Oct 17, 2019

gemygk commented Oct 18, 2019

swarbred commented Oct 25, 2019

lucventurini commented Oct 25, 2019

lucventurini commented Oct 25, 2019

swarbred commented Oct 25, 2019

lucventurini commented Oct 25, 2019

lucventurini commented Oct 25, 2019

swarbred commented Oct 30, 2019 •

edited by lucventurini

Loading

lucventurini commented Oct 30, 2019

lucventurini commented Oct 30, 2019

lucventurini commented Oct 30, 2019

lucventurini commented Oct 30, 2019

lucventurini commented Oct 31, 2019

lucventurini commented Nov 14, 2019

swarbred commented Nov 14, 2019

lucventurini commented Nov 14, 2019 •

edited

Loading

lucventurini commented Nov 15, 2019

swarbred commented Nov 20, 2019

lucventurini commented Nov 20, 2019

swarbred commented Nov 20, 2019

lucventurini commented Nov 20, 2019 •

edited

Loading

lucventurini commented Feb 24, 2020

lucventurini commented Feb 27, 2020

Add option to allow requirements and not_fragmentary checks to be applied to models marked as reference #240

Add option to allow requirements and not_fragmentary checks to be applied to models marked as reference #240

Comments

swarbred commented Oct 17, 2019

lucventurini commented Oct 17, 2019

lucventurini commented Oct 17, 2019

gemygk commented Oct 18, 2019

swarbred commented Oct 25, 2019

lucventurini commented Oct 25, 2019

lucventurini commented Oct 25, 2019

swarbred commented Oct 25, 2019

lucventurini commented Oct 25, 2019

lucventurini commented Oct 25, 2019

swarbred commented Oct 30, 2019 • edited by lucventurini Loading

lucventurini commented Oct 30, 2019

lucventurini commented Oct 30, 2019

lucventurini commented Oct 30, 2019

lucventurini commented Oct 30, 2019

lucventurini commented Oct 31, 2019

lucventurini commented Nov 14, 2019

swarbred commented Nov 14, 2019

lucventurini commented Nov 14, 2019 • edited Loading

lucventurini commented Nov 15, 2019

swarbred commented Nov 20, 2019

lucventurini commented Nov 20, 2019

swarbred commented Nov 20, 2019

lucventurini commented Nov 20, 2019 • edited Loading

lucventurini commented Feb 24, 2020

lucventurini commented Feb 27, 2020

swarbred commented Oct 30, 2019 •

edited by lucventurini

Loading

lucventurini commented Nov 14, 2019 •

edited

Loading

lucventurini commented Nov 20, 2019 •

edited

Loading