Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to allow requirements and not_fragmentary checks to be applied to models marked as reference #240

Closed
swarbred opened this issue Oct 17, 2019 · 25 comments
Assignees
Milestone

Comments

@swarbred
Copy link
Collaborator

Hi @lucventurini
In GMC we currently make use of the --only-reference-update flag as we have labels (sets of models) marked as reference (e.g. multiple sets of augustus models + subset of mikado models) but also other sets e.g. say the full mikado output that we only include for the purposes of adding splice variants (so marked as not reference). To avoid these models being selected as a primary models we give them a low base score and run with the --only-reference-update flag so loci based on these transcripts alone are excluded.

This works but it means that the requirements and not_fragmentary checks are not applied to the models marked as reference. Most of the time this is the desired behaviour but in our case I would like these checks to be applied to the reference models.

I wondered if we could add an option to override the default behaviour and allow requirements and not_fragmentary checks to be applied to reference models?

@lucventurini
Copy link
Collaborator

Hi @swarbred, yes, I think it is feasible.
I should have the fix soon.

@lucventurini lucventurini self-assigned this Oct 17, 2019
@lucventurini lucventurini added this to the 2.0 milestone Oct 17, 2019
lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 17, 2019
lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 17, 2019
lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 17, 2019
@lucventurini
Copy link
Collaborator

Hi @swarbred,
I should have done this in b1d0cf1. Contextually, I have also added the relevant option to both pick and configure.

I have not tested this as creating the right dataset would take quite a bit of time. If you could have a go with it, and verify it functions, it would be great.

@gemygk
Copy link
Collaborator

gemygk commented Oct 18, 2019

Thanks @lucventurini

@swarbred, I have now installed mikado-2.0rc6_b1d0cf1_CBG to our cluster for testing.

singularity exec /ei/software/cb/mikado/2.0rc6_b1d0cf1_CBG/x86_64/mikado-2.0rc6_b1d0cf1_CBG.img show_commit_hash
b1d0cf10f926d299dcd9c8731cba1480353408bf

@swarbred
Copy link
Collaborator Author

@lucventurini

Have just tested this with version 89eca18 and this looks to have an issue (or i'm not understanding how it should work)

My test data set is at

/tgac/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET

I ran two picks the only difference being check_references being set as either true or false

mikado pick -v --mode nosplit --seed 10 --only-reference-update --procs 1 --json-conf mikado.configuration.run5-test_check_references_true.yaml --subloci-out mikado.subloci.gff3 -od mikado_pick_check_references_true

mikado pick -v --mode nosplit --seed 10 --only-reference-update --procs 1 --json-conf mikado.configuration.run5-test_check_references_false.yaml --subloci-out mikado.subloci.gff3 -od mikado_pick_check_references_false

The false run outputs models in the mikado.loci.gff3 file the true version has no models.

So I was expecting the true run to apply the requirements to models marked as reference, some filtering should occur but not all models should be removed. Looking at the pick log and it looks like models were removed as the gene had no reference transcripts but the model below is marked as a reference (and was output in the "false" run). So perhaps check_references is not working as expected when set to true.

2019-10-25 12:01:52,094 - scaffold_1:815344-829529 - abstractlocus.py:1151 - DEBUG - _check_not_passing - MainProcess - Transcript MYZPE13164_run2_MYZPE13164_r2_0000950.1 (source MYZPE13164_run2) is not a reference transcript (references: ['Agly_coding', 'Apis_coding', 'BtabMEAM1_coding', 'BtabMED_coding', 'Dnox_coding', 'Mcer_coding', 'Mikado_All_gold', 'Mikado_PacBio_gold', 'MperG006_coding', 'MYZPE13164_run1', 'MYZPE13164_run2', 'MYZPE13164_run3', 'Nlug_coding', 'Pnig_coding', 'Rmai_coding', 'Rpad_coding', 'Sfuc_coding']; in it: True) 2019-10-25 12:01:52,097 - scaffold_1:815344-829529 - locus.py:195 - INFO - finalize_alternative_splicing - MainProcess - MYZPE13164_run2_MYZPE13164_r2_0000950.1 is now padded 2019-10-25 12:01:52,097 - scaffold_1:815344-829529 - locus.py:199 - DEBUG - finalize_alternative_splicing - MainProcess - gene:scaffold_1+:823385-829529 has 2 transcripts (MYZPE13164_run3_MYZPE13164_r3_0000980.1, MYZPE13164_run2_MYZPE13164_r2_0000950.1) 2019-10-25 12:01:52,099 - scaffold_1:815344-829529 - superlocus.py:1222 - DEBUG - define_loci - MainProcess - Removing gene:scaffold_1+:823385-829529 (primary: MYZPE13164_run2_MYZPE13164_r2_0000950.1) as it has no reference transcripts.

@lucventurini
Copy link
Collaborator

Hi @swarbred,
no, that's a different bug. I have no idea of why, but Mikado has started (only in the single-threaded mode) to use the "True/False" values as "1/0". This has ... created all sorts of havoc. Including the reference check.

I was planning to investigate that.

In the meantime, da0265b contains both the fix for the reference check problem and should fix the other issues with the alias.

I am really sorry about all of this - I did some changes that should have been transparent to the backbone, but apparently are not.

lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 25, 2019
…that caused boolean values to be converted into integers.
@lucventurini
Copy link
Collaborator

Hi @swarbred , @gemygk , solved the bug for the boolean values being converted into integers in pick (which caused the strange behaviour observed by @swarbred earlier).

@swarbred
Copy link
Collaborator Author

@lucventurini I'm losing track of versions, if we are going to test everything works we should pull 5d9794a ? I had a go at installing using singularity but am getting an error @gemygk can educate me how to do this on Monday :-)

@lucventurini
Copy link
Collaborator

Hi @swarbred,
yes, let's put everything back into master. I will create the pull request now and delete the branch. If it does not function, I will re-create another one :-)

Sorry for all these commits. Basically when I was about at rc4 various people noticed how Mikado was slow on certain datasets, and that pushed me to do many changes under the hood to make it work with big datasets of millions of transcripts. It was worth it but I fear it pushed the software back into a beta state ...

@lucventurini
Copy link
Collaborator

Hi @swarbred, the latest commits to master (160a3a3 and 10f0c51) have ported all the bug-fixes etc. back into the main trunk.

Let us use this as the base for the time being.

lucventurini added a commit that referenced this issue Oct 28, 2019
Add fixes for #240
@swarbred
Copy link
Collaborator Author

swarbred commented Oct 30, 2019

@lucventurini
cc @gemygk

Using the test region at /ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET

running version mikado-2.0rc6_89eca18_CBG
Output = /ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET/mikado_pick_check_references_false

we see models in the output that are identical to other outputted models
e.g. MYZPE13164_run2_MYZPE13164_r2_0000950.1 and MYZPE13164_run3_MYZPE13164_r3_0000980.1 but there are multiple occurrences

I've just tested this with mikado-2.0_rc1
output dir = /ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET/mikado-2.0_rc1_mikado_pick_check_references_false

and this is correct

running with mikado-2.0rc6_a9ef516
output dir = /ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET/mikado-2.0rc6_a9ef516_mikado_pick_check_references_false

and mikado errors (with an unrelated issue)

mikado/2.0rc6_a9ef516 is sourced from /ei/software/cb location
Usage:
  mikado --help
2019-10-30 09:59:00,391 - main - __init__.py:120 - ERROR - main - MainProcess - Mikado crashed, cause:
2019-10-30 09:59:00,392 - main - __init__.py:121 - ERROR - main - MainProcess - Invalid value for has_start_codon: <class 'int'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/Mikado/__init__.py", line 106, in main
    args.func(args)
  File "/usr/local/lib/python3.7/site-packages/Mikado/subprograms/pick.py", line 205, in pick
    creator()
  File "/usr/local/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1124, in __call__
    self._parse_and_submit_input()
  File "/usr/local/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1092, in _parse_and_submit_input
    self.__submit_single_threaded()
  File "/usr/local/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1020, in __submit_single_threaded
    gene_counter, curr_chrom, locus_printer, submit_locus)
  File "/usr/local/lib/python3.7/site-packages/Mikado/picking/picker.py", line 1049, in __check_transcript
    current_transcript.finalize()
  File "/usr/local/lib/python3.7/site-packages/Mikado/transcripts/transcript.py", line 1244, in finalize
    finalize(self)
  File "/usr/local/lib/python3.7/site-packages/Mikado/transcripts/transcript_methods/finalizing.py", line 780, in finalize
    setattr(transcript, prop, val)
  File "/usr/local/lib/python3.7/site-packages/Mikado/transcripts/transcript.py", line 2831, in has_start_codon
    "Invalid value for has_start_codon: {0}".format(type(value)))
TypeError: Invalid value for has_start_codon: <class 'int'>
Command exited with non-zero status 1
	Command being timed: "mikado pick -v --mode nosplit --seed 10 --only-reference-update --procs 1 --json-conf mikado.configuration.run5-test_check_references_false.yaml --subloci-out mikado.subloci.gff3 -od mikado-2.0rc6_a9ef516_mikado_pick_check_references_false"
	User time (seconds): 3.26
	System time (seconds): 2.42
	Percent of CPU this job got: 74%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:07.68
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 141516
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 326
	Minor (reclaiming a frame) page faults: 123399
	Voluntary context switches: 1560
	Involuntary context switches: 337
	Swaps: 0
	File system inputs: 114664
	File system outputs: 128
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 1

@lucventurini
Copy link
Collaborator

Hi @swarbred, I thought I had solved that specific, unrelated bug earlier last week .. I will have to double check. Please give me ~10 minutes.

@lucventurini
Copy link
Collaborator

Hi @swarbred, the bug is in the GTF file. If you inspect it you will see:

scaffold_1	mikado_all	mRNA	815344	829474	.	+	.	gene_id "mikado_all_mikado.scaffold_1G122.1.gene"; transcript_id "mikado_all_mikado.scaffold_1G122.1"; Name "mikado_all_mikado.scaffold_1G122.1"; alias "LIB1777.scallop_sca.gene.LIB1777.110.0.1"; primary "1"; canonical_number "1"; canonical_proportion "1"; canonical_junctions "1"; has_start_codon "1"; has_stop_codon "1";

That's wrong, those "has_start_codon", "has_stop_codon" and "is_reference" should be True (instead of 1) or False (instead of 0). That's due to a bug I already solved but that is affecting this file ...

In 5744533 I made the code resilient to that. In the meantime, I did correct the input GTF and reran with the same mikado version (a9ef516) and the job finished successfully.

See

/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET/mikado-2.0rc6_a9ef516_mikado_pick_check_references_false

@lucventurini
Copy link
Collaborator

PS: the secondary problem - double models being retained after padding - is still there though.

lucventurini added a commit to lucventurini/mikado that referenced this issue Oct 30, 2019
@lucventurini
Copy link
Collaborator

@swarbred
cc @gemygk

f89b0f2 should have solved the problem. There were three separate issues:

1- Reference models were never marked as redundant - even when ending up with two identical reference models.
2- When doing the padding, in order to allow for UTR expansion, Mikado has to allow redundant models into the loci in the first place. However, when doing the padding, this should be reverted. This has now been fixed.
3- There was a bug in another part, which marked two identical models as "non-intersecting" ... hence leading to duplicated loci.

f89b0f2 fixes all of this, so that now in

/ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Analysis/mikado-2.0rc4/annotation_run2/mikado-2.0rc6_89eca18_CBG_TEST_SET/mikado-2.0rc6_a9ef516_mikado_pick_check_references_false

there are no duplicated models.

@lucventurini
Copy link
Collaborator

PPS: @swarbred , @gemygk

I have run both with and without the "check_references" flag on. Results are different, ie the checks are triggered as expected.

@lucventurini
Copy link
Collaborator

Hi @swarbred, @gemygk,
would either of you be able to certify that the changes are as we want after the last commit to master?

@swarbred
Copy link
Collaborator Author

Hi @lucventurini Yes we will take a look at this next week as like you we want to tie up the release.

On a separate note while I remember for ease of testing it would be handy to be able to give a list of comma separated scaffold:region..region via the command line to restrict pick to specified regions (easier than cutting the gtf down).

@lucventurini
Copy link
Collaborator

lucventurini commented Nov 14, 2019

Hi @lucventurini Yes we will take a look at this next week as like you we want to tie up the release.

Great thanks

On a separate note while I remember for ease of testing it would be handy to be able to give a list of comma separated scaffold:region..region via the command line to restrict pick to specified regions (easier than cutting the gtf down).

I will have a look at this, it should not be too difficult to do.

lucventurini added a commit to lucventurini/mikado that referenced this issue Nov 15, 2019
…ovide target regions for analysis. This should speed up testing.
@lucventurini
Copy link
Collaborator

Hi @swarbred , @gemygk , commit be2a884 introduces the option you were asking for.

Specifically, using the -r/--regions flag in the command line, it will be possible to feed mikado pick a region/list of regions that have to be considered for the analysis.

The regions should be in the form of chrom:start..end, ie the same format used by WebApollo (a touch that I hope you will find useful to go back and forth from the browser!).

Please let me know how you get along with this new function.

@swarbred
Copy link
Collaborator Author

Thanks @lucventurini will install this and test, if I'm going to download and install to test this and other changes should I use be2a884 or something else?

@lucventurini
Copy link
Collaborator

The commit from yesterday (0723988) has cleared up a bit the code for the configuration files and added a new simplified configuration format, TOML (https://github.com/toml-lang/toml; see #239). I would suggest using the latest release but be2a884 should be functionally equivalent.

Best

@swarbred
Copy link
Collaborator Author

Thanks @lucventurini

@gemygk installed as mikado-2.0rc6_0723988_CBG

@lucventurini
Copy link
Collaborator

lucventurini commented Nov 20, 2019

Great. Please do not hate me, but this morning I have also finished working on #237, with ab11a75: the code in that branch now splits models with too-long introns as the default.

I would recommend testing that separately as the only thing in that branch that really has changed from the code you just installed is the corner case in mikado prepare. So that change can be eventually tested later, quickly, on a subset of stuff.

@lucventurini
Copy link
Collaborator

@swarbred

A quick notice to please check and update on this issue as well when testing the develop branch.

@lucventurini
Copy link
Collaborator

Closing for now as no further report of problems since changes in November 2019.

lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
* Fix EI-CoreBioinformatics#240, EI-CoreBioinformatics#243
* Solved a bug that caused boolean values to be converted into integers for `pick`.
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
lucventurini added a commit to lucventurini/mikado that referenced this issue Feb 11, 2021
…ovide target regions for analysis. This should speed up testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants