Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mkado util convert - additional gene lines #384

Closed
swarbred opened this issue Mar 3, 2021 · 4 comments · Fixed by #383
Closed

mkado util convert - additional gene lines #384

swarbred opened this issue Mar 3, 2021 · 4 comments · Fixed by #383
Assignees

Comments

@swarbred
Copy link
Collaborator

swarbred commented Mar 3, 2021

run with the test file (part of the minos input test set)

reat_xspecies_wport_wex_wutr.loci.regionA.gtf (this is from reat homology so a mikado output file)

This has a gene (mikado.Chr3G297) with two transcripts they are not found adjacent in the file and this results in mikado convert generating an additional gene line (invalid)

[swarbred@EI-HPC interactive Models]$ mikado util convert -if gtf -of gff3 reat_xspecies_wport_wex_wutr.loci.regionA.gtf reat_xspecies_wport_wex_wutr.loci.regionA.gff3
[swarbred@EI-HPC interactive Models]$ grep mikado.Chr3G297 reat_xspecies_wport_wex_wutr.loci.regionA.gff3
Chr3	Mikado_xspecies_loci	gene	1123361	1140723	.	+	.	ID=mikado.Chr3G297
Chr3	Mikado_xspecies_loci	mRNA	1123361	1140723	31.36	+	.	ID=mikado.Chr3G297.2;Parent=mikado.Chr3G297;Name=mikado.Chr3G297.2;alias=Ppersica_Prupe.8G212600.1.v2.1.m1;avg_ef1=27;avg_jf1=33;canonical_junctions=1,2,3;canonical_number=3;canonical_proportion=1;ccode=j;cds_ccode=j;cds_exon_f1=33;cds_id=46;cds_junction_f1=50;has_start_codon=True;has_stop_codon=True;is_reference=True;max_ef1=100;max_jf1=100;min_ef1=0;min_jf1=0;multiexonic=True;note=Prupe.8G212600.1.v2.1|cov:100.00|id:59.30|cds_id:46.38|cds_exon_f1:33.33|cds_junction_f1:50.00|cds_ccode:j|min_ef1:0.00|max_ef1:100.00|avg_ef1:27.50|min_jf1:0.00|max_jf1:100.00|avg_jf1:33.33|xscore:10.00;primary=False;retained_intron=False;superlocus=Mikado_xspecies_superlocus:Chr3+:1123361-1140723;xscore=10
Chr3	Mikado_xspecies_loci	CDS	1123361	1123379	.	+	0	ID=mikado.Chr3G297.2.CDS1;Parent=mikado.Chr3G297.2
Chr3	Mikado_xspecies_loci	exon	1123361	1123379	.	+	.	ID=mikado.Chr3G297.2.exon1;Parent=mikado.Chr3G297.2
Chr3	Mikado_xspecies_loci	CDS	1124237	1124367	.	+	2	ID=mikado.Chr3G297.2.CDS2;Parent=mikado.Chr3G297.2
Chr3	Mikado_xspecies_loci	exon	1124237	1124367	.	+	.	ID=mikado.Chr3G297.2.exon2;Parent=mikado.Chr3G297.2
Chr3	Mikado_xspecies_loci	CDS	1124995	1125035	.	+	0	ID=mikado.Chr3G297.2.CDS3;Parent=mikado.Chr3G297.2
Chr3	Mikado_xspecies_loci	exon	1124995	1125035	.	+	.	ID=mikado.Chr3G297.2.exon3;Parent=mikado.Chr3G297.2
Chr3	Mikado_xspecies_loci	CDS	1140504	1140723	.	+	1	ID=mikado.Chr3G297.2.CDS4;Parent=mikado.Chr3G297.2
Chr3	Mikado_xspecies_loci	exon	1140504	1140723	.	+	.	ID=mikado.Chr3G297.2.exon4;Parent=mikado.Chr3G297.2
Chr3	Mikado_xspecies_loci	gene	1140318	1140723	.	+	.	ID=mikado.Chr3G297
Chr3	Mikado_xspecies_loci	mRNA	1140318	1140723	43.03	+	.	ID=mikado.Chr3G297.1;Parent=mikado.Chr3G297;Name=mikado.Chr3G297.1;alias=Crubella_Carub.0003s0351.1.v1.1.m1;avg_ef1=72;avg_jf1=73;canonical_junctions=1;canonical_number=1;canonical_proportion=1;cds_exon_f1=100;cds_id=96;cds_junction_f1=100;has_start_codon=True;has_stop_codon=True;is_reference=True;max_ef1=100;max_jf1=100;min_ef1=0;min_jf1=0;multiexonic=True;note=carub.0003s0351.1.v1.1|cov:100.00|id:96.90|cds_id:96.91|cds_exon_f1:100.00|cds_junction_f1:100.00|cds_ccode:=|min_ef1:0.00|max_ef1:100.00|avg_ef1:72.50|min_jf1:0.00|max_jf1:100.00|avg_jf1:73.33|xscore:70.00;primary=True;superlocus=Mikado_xspecies_superlocus:Chr3+:1123361-1140723;xscore=70
Chr3	Mikado_xspecies_loci	CDS	1140318	1140388	.	+	0	ID=mikado.Chr3G297.1.CDS1;Parent=mikado.Chr3G297.1
Chr3	Mikado_xspecies_loci	exon	1140318	1140388	.	+	.	ID=mikado.Chr3G297.1.exon1;Parent=mikado.Chr3G297.1
Chr3	Mikado_xspecies_loci	CDS	1140504	1140723	.	+	1	ID=mikado.Chr3G297.1.CDS2;Parent=mikado.Chr3G297.1
Chr3	Mikado_xspecies_loci	exon	1140504	1140723	.	+	.	ID=mikado.Chr3G297.1.exon2;Parent=mikado.Chr3G297.1
@lucventurini
Copy link
Collaborator

This is due to the fact the mikado util convert presumes that the file is already sorted .. I can amend this but the memory requirements will go up.

I will probably add a flag, like -as, --assume-sorted to avoid putting things in memory.

@swarbred
Copy link
Collaborator Author

swarbred commented Mar 3, 2021

The file is coordinate sorted (as gffread) but not sorted by each gene (i.e as gt gff --sort). Yes the -as option seems useful

@lucventurini
Copy link
Collaborator

Coordinate sorted as in transcript lines might come after exon lines? My parsers won't particularly like that, but I can see what to do ..

lucventurini added a commit that referenced this issue Mar 3, 2021
@lucventurini lucventurini self-assigned this Mar 3, 2021
@lucventurini
Copy link
Collaborator

lucventurini commented Mar 3, 2021

Hi @swarbred

b071298 will fix this. It can also deal with files sorted purely by chrom, start, end.

Best,

lucventurini added a commit that referenced this issue Mar 3, 2021
@lucventurini lucventurini linked a pull request Mar 13, 2021 that will close this issue
lucventurini added a commit that referenced this issue Mar 15, 2021
# Version 2.2.0
Removed Cython from the requirements.txt file. This allows to perform the tests correctly in a Conda environment (as Conda disallows installing Cython as part of a distributed package).
As a result of this change, the preferred installation procedure from source has to be slightly amended:
- either install using `pip wheel -w dist . && pip install dist/Mikado*whl`
- or install with `python setup.py bdist_wheel` **after** having forcibly installed Cython, with `pip install Cython` or the like.

Other changes:
- Fix [#381](#381): now Mikado will be able to guess correctly 
  the input file format, instead of relying on the file name extension or user's settings. Sniffing for files 
  provided as a stream is *disabled* though.
- Fix [#382](#382): now Mikado can accept generic BED12 files 
  as input junctions, not just Portcullis junctions. This allows e.g. a user to provide a ***set of gene models*** 
  in BED12 format as sources of valid junctions.
- Fix [#384](#384): now Mikado convert deals properly with 
  unsorted GTFs/GFFs. 
- Fix [#386](#386): dealing better with unsorted GFFs/GTFs for 
  the stats utility.
- Fix [#387](#387): now Mikado will always use a static seed, 
  rather than generating a new one per call unless specifically instructed to do so. The old behaviour can still be 
  replicated by either setting the `seed` parameter to `null` (ie `None`) in the configuration file, or by 
  specifying `--random-seed` during the command invocation.
- General increase in code unit-test coverage; in particular:  
  - Slightly increased the unit-test coverage for the locus classes, e.g. properly covering the `as_dict` and `load_dict`
    methods. Minor bugfixes related to the introduction of these unit-tests.
- `Mikado.parsers.to_gff` has been renamed to `Mikado.parsers.parser_factory`.
- The code related to the transcript padding has been moved to the submodule `Mikado.transcripts.pad`, rather than 
  being part of the `Mikado.loci.locus` submodule.
- Mikado will error informatively if the scoring configuration file is malformed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants