diff --git a/CHANGELOG.md b/CHANGELOG.md index e96280d31..1daa44718 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,4 +1,4 @@ -# Version 1.2.5 +# Version 1.3 One of the major highlights of this release is the completion of the "padding" functionality. Briefly, if instructed to do so, now Mikado will be able to uniform the ends of transcripts within a single locus (similar to what was done for the last _Arabidopsis thaliana_ annotation release). @@ -12,6 +12,8 @@ Bugfixes and improvements: - Fixed [#127](https://github.com/lucventurini/mikado/issues/127): previously, Mikado _prepare_ only considered cDNA coordinates when determining the redundancy of two models. In some edge cases, two models could be identical but have a different ORF called. Now Mikado will also consider the CDS before deciding whether to discard a model as redundant. - [#129](https://github.com/lucventurini/mikado/issues/129): Mikado is now capable of correctly padding the transcripts so to uniform their ends in a single locus. This will also have the effect of trying to enlarge the ORF of a transcript if it is truncated to begin with. - [#130](https://github.com/lucventurini/mikado/issues/130): it is now possible to specify a different metric inside the "filter" section of scoring. +- [#131](https://github.com/lucventurini/mikado/issues/131): in rare instances, Mikado could have missed loci if they were lost between the sublocus and monosublocus stages. Now Mikado implements a basic backtracking recursive algorithm that should ensure no locus is missed. +- [#132](https://github.com/lucventurini/mikado/issues/132) # Version 1.2.4 diff --git a/Mikado/__init__.py b/Mikado/__init__.py index 7d9cd4a67..2665e8c76 100755 --- a/Mikado/__init__.py +++ b/Mikado/__init__.py @@ -9,8 +9,8 @@ __title__ = "Mikado" __author__ = 'Luca Venturini' __license__ = 'GPL3' -__copyright__ = 'Copyright 2015-2019 Luca Venturini' -__version__ = "1.2.5" +__copyright__ = 'Copyright 2015-2020 Luca Venturini' +__version__ = "1.3" __all__ = ["configuration", "exceptions", diff --git a/Mikado/loci/superlocus.py b/Mikado/loci/superlocus.py index ef12c41c3..dd9604902 100644 --- a/Mikado/loci/superlocus.py +++ b/Mikado/loci/superlocus.py @@ -1142,20 +1142,31 @@ def define_loci(self): def __find_lost_transcripts(self): - if self.loci_defined is True: - return + cds_only = self.json_conf["pick"]["clustering"]["cds_only"] + # simple_overlap = self.json_conf["pick"]["run_options"]["monoloci_from_simple_overlap"] + cdna_overlap = self.json_conf["pick"]["clustering"]["min_cdna_overlap"] + cds_overlap = self.json_conf["pick"]["clustering"]["min_cds_overlap"] + + t_graph = self.define_graph(self.transcripts, + inters=MonosublocusHolder.is_intersecting, + cds_only=cds_only, + logger=self.logger, + min_cdna_overlap=cdna_overlap, + min_cds_overlap=cds_overlap, + simple_overlap_for_monoexonic=False) - loci_transcripts = itertools.chain(*[{self.loci[_].transcripts.keys()} for _ in self.loci]) + loci_transcripts = set() + for locus in self.loci.values(): + loci_transcripts.update(set([_ for _ in locus.transcripts.keys()])) - for tid in set.difference({self.transcripts.keys()}, loci_transcripts): - found = False - for lid in self.loci: - if MonosublocusHolder.in_locus(self.loci[lid], self.transcripts[tid]): - found = True - break - else: - continue - if found is True: + not_loci_transcripts = set.difference({_ for _ in self.transcripts.keys()}, loci_transcripts) + + if not not_loci_transcripts: + return + + for tid in not_loci_transcripts: + neighbours = set(t_graph.neighbors(tid)) + if set.intersection(neighbours, loci_transcripts): continue else: self.__lost.update({tid: self.transcripts[tid]}) diff --git a/docs/Algorithms.rst b/docs/Algorithms.rst index 7658af79d..9435f37fb 100644 --- a/docs/Algorithms.rst +++ b/docs/Algorithms.rst @@ -219,6 +219,13 @@ For example, this is a snippet of a scoring section: end_distance_from_junction: filter: {operator: lt, value: 55} rescaling: min + non_verified_introns_num: + rescaling: max + multiplier: -10 + filter: + operator: gt + value: 1 + metric: exons_num Using this snippet as a guide, Mikado will score transcripts in each locus as follows: @@ -228,6 +235,11 @@ Using this snippet as a guide, Mikado will score transcripts in each locus as fo * Assign a full score (**two points**, as a multiplier of 2 is specified) to transcripts that have a total amount of CDS bps approximating 80% of the transcript cDNA length (*combined_cds_fraction*) * Assign a full score (one point, as no multiplier is specified) to transcripts that have a 5' UTR whose length is nearest to 100 bps (*five_utr_length*); if the 5' UTR is longer than 2,500 bps, this score will be 0 (see the filter section) * Assign a full score (one point, as no multiplier is specified) to transcripts which have the lowest distance between the CDS end and the most downstream exon-exon junction (*end_distance_from_junction*). If such a distance is greater than 55 bps, assign a score of 0, as it is a probable target for NMD (see the filter section). +* Assign a maximum penalty (**minus 10 points**, as a **negative** multiplier is specified) to the transcript with the highest number of non-verified introns in the locus. + * Again, we are using a "filter" section to define which transcripts will be exempted from this scoring (in this case, a penalty) + * However, please note that we are using the keyword **metric** in this section. This tells Mikado to check a *different* metric for evaluating the filter. Nominally, in this case we are excluding from the penalty any *monoexonic* transcript. This makes sense as a monoexonic transcript will never have an intron to be confirmed to start with. + +.. note:: The possibility of using different metrics for the "filter" section is present from Mikado 1.3 onwards. .. _Metrics: