Improvement on #131: now we are using a graph-based function, rather …

…than a for cycle, to find the missing loci. This also ensures coherence in terms of the overlapping parameters.
EI-CoreBioinformatics · Oct 5, 2018 · a94578e · a94578e
1 parent ffbff8b
commit a94578e
Show file tree

Hide file tree

Showing 4 changed files with 40 additions and 15 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,4 +1,4 @@
-# Version 1.2.5
+# Version 1.3
 
 One of the major highlights of this release is the completion of the "padding" functionality.
 Briefly, if instructed to do so, now Mikado will be able to uniform the ends of transcripts within a single locus (similar to what was done for the last _Arabidopsis thaliana_ annotation release).
@@ -12,6 +12,8 @@ Bugfixes and improvements:
 - Fixed [#127](https://github.com/lucventurini/mikado/issues/127): previously, Mikado _prepare_ only considered cDNA coordinates when determining the redundancy of two models. In some edge cases, two models could be identical but have a different ORF called. Now Mikado will also consider the CDS before deciding whether to discard a model as redundant.
 - [#129](https://github.com/lucventurini/mikado/issues/129): Mikado is now capable of correctly padding the transcripts so to uniform their ends in a single locus. This will also have the effect of trying to enlarge the ORF of a transcript if it is truncated to begin with.
 - [#130](https://github.com/lucventurini/mikado/issues/130): it is now possible to specify a different metric inside the "filter" section of scoring.
+- [#131](https://github.com/lucventurini/mikado/issues/131): in rare instances, Mikado could have missed loci if they were lost between the sublocus and monosublocus stages. Now Mikado implements a basic backtracking recursive algorithm that should ensure no locus is missed.
+- [#132](https://github.com/lucventurini/mikado/issues/132)
 
 # Version 1.2.4
 

diff --git a/Mikado/__init__.py b/Mikado/__init__.py
@@ -9,8 +9,8 @@
 __title__ = "Mikado"
 __author__ = 'Luca Venturini'
 __license__ = 'GPL3'
-__copyright__ = 'Copyright 2015-2019 Luca Venturini'
-__version__ = "1.2.5"
+__copyright__ = 'Copyright 2015-2020 Luca Venturini'
+__version__ = "1.3"
 
 __all__ = ["configuration",
            "exceptions",

diff --git a/Mikado/loci/superlocus.py b/Mikado/loci/superlocus.py
@@ -1142,20 +1142,31 @@ def define_loci(self):
 
     def __find_lost_transcripts(self):
 
-        if self.loci_defined is True:
-            return
+        cds_only = self.json_conf["pick"]["clustering"]["cds_only"]
+        # simple_overlap = self.json_conf["pick"]["run_options"]["monoloci_from_simple_overlap"]
+        cdna_overlap = self.json_conf["pick"]["clustering"]["min_cdna_overlap"]
+        cds_overlap = self.json_conf["pick"]["clustering"]["min_cds_overlap"]
+
+        t_graph = self.define_graph(self.transcripts,
+                                    inters=MonosublocusHolder.is_intersecting,
+                                    cds_only=cds_only,
+                                    logger=self.logger,
+                                    min_cdna_overlap=cdna_overlap,
+                                    min_cds_overlap=cds_overlap,
+                                    simple_overlap_for_monoexonic=False)
 
-        loci_transcripts = itertools.chain(*[{self.loci[_].transcripts.keys()} for _ in self.loci])
+        loci_transcripts = set()
+        for locus in self.loci.values():
+            loci_transcripts.update(set([_ for _ in locus.transcripts.keys()]))
 
-        for tid in set.difference({self.transcripts.keys()}, loci_transcripts):
-            found = False
-            for lid in self.loci:
-                if MonosublocusHolder.in_locus(self.loci[lid], self.transcripts[tid]):
-                    found = True
-                    break
-                else:
-                    continue
-            if found is True:
+        not_loci_transcripts = set.difference({_ for _ in self.transcripts.keys()}, loci_transcripts)
+
+        if not not_loci_transcripts:
+            return
+
+        for tid in not_loci_transcripts:
+            neighbours = set(t_graph.neighbors(tid))
+            if set.intersection(neighbours, loci_transcripts):
                 continue
             else:
                 self.__lost.update({tid: self.transcripts[tid]})

diff --git a/docs/Algorithms.rst b/docs/Algorithms.rst
@@ -219,6 +219,13 @@ For example, this is a snippet of a scoring section:
         end_distance_from_junction:
             filter: {operator: lt, value: 55}
             rescaling: min
+        non_verified_introns_num:
+            rescaling: max
+            multiplier: -10
+            filter:
+                operator: gt
+                value: 1
+                metric: exons_num
 
 
 Using this snippet as a guide, Mikado will score transcripts in each locus as follows:
@@ -228,6 +235,11 @@ Using this snippet as a guide, Mikado will score transcripts in each locus as fo
 * Assign a full score (**two points**, as a multiplier of 2 is specified) to transcripts that have a total amount of CDS bps approximating 80% of the transcript cDNA length (*combined_cds_fraction*)
 * Assign a full score (one point, as no multiplier is specified) to transcripts that have a 5' UTR whose length is nearest to 100 bps (*five_utr_length*); if the 5' UTR is longer than 2,500 bps, this score will be 0 (see the filter section)
 * Assign a full score (one point, as no multiplier is specified) to transcripts which have the lowest distance between the CDS end and the most downstream exon-exon junction (*end_distance_from_junction*). If such a distance is greater than 55 bps, assign a score of 0, as it is a probable target for NMD (see the filter section).
+* Assign a maximum penalty (**minus 10 points**, as a **negative** multiplier is specified) to the transcript with the highest number of non-verified introns in the locus.
+  * Again, we are using a "filter" section to define which transcripts will be exempted from this scoring (in this case, a penalty)
+  * However, please note that we are using the keyword **metric** in this section. This tells Mikado to check a *different* metric for evaluating the filter. Nominally, in this case we are excluding from the penalty any *monoexonic* transcript. This makes sense as a monoexonic transcript will never have an intron to be confirmed to start with.
+
+.. note:: The possibility of using different metrics for the "filter" section is present from Mikado 1.3 onwards.
 
 .. _Metrics: