Scoring overhaul #634

tokebe · 2023-04-25T21:38:47Z

Scoring could use some changes:

More hops (more qEdges) should reduce score (possibly only in creative mode? Possibly in all circumstances?)
More knowledge edges supporting a result should increase score (already happens, but might need some tuning)
A rudimentary "knowledge type" scoring weight, or at least a weight-by-api?
Creative mode: multiple templates supporting the same result should increase score, but not by as much as it presently does (after the scoring normalization PR is merged, the only bonus this gives is taking the max of the results from different templates, which is too little bonus)

This issue requires further investigation and discussion. At some point I'll work on slides to give a few case examples. It should not be put under active development yet.

colleenXu · 2023-04-26T18:12:43Z

Related to:

Reserving spaces for direct answers in creative mode #590
investigate weighting of edge scores by primary source (semmeddb) or primary source type (text-mining) #550

Potentially related to:

andrewsu · 2023-06-13T22:48:14Z

Another example worth looking into is https://ui.test.transltr.io/results?l=Postaxial%20Acrofacial%20Dysostosis&t=0&q=72fedb01-7ee2-4c7d-8a78-2cab471d4df8, described in NCATSTranslator/Feedback#349. This is a visual summary of the result:

From https://arax.ci.transltr.io/api/arax/v1.4/response/3709473e-e00a-40a7-bb4b-37d75ec57c29, this is the second-highest scoring answer with a normalized score of 99.72677595628416. Since "failure to thrive" and "growth delay" are relatively generic phenotypes, I would have expected the paths leading to those nodes to be relatively low scoring.

andrewsu · 2023-06-21T17:06:09Z

Basic principles laid out in 2023-06-21 team meeting:

Also need to think about benchmarking. @tokebe please check out https://github.com/TranslatorSRI/Benchmarks as a possible framework for running benchmarks... (NOTE: we may also need to think about excluding certain resources during the benchmark runs to avoid trivial one-hop retrievals of the right answer...)

andrewsu · 2023-06-21T22:12:09Z

Regarding benchmarking, I gave the Benchmarks tool a try and pretty easily came up with this result table (based on this template and this data file):

Benchmark: treats
Results Directory: bte_test

                        k=1     k=5     k=10    k=20
Precision @ k           0.0000  0.0182  0.0409  0.0364
Recall @ k              0.0000  0.0476  0.2143  0.3810
mAP @ k                 0.0000  0.0205  0.0507  0.0627
Top-k Accuracy          0.0000  0.0476  0.2143  0.3810

Mean Reciprocal Rank    0.06197026657552973

Looks like it ran 22 TRAPI queries to generate those results. Overall, it seems like a reasonable system on which to base our scoring optimization efforts...

andrewsu · 2023-06-23T16:56:18Z

Another factor that we can use is scoring is opposing evidence, as described in this comment:

the question is whether we should have a filter in text-mined resources to remove X - TREATS - Y when there are many more PMIDs associated with X - CAUSES - Y. This intuition seems to hold for the Halothane / malignant hyperthermia example. Semmeddb currently has 3 PMIDs for TREATS and 32 PMIDs for CAUSES. (link)

tokebe · 2023-06-27T13:37:05Z

@ericz1803 Assigning this one to you -- general idea is:

Get benchmarking working as a means to test changes
Test current scoring
Implement each scoring principle from above as a weight applied on top of the NGD score, probably prior to score normalization. After each principle is implemented, run benchmark tests to determine the best value to tune its specific weighting to. Any principles that seem to just reduce quality regardless of weight can be removed.

There will be some complications around creative mode, as its score combination takes place after primary scoring. You can generally consider a result being merged as an instance of principle 3 (rather than the present implementation of taking the max score of the merged results).

Please let me know if you have any questions on implementation details/issue expectations/etc.

tokebe · 2023-08-01T15:53:26Z

Just noting that #515 seems relevant, I noticed it while diving through old issues. Not sure if it's something we want to do for this overhaul.

andrewsu · 2023-08-01T18:15:53Z

#515 will likely have some independent challenges associated with it related to identifier mapping, so let's keep it separate from this issue.

tokebe · 2023-08-02T20:58:30Z

@ericz1803 Just checking, in the case of results merging in creative mode, have any changes been made to this behavior? In current live code, the highest score is taken.

colleenXu · 2023-08-02T21:57:11Z

(brought up in discussion with Jackson and also discussed here (lab Slack link))

@ericz1803 Note that the scoring-related logs may need updating. Perhaps we could remove them or change them so they record something useful regarding how results were scored

Examples:

        {
            "timestamp": "2023-08-02T20:40:04.476Z",
            "level": "DEBUG",
            "message": "[Template-3]: Successfully scored 320 results, couldn't score 258 results.",
            "code": null
        },

        {
            "timestamp": "2023-08-02T20:40:04.532Z",
            "level": "INFO",
            "message": "Scoring Summary: (0) scored / (500) unscored",
            "code": null
        }

ericz1803 · 2023-08-02T22:24:23Z

@tokebe I didn't make any changes to the way results are merged in creative mode.

@colleenXu I'll take a look at the logs. All results should now be getting scored so I'll try to give it some more useful info

tokebe · 2023-08-03T13:25:48Z

We may want to apply some sort of transformation similar to situation 2 higher score for those cases, though that might take some undue effort and could be something we can address in a future iteration.

ericz1803 · 2023-08-03T22:15:57Z

@tokebe you're right, I implemented score addition to the creative queries and it seemed to make a pretty significant positive effect on the results and I added it to the PR.

colleenXu · 2023-08-09T20:12:26Z

From discussion with Jackson right now:

as we add / adjust KPs that BTE uses, we don't have to rerun the scoring analysis to theoretically adjust the weights.

ericz1803 · 2023-08-10T18:22:11Z

Leaving this here in case anyone needs to run benchmarks on BTE in the future:
Either we have a non standard results object or the Benchmarks tool hasn't been updated yet since our scores are kept at result['analyses'][0]['score'] while the Benchmarks tool is expecting them at result['score'].

Also, all of the scoring optimization was based on the MostPrescribed_2019 and the DrugCentral_creative datasets, I did try the other datasets (CPIC, DrugMechDB, and IRDiRC), but BTE either came up with no results or not the target results for all the queries.

andrewsu · 2023-08-10T18:46:07Z

Either we have a non standard results object or the Benchmarks tool hasn't been updated yet since our scores are kept at result['analyses'][0]['score'] while the Benchmarks tool is expecting them at result['score'].

Great point, @ericz1803 . The change in the location of the score is actually something that was recently made Consortium-wide. So if you've already updated your local instance to use the new location, please create a PR on the benchmarks repo with that change.

colleenXu · 2023-08-10T19:59:04Z

It's related to TRAPI 1.4:

phase 1: aux-graph/result.analyses refactor for basic querying #603 (particularly the text starting with: "In this situation, for the result.analyses array, make an object where:")
phase 2: creative-mode and aux-graph/result-analyses refactor #615

I think for our tool, in all cases we have only 1 analysis object?

But for other tools (ARAs / KPs) they may have multiple analyses objects and scores...so I dunno if that'll cause issues for them trying to benchmark with our PR / changes...

ericz1803 · 2023-08-16T15:56:12Z

For future reference:
https://docs.google.com/presentation/d/1DdsKaQx88ppCM0lNWVKZsk4EwegyAtWW6h7aC-23vXA/edit?usp=sharing
Data:
https://colab.research.google.com/drive/1ZfjOlDORK4GlQMNdX4JWevnFOfbHx_3x?usp=sharing

tokebe assigned ericz1803 Jun 27, 2023

tokebe added the enhancement New feature or request label Aug 1, 2023

This was referenced Aug 1, 2023

Reserving spaces for direct answers in creative mode #590

Closed

investigate weighting of edge scores by primary source (semmeddb) or primary source type (text-mining) #550

Closed

BTE encounters RangeError when doing ngd scoring #517

Closed

colleenXu mentioned this issue Aug 2, 2023

update example and test queries #681

Open

ericz1803 mentioned this issue Aug 2, 2023

#634: Scoring overhaul biothings/bte_trapi_query_graph_handler#163

Merged

tokebe added this to the 2023-08-18 Code Freeze milestone Aug 3, 2023

tokebe added On CI Related changes are deployed to CI server On Dev Related changes are deployed to Dev server and removed On CI Related changes are deployed to CI server On Dev Related changes are deployed to Dev server labels Aug 8, 2023

tokebe added On Test Related changes are deployed to Test server and removed On CI Related changes are deployed to CI server labels Aug 14, 2023

andrewsu mentioned this issue Aug 22, 2023

how to refine a two-hop query that explodes on the first edge? #493

Open

tokebe closed this as completed Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring overhaul #634

Scoring overhaul #634

tokebe commented Apr 25, 2023

colleenXu commented Apr 26, 2023

andrewsu commented Jun 13, 2023

andrewsu commented Jun 21, 2023 •

edited

Loading

andrewsu commented Jun 21, 2023 •

edited

Loading

andrewsu commented Jun 23, 2023

tokebe commented Jun 27, 2023

tokebe commented Aug 1, 2023

andrewsu commented Aug 1, 2023

tokebe commented Aug 2, 2023

colleenXu commented Aug 2, 2023 •

edited

Loading

ericz1803 commented Aug 2, 2023

tokebe commented Aug 3, 2023

ericz1803 commented Aug 3, 2023

colleenXu commented Aug 9, 2023

ericz1803 commented Aug 10, 2023

andrewsu commented Aug 10, 2023

colleenXu commented Aug 10, 2023

ericz1803 commented Aug 16, 2023

Scoring overhaul #634

Scoring overhaul #634

Comments

tokebe commented Apr 25, 2023

colleenXu commented Apr 26, 2023

andrewsu commented Jun 13, 2023

andrewsu commented Jun 21, 2023 • edited Loading

andrewsu commented Jun 21, 2023 • edited Loading

andrewsu commented Jun 23, 2023

tokebe commented Jun 27, 2023

tokebe commented Aug 1, 2023

andrewsu commented Aug 1, 2023

tokebe commented Aug 2, 2023

colleenXu commented Aug 2, 2023 • edited Loading

ericz1803 commented Aug 2, 2023

tokebe commented Aug 3, 2023

ericz1803 commented Aug 3, 2023

colleenXu commented Aug 9, 2023

ericz1803 commented Aug 10, 2023

andrewsu commented Aug 10, 2023

colleenXu commented Aug 10, 2023

ericz1803 commented Aug 16, 2023

andrewsu commented Jun 21, 2023 •

edited

Loading

andrewsu commented Jun 21, 2023 •

edited

Loading

colleenXu commented Aug 2, 2023 •

edited

Loading