Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoring overhaul #634

Closed
tokebe opened this issue Apr 25, 2023 · 18 comments
Closed

Scoring overhaul #634

tokebe opened this issue Apr 25, 2023 · 18 comments
Assignees
Labels
enhancement New feature or request On Test Related changes are deployed to Test server

Comments

@tokebe
Copy link
Member

tokebe commented Apr 25, 2023

Scoring could use some changes:

  • More hops (more qEdges) should reduce score (possibly only in creative mode? Possibly in all circumstances?)
  • More knowledge edges supporting a result should increase score (already happens, but might need some tuning)
  • A rudimentary "knowledge type" scoring weight, or at least a weight-by-api?
  • Creative mode: multiple templates supporting the same result should increase score, but not by as much as it presently does (after the scoring normalization PR is merged, the only bonus this gives is taking the max of the results from different templates, which is too little bonus)

This issue requires further investigation and discussion. At some point I'll work on slides to give a few case examples. It should not be put under active development yet.

@andrewsu
Copy link
Member

Another example worth looking into is https://ui.test.transltr.io/results?l=Postaxial%20Acrofacial%20Dysostosis&t=0&q=72fedb01-7ee2-4c7d-8a78-2cab471d4df8, described in NCATSTranslator/Feedback#349. This is a visual summary of the result:

image

From https://arax.ci.transltr.io/api/arax/v1.4/response/3709473e-e00a-40a7-bb4b-37d75ec57c29, this is the second-highest scoring answer with a normalized score of 99.72677595628416. Since "failure to thrive" and "growth delay" are relatively generic phenotypes, I would have expected the paths leading to those nodes to be relatively low scoring.

@andrewsu
Copy link
Member

andrewsu commented Jun 21, 2023

Basic principles laid out in 2023-06-21 team meeting:

image

Also need to think about benchmarking. @tokebe please check out https://github.com/TranslatorSRI/Benchmarks as a possible framework for running benchmarks... (NOTE: we may also need to think about excluding certain resources during the benchmark runs to avoid trivial one-hop retrievals of the right answer...)

@andrewsu
Copy link
Member

andrewsu commented Jun 21, 2023

Regarding benchmarking, I gave the Benchmarks tool a try and pretty easily came up with this result table (based on this template and this data file):

Benchmark: treats
Results Directory: bte_test

                        k=1     k=5     k=10    k=20
Precision @ k           0.0000  0.0182  0.0409  0.0364
Recall @ k              0.0000  0.0476  0.2143  0.3810
mAP @ k                 0.0000  0.0205  0.0507  0.0627
Top-k Accuracy          0.0000  0.0476  0.2143  0.3810

Mean Reciprocal Rank    0.06197026657552973

Looks like it ran 22 TRAPI queries to generate those results. Overall, it seems like a reasonable system on which to base our scoring optimization efforts...

@andrewsu
Copy link
Member

Another factor that we can use is scoring is opposing evidence, as described in this comment:

the question is whether we should have a filter in text-mined resources to remove X - TREATS - Y when there are many more PMIDs associated with X - CAUSES - Y. This intuition seems to hold for the Halothane / malignant hyperthermia example. Semmeddb currently has 3 PMIDs for TREATS and 32 PMIDs for CAUSES. (link)

@tokebe
Copy link
Member Author

tokebe commented Jun 27, 2023

@ericz1803 Assigning this one to you -- general idea is:

  1. Get benchmarking working as a means to test changes
  2. Test current scoring
  3. Implement each scoring principle from above as a weight applied on top of the NGD score, probably prior to score normalization. After each principle is implemented, run benchmark tests to determine the best value to tune its specific weighting to. Any principles that seem to just reduce quality regardless of weight can be removed.

There will be some complications around creative mode, as its score combination takes place after primary scoring. You can generally consider a result being merged as an instance of principle 3 (rather than the present implementation of taking the max score of the merged results).

Please let me know if you have any questions on implementation details/issue expectations/etc.

@tokebe
Copy link
Member Author

tokebe commented Aug 1, 2023

Just noting that #515 seems relevant, I noticed it while diving through old issues. Not sure if it's something we want to do for this overhaul.

@andrewsu
Copy link
Member

andrewsu commented Aug 1, 2023

#515 will likely have some independent challenges associated with it related to identifier mapping, so let's keep it separate from this issue.

@tokebe
Copy link
Member Author

tokebe commented Aug 2, 2023

@ericz1803 Just checking, in the case of results merging in creative mode, have any changes been made to this behavior? In current live code, the highest score is taken.

@colleenXu
Copy link
Collaborator

colleenXu commented Aug 2, 2023

(brought up in discussion with Jackson and also discussed here (lab Slack link))

@ericz1803 Note that the scoring-related logs may need updating. Perhaps we could remove them or change them so they record something useful regarding how results were scored

Examples:

        {
            "timestamp": "2023-08-02T20:40:04.476Z",
            "level": "DEBUG",
            "message": "[Template-3]: Successfully scored 320 results, couldn't score 258 results.",
            "code": null
        },
        {
            "timestamp": "2023-08-02T20:40:04.532Z",
            "level": "INFO",
            "message": "Scoring Summary: (0) scored / (500) unscored",
            "code": null
        }

@ericz1803
Copy link
Contributor

@tokebe I didn't make any changes to the way results are merged in creative mode.

@colleenXu I'll take a look at the logs. All results should now be getting scored so I'll try to give it some more useful info

@tokebe
Copy link
Member Author

tokebe commented Aug 3, 2023

We may want to apply some sort of transformation similar to situation 2 higher score for those cases, though that might take some undue effort and could be something we can address in a future iteration.

@tokebe tokebe added this to the 2023-08-18 Code Freeze milestone Aug 3, 2023
@ericz1803
Copy link
Contributor

@tokebe you're right, I implemented score addition to the creative queries and it seemed to make a pretty significant positive effect on the results and I added it to the PR.

@tokebe tokebe added On CI Related changes are deployed to CI server On Dev Related changes are deployed to Dev server and removed On CI Related changes are deployed to CI server On Dev Related changes are deployed to Dev server labels Aug 8, 2023
@colleenXu
Copy link
Collaborator

From discussion with Jackson right now:

  • as we add / adjust KPs that BTE uses, we don't have to rerun the scoring analysis to theoretically adjust the weights.

@ericz1803
Copy link
Contributor

Leaving this here in case anyone needs to run benchmarks on BTE in the future:
Either we have a non standard results object or the Benchmarks tool hasn't been updated yet since our scores are kept at result['analyses'][0]['score'] while the Benchmarks tool is expecting them at result['score'].

Also, all of the scoring optimization was based on the MostPrescribed_2019 and the DrugCentral_creative datasets, I did try the other datasets (CPIC, DrugMechDB, and IRDiRC), but BTE either came up with no results or not the target results for all the queries.

@andrewsu
Copy link
Member

Either we have a non standard results object or the Benchmarks tool hasn't been updated yet since our scores are kept at result['analyses'][0]['score'] while the Benchmarks tool is expecting them at result['score'].

Great point, @ericz1803 . The change in the location of the score is actually something that was recently made Consortium-wide. So if you've already updated your local instance to use the new location, please create a PR on the benchmarks repo with that change.

@colleenXu
Copy link
Collaborator

It's related to TRAPI 1.4:

I think for our tool, in all cases we have only 1 analysis object?

But for other tools (ARAs / KPs) they may have multiple analyses objects and scores...so I dunno if that'll cause issues for them trying to benchmark with our PR / changes...

@tokebe tokebe added On Test Related changes are deployed to Test server and removed On CI Related changes are deployed to CI server labels Aug 14, 2023
@ericz1803
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request On Test Related changes are deployed to Test server
Projects
None yet
Development

No branches or pull requests

4 participants