Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude some synonym types from duplicate_exact_synonym report query #1179

Merged
merged 5 commits into from
May 1, 2024

Conversation

allenbaron
Copy link
Contributor

@allenbaron allenbaron commented Feb 1, 2024

Resolves #1175

  • docs/ have been added/updated
  • tests have been added/updated
  • mvn verify says all tests pass
  • mvn site says all JavaDocs correct
  • CHANGELOG.md has been updated

As discussed in issue #1175, this PR excludes exact synonyms annotated with the abbreviation (OMO:0003000) and acronym (OMO:0003012) synonym type from the duplicate_exact_synonym test of ROBOT report. It also now ignores case when making comparisons.

Exclusion of abbreviations and acronyms should be backward compatible and only incur a time hit for those not using them, while ignoring case will bring up new errors.

I'm happy to change this PR as needed and will attempt to update the documentation and tests if I can (I have no experience with Java).

- Excludes abbreviation (OMO:0003000) and acronym (OMO:0003012)
synonym type from this test
- Ignores case when comparing synonyms
@allenbaron
Copy link
Contributor Author

All tests passed but it doesn't look like there's one specifically for the output of the duplicate_exact_synonym query.

After checking that ROBOT built correctly with this PR, I ran some ROBOT report tests (default profile) comparing the current release (1.9.5) and the build based on this PR on both the DO's edit file (with and without acronym annotations) and the uberon.owl file.

All the results came out as expected: some new case-insensitive matches were identified in the DO (oops, 😅) and the acronym exclusion worked as expected. The execution time for 5 runs each can be seen in the plot and summaries below (cpu = user + sys; not showing the run against DO with the acronym annotations added but it was the same as shown below for the doid-edit.owl file before I added them). I think the increased time is reasonable and expect most people won't really notice.

There was one problem I can't explain or fix: the values column of the report output was empty for the robot.jar with the new duplicate_exact_synonym query robot.jar built on this PR (executed from upper dir of this repo as java -jar bin/robot.jar report -i <path_to_owl_file> -o DEL.tsv). It's not a problem with the query; using the query command with the same jar file produced the expected output.

Time Summaries

CPU Time Summary (user + sys)

input version cpu_mean cpu_sd cpu_min cpu_max cpu_diff
doid-edit report-1.9.5 36.16 0.8 33.74 38.57 1
doid-edit report-dev 40.35 1.16 36.87 43.84 1.12
uberon report-1.9.5 102.32 12.23 65.62 139.02 1
uberon report-dev 111.5 13.91 69.78 153.23 1.09

Real Time Summary

input version real_mean real_sd real_min real_max real_diff
doid-edit report-1.9.5 11.57 0.18 11.02 12.13 1
doid-edit report-dev 12.62 0.31 11.69 13.55 1.09
uberon report-1.9.5 36.03 0.33 35.04 37.03 1
uberon report-dev 45.73 1.07 42.51 48.94 1.27

@allenbaron allenbaron marked this pull request as ready for review February 1, 2024 21:21
Copy link
Contributor

@matentzn matentzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a very nice solution. Using a subquery: if there is an acronym or abbreviation annotation, we filter the exact match out (BOUND(?exclude)). The actual equivalence check happens outside the subquery.

Cool, I like it, but I would be more confident if @anitacaron could review this..

@allenbaron
Copy link
Contributor Author

I agree that it was a pretty smart way to solve the problem but it wasn't mine. The query is essentially the one @anitacaron shared from Uberon with only slight modification to improve performance and additionally exclude acronym. Having her review it is a good idea.

@allenbaron
Copy link
Contributor Author

In this query I decided to pass the property up from the subquery to improve performance slightly. In thinking about this just now while working on another query, it occurred to me that this might not work as desired in a specific scenario.

Here's the scenario:
An ontology uses both IAO:0000118 and oio:hasExactSynonym and has the same synonym once with each of these properties on different terms. These synonyms would not be identified by this query as errors since they use different properties (false negative). Essentially, passing the property up prevents cross-property identification of duplicate synonyms.

Does this matter?

I've actually been wondering about the inclusion of IAO:0000118. I left it in because it was in the original duplicate_exact_synonym query but I've wondered if it should be removed, since it appears that IAO:0000118 is the annotation property grouping (i.e. parent) for all of the oboInOwl synonym terms, which include broad, narrow and related types. Based on comments I've heard about annotation properties, my assumption is that annotation property hierarchy doesn't mean anything, so maybe it's more about how IAO:0000118 is used in practice?

DO & Uberon don't use IAO:0000118 but RO, BFO, SO, GENO, and a number of other ontologies do appear to use it.

@matentzn
Copy link
Contributor

matentzn commented Feb 3, 2024

@allenbaron very good thinking - I totally missed this.

Here is my position:

  1. The query may be a tiny bit more permissive on the case that an entity has both an obo:hasExactSynonym and an IAO:0000118 associated with it. A bit more permissive wont annoy anyone.
  2. The case is not very frequent in the OBO World:
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
     PREFIX rh:<http://rdf.rhea-db.org/>
     
     SELECT DISTINCT ?cls ?x ?y WHERE {
       ?cls <http://purl.obolibrary.org/obo/IAO_0000118> ?x .
       ?cls <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> ?y .
     } LIMIT 1000
    
    Yields 25 results on Ubergraph and ca. 5128 cases on ontobee, the vast majority of which are some stray EFO classes, and some CHEBI and NCBITaxon classes.
  3. I personally think that if people choose to use both of these properties, they intend to duplicate them for interoperability reasons. So indeed, I believe they should not raise a red flag, as it currently does.

My position is that this "scenario" you identified actually makes the situation better, not worse.

@jamesaoverton jamesaoverton merged commit c1ad167 into ontodev:master May 1, 2024
3 checks passed
@allenbaron allenbaron deleted the query_update branch December 12, 2024 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ignoring duplicate exact synonyms that are acronyms in robot report
3 participants