Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce time required to prepare for ROBOT query/verify #666

Merged
merged 5 commits into from
Apr 15, 2020

Conversation

balhoff
Copy link
Contributor

@balhoff balhoff commented Apr 2, 2020

This makes two changes:

  1. In creating the Jena Dataset used to execute SPARQL queries, instead of TDBFactory.createDataset(), DatasetFactory.createGeneral() is used. This eliminates a large amount of time spent initializing the Dataset (240 seconds in the case of the GO build on my laptop).
  2. I implemented a direct triple conversion from Sesame objects to Jena objects, avoiding creating a huge string representing the ontology. In the end this amazingly provides no speedup that I've been able to measure. But I expect it will reduce peak memory usage, which is a problem for us on Travis.

Behavior changes worth noting:

  • I had to make a change to one test. In a SPARQL query, when specifying a named graph representing an ontology in the import chain, you need to use FROM instead of FROM NAMED. This is how I would have expected it to work in the first place, and I prefer the new behavior (in my opinion FROM NAMED is super confusing). I'm not sure if this impacts anyone. Neither was documented before.

  • Method signatures with regard to checked exceptions have changed. Will this be a problem? If so I can catch and wrap those exceptions. (UPDATE—I saw that Travis failed; I wrapped the exceptions and now it passes.)

  • docs/ have been added/updated

  • tests have been added/updated

  • mvn verify says all tests pass

  • mvn site says all JavaDocs correct

  • CHANGELOG.md has been updated

…g time. Implement direct triple conversion to avoid intermediate Turtle string.
@cmungall
Copy link
Contributor

cmungall commented Apr 2, 2020

This is great! Issue 666 ftw 😈😈😈

I assume the tests were with in memory Jena - does this affect things when the tdb options is set?

@balhoff
Copy link
Contributor Author

balhoff commented Apr 2, 2020

@cmungall I haven't tested the disk-based TDB. It looks like a separate code path to me. @beckyjackson am I right?

@jamesaoverton
Copy link
Member

Thanks @balhoff! I want to test this on some of my ontology projects. I'll try to do that soon.

@jamesaoverton
Copy link
Member

I haven't yet had time for this -- I sincerely apologize.

@ontodev/robot-team Everyone is welcome to test this using this JAR: https://build.obolibrary.io/job/ontodev/job/robot/job/direct-rdf/lastSuccessfulBuild/artifact/bin/robot.jar

@beckyjackson
Copy link
Contributor

Sorry for the delay in response. Yes, the --tdb option goes a separate way, never loading the ontology into memory.

I tested this on DOID and it works great - a query that previously took 20 seconds only took 9 seconds 😄

I also tested this with MRO and a query that previously took 29 seconds took 14 seconds, and the results were essentially the same. The only (tiny) difference was that a GROUP_CONCAT on synonyms returned the same values in a different order for two results (out of 941 results), but I don't think this is a problem?

@balhoff
Copy link
Contributor Author

balhoff commented Apr 15, 2020

The only (tiny) difference was that a GROUP_CONCAT on synonyms returned the same values in a different order for two results (out of 941 results), but I don't think this is a problem?

That seems understandable for different backends. As far as I know you can't guarantee an ordering within GROUP_CONCAT.

@jamesaoverton jamesaoverton merged commit dd2d871 into master Apr 15, 2020
@jamesaoverton
Copy link
Member

Thanks @balhoff! Sorry again about the delay -- we're even busier than usual lately.

@balhoff
Copy link
Contributor Author

balhoff commented Apr 15, 2020

No problem, thanks!!

@jamesaoverton jamesaoverton deleted the direct-rdf branch June 16, 2022 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants