-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boost efficiency of CHEBI/NCBITaxon imports? #552
Comments
@pbuttigieg I 100% feel your pain, and share it on a daily basis. PR, CHEBI, NCBITAXON. Ok, let me say it like this: You can reduce time by reducing network traffic, and that you do by tapping the gzipped versions of an ontology; many such as CHEBI have one. There is a patch for ROBOT in the pipeline to handle gzipped IRIs correctly (its fixed but not published), until then, wget the gzipped ontology and just open the file with ROBOT the usual way. The other way to handle this would be to lobby the OBO foundry to define "useful subsets" of these ontologies; this would be best, but I cant see it happening soon. Unfortunately, this is not a ROBOT issue.. :/ I personally excluded these huge ontologies from the automatic update pipelines ala ODK, and only refresh them explicitly overnight when I feel like it. Not good. |
ROBOT has been able to handle gzipped files from disk for a while now. There was a bug with The NCBI Taxonomy is quite simple. It's just driven by some tables, so it's easy enough to write custom code for working with it, along the lines of https://github.com/obophenotype/ncbitaxon For the other large ontologies that take advantage of OWL, sometimes you can get away with just using SPARQL. If you need to do proper OWL work, you need OWLAPI, and large ontologies just require a lot of memory. ROBOT tries to be a lightweight layer on top of OWLAPI -- if anybody sees inefficiencies in ROBOT, let us know and we'll try to optimize. The main exception is ROBOT @matentzn is right that we could work toward providing useful subsets of the large ontologies. |
I agree with all the above, re gz and tbd An overly complicated solution would be to put all ontologies in S3 buckets and make it easier for people to run things on EC2 colocated, perhaps via some build-as-a-service [future grant idea]. We could also implement SLME over SPARQL. Or just MIREOT (usually the formal guarantees of SLME are not required for something like ncbitaxon). Or we could simply have ROBOT call the OntoFox API. Or you could simply implement OntoFox in your pipeline. (I think that is in decreasing order of work for ROBOT developers). There are various dependency issues here. If we are to depend on an external SPARQL endpoint, I would rather have it depend on one that implements standard patterns for organizing ontologies into NGs (another hobby horse of mine). Note for ncbitaxon we do have a slim (that is actually quite chonk) I thought we had something in ODK to make it easy to specify a slim rather than a full product but it looks not to be the case.. It's hard to say what the right subset of chebi would be. Sometimes the 'naturally occurring' subset is most useful, but maybe not for Pier's use case, where we might want X-contaminated soil, where X is an anthropogenic product. |
Thanks @cmungall @matentzn @jamesaoverton for the perspective and guidance Strong +1 for going for a Data as a Service model (push the code to a remote data hosting space and pull back only results). Until then, and because we have the luxury of servers plugged in to beast-mode internet/network resources at our institute, I'm just spinning up the docker container there to do any heavy lifting. Would be cool to see this issue punted to OBO Operations if you feel it's better there. |
Just as a side note: with the new ROBOT (1.4.3) you can now do this:
Which will save you a great deal of time when running CHEBI. |
Hi all,
In the day-to-day of mirroring and importing minimal subgraphs, CHEBI and NCBITaxon stick out as outliers due to their relatively monstrous size.
Is there any magic that can be done to reduce the memory requirements for handling these?
The text was updated successfully, but these errors were encountered: