Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task 3 - API - add search models by species #290

Closed
2 of 3 tasks
lpalbou opened this issue Mar 18, 2020 · 9 comments
Closed
2 of 3 tasks

Task 3 - API - add search models by species #290

lpalbou opened this issue Mar 18, 2020 · 9 comments
Assignees

Comments

@lpalbou
Copy link
Contributor

lpalbou commented Mar 18, 2020

Task requirement from Noctua Landing Page Project

This will be a three steps task:

  • Batch update of all current models to add taxon id of each gene
  • Update minerva so that anytime a model is added or updated, the taxon ids will be added to each gene of the model
  • Provide the API route for NLP UI

Also linked to #230

@lpalbou lpalbou changed the title API - add search by species API - add search models by species Mar 18, 2020
@lpalbou lpalbou changed the title API - add search models by species Task 3 - API - add search models by species Mar 18, 2020
@lpalbou
Copy link
Contributor Author

lpalbou commented Mar 18, 2020

@goodb what is the status of the new neo, we can discuss this requirement further

@goodb
Copy link
Contributor

goodb commented Mar 22, 2020

@lpalbou the status for minerva.. is that I am about to put in PR into the dev branch that provides search by taxon id among several other things.

I accomplished this by (1) building a taxon to models map on server startup (2) using that to handle the search (3) updating that map when models are saved.

I didn't touch the model metadata but could of course do it that way as well.

Still don't know what the NLP UI is or how it is intended to interact here. It seems my previous comment to that effect was lost in some kind of issue reshuffle.

@lpalbou
Copy link
Contributor Author

lpalbou commented Mar 26, 2020

Update from recent discussions:

  • @tmushayahama is able to power both the search and the browse models by species

  • @vanaukenk is testing and looks good at the moment

  • @goodb I like the taxon <-> model map solution you implemented and as long as it gets updated whenever a model is saved, it should be fine and fast enough. However, one query is taking a long time (~15s):

http://barista-dev.berkeleybop.org/search?offset=0&limit=50&taxon=http://purl.obolibrary.org/obo/NCBITaxon_10090

Considering your elegant solution, I could not see any other reason than a SPARQL query to redesign, but when looking at the response, I see a "sparql" field at the end that is messing up the response. Could it be why this query is slow ?

Sample from the response:

"sparql": "PREFIX owl: <http://www.w3.org/2002/07/owl#> \nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> \n#model metadata\nPREFIX metago: <http://model.geneontology.org/>\nPREFIX lego: <http://geneontology.org/lego/> \n#model data\nPREFIX part_of: <http://purl.obolibrary.org/obo/BFO_0000050>\nPREFIX occurs_in: <http://purl.obolibrary.org/obo/BFO_0000066>\nPREFIX enabled_by: <http://purl.obolibrary.org/obo/RO_0002333>\nPREFIX has_input: <http://purl.obolibrary.org/obo/RO_0002233>\nPREFIX has_output: <http://purl.obolibrary.org/obo/RO_0002234>\nPREFIX causally_upstream_of: <http://purl.obolibrary.org/obo/RO_0002411>\nPREFIX provides_direct_input_for: <http://purl.obolibrary.org/obo/RO_0002413>\nPREFIX directly_positively_regulates: <http://purl.obolibrary.org/obo/RO_0002629>\n\nSELECT  ?id ?date ?title ?state  (GROUP_CONCAT(DISTINCT ?contributor;separator=\";\") AS ?contributors) (GROUP_CONCAT(DISTINCT ?group;separator=\";\") AS ?groups)    \nWHERE {\n  GRAPH ?id {  \n        ?id <http://purl.org/dc/elements/1.1/title> ?title ;\n           <http://purl.org/dc/elements/1.1/date> ?date ;\n           <http://purl.org/dc/elements/1.1/contributor> ?contributor ;   \n        optional{?id <http://purl.org/pav/providedBy> ?group } .   \n        optional{?id lego:modelstate ?state } .    \n       \n      \n      \n       \n      \n      \n      \n       VALUES ?id { \n<http://model.geneontology.org/SYNGO_1940> \n<http://model.geneontology.org/cec18c47-fdc7-49e2-b984-f18ff6e879f8> \n<http://model.geneontology.org/3050ee6a-25b5-4589-9fe1-403433c0a70b> \n<http://model.geneontology.org/SYNGO_1943> \n<http://model.geneontology.org/0cb2a12e-36d5-4be9-838b-f3c52938768b> \n<http://model.geneontology.org/3c017263-064d-4ea4-9982-4bf5ad754a81> \n<http://model.geneontology.org/313091b5-f5be-4be4-b814-4b2cc462be74> \n<http://model.geneontology.org/3c977124-f610-4db7-bfa4-e04f0d505cf9> \n<http://model.geneontology.org/78e3156d-3d80-4ba2-8556-76c3b186dc5a> \n<http://model.geneontology.org/13942cf0-359b-4ec9-9091-9e67c23a353b> \n<http://model.geneontology.org/b6043995-b203-494c-8d84-883669765dd9> \n<http://model.geneontology.org/ec3ba64b-34ee-4f61-bcc1-99cd0ce252cc> \n<http://model.geneontology.org/3abb0e36-6ba2-4548-a37f-6f105407874e> \n<http://model.geneontology.org/db3f468e-ab8d-41df-8049-2151b14af94b> \n<http://model.geneontology.org/8d539789-349d-4d5f-8be9-9b761b499ae0> \n<http://model.geneontology.org/160a7be8-43f1-4b6b-9edd-116bee206837> \n<http://model.geneontology.org/7759d242-bb8d-4f83-8406-67f7770f7d60> \n<http://model.geneontology.org/ce10473b-df09-4744-8774-17545a78c446> \n<http://model.geneontology.org/9712a11d-60f1-4b41-a04a-4131b95e2176> \n<http://model.geneontology.org/b36dee6e-4b7c-460c-8c6e-197e3a321fe0> \n<http://model.geneontology.org/SYNGO_1931> \n<http://model.geneontology.org/SYNGO_1930> \n<http://model.geneontology.org/6acc8709-5e45-4792-8112-a90b9cc76b2e> \n<http://model.geneontology.org/1cb9ff3f-b2a5-44e0-b38d-2fcf68488046> \n<http://model.geneontology.org/SYNGO_1932> \n<

goodb added a commit that referenced this issue Mar 28, 2020
…lag - work on #290

Surprisingly, this doesn't seem to impact response time for: #290 .
It appears this may be related to #249 as I am intermittently seeing the same error for large responses:

SEVERE: An I/O error has occurred while writing a response message entity to the container output stream.
org.glassfish.jersey.server.internal.process.MappableException: com.google.gson.JsonIOException: org.eclipse.jetty.io.EofException
	at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:67)
	at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:139)

Notable that response time is very fast for the Master model collection ( a couple thousand).  Starts lagging for the dev collection (10s of thousands).
@goodb
Copy link
Contributor

goodb commented Mar 28, 2020

@lpalbou surprisingly the sparql information doesn't seem to be the root of the problem. I made that an optional parameter anyway, to clear up the response. Now you need to add &debug to see it.

Investigating.. It seems to be triggering another problem in the server that I've come across elsewhere.

Apart from this, I'm not opposed to adding the taxon information into the model metadata. But, I think that should probably be done (if desired) as part of small overhaul of all the desired model metadata - e.g. created/modified date, shex=valid, etc.

@goodb
Copy link
Contributor

goodb commented Mar 28, 2020

@lpalbou notable that this is not an issue at all in the Master repo. Scaling up to the dev collection triggers it.

goodb added a commit that referenced this issue Mar 29, 2020
This includes a method to update all models in a given journal with taxon information as metadata on the models.  It will need to be run on the input database for the taxa-related search features to work.  Models saved using this build will include the taxon data.  Notable performance improvements over last incarnation.

Batch update all taxon metadata.
minerva-cli.sh
--add-taxon-metadata
-j blazegraph-gocam-db.jnl
-ontojournal /tmp/blazegraph.jnl
@goodb
Copy link
Contributor

goodb commented Mar 30, 2020

@lpalbou latest PR has things mostly set up as you wrote in the ticket. Couldn't get the first approach to work at the scale of the dev model collection. Adding taxon as an annotation on the models themselves (same level as title). This seems to work much faster. Will take some coordination to switch e.g. @dustine32 's generator and add the information to existing models but not hard.

@lpalbou
Copy link
Contributor Author

lpalbou commented Apr 1, 2020

That's indeed what I was guessing... although your map (model -> taxon) updated by minerva on every model request should have worked ? I wonder if it could be related to #291. Anyhow, thanks, adding the taxon to the model directly would work too. I agree we'll have to revisit the meta we want to include in the model.

@goodb
Copy link
Contributor

goodb commented Apr 2, 2020

It did work, just not well at scale. Results in a very large query in the second step. Yes, the same pattern could be related to slow down when searching by all ontology terms (via e.g. MF with expand) as again, it results in a very large query.

May be more efficient to try more direct integration of the ontology graph with the model graph. This would require more time to figure out as I suspect that a lot of things assume that the model graph has nothing else in it.

the dev server appears to be working with the new model now. e.g. http://noctua-dev.berkeleybop.org:6800/search/?taxon=10090&limit=100000.

@goodb
Copy link
Contributor

goodb commented Apr 18, 2020

Code here is working. Will need to update the production and other incoming models with the taxon field.

@goodb goodb closed this as completed Apr 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants