Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipelines currently failing due to solr disconnect behavior #71

Closed
kltm opened this issue Jan 9, 2019 · 5 comments
Closed

Pipelines currently failing due to solr disconnect behavior #71

kltm opened this issue Jan 9, 2019 · 5 comments

Comments

@kltm
Copy link
Member

kltm commented Jan 9, 2019

Starting on 2018-12-29, the snapshot release (possibly earlier for NEO, see below) has consistently failed due to issues around Solr. Exception examples below.

Also see:
geneontology/neo#38 (comment)
geneontology/noctua#595

Current work is on alleviating the symptoms and tracing possible causes for their sudden onset.

@kltm
Copy link
Member Author

kltm commented Jan 9, 2019

The most informative exception so far:

2019-01-08 09:26:31,469 INFO  (FlexSolrDocumentLoader:47) Processed 1000 flex ontology docs at 72000 and committing...

Exception in thread "main" org.apache.solr.common.SolrException: [was
class java.io.IOException] java.util.concurrent.TimeoutException: Idle
timeout expired: 30000/30000 ms??java.lang.RuntimeException: [was
class java.io.IOException] java.util.concurrent.TimeoutException: Idle
timeout expired: 30000/30000 ms? at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)?
at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)?
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)?
at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)?
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)? at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)?
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)? at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)?
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)?
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)? at
org.apa

[was class java.io.IOException] java.util.concurrent.TimeoutException:
Idle timeout expired: 30000/30000 ms??java.lang.RuntimeException: [was
class java.io.IOException] java.util.concurrent.TimeoutException: Idle
timeout expired: 30000/30000 ms? at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)?
at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)?
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)?
at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)?
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)? at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)?
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)? at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)?
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)?
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)? at
org.apa

request: http://localhost:8080/solr/update?wt=javabin&version=2
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)
	at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:249)
	at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
	at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:69)
	at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
	at owltools.solrj.AbstractSolrLoader.addToServer(AbstractSolrLoader.java:166)
	at owltools.solrj.AbstractSolrLoader.incrementalAddAndCommit(AbstractSolrLoader.java:160)
	at owltools.solrj.FlexSolrDocumentLoader.load(FlexSolrDocumentLoader.java:48)
	at owltools.cli.SolrCommandRunner.flexLoadOntologySolr(SolrCommandRunner.java:325)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:4779)
	at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:76)
	at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:68)
	at owltools.cli.CommandLineInterface.main(CommandLineInterface.java:12)
Check that results have been stored properly

The vast majority are less informative, like:

2019-01-01 01:47:34,641 INFO  (FlexSolrDocumentLoader:47) Processed 1000 flex ontology docs at 211000 and committing...
Exception in thread "main" org.apache.solr.client.solrj.SolrServerException: java.net.SocketException: Broken pipe (Write failed)
        at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:475)
        at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:249)
        at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
        at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:69)

@kltm
Copy link
Member Author

kltm commented Jan 9, 2019

Note that most changes to geneontology/pipeline and geneontology/operations over the last week have related to this.

@kltm
Copy link
Member Author

kltm commented Jan 11, 2019

The original logic for the new way of dealing with memory comes from: http://lucene.472066.n3.nabble.com/java-util-concurrent-TimeoutException-Idle-timeout-expired-50001-50000-ms-td4321209.html
Tying it back to the XML parsing comes from playing with settings when looking at NEO and the error messages above.
That said, this is a weak model that doesn't make a lot of sense. Even though the issue seems to no longer be occurring, enough things were updated in that time (docker base image to 18.04, java initial memory allocation, java entity expansion), as well as the possibility that the ontology itself may have become "fixed", that I think that memory hungry water sprites has equal predictive ability.

@cmungall
Copy link
Member

Thanks! I have 98-99% confidence there has been no variability in either ontology size or factors that would likely have any noticeable effect, if that helps future debugging

@kltm
Copy link
Member Author

kltm commented Jan 12, 2019

I'll explain a little more for future historians.

I'm also pretty sure that nothing has changed on those systems or within the docker image that the pipeline loader is part of and running in. The only really free variables are a silent host OS update or a data change. I could spend a little more time on the former, but I think I've hit diminishing returns for sleuthing and can't really thing of a mechanism, and you're pretty sure on the latter.
So, we have a case where nothing looks like it was changed, yet on a certain date everything went screwy for NEO (no docker image, running fairly manual commands) and soon after for (at least) the ontology in the snapshot pipeline (again, entirely ensconced in a docker image). My best guess is something like a small change in the data somehow nudged a marginal setup into a terminal one, possibly related to Java memory handling. But I can prove little as running the pipelines takes so long that really trying to do controlled recreations would be a slog.

Anyways, if things are fine for the next few days I'll be pretty happy with the outcome. I don't think there was much change in the data--the indices are almost exactly the same--but something happened...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants