-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blazegraph struggles under load #187
Comments
ERROR WITH JOB: blob-6603611803232263074
Traceback (most recent call last): File "/opt/conda/envs/backend/lib/python2.7/site-packages/rq/worker.py", line 700, in perform_job rv = job.perform() File "/opt/conda/envs/backend/lib/python2.7/site-packages/rq/job.py", line 500, in perform self._result = self.func(*self.args, **self.kwargs) File "./modules/blazeUploader/reserve_id.py", line 138, in write_reserve_id spfyid = reserve_id(query_file) File "./modules/blazeUploader/reserve_id.py", line 121, in reserve_id largest = check_largest_spfyid() File "./modules/blazeUploader/reserve_id.py", line 62, in check_largest_spfyid results = sparql.query().convert() File "/opt/conda/envs/backend/lib/python2.7/site-packages/SPARQLWrapper/Wrapper.py", line 567, in query return QueryResult(self._query()) File "/opt/conda/envs/backend/lib/python2.7/site-packages/SPARQLWrapper/Wrapper.py", line 537, in _query response = urlopener(request) File "/opt/conda/envs/backend/lib/python2.7/urllib2.py", line 154, in urlopen return opener.open(url, data, timeout) File "/opt/conda/envs/backend/lib/python2.7/urllib2.py", line 429, in open response = self._open(req, data) File "/opt/conda/envs/backend/lib/python2.7/urllib2.py", line 447, in _open '_open', req) File "/opt/conda/envs/backend/lib/python2.7/urllib2.py", line 407, in _call_chain result = func(*args) File "/opt/conda/envs/backend/lib/python2.7/urllib2.py", line 1228, in http_open return self.do_open(httplib.HTTPConnection, req) File "/opt/conda/envs/backend/lib/python2.7/urllib2.py", line 1201, in do_open r = h.getresponse(buffering=True) File "/opt/conda/envs/backend/lib/python2.7/site-packages/raven/breadcrumbs.py", line 346, in getresponse rv = real_getresponse(self, *args, **kwargs) File "/opt/conda/envs/backend/lib/python2.7/httplib.py", line 1121, in getresponse response.begin() File "/opt/conda/envs/backend/lib/python2.7/httplib.py", line 438, in begin version, status, reason = self._read_status() File "/opt/conda/envs/backend/lib/python2.7/httplib.py", line 394, in _read_status line = self.fp.readline(_MAXLINE + 1) File "/opt/conda/envs/backend/lib/python2.7/socket.py", line 480, in readline data = self._sock.recv(self._rbufsize) File "/opt/conda/envs/backend/lib/python2.7/site-packages/rq/timeouts.py", line 51, in handle_death_penalty 'value ({0} seconds)'.format(self._timeout)) JobTimeoutException: Job exceeded maximum timeout value (180 seconds) |
It looks like our earlier concern about blazegraph not freeing up memory properly is also seen by other users https://www.linkedin.com/pulse/blazegraph-doesnt-do-windows-paul-houle |
Hoping this issue is only connected to speed of disks on |
To address Could do this via |
May also want to bump up default time out in any task that queries blazegraph, could do this by setting it when declaring the Also, have to decide whether to address this in the production branches of spfy or in master, as this issue is localized to |
Had to restart the blazegraph instance today: was hitting timeouts on sparql queries when submitting new subtyping tasks, though the db status option was working. |
As reported by @chadlaing , one of our collaborators had recently started to get errors. Looks like the errors are linked to this issue.
|
Perhaps related to the main issue: https://jira.blazegraph.com/browse/BLZG-9003 |
Perhaps relevant: https://jira.blazegraph.com/browse/BLZG-9058 |
Looks like the issue has been solved as of https://github.com/superphy/backend/releases/tag/v5.0.3 with a caveat. Some notes from Slack:
The caveat is that the change to upload graphs via a queue needs to be modified to serialize the graph files, as we've run out of RAM (24 GB) on the VM. The other option would be to allow Redis DB to store temp data to disk. Will verify the inferencing setup and make this change over the weekend. |
Looks like it didn't load the properties file correctly. Will have to debug. |
Must've started blazegraph (& created a bigdata.jnl file) before making proper changes to the defaults. Inferencing is working now. Here's a defaults file: JETTY_HOME=/Warehouse/Users/claing/jetty
JETTY_USER=claing
JETTY_PORT=9999
JETTY_HOST=192.168.0.1
JETTY_LOGS=/Warehouse/Users/claing/jetty/logs/
JAVA_OPTIONS="-Dcom.bigdata.rdf.sail.webapp.ConfigParams.propertyFile=/Warehouse/Users/claing/RWStore.properties" What the status should look like: [claing@superphy claing]$ service jetty status
** WARNING: JETTY_LOGS is Deprecated. Please configure logging within the jetty base.
Jetty running pid=22135
JAVA = /bin/java
JAVA_OPTIONS = -Dcom.bigdata.rdf.sail.webapp.ConfigParams.propertyFile=/Warehouse/Users/claing/RWStore.properties -Djetty.home=/Warehouse/Users/claing/jetty -Djetty.base=/Warehouse/Users/claing/jetty -Djava.io.tmpdir=/tmp
JETTY_HOME = /Warehouse/Users/claing/jetty
JETTY_BASE = /Warehouse/Users/claing/jetty
START_D = /Warehouse/Users/claing/jetty/start.d
START_INI = /Warehouse/Users/claing/jetty/start.ini
JETTY_START = /Warehouse/Users/claing/jetty/start.jar
JETTY_CONF = /Warehouse/Users/claing/jetty/etc/jetty.conf
JETTY_ARGS = jetty.state=/Warehouse/Users/claing/jetty/jetty.state jetty-started.xml
JETTY_RUN = /Warehouse/Users/claing/jetty/jetty
JETTY_PID = /Warehouse/Users/claing/jetty/jetty/jetty.pid
JETTY_START_LOG= /Warehouse/Users/claing/jetty/jetty/jetty-start.log
JETTY_STATE = /Warehouse/Users/claing/jetty/jetty.state
RUN_CMD = /bin/java -Dcom.bigdata.rdf.sail.webapp.ConfigParams.propertyFile=/Warehouse/Users/claing/RWStore.properties -Djetty.home=/Warehouse/Users/claing/jetty -Djetty.base=/Warehouse/Users/claing/jetty -Djava.io.tmpdir=/tmp -jar /Warehouse/Users/claing/jetty/start.jar jetty.state=/Warehouse/Users/claing/jetty/jetty.state jetty-started.xml Bulk loading time! |
|
We have some IO optimization already implemented as @chadlaing had suggested: https://wiki.blazegraph.com/wiki/index.php/IOOptimization https://github.com/superphy/docker-blazegraph/blob/master/2.1.4-inferencing/RWStore.properties We're also currently using |
Suppressing truth maintenance via the REST API doesn't seem to work without a query parameter:
|
Not sure what this just gave me: https://gist.github.com/kevinkle/45268c4c84a996c2c5dd3d759c326077 |
Maybe this is only a one-time disable thing??? Looks like there's also some SPARQL UPDATE command for this https://wiki.blazegraph.com/wiki/index.php/SPARQL_Update#Manage_truth_maintenance_in_SPARQL_UPDATE |
re: #187 (comment) Looks like we're still generating timeouts at the same rate. Will have to check how this can be set. |
Nothing good old DevTools + cURL can't get around:
Response: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8"><title>blazegraph™ by SYSTAP</title
></head
><body<p>totalElapsed=215ms, elapsed=13ms, connFlush=0ms, batchResolve=0, whereClause=0ms, deleteClause=0ms, insertClause=0ms</p
><hr><p>COMMIT: totalElapsed=1815ms, commitTime=1512928665654, mutationCount=0</p
></html
> |
Queuing up another batch for testing: [claing@superphy ~]$ cd /docker/
[claing@superphy docker]$ sudo mkdir chunk4
[sudo] password for claing:
[claing@superphy docker]$ sudo chown -R claing:docker chunk4
[claing@superphy docker]$ cd /opt/chunk-pickles/
[claing@superphy chunk-pickles]$ screen
[claing@superphy chunk-pickles]$ python chunk.py -c batch_6310_9465.p -d /docker/chunk4/ |
Loaded this batch: # nginx has to be stopped before touching the docker-compose
# think it has something to do with the way routing to the containers works
[claing@superphy backend-4.4.0]$ sudo systemctl stop nginx
# change mapping of volume
[claing@superphy backend-4.4.0]$ vim docker-compose.yml
# delete the redis persistence file
[claing@superphy backend-4.4.0]$ cd /docker/
[claing@superphy redis]$ sudo rm appendonly.aof
# bring everything back up
[claing@superphy redis]$ cd /opt/backend-4.4.0/
[claing@superphy backend-4.4.0]$ docker-compose up -d
[claing@superphy backend-4.4.0]$ sudo systemctl start nginx |
Going to work on tuning Blazegraph (likely via branching factors) on a local box while the above is testing on corefacility. |
Checking the branching factors (newM is the recommended value): ubuntu@host-10-1-5-81:~$ curl -d 'dumpPages' -d 'dumpJournal' 'http://localhost:8080/bigdata/status' name indexType m height nnodes nleaves nentries nrawRecs nerrors nodeBytes leafBytes rawRecBytes totalBytes avgNodeBytes avgLeafBytes avgRawRecBytes minNodeBytes maxNodeBytes minLeafBytes maxLeafBytes 64 128 192 320 512 768 1024 2048 3072 4096 8192 blobs newM curM
__globalRowStore BTree 32 1 1 3 66 0 0 139 6404 0 6543 139 2134 0139 139 971 3974 0.0 0.0 0.25 0.0 0.0 0.0 0.25 0.25 0.0 0.25 0.0 0.0 803 32
kb.lex.BLOBS BTree 400 1 1 105 27507 27507 0 1419 536517 391801311 392339247 1419 5109 14243 1419 1419 3950 7724 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.009433962264150943 0.0 0.0660377358490566 0.9245283018867925 0.0 1180 400
kb.lex.ID2TERM BTree 400 2 4 679 135940 126070 0 9186 2285517 11773580 14068283 2296 3366 93 84 3732 3265 5071 0.0 0.0014641288433382138 0.0 0.0 0.0 0.0 0.0 0.0 0.0029282576866764276 0.9941434846266471 0.0014641288433382138 0.0 959 400
kb.lex.TERM2ID BTree 400 2 3 566 135940 0 0 19549 7568016 0 7587565 6516 13371 0 163 10061 6319 47559 0.0 0.0 0.0017574692442882249 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1265377855887522 0.8717047451669596 299 400
kb.spo.JUST BTree 1024 2 12 7438 4045350 0 0 289345 202792383 0 203081728 24112 27264 0 520 36563 8862 71349 0.0 0.0 0.0 0.0 0.0 1.3422818791946307E-4 0.0 0.0 0.0 0.0 0.0 0.9998657718120806 262 1024
kb.spo.OSP BTree 1024 2 10 5637 3354502 0 0 100552 47560227 0 47660779 10055 8437 0 231 15766 3871 24212 0.0 0.0 0.0 1.7708517797060386E-4 0.0 0.0 0.0 0.0 0.0 0.21073136178501858 0.21692934301398972 0.572162210023021 731 1024
kb.spo.POS BTree 1024 2 9 5552 3354502 0 0 100029 27106902 0 27206931 11114 4882 0 218 18414 2818 16447 0.0 0.0 0.0 1.798237727027513E-4 0.0 0.0 0.0 0.0 0.2138104657435713 0.24887610142060781 0.481568063297968 0.05556554576515015 988 1024
kb.spo.SPO BTree 1024 2 11 6056 3354502 0 0 123800 31663320 0 31787120 11254 5228 0 273 20896 3423 10633 0.0 0.0 0.0 1.6482610845557937E-4 0.0 0.0 0.0 0.0 0.0 0.21279050601615296 0.7667710565353552 0.020273611340036263 939 1024 from corefacility: curl 'http://192.168.0.1:8080/blazegraph/status?dumpJournal&dumpPages' -H 'Pragma: no-cache' -H 'Origin: http://192.168.5.19:8080' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: en-US,en;q=0.9,la;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: */*' -H 'Cache-Control: no-cache' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'Referer: http://192.168.0.1:8080/blazegraph/' -H 'DNT: 1' --compressed name indexType m height nnodes nleaves nentries nrawRecs nerrors nodeBytes leafBytes rawRecBytes totalBytes avgNodeBytes avgLeafBytes avgRawRecBytes minNodeBytes maxNodeBytes minLeafBytes maxLeafBytes 64 128 192 320 512 768 1024 2048 3072 4096 8192 blobs newM curM
__globalRowStore BTree 32 1 1 3 66 0 0 139 6404 0 6543 139 2134 0139 139 971 3974 0.0 0.0 0.25 0.0 0.0 0.0 0.25 0.25 0.0 0.25 0.0 0.0 803 32
kb.lex.BLOBS BTree 400 2 17 3845 1036630 1036630 0 51306 19495651 27581346105 27600893062 3018 5070 26606 269 3601 3768 7506 0.0 0.0 0.0 2.589331952356292E-4 0.0 0.0 0.0 0.0 0.0010357327809425167 0.10227861211807354 0.8964267219057483 0.0 692 400
kb.lex.ID2TERM BTree 400 2 128 25728 5145622 5139450 0 332519 84057361 516142833 600532713 2597 3267 100 2010 4388 3264 3737 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.8675742574257426E-5 0.004834467821782178 0.9950881806930693 3.8675742574257426E-5 0.0 905 400
kb.lex.TERM2ID BTree 400 2 78 19912 5145622 0 0 898267 325617776 0 326516043 11516 16352 0 3810 22772 3604 130533 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 9.504752376188094E-4 0.006053026513256629 0.9929964982491246 193 400
kb.spo.JUST BTree 1024 2 340 188967 103902351 0 0 6873183 5322237409 0 5329110592 20215 28164 0 7789 40163 8862 74882 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.1657941861632165E-4 0.9997834205813837 284 1024
kb.spo.OSP BTree 1024 2 251 143414 86535527 0 0 2624942 1239165246 0 1241790188 10457 8640 0 5477 19697 3764 23826 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.17524101207670623 0.23344586364110953 0.5913131242821843 708 1024
kb.spo.POS BTree 1024 2 252 145839 86535527 0 0 2648135 728544112 0 731192247 10508 4995 0 5450 21343 2917 15590 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.17228987411955562 0.25579946745521626 0.5135018584307042 0.05840879999452396 990 1024
kb.spo.SPO BTree 1024 2 266 155958 86535527 0 0 3588811 873139445 0 876728256 13491 5598 0 7388 22900 3423 11171 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.218750.7392846169602622 0.04196538303973781 847 1024 Raw data is at: superphy/docker-blazegraph@c1a13b4 New branchingFactors: # Set the branching factor for "__globalRowStore" to the specified value.
com.bigdata.namespace.__globalRowStore.com.bigdata.btree.BTree.branchingFactor=32
# Set the branching factor for "kb.lex.BLOBS" to the specified value.
com.bigdata.namespace.kb.lex.BLOBS.com.bigdata.btree.BTree.branchingFactor=692
# Set the branching factor for "kb.lex.ID2TERM" to the specified value.
com.bigdata.namespace.kb.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor=905
# Set the branching factor for "kb.lex.TERM2ID" to the specified value.
com.bigdata.namespace.kb.lex.TERM2ID.com.bigdata.btree.BTree.branchingFactor=193
# Set the branching factor for "kb.spo.JUST" to the specified value.
com.bigdata.namespace.kb.spo.JUST.com.bigdata.btree.BTree.branchingFactor=284
# Set the branching factor for "kb.spo.OSP" to the specified value.
com.bigdata.namespace.kb.spo.OSP.com.bigdata.btree.BTree.branchingFactor=708
# Set the branching factor for "kb.spo.POS" to the specified value.
com.bigdata.namespace.kb.spo.POS.com.bigdata.btree.BTree.branchingFactor=990
# Set the branching factor for "kb.spo.SPO" to the specified value.
com.bigdata.namespace.kb.spo.OSP.com.bigdata.btree.BTree.branchingFactor=1024 The new factors vs the current ones:
Tangential relation blazegraph/database#9 |
Looks like no luck with the |
Swapped in the new branching factors and loaded a test set now. Also configured the |
|
The first 3155 genomes loaded fine, but this is as before. Loading in the second set. |
Looks like the errors we still see trace back to two things:
Before, our assumption was that we had too much concurrency for blazegraph to handle (uploads were handled by the task worker which completed it), at least while it ran on our LFS. Atm., we only have 2 concurrent processes which interface with blazegraph (above). An optimistic outlook would be that moving Will make the changes to the uploading and see. A follow-up would be to cache some info. on identifiers (or whatever we can) onto redis and not go back and forth with blazegraph queries as much. |
|
We've now offshored the indexing of spfyids into MongoDB as well as the storage of the largest current spfyid. Will still have effect retrieval for tasks like phylotyper or group comparisons, but this should solve ~90% of the issues. PR: #284 |
The text was updated successfully, but these errors were encountered: