Skip to content

Dealing with rapper's 2GB limitation

timrdf edited this page Jan 6, 2013 · 18 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

Problem: csv2rdf4lod outputs Turtle and rapper cannot parse turtle files >2GB.

Recognizing that the file is too big.

  • find . -size +1900M to find output files that rapper will fail to parse.
  • stat -c "%s" OR stat -f "%z", depending on the flavor of unix...
  • du -sch *.ttl | tail -1 will show file size

Encapsulating the logic

Logic to determine if a turtle file is too big for rapper is in:

  • bin/util/too-big-for-rapper.sh (use this one)
  • bin/util/rdf2nt.sh (reproduced logic to avoid csv2rdf4lod dependencies, so that this script can stand alone.)
  • bin/convert-aggregate.sh (uses stat -f and find publish -size +1900M) SHOULD BE UPDATED TO USE too-big-for-rapper.sh
  • bin/util/pvload.sh (uses find . -size +1900M) SHOULD BE UPDATED TO USE too-big-for-rapper.sh

Dealing with a too-big file.

  • $CSV2RDF4LOD_HOME/bin/util/too-big-for-rapper.sh will tell you "yes" or "no".

  • $CSV2RDF4LOD_HOME/bin/split_ttl.pl will take a list of files and split them into chunk-FILENAME-NNN.ttl Assumes that @prefix definitions are sprinkled every 1.9 gig or so (acceptable for csv2rdf4lod outputs, but does not work generally).

  • $CSV2RDF4LOD_HOME/bin/util/bigttl2nt.sh will print ntriples to stdout. Assumes that @prefix definitions are sprinkled every 1.9 gig or so (acceptable for csv2rdf4lod outputs, but does not work generally).

And then there was serdi!

serdi does not have the 2GB restriction that rapper has. And it's fast, with "no" dependencies. serdi doesn't handle RDF/XML, so rapper is still in the game...

http://drobilla.net/software/serd/

Clone this wiki locally