-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NTriplesParser optimization #35
Comments
Sounds great! It's up to @robstewart57 to decide, but I think it would be better if you just created a pull request for your changes - this way we can both see what's changed and discuss stuff in more detail in the comments (and reject your pull request, if it's not good enough 😉 ). |
Hi @mgmeier I'm very excited to see the work you've put into this, thanks! I'm also snowed under for the next week or so, but after that I'll definitely look in detail at the changes your commits have made, and will provide feedback. |
@mgmeier interesting approach. First things first. Prior to your commits:
Once you've followed these instructions: How many tests fail? |
I've merged 76722cc . This does seem to have triggered some failing tests, 94 rdf W3C tests failed before, now 105 fail. Nevertheless, a 9000% speedup is too good to resist 👍 I have resisted merging the other two commits for now, as it introduces state in the GenParser used for parsing NTriples I'm going to upload rdf4h-2.0.0 to hackage, which represents a substantially more robust collection of RDF parsers and a slight change in the API in the names of the RDF graph instances. The important thing now, is to try to squash the 105 out of the 784 failing W3C unit tests. |
Nice! On Sat, 16 Apr 2016, 13:14 Rob Stewart, notifications@github.com wrote:
|
I worked on various optimizations of the NTriplesParser. It's not ready for a PR yet, but if you're interested, you can check it out from the
optimize1
branch of my fork (https://github.com/mgmeier/rdf4h/tree/optimize1).There are two major optimizations:
Blank node label parsing
oneOf
/noneOf
parsers with large sets of characters are always a red flag with parsec. Particularly oneoneOf
parser had a range over thousands of characters combined. The problem with these is, that all these values have to be constructed (by enumeration) for the parser to run; in the worst case, multiple times, in case the corresponding closure gets GC'ed. I replaced them with simple range checks:if (c >= x && c <=y) || (c >= x' && c <=y') ...
This brought down parsing a
.nt
file (~600000 triples) with many blank node labels from completely unusable to 45 seconds.UNode parsing
I used a combination of different optimizations:
Text
values prematurely, especially not thousands of singletons; at the same time avoid conversions to and fromString
for running several validations. I had to duplicate some of the functionality inTypes.hs
for that.Triple
type is replaced by a intermediate type containing hash values; the triples are filled in only at the end of a parse.These optimizations gave a speed-up of ~35% parsing an
.nt
file with loads of UNodes/URIs.If you're interested in my approach and can verify that it does improve things, we could definitely discuss on how to proceed from here (maybe IRC or mail), since what I suggested here is a proof of concept, which may need more cleanup and streamling the design with the rest of the library. Also the optimizations imply some complex changes and are more than simple one-liners. However, I'd be happy to cotribute to rdf4h, so let me know what you think.
BTW I have not touched any other parsers, the library's core types or any RDF representations.
The text was updated successfully, but these errors were encountered: