Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NTriplesParser optimization #35

Closed
mgmeier opened this issue Feb 22, 2016 · 5 comments
Closed

NTriplesParser optimization #35

mgmeier opened this issue Feb 22, 2016 · 5 comments

Comments

@mgmeier
Copy link
Contributor

mgmeier commented Feb 22, 2016

I worked on various optimizations of the NTriplesParser. It's not ready for a PR yet, but if you're interested, you can check it out from the optimize1 branch of my fork (https://github.com/mgmeier/rdf4h/tree/optimize1).

There are two major optimizations:

  1. Blank node label parsing
    oneOf / noneOf parsers with large sets of characters are always a red flag with parsec. Particularly one oneOf parser had a range over thousands of characters combined. The problem with these is, that all these values have to be constructed (by enumeration) for the parser to run; in the worst case, multiple times, in case the corresponding closure gets GC'ed. I replaced them with simple range checks: if (c >= x && c <=y) || (c >= x' && c <=y') ...
    This brought down parsing a .nt file (~600000 triples) with many blank node labels from completely unusable to 45 seconds.

  2. UNode parsing
    I used a combination of different optimizations:

    • Don't construct Text values prematurely, especially not thousands of singletons; at the same time avoid conversions to and from String for running several validations. I had to duplicate some of the functionality in Types.hs for that.
    • Avoid constructing the same string values over and over. For that, I use a map from hash value to string as a parser state. Only previously unencountered strings are memorized. The Triple type is replaced by a intermediate type containing hash values; the triples are filled in only at the end of a parse.
    • As an extension to the former, skip the well-formedness check of a URI of which the hash value is already in the parser state (meaning that it has been validated previously). The validation check is a somewhat expensive operation, and we only need it once per URI.

    These optimizations gave a speed-up of ~35% parsing an .nt file with loads of UNodes/URIs.

If you're interested in my approach and can verify that it does improve things, we could definitely discuss on how to proceed from here (maybe IRC or mail), since what I suggested here is a proof of concept, which may need more cleanup and streamling the design with the rest of the library. Also the optimizations imply some complex changes and are more than simple one-liners. However, I'd be happy to cotribute to rdf4h, so let me know what you think.

BTW I have not touched any other parsers, the library's core types or any RDF representations.

@cordawyn
Copy link
Collaborator

Sounds great! It's up to @robstewart57 to decide, but I think it would be better if you just created a pull request for your changes - this way we can both see what's changed and discuss stuff in more detail in the comments (and reject your pull request, if it's not good enough 😉 ).

@robstewart57
Copy link
Owner

Hi @mgmeier I'm very excited to see the work you've put into this, thanks! I'm also snowed under for the next week or so, but after that I'll definitely look in detail at the changes your commits have made, and will provide feedback.

@robstewart57
Copy link
Owner

@mgmeier interesting approach. First things first. Prior to your commits:

91 out of 781 tests failed (38.00s)

Once you've followed these instructions:
https://github.com/robstewart57/rdf4h#running-tests

How many tests fail?

@robstewart57
Copy link
Owner

I've merged 76722cc .

This does seem to have triggered some failing tests, 94 rdf W3C tests failed before, now 105 fail. Nevertheless, a 9000% speedup is too good to resist 👍

I have resisted merging the other two commits for now, as it introduces state in the GenParser used for parsing NTriples

I'm going to upload rdf4h-2.0.0 to hackage, which represents a substantially more robust collection of RDF parsers and a slight change in the API in the names of the RDF graph instances.

The important thing now, is to try to squash the 105 out of the 784 failing W3C unit tests.

@cordawyn
Copy link
Collaborator

Nice!

On Sat, 16 Apr 2016, 13:14 Rob Stewart, notifications@github.com wrote:

I've merged 76722cc
76722cc
.

This does seem to have triggered some failing tests, 94 rdf W3C tests
failed before, now 105 fail. Nevertheless, a 9000% speedup is too good to
resist [image: 👍]

I have resisted merging the other two commits for now, as it introduces
state in the GenParser used for parsing NTriples

I'm going to upload rdf4h-2.0.0 to hackage, which represents a
substantially more robust collection of RDF parsers and a slight change in
the API in the names of the RDF graph instances.

The important thing now, is to try to squash the 105 out of the 784
failing W3C unit tests.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#35 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants