NTriplesParser optimization #35

mgmeier · 2016-02-22T00:30:25Z

I worked on various optimizations of the NTriplesParser. It's not ready for a PR yet, but if you're interested, you can check it out from the optimize1 branch of my fork (https://github.com/mgmeier/rdf4h/tree/optimize1).

There are two major optimizations:

Blank node label parsing
oneOf / noneOf parsers with large sets of characters are always a red flag with parsec. Particularly one oneOf parser had a range over thousands of characters combined. The problem with these is, that all these values have to be constructed (by enumeration) for the parser to run; in the worst case, multiple times, in case the corresponding closure gets GC'ed. I replaced them with simple range checks: if (c >= x && c <=y) || (c >= x' && c <=y') ...
This brought down parsing a .nt file (~600000 triples) with many blank node labels from completely unusable to 45 seconds.
UNode parsing
I used a combination of different optimizations:
- Don't construct Text values prematurely, especially not thousands of singletons; at the same time avoid conversions to and from String for running several validations. I had to duplicate some of the functionality in Types.hs for that.
- Avoid constructing the same string values over and over. For that, I use a map from hash value to string as a parser state. Only previously unencountered strings are memorized. The Triple type is replaced by a intermediate type containing hash values; the triples are filled in only at the end of a parse.
- As an extension to the former, skip the well-formedness check of a URI of which the hash value is already in the parser state (meaning that it has been validated previously). The validation check is a somewhat expensive operation, and we only need it once per URI.
These optimizations gave a speed-up of ~35% parsing an .nt file with loads of UNodes/URIs.

If you're interested in my approach and can verify that it does improve things, we could definitely discuss on how to proceed from here (maybe IRC or mail), since what I suggested here is a proof of concept, which may need more cleanup and streamling the design with the rest of the library. Also the optimizations imply some complex changes and are more than simple one-liners. However, I'd be happy to cotribute to rdf4h, so let me know what you think.

BTW I have not touched any other parsers, the library's core types or any RDF representations.

The text was updated successfully, but these errors were encountered:

cordawyn · 2016-02-23T10:42:02Z

Sounds great! It's up to @robstewart57 to decide, but I think it would be better if you just created a pull request for your changes - this way we can both see what's changed and discuss stuff in more detail in the comments (and reject your pull request, if it's not good enough 😉 ).

robstewart57 · 2016-02-23T10:45:46Z

Hi @mgmeier I'm very excited to see the work you've put into this, thanks! I'm also snowed under for the next week or so, but after that I'll definitely look in detail at the changes your commits have made, and will provide feedback.

robstewart57 · 2016-02-29T01:37:31Z

@mgmeier interesting approach. First things first. Prior to your commits:

91 out of 781 tests failed (38.00s)

Once you've followed these instructions:
https://github.com/robstewart57/rdf4h#running-tests

How many tests fail?

robstewart57 · 2016-04-16T11:14:10Z

I've merged 76722cc .

This does seem to have triggered some failing tests, 94 rdf W3C tests failed before, now 105 fail. Nevertheless, a 9000% speedup is too good to resist 👍

I have resisted merging the other two commits for now, as it introduces state in the GenParser used for parsing NTriples

I'm going to upload rdf4h-2.0.0 to hackage, which represents a substantially more robust collection of RDF parsers and a slight change in the API in the names of the RDF graph instances.

The important thing now, is to try to squash the 105 out of the 784 failing W3C unit tests.

cordawyn · 2016-04-16T17:22:35Z

Nice!

On Sat, 16 Apr 2016, 13:14 Rob Stewart, notifications@github.com wrote:

I've merged 76722cc
76722cc
.

This does seem to have triggered some failing tests, 94 rdf W3C tests
failed before, now 105 fail. Nevertheless, a 9000% speedup is too good to
resist [image: 👍]

I have resisted merging the other two commits for now, as it introduces
state in the GenParser used for parsing NTriples

I'm going to upload rdf4h-2.0.0 to hackage, which represents a
substantially more robust collection of RDF parsers and a slight change in
the API in the names of the RDF graph instances.

The important thing now, is to try to squash the 105 out of the 784
failing W3C unit tests.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#35 (comment)

robstewart57 closed this as completed Oct 27, 2016

robstewart57 reopened this Oct 27, 2016

wismill mentioned this issue Jul 23, 2018

Improve performance #56

Open

robstewart57 closed this as completed Oct 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NTriplesParser optimization #35

NTriplesParser optimization #35

mgmeier commented Feb 22, 2016

cordawyn commented Feb 23, 2016

robstewart57 commented Feb 23, 2016

robstewart57 commented Feb 29, 2016

robstewart57 commented Apr 16, 2016

cordawyn commented Apr 16, 2016

NTriplesParser optimization #35

NTriplesParser optimization #35

Comments

mgmeier commented Feb 22, 2016

cordawyn commented Feb 23, 2016

robstewart57 commented Feb 23, 2016

robstewart57 commented Feb 29, 2016

robstewart57 commented Apr 16, 2016

cordawyn commented Apr 16, 2016