Improve performance #56

wismill · 2018-07-23T08:49:03Z

Now that rdf4h has a complete support for NTriples & Turtle, it may be a good time to focus on performance:

Parsers
Graph implementation
IRI handling

As mentioned in #35 and #44, there are several places where we could improve the parsers. I think it would be a good idea to keep only one modern parser library (attoparsec or megaparsec) to keep the implementation simple and make it more efficient.

I think the handling of prefixes in UNode is not satisfying. For instance, several important operations require expandTriples which is very expensive. I propose that we remove expandTriples and make use of a smart constructor unode :: Text -> Either IRIError UNode for IRI (currently merely a constructor synonym) that ensure the IRI is a valid absolute IRI. Then have a function mkIRI that accept a namespace (or a prefix mapping using a new type class) to create IRIs, from a relative IRI or a prefixed IRI (see expandURI).

Edit: change the proposed signature of unode to use Either rather than Maybe.

The text was updated successfully, but these errors were encountered:

wismill · 2019-05-22T14:03:22Z

@robstewart57 what do you think about this proposal?

I would like to keep only Megaparsec for the parsing, as it is now very fast and robust.

robstewart57 · 2019-05-31T21:19:58Z

@wismill the idea of #36 was to generalise rdf4h across numerous parsers, specificlaly attoparsec and parsec. E.g.

-- |'NTriplesParser' is an instance of 'RdfParser' using parsec based parsers.
instance RdfParser NTriplesParser where
  parseString _  = parseStringParsec
  parseFile   _  = parseFileParsec
  parseURL    _  = parseURLParsec

-- |'NTriplesParser' is an instance of 'RdfParser'.
instance RdfParser NTriplesParserCustom where
  parseString (NTriplesParserCustom Parsec)     = parseStringParsec
  parseString (NTriplesParserCustom Attoparsec) = parseStringAttoparsec
  parseFile   (NTriplesParserCustom Parsec)     = parseFileParsec
  parseFile   (NTriplesParserCustom Attoparsec) = parseFileAttoparsec
  parseURL    (NTriplesParserCustom Parsec)     = parseURLParsec
  parseURL    (NTriplesParserCustom Attoparsec) = parseURLAttoparsec

rdf4h/src/Text/RDF/RDF4H/NTriplesParser.hs

Line 44 in 896e21e

-- |'NTriplesParser' is an instance of 'RdfParser' using parsec based parsers.

.

This functionality comes from this PR: #36

The motivation of this flexibility between parsers is that each of they have trade offs. E.g. "Megaparsec vs Attoparsec" : https://github.com/mrkkrp/megaparsec#megaparsec-vs-attoparsec

Attoparsec is sometimes faster but not that feature-rich. It should be used when you want to process large amounts of data where performance matters more than quality of error messages.

This is a realistic assumption when working with RDF data, which might be 10s/100s MegaBytes or millions of triples.

There's megaparsec instances in parsers-megaparsec that we could use: https://hackage.haskell.org/package/parsers-megaparsec

The issue arises when attoparsec, parsec and megaparsec instances of the typeclasses in the parsers have different parsing semantics (which presumably shouldn't happen), meaning the rdf4h tests against w2c-tests might pass for one instance of the parsers typeclasses, e.g. parsec, but fail for others, megaparsec.

wismill · 2019-05-31T23:49:14Z

Ok, but at least we could drop parsec and keep only attoparsec and megaparsec. Then we could use parser-combinators.

I think we should also provide a way to stream parsing results. As there are several framework for this, I wonder if we should provide new packages such as: rdf4h-pipes, rdf4h-conduit, rdf4h-streamly, etc.

robstewart57 · 2020-01-03T09:27:43Z

@wismill

I think we should also provide a way to stream parsing results. As there are several framework for this, I wonder if we should provide new packages such as: rdf4h-pipes, rdf4h-conduit, rdf4h-streamly, etc.

I would also really like to see this!

robstewart57 added the enhancement label Oct 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance #56

Improve performance #56

wismill commented Jul 23, 2018 •

edited

Loading

wismill commented May 22, 2019

robstewart57 commented May 31, 2019

wismill commented May 31, 2019

robstewart57 commented Jan 3, 2020

Improve performance #56

Improve performance #56

Comments

wismill commented Jul 23, 2018 • edited Loading

wismill commented May 22, 2019

robstewart57 commented May 31, 2019

wismill commented May 31, 2019

robstewart57 commented Jan 3, 2020

wismill commented Jul 23, 2018 •

edited

Loading