Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsers based parsers #36

Merged
merged 13 commits into from
Nov 26, 2016
Merged

Conversation

axman6
Copy link
Contributor

@axman6 axman6 commented Nov 18, 2016

I mentioned on reddit that there might be some advantages to using Ed Kmett's parsers package. This PR replaces all explicit uses of Parsec with the type class CharParsing, and then instantiated with Parsec (and ParsecT over State because ParsecT's MonadState instance is practically useless...).

The advantage of making this change is that it will make changing parser libraries easy in the future - if you want the nice error messages trifecta offers, only the calls to runParser need to be replaced (with some other fairly trivial changes, like changing the error type returned from Parsec's).

I've also put some effort into making the code more idiomatic - avoid monadic functions when functor and applicative will do, make some things shorter, and generally follow most of the suggestions from hlint.

There are still a few places where the code can be made shorter but I think this does a good job improving readability.

@robstewart57
Copy link
Owner

This is great, thank you!

There are additional 6 tests that fail as a result of these commits, so I've pushed them to a parsers-based-parsers branch. If you'd like to submit more PRs, please point them to that branch until those 6 tests pass. These are the newly failing tests:

  SPARQL_style_prefix:                                                FAIL
    Exception: ParseFailure "(line 1, column 7):\nunexpected \" \"\nexpecting subject resource"
    CallStack (from HasCallStack):
      error, called at src/Data/RDF/Query.hs:93:19 in rdf4h-3.0.1-CjcbQra4TThLE7wwhzXueu:Data.RDF.Query

  SPARQL_style_base:                                                  FAIL
    Exception: ParseFailure "(line 1, column 5):\nunexpected \" \"\nexpecting subject resource"
    CallStack (from HasCallStack):
      error, called at src/Data/RDF/Query.hs:93:19 in rdf4h-3.0.1-CjcbQra4TThLE7wwhzXueu:Data.RDF.Query

  turtle-syntax-base-02:                                              FAIL
    Exception: HUnitFailure (Just (Location {locationFile = "testsuite/tests/W3C/W3CAssertions.hs", locationLine = 39, locationColumn = 3})) "unable to parse, reason:\nLeft (ParseFailure \"(line 1, column 5):\\nunexpected \\\" \\\"\\nexpecting subject resource\")

  turtle-syntax-base-04:                                              FAIL
    Exception: HUnitFailure (Just (Location {locationFile = "testsuite/tests/W3C/W3CAssertions.hs", locationLine = 39, locationColumn = 3})) "unable to parse, reason:\nLeft (ParseFailure \"(line 1, column 5):\\nunexpected \\\" \\\"\\nexpecting subject resource\")"

  turtle-syntax-prefix-02:                                            FAIL
    Exception: HUnitFailure (Just (Location {locationFile = "testsuite/tests/W3C/W3CAssertions.hs", locationLine = 39, locationColumn = 3})) "unable to parse, reason:\nLeft (ParseFailure \"(line 1, column 7):\\nunexpected \\\" \\\"\\nexpecting subject resource\")"

  turtle-syntax-prefix-03:                                            FAIL
    Exception: HUnitFailure (Just (Location {locationFile = "testsuite/tests/W3C/W3CAssertions.hs", locationLine = 39, locationColumn = 3})) "unable to parse, reason:\nLeft (ParseFailure \"(line 1, column 7):\\nunexpected \\\" \\\"\\nexpecting subject resource\")"

I see that there's an attoparsec instance for the Parsing type class. I'll set up parsing W3C tests and also parsing benchmarks for parsec and attoparsec instances for both the Turtle and NTriples parsers (on this parsers-bases-parsers branch).

@axman6
Copy link
Contributor Author

axman6 commented Nov 20, 2016

Oops, that must've been problem with one of the more recent commits, I'll see if I can track it down.

@axman6
Copy link
Contributor Author

axman6 commented Nov 20, 2016

I can't seem to reproduce these test failures, I assume running stack test should be enough to run the whole test suite? I've downloaded the W3C test suites that need to be needed, and looking at the output, there's no difference between master and my branch (baring differences in test timing).
master-test.txt
parsers-test.txt

(I made a small change to the tests suite code to print each triple on a new line to help dinging differences)

@robstewart57
Copy link
Owner

Hmm, that's a strange output. I'm surprised to see:

parser-w3c-tests
test-rdf4h: query empty: subject mf:node & predicate mf:name in:

That is, the W3C tests aren't being run. The data files are in the gh-pages branch (I don't know why). Could you run:

$ git submodule update --init --recursive
$ git submodule foreach git pull origin gh-pages
$ stack test --test-arguments="--pattern parser-w3c-tests"

I get:

70 out of 521 tests failed (1.41s)

@robstewart57
Copy link
Owner

I've fixed those 6 test failures:

146db62

The NTriples and Turtle parsers have been migrated to the parsers
library. A key advantage is that it opens up the parsec and attoparsec
instances "for free". This migration was added in:

robstewart57#36

Here are results from running:

$ stack bench --benchmark-arguments 'rdf4h/parsers'

Benchmark rdf4h-bench: RUNNING...
benchmarking rdf4h/parsers/ntriples-parsec
time                 323.8 ms   (317.8 ms .. 331.3 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 317.3 ms   (313.9 ms .. 319.8 ms)
std dev              3.256 ms   (3.257 μs .. 3.947 ms)
variance introduced by outliers: 16% (moderately inflated)

benchmarking rdf4h/parsers/ntriples-attoparsec
time                 243.0 ms   (233.0 ms .. 250.2 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 242.8 ms   (239.4 ms .. 245.9 ms)
std dev              3.566 ms   (2.099 ms .. 5.014 ms)
variance introduced by outliers: 16% (moderately inflated)

benchmarking rdf4h/parsers/turtle-parsec
time                 349.1 ms   (329.8 ms .. 367.2 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 345.9 ms   (342.6 ms .. 348.0 ms)
std dev              3.203 ms   (0.0 s .. 3.692 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking rdf4h/parsers/turtle-attoparsec
time                 310.0 ms   (303.6 ms .. 317.2 ms)
                     1.000 R²   (0.998 R² .. 1.000 R²)
mean                 308.2 ms   (306.1 ms .. 311.1 ms)
std dev              3.019 ms   (749.2 μs .. 3.740 ms)
variance introduced by outliers: 16% (moderately inflated)
@robstewart57
Copy link
Owner

robstewart57 commented Nov 20, 2016

I've added the parsec and attoparsec instances for Turtle and NTriples parsers to the criterion benchmarking suite.

7416501

The results are interesting when using the attoparsec instance instead of parsec:

  • 25% speedup for NTriples parsing
  • 11% speedup for Turtle parsing
$ stack bench --benchmark-arguments 'rdf4h/parsers'

Benchmark rdf4h-bench: RUNNING...
benchmarking rdf4h/parsers/ntriples-parsec
time                 323.8 ms   (317.8 ms .. 331.3 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 317.3 ms   (313.9 ms .. 319.8 ms)
std dev              3.256 ms   (3.257 μs .. 3.947 ms)
variance introduced by outliers: 16% (moderately inflated)

benchmarking rdf4h/parsers/ntriples-attoparsec
time                 243.0 ms   (233.0 ms .. 250.2 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 242.8 ms   (239.4 ms .. 245.9 ms)
std dev              3.566 ms   (2.099 ms .. 5.014 ms)
variance introduced by outliers: 16% (moderately inflated)

benchmarking rdf4h/parsers/turtle-parsec
time                 349.1 ms   (329.8 ms .. 367.2 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 345.9 ms   (342.6 ms .. 348.0 ms)
std dev              3.203 ms   (0.0 s .. 3.692 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking rdf4h/parsers/turtle-attoparsec
time                 310.0 ms   (303.6 ms .. 317.2 ms)
                     1.000 R²   (0.998 R² .. 1.000 R²)
mean                 308.2 ms   (306.1 ms .. 311.1 ms)
std dev              3.019 ms   (749.2 μs .. 3.740 ms)
variance introduced by outliers: 16% (moderately inflated)

Copy link
Contributor Author

@axman6 axman6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary: I'm an idiot =)

(do rdfContent <- T.pack <$> readFile "bills.099.actions.rdf"
fawltyContentTurtle <- T.pack <$> readFile "data/ttl/fawlty1.ttl"
fawltyContentNTriples <- T.pack <$> readFile "data/nt/all-fawlty-towers.nt"
let (Right rdf1) =
parseString (XmlParser Nothing Nothing) rdfContent
let (Right rdf2) =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to not just replace all uses of rdf2 with rdf1? they are guaranteed to be identical.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, they're different types! never mind!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have type sigs here instead of at their use site.

@axman6
Copy link
Contributor Author

axman6 commented Nov 21, 2016

That's great news, I'm glad you managed to fix those failures. Have you confirmed that the attoparsec implementation passes all the tests? this isn't guaranteed since parsec and attoparsec may implement different semantics (though your liberal use of try should avoid that problem - it might be worth trying its usage if possible).

Also, would you be able to add the instructions for running the tests to the README? I was running them by downloading the W3C tests from the W3C directly and putting them in the ref-tests directory, which seemed to work, but was apparently wrong.

@robstewart57 robstewart57 merged commit 7416501 into robstewart57:master Nov 26, 2016
@robstewart57
Copy link
Owner

Good point. I've separated out the tests for Turtle and NTriples parsing, to separately report parsec and attoparsec results. Here they are:

NTriples

  • NTriples with parsec
$ stack test --test-arguments="--pattern parser-w3c-tests-ntriples-parsec"
...
All 68 tests passed (1.41s)
  • NTriples with attoparsec
$ stack test --test-arguments="--pattern parser-w3c-tests-ntriples-attoparsec"
...
27 out of 68 tests failed (1.56s)

Turtle

  • Turtle with parsec
$ stack test --test-arguments="--pattern parser-w3c-tests-turtle-parsec"
...
20 out of 291 tests failed (1.48s)
  • Turtle with attoparsec
$ stack test --test-arguments="--pattern parser-w3c-tests-turtle-attoparsec"
...
110 out of 291 tests failed (1.61s)

Summary

So, indeed their semantics are different. I haven't looked into the failure reports to see why.

I've exposed TurtleParserCustom and NTriplesParserCustom, to opt in to either the parsec and attoparsec instances. The existing TurtleParser and NTriplesParser instances of RdfParser use the parsec instances of Parsing, because it passes more W3C tests than the attoparsec instance currently. If a user wants higher performance, they should use TurtleParserCustom and NTriplesParserCustom.

I've merged all of your commits, and my recent commits, into the master branch.

@robstewart57
Copy link
Owner

Also, would you be able to add the instructions for running the tests to the README? I was running them by downloading the W3C tests from the W3C directly and putting them in the ref-tests directory, which seemed to work, but was apparently wrong.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants