Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Start hand-written PHP combinator-based parser #1681

Closed
wants to merge 2 commits into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Jul 3, 2017

Foreward

Clearly the PHP parser is the harder parser to write because JavaScript is much more amenable to the kind of programming style needed to effectively and efficiently write a parser. This PR is currently in progress experimenting with an idea which I believe will end up much more efficient than the generated parser from the PEG.

This code is still highly volatile and likely to change. That being said I welcome your feedback along the way. 😄

I'm testing this code outside of WordPress. Locally I installed the PHPUnit phar and am running this to test…

php phpunit.phar --no-configuration --colors phpunit/class.block-parser-test.php

Please note that some cruft is lying around and there are style violations. Please know that I'm aware of these minutia and have no intention of leaving them in the final versions of this PR 😉


The goal of this alternative parser is to make a faster parser than the
one generated by the PEG grammar directly. It must not take multiple
seconds to parse a document no matter how big.

This parser currently is designed to produce the blocks and their
attributes, leaving all inside content raw.

This is based on the seminal paper "Monadic Parser Combinators" by
Hutton and Meijer, University of Nottingham, 1996. I have had a hard
time finding literature detailing the performance of such an approach in
PHP.

Although this approach will be building up user-defined function
calls it avoids creating closures by means of passing around
descriptions of parsers in the PHP callback style. That is, instead of
passing partial functions, we pass a callable string and an array of
partial arguments. The input is added as the final input before calling
the given parser.

A further speedup taken in this approach involves defining parse rules "on
the branches" so as to reduce backtracking. That is, if two rules share
a prefix then we can parse the prefix and descend into a first_of
branch instead of having a top-level first_of branch and duplicating
the prefix. The latter approach has been taken in the formal grammar
spec to make the rules easier to read.

RegExp patterns have also been dupilcated where possible to combine
rules. That is, although we could ignore whitespace and then ignore a
closing block comment, instead I'm combining the RegExp pattern so that
we benefit from the faster performance of the RegExp and from only
making one call instead of two. Each call passes around its own state on
the parse "stack."

@dmsnell dmsnell added [Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Status] In Progress Tracking issues with work in progress labels Jul 3, 2017
Copy link
Member

@nylen nylen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I am finding this code a bit hard to follow. That is probably because I am not well-versed in the theory at play here. However, this is a general concern about our parsing implementation: so far, it seems like only a few people will have the knowledge needed to maintain these.

There are several categories of tests that are absolutely mandatory that we add very soon, especially before switching over to a parsing solution that is different than "use the same grammar on the client and server":

  • Invalid nesting of blocks, malformed delimiters, content that cuts off before the end of a block delimiter, etc.
  • Multibyte character handling, including block delimiters that are made invalid by the insertion of multibyte characters in various places

Finally, I question whether the assumptions that led to the creation of this PR are actually valid. It seems possible to automatically generate a quite performant parser based on a PEG.js grammar. See #1775 for details.

We need to do similar profiling of this solution. In particular there will be thousands of call_user_func calls every time a post is parsed, and we can likely save a lot of time by getting rid of these.


function test_combinator_succeed() {
$this->assertEquals(
[ [ 'test', 'bork' ] ],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere (though I only see this in the tests), the [ ... ] syntax must be changed to array( ... ) for PHP 5.2 compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's only in the test code

Copy link
Member

@nylen nylen Jul 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, which will need to be merged into the plugin and run against PHP 5.2 along with the rest of the tests.

// trampoline for stack-safe recursion of the actual parser
while ( $this->input && ( microtime( true ) - $tic ) < self::MAX_RUNTIME ) {
return $this->proceed( $input );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the non-deterministic behavior introduced by the time limit here. It will likely mean, for example, that some posts parse correctly on one page load but not on the next. This is a bit worrisome to me, and it would be better to make sure we understand the performance characteristics of our parser(s) well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well clearly it's not fully built out. there will be some post that's about two million lines long and that will hang on the server. this isn't a big part of this PR but the general idea is that it provides a safe limit where if we run out of time can stop parsing and return the rest as unparsed

again, not a big part of this (unfinished) PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Generally I would hope that we can do better than the rest of WP core here.


public static function literal( $value, $input ) {
return strpos( $input, $value ) === 0
? array( $value, substr( $input, strlen( $value ) ) )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, should we be using mb_substr instead? If not, how can we guarantee that this parser will behave well with UTF-8 input?

If we do switch to mb_substr, then performance becomes a concern because it has to scan through the entire string until each start position.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a good point and I was planning on creating a local string length function. one thing I wanted to check on is if not using mb_substr is actually going to cause problems, since our syntax is all ASCII. we shouldn't theoretically be splitting on any non-ASCII characters and so I think we can probably make this assumption safely (though I haven't finished considering that yet)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed possible that this won't be an issue here. Let's test it thoroughly rather than assuming.

@@ -0,0 +1,205 @@
<?php

if (!class_exists('Gutenberg_Block_Parser_State', false)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would these classes already exist?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully they wouldn't. just guarding

class Gutenberg_Block_Parser {
const BLOCK_COMMENT_OPEN = '(^<!--)';
const BLOCK_COMMENT_CLOSE = '(^/?-->)';
const BLOCK_NAME = '(^[[:alpha:]](?:[[:alnum:]]|/[[:alnum:]])*)i';
Copy link
Member

@nb nb Jul 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we limit this to the registered block names and avoid regular expressions altogether?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to be able to safely parse blocks from plugins that have been disabled or uninstalled. This could work by including them within larger chunks of content, but I think it is better to show them as an independent but "unrecognized" block. See #1735.

I also don't like the idea of coupling the parser to the list of registered blocks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nb I don't think that we should totally avoid RegExps. In this case I'm only using simple RegExps to tokenize and not to parse. As @nylen pointed out we unfortunately can't limit this to known block types because it has to parse everything.

The grammar doesn't use RegExps and we wouldn't have to here, but I felt like it was going to be faster and clearer than running character-by-character the way the PEG parser does.

@dmsnell
Copy link
Member Author

dmsnell commented Jul 10, 2017

I'm planning on closing this but might play a little more with the concept. There's no way to actually test the performance until the full grammar is incorporated, so maybe I'll try and fill that out and test.

The goal of this alternative parser is to make a faster parser than the
one generated by the PEG grammar directly. It must not take multiple
seconds to parse a document no matter how big.

This parser currently is designed to produce the blocks and their
attributes, leaving all inside content raw.

This is based on the seminal paper "Monadic Parser Combinators" by
Hutton and Meijer, University of Nottingham, 1996. I have had a hard
time finding literature detailing the performance of such an approach in
PHP.

Although this approach will be building up user-defined function
calls it avoids creating closures by means of passing around
descriptions of parsers in the PHP callback style. That is, instead of
passing partial functions, we pass a callable string and an array of
partial arguments. The input is added as the final input before calling
the given parser.

A further speedup taken in this approach involves defining parse rules "on
the branches" so as to reduce backtracking. That is, if two rules share
a prefix then we can parse the prefix and descend into a `first_of`
branch instead of having a top-level `first_of` branch and duplicating
the prefix. The latter approach has been taken in the formal grammar
spec to make the rules easier to read.

RegExp patterns have also been dupilcated where possible to combine
rules. That is, although we could ignore whitespace and then ignore a
closing block comment, instead I'm combining the RegExp pattern so that
we benefit from the faster performance of the RegExp and from only
making one call instead of two. Each call passes around its own state on
the parse "stack."
Not included in this update:
 - Void block syntax
 - More tag, nofollow tag, nextpage tag
@dmsnell
Copy link
Member Author

dmsnell commented Jul 10, 2017

Okay so my curiosity got to me and I don't think I'll close this yet. Right now it handles most of the grammar and I'd really like to see if how its performance compares to the generated parser. I wouldn't be at all surprised if it's much slower actually.

It's still a WIP so don't be surprised at the ugly proceed() function. There are things I don't like about it but yet that is still the place where we diverge from the pattern of monadic parsers and start thinking iteratively with a stack and state.

Some of the returned values from the parsers I have been thinking of changing, plus this is still all immutable. These changes could speed things up dramatically especially for long documents. In the first version I wanted to make sure I could do it right and immutability has helped a lot by keeping different things different.

One big point that I think I need to add in the values from the parsers is some indication of which production is being generated. It's hard right now to use the results of the first_of combinator because you have to structurally determine which of the productions you have; this is impossible if the productions contain the same number of values.

If this all works well we might be able to turn the proceed() function into a parser too. I think it would be something like…

block => sequence [
  block_opener,
  zero_or_more [
    block,
    raw_content
  ],
  block_closer
]

…but the one big problem there is that if this isn't fast enough we have no safe exit. with process() imperatively built we can set a threshold on the parse and when we reach the limit, return the remainder of the document as a freeform block - a concept I find pretty cool. this would obviously depend on Gutenberg not killing blocks it doesn't understand though

@dmsnell
Copy link
Member Author

dmsnell commented Jul 10, 2017

In general I am finding this code a bit hard to follow. That is probably because I am not well-versed in the theory at play here. However, this is a general concern about our parsing implementation: so far, it seems like only a few people will have the knowledge needed to maintain these.

This is something I also recognize but at the same time I feel like if we want to accomplish our goals we will have to accept that the theory is there to enable us and will keep us safe and successful even if it is unfamiliar.

As it becomes more stable I would expect more of our efforts to go into documenting the concepts and building guides so that more people can learn about the theory and how to participate in this code (all parsing code). It will help the project and hopefully help them too.

That raises a point I'd like to look into and that is making the parser less integral to Gutenberg (as you did similar work earlier) and let it be overwritable by a plugin. We have to cover both the parse and the serialize functions. In my head I would like to explore some other approaches that don't involve serializing to a string…

  • save JSON blob directly to post meta
  • save structure to Simperium as-is
  • API calls instead of local processing
  • something else fun, maybe directly into a (gasp, MongoDB) database table?

These approaches don't actually need string parsing (other than maybe the native JSON parse) and could offer novel ways to speed up the editor for heavy workloads or integrate into other storage systems.

Anyway, still doodling here

@dmsnell
Copy link
Member Author

dmsnell commented Sep 21, 2017

It's not necessary to keep this open so I'm going to close it. The idea may be worth exploring more but I don't have the time now to do it.

@dmsnell dmsnell closed this Sep 21, 2017
@dmsnell dmsnell deleted the add/php-block-parser branch September 21, 2017 05:40
dmsnell added a commit that referenced this pull request Jul 20, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Jul 23, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 23, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 23, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 24, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 27, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 29, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Aug 30, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
mcsf pushed a commit that referenced this pull request Aug 31, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Sep 3, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
dmsnell added a commit that referenced this pull request Sep 5, 2018
For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file
gziolo pushed a commit that referenced this pull request Sep 6, 2018
* Parser: Propose new hand-coded PHP parser

For some time we've needed a more performant PHP parser for the first
stage of parsing the `post_content` document.

See #1681 (early exploration)
See #8044 (parser performance issue)
See #1775 (parser performance, fixed in php-pegjs)

I'm proposing this implementation of the spec parser as an alternative
to the auto-generated parser from the PEG definition.

This is not yet ready to go but I wanted to get the code in a branch
so I can iterate on it and garner early feedback.

This should eventually provide a setup fixture for #6831 wherein we
are testing alternate parser implementations.

 - designed as a basic recursive-descent
 - but doesn't recurse on the call-stack, recurses via trampoline
 - moves linearly through document in one pass
 - relies on RegExp for tokenization

 - nested blocks include the nested content in their `innerHTML`
   this needs to go away
 - create test fixutre
 - figure out where to save this file

* Fix issue with containing the nested innerHTML

* Also handle newlines as whitespace

* Use classes for some static typing

* add type hints

* remove needless comment

* space where space is due

* meaningless rename

* remove needless function call

* harmonize with spec parser

* don't forget freeform HTML before blocks

* account for oddity in spec-parser

* add some polish, fix a thing

* comment it

* add JS version too

* Change `.` to `[^]` because `/s` isn't well supported in JS

The `s` flag on the RegExp object informs the engine to treat a
dot character as a class that includes the newline character.
Without it newlines aren't considered in the dot.

Since this flag is new to Javascript and not well supported in
different browsers I have removed it in favor of an explicit class
of characters that _does_ include the newline, namely the open
exclusion of `[^]` which permits all input characters.

Hat-top to @Hywan for finding this.

* Move code into `/packages` directory, prepare for review

* take out names from RegExp pattern to not fail tests

* Fix bug in parser: store HTML soup in stack frames while parsing

Previously we were sending all "HTML soup" segments of HTML between
blocks to the output list before any blocks were processed. We should
have been tracking these segments during the parsing and only spit
them out when closing a block at the top level.

This change stores the index into the input document at which that
soup starts if it exists and then produces the freeform block when
adding a block to the output from the parse frame stack.

* fix whitespace

* fix oddity in spec

* match styles

* use class name filter on server-side parser class

* fix whitespace

* Document extensibility

* fix typo in example code

* Push failing parsing test

* fix lazy/greedy bug in parser regexp

* Docs: Fix typos, links, tweak style.

* update from PR feedback

* trim docs

* Load default block parser, replacing PEG-generated one

* Expand `?:` shorthand for PHP 5.2 compat

* add fixtures test for default parser

* spaces to tabs

* could we need no assoc?

* fill out return array

* put that assoc back in there

* isometrize

* rename and add 0

* Conditionally include the parser class

* Add docblocks

* Standardize the package configuration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] Parsing Related to efforts to improving the parsing of a string of data and converting it into a different f [Status] In Progress Tracking issues with work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants