Migration from `pest` to `logos` + `rowan`? #29

peterhuene · 2024-05-02T21:22:52Z

peterhuene
May 2, 2024
Maintainer

Overview

As development on top of the wdl family of crates ramps up in terms of linters, formatters, documenters, and, perhaps one day, a LSP server for live analysis of WDL documents, I wanted to discuss the use of pest and whether or not it makes sense to migrate to a different parser implementation.

Definitions of a few terms that are relevant to this discussion:

pest - a PEG-based parser that uses a grammar definition file to generate a parser via a Rust procedural macro; this crate is currently used by wdl-grammar.
logos - a lexer that uses procedural macros containing regular expressions and compiles those regular expressions into a DFA that drives tokenization; as a result, it has very good performance for incrementally classifying text into a stream of tokens.
rowan - a crate created for use in rust-analyzer that implements a "red-green" tree data structure; this data structure is meant for efficient "live" analysis and updates that are required from interactive parses such as from within a code editor.
CST - "Concrete Syntax Tree" or parse tree, a tree which losslessly includes every span of the source document.
AST - "Abstract Syntax Tree", a tree that only contains elements with semantic meaning (e.g. ignores trivia).
"Infallible" or "recoverable" parser - a parser capable of producing a CST despite syntactic errors in the source; very useful for interactive parsing.
Trivia - tokens such as whitespace, comments, or, in the case of infallible parsers, syntactically invalid tokens.
LSP - a stand-in acronym to mean a server implementing the Language Server Protocol, typically for use in a code editor.

One developer's Rust-based parser journey

For one of the (entirely unrelated) languages I've implemented in the past, I started with pest because it was pretty easy to get going: it handled both lexing and parsing in a single PEG grammar definition.

As time went on, I realized that the parse errors produced by pest weren't great (and confusing users by referencing rule names that are internal implementation details), the performance of the parse itself could be better, the rules were difficult to debug, and it was very challenging to preserve whitespace and comments at any location in the grammar, leading to overly complicated rules and parsing bugs involving whitespace. So I decided to abandon pest in favor of logos with a hand-crafted parser that consumes tokens from the stream with a single lookahead.

The immediate benefits were the improved diagnostics (this change also coincided with switching to thiserror and miette for defining lexing/parsing errors) and consistent handling of trivia during the parse. Additionally, the rules were easier to debug as they were implemented directly in source and not hidden behind pest proc-macro output. The obvious downside was that it was a lot more code to implement the parser than a proc-macro invocation.

As I started to implement an LSP (said LSP is a side project that is still nowhere near ready), I realized that I needed an infallible parser, one capable of always producing a CST no matter what the input was. So the hand-crafted parser was once again re-implemented in a manner by which rust-analyzer constructs its parse tree: a series of "rule" functions that consume tokens and, at certain points in the grammar, attempts recovery when it encounters errors (the general strategy is to "abandon" the node currently being constructed and look for where it might start again on the next node depending on the current context).

For example, here's what one such rule function looks like in my parser:

fn world_decl(parser: &mut Parser, marker: Marker) -> Result<(), (Marker, Error)> {
    require!(parser, Token::WorldKeyword);
    expect!(parser, marker, Token::Ident);
    expect!(parser, marker, Token::OpenBrace);
    delimited!(
        parser,
        marker,
        UNTIL_CLOSE_BRACE,
        WORLD_ITEM_RECOVERY_SET,
        |parser, marker| {
            world_item(parser, marker)?;
            Ok(true)
        }
    );
    expect!(parser, marker, Token::CloseBrace);
    marker.complete(parser, NodeKind::WorldDecl);
    Ok(())
}

Which parses this rule:

world-decl ::= 'world' id '{' world-item* '}'
world-item ::= <omitted for brevity>

Note that trivia is attached to the node currently being parsed by the parser implementation, so the rules don't concern themselves with it; additionally, it provides detailed "expected/found" diagnostics when it encounters unexpected tokens.

In addition to the need for an infallible parser, I needed a proper representation of the CST that allowed for incremental re-parses, incremental tree updates, and fast span lookup. Again I asked myself "what would rust-analyzer do?" and the answer was rowan, which implements a red-green tree that can scale to very large parse trees. So my AST representation was converted to be an adapter over a rowan::SyntaxNode (basically a red tree node) which knows how to walk the CST to provide only semantically necessary elements.

The discussion topic

From the lessons I've learned, I wanted to propose the following question for discussion: should we continue to base wdl off of pest or should we consider a different parser implementation?

For what it's worth, here's my opinionated list of pros and cons for pest:

Pros

pest is currently used by wdl-grammar and is exposed publicly in its API, which means switching to a new parser implementation would be a very disruptive breaking change; to me, this is by far the biggest pro.
The single PEG grammar makes it (relatively) easy to see the language being defined, albeit in a syntax that only pest uses so it does require pest familiarity to understand.
The parser is generated, so a lot less code required to implement a parser.

Cons

The diagnostics aren't good and may directly reference rule names that may have no relevance to users.
Preserving trivia in the parse tree is onerous and error-prone; for example, look at how often (WHITESPACE | COMMENT) appear in the current WDL pest grammar.
Debugging a rule parse failure can be difficult.
The performance of pest has historically been not all that great.

And for a different parser implementation:

Pros

It gives us an opportunity to improve the parser diagnostics (additionally also allows for a transition to thiserror and miette for defining errors, making error handling better from both an API perspective and from a UX perspective).
It could* be a significantly faster parser implementation (* supported evidence required in the form of benchmarks)
An infallible parser could improve the UX by displaying multiple errors in the parse/validation rather than stopping at the first encountered error.
An infallible parser with a different parse tree representation could be the basis for a future LSP mplementation.

Cons

It'll be a highly disruptive change to existing API consumers:
- wdl-grammar users will need to use rowan::SyntaxNode for traversing all elements in the CST.
- wdl-ast users will need to use our provided AST adapters that adapt SyntaxNode into something that walks the CST for them based on WDL semantics.
Existing lints will have to be updated.
Documentation will need to be updated.
It might not be worth it? pest is an entirely adequate parser implementation for our current needs.

Should we continue to base `wdl` off of `pest` or should we consider a different parser implementation?

Yes, keep `wdl` based off of `pest` (an alternative implementation is not worth the effort).

0%

No, let's consider a different basis for the parser implementation.

100%

3 votes

a-frantz · 2024-05-03T17:02:02Z

a-frantz
May 3, 2024
Maintainer

It sounds to me like the right call is to ditch pest in favor of logos and rowan. The main conclusion I'm drawing is that sticking with pest will save us dev-time in the short term but likely cause headaches and potentially limit us in the future. Versus switching to logos and rowan is an up-front investment for a better dev experience and UX down the line.

I think there's a good chance we regret it further down the line if we stick with pest. Regretting using logos and rowan seems unlikely to me.

@peterhuene I appreciate that this was a thorough write-up, and you've preemptively answered a lot of questions I would have asked.

0 replies

claymcleod · 2024-05-03T19:00:16Z

claymcleod
May 3, 2024
Maintainer

Two quick questions (sorry, I was not able to give this deep thought):

In the logos/rowan setup, how much more complex is parsing itself? I gather that logos provides the lexing, but is the parsing itself going to be totally custom? Should we consider using something like ungrammar, which appears to be part of rust-analyzer?
Though it's a bit orthogonal to the choice of logos/rowan vs. pest, is there any value in making the AST mutable and independent as opposed to having it be an immutable interpretation of the CST? In particular, I'm wondering if there are going to be benefits from, say, the ability to compose an AST using code and then convert it to syntax (say, from a code generation standpoint, which I've definitely been thinking about).

Overall, I'm not super concerned about the downsides—we're still v0.x, so we'd be okay to change IMO.

2 replies

peterhuene May 3, 2024
Maintainer Author

In the logos/rowan setup, how much more complex is parsing itself? I gather that logos provides the lexing, but is the parsing itself going to be totally custom? Should we consider using something like ungrammar, which appears to be part of rust-analyzer?

The lexer would be pretty strait forward with logos (~a few hundred lines), but the parser would be hand-written and certainly more complex (in terms of code size) if an infallible parse is desired; an implementation I've made in the past loosely followed rust-analyzer's design, where the parser essentially consumes a token stream and produces a "parser event" list which can be used to construct the CST bottom-up; in that implementation, the parser event list may contain error events for errors encountered in the parse, which are collected and provided along side the always-produced CST for user feedback.

ungrammar is used by rust-analyzer to automatically generate the SyntaxKind (i.e. the enum used to differentiate nodes/tokens in the CST) as well as the AST adapters for the CST, which definitely saves a lot of hand-written boilerplate code for a complex language like Rust. The ungrammar crate itself isn't a general purpose code generator, just a parser of ungrammars that produce a rule representation for a code generator to make use of; we'd have to implement said code generator ourselves (or otherwise adapt rust-analyzer's task for our needs). I think we'd only discover if such an investment is useful at the stage where we're authoring the WDL AST adapters.

That said, I found that for my (comparatively-simple) language I didn't mind writing those adapters by hand. It also allowed my AST adapters to be a little more precise than what their generator outputs. For example, it might know that if a node was produced, then a certain child token is always present, so the adapter function would return a representation of the SyntaxToken rather than an Option<SyntaxToken>, leading to a slightly more ergonomical API.

Though it's a bit orthogonal to the choice of logos/rowan vs. pest, is there any value in making the AST mutable and independent as opposed to having it be an immutable interpretation of the CST? In particular, I'm wondering if there are going to be benefits from, say, the ability to compose an AST using code and then convert it to syntax (say, from a code generation standpoint, which I've definitely been thinking about).

That's a great question! Certainly basing the CST implementation on an immutable data structure that is practically (from an ergonomics standpoint) only constructable from a parser's output would not lend itself to a good code generation API.

The nice thing about having the AST be a facade of the CST is that it requires no additional space; for linters*, doc generators, and other tools that simply want to analyze the source without generating any new source, they don't pay the cost of copying data out into a separate, mutable representation, or, alternatively, parsing to produce a mutable CST in the first place, which loses a whole lot of benefits of a red-green tree representation (namely not having to completely recreate the tree for live editing).

But I certainly see the potential need for a WDL source generator. Personally I would advocate for a separate "builder" API that has a builder for each node in the CST, producing a CST as an artifact from a top-level DocumentBuilder or some such type. While certainly that would mean more code than having a single mutable CST, it does mean users only pay for what they need.

I honestly can't say if rowan is definitely the right call for our CST representation, but if we do plan on someday having the need for a live editing parser, the immutable nature of the CST will be of great benefit.

* I realize some linters may want to auto-correct issues, but that is possible with an immutable CST without a builder API; edits simply require replacing nodes from the edited node all the way to root to produce a new tree; that is still very efficient as the unaffected subtrees are reused between the two trees.

claymcleod May 7, 2024
Maintainer

Thanks for the reply: admittedly, I didn't look closely at the ungrammar crate in my first reply, but I see what you mean now after looking at it more closely.

My feeling is that we should definitely make the switch in the long term. The question is really regarding the short term whether this is the right play to focus on—particularly for the next six weeks. The major guiding factor there that I can see is the trade-off between these two things:

How much time is going to be wasted if we keep Pest for now and implement, say, the 30 or so proposed rules and eventually convert them over to logos vs.
How much are we going to learn during the implementation of these rules that might be easier to draft out with Pest since it requires less custom coding.

I actually do not have a preference. If it were me, I'd probably just go ahead and do the conversion, but I'm also totally okay with not doing the conversion just to get something out there and make the tool more useful in the short term.

peterhuene · 2024-06-18T19:35:04Z

peterhuene
Jun 18, 2024
Maintainer Author

Closing this discussion as the new parser implementation is now present in the wdl family of crates and used from sprocket 0.3.0.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migration from `pest` to `logos` + `rowan`? #29

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Migration from pest to logos + rowan? #29

peterhuene May 2, 2024 Maintainer

Overview

Definitions of a few terms that are relevant to this discussion:

One developer's Rust-based parser journey

The discussion topic

Pros

Cons

Pros

Cons

Replies: 3 comments · 2 replies

a-frantz May 3, 2024 Maintainer

claymcleod May 3, 2024 Maintainer

peterhuene May 3, 2024 Maintainer Author

claymcleod May 7, 2024 Maintainer

peterhuene Jun 18, 2024 Maintainer Author

Migration from `pest` to `logos` + `rowan`? #29

peterhuene
May 2, 2024
Maintainer

Replies: 3 comments 2 replies

a-frantz
May 3, 2024
Maintainer

claymcleod
May 3, 2024
Maintainer

peterhuene May 3, 2024
Maintainer Author

claymcleod May 7, 2024
Maintainer

peterhuene
Jun 18, 2024
Maintainer Author