Migration from pest
to logos
+ rowan
?
#29
Replies: 3 comments 2 replies
-
It sounds to me like the right call is to ditch I think there's a good chance we regret it further down the line if we stick with @peterhuene I appreciate that this was a thorough write-up, and you've preemptively answered a lot of questions I would have asked. |
Beta Was this translation helpful? Give feedback.
-
Two quick questions (sorry, I was not able to give this deep thought):
Overall, I'm not super concerned about the downsides—we're still v0.x, so we'd be okay to change IMO. |
Beta Was this translation helpful? Give feedback.
-
Closing this discussion as the new parser implementation is now present in the |
Beta Was this translation helpful? Give feedback.
-
Overview
As development on top of the
wdl
family of crates ramps up in terms of linters, formatters, documenters, and, perhaps one day, a LSP server for live analysis of WDL documents, I wanted to discuss the use ofpest
and whether or not it makes sense to migrate to a different parser implementation.Definitions of a few terms that are relevant to this discussion:
pest
- a PEG-based parser that uses a grammar definition file to generate a parser via a Rust procedural macro; this crate is currently used bywdl-grammar
.logos
- a lexer that uses procedural macros containing regular expressions and compiles those regular expressions into a DFA that drives tokenization; as a result, it has very good performance for incrementally classifying text into a stream of tokens.rowan
- a crate created for use inrust-analyzer
that implements a "red-green" tree data structure; this data structure is meant for efficient "live" analysis and updates that are required from interactive parses such as from within a code editor.One developer's Rust-based parser journey
For one of the (entirely unrelated) languages I've implemented in the past, I started with
pest
because it was pretty easy to get going: it handled both lexing and parsing in a single PEG grammar definition.As time went on, I realized that the parse errors produced by
pest
weren't great (and confusing users by referencing rule names that are internal implementation details), the performance of the parse itself could be better, the rules were difficult to debug, and it was very challenging to preserve whitespace and comments at any location in the grammar, leading to overly complicated rules and parsing bugs involving whitespace. So I decided to abandonpest
in favor oflogos
with a hand-crafted parser that consumes tokens from the stream with a single lookahead.The immediate benefits were the improved diagnostics (this change also coincided with switching to
thiserror
andmiette
for defining lexing/parsing errors) and consistent handling of trivia during the parse. Additionally, the rules were easier to debug as they were implemented directly in source and not hidden behindpest
proc-macro output. The obvious downside was that it was a lot more code to implement the parser than a proc-macro invocation.As I started to implement an LSP (said LSP is a side project that is still nowhere near ready), I realized that I needed an infallible parser, one capable of always producing a CST no matter what the input was. So the hand-crafted parser was once again re-implemented in a manner by which
rust-analyzer
constructs its parse tree: a series of "rule" functions that consume tokens and, at certain points in the grammar, attempts recovery when it encounters errors (the general strategy is to "abandon" the node currently being constructed and look for where it might start again on the next node depending on the current context).For example, here's what one such rule function looks like in my parser:
Which parses this rule:
Note that trivia is attached to the node currently being parsed by the parser implementation, so the rules don't concern themselves with it; additionally, it provides detailed "expected/found" diagnostics when it encounters unexpected tokens.
In addition to the need for an infallible parser, I needed a proper representation of the CST that allowed for incremental re-parses, incremental tree updates, and fast span lookup. Again I asked myself "what would
rust-analyzer
do?" and the answer wasrowan
, which implements a red-green tree that can scale to very large parse trees. So my AST representation was converted to be an adapter over arowan::SyntaxNode
(basically a red tree node) which knows how to walk the CST to provide only semantically necessary elements.The discussion topic
From the lessons I've learned, I wanted to propose the following question for discussion: should we continue to base
wdl
off ofpest
or should we consider a different parser implementation?For what it's worth, here's my opinionated list of pros and cons for
pest
:Pros
pest
is currently used bywdl-grammar
and is exposed publicly in its API, which means switching to a new parser implementation would be a very disruptive breaking change; to me, this is by far the biggest pro.pest
uses so it does requirepest
familiarity to understand.Cons
(WHITESPACE | COMMENT)
appear in the current WDLpest
grammar.pest
has historically been not all that great.And for a different parser implementation:
Pros
thiserror
andmiette
for defining errors, making error handling better from both an API perspective and from a UX perspective).Cons
wdl-grammar
users will need to userowan::SyntaxNode
for traversing all elements in the CST.wdl-ast
users will need to use our provided AST adapters that adaptSyntaxNode
into something that walks the CST for them based on WDL semantics.pest
is an entirely adequate parser implementation for our current needs.3 votes ·
Beta Was this translation helpful? Give feedback.
All reactions