Tree-edit’s syntax generation has a fairly novel implementation (as far as I
know), and is also unfortunately complicated by some issues with the tree-sitter
API and some concessions made for performance. So this document serves to both
explain the high level architecture of tree-edit
and some of the complications
it faces, which hopefully can result in upstream fixes!
Tree-edit’s syntax generation consists of three main phases:
- Converting text into a syntax tree
- Modifying tokens in a syntactically correct manner
- Converting the changes back into text
To convert the text of the buffer into a syntax tree, elisp-tree-sitter is used. Tree-sitter only offers a way one way conversion from text to syntax tree, so the rest is handled by tree-edit.
To perform the syntax generation in tree-edit, reazon, an elisp implementation of the logic programming DSL miniKanren is used.
TL;DR on miniKanren: you can build functions and then ‘reverse’ them. So tree-edit defines a parser which can be used in reverse to generate all possible correct set of tokens for a node type. Tree-sitter generates a JSON file that describes the grammar of the language under parse, which tree-edit uses. Using this reverse parser along with some additional constraints can enable us to create structural editing functions:
- Insertion
- Insert a new node of node-type before or after node. This operation asserts that there are a number of new tokens to the left or right of node, which contain node-type.
- Deletion
- Delete node. This operation removes node, and all surrounding syntax – but syntax is allowed to repopulate if needed.
- Replacement
- Replace node with a new node of node-type. This operation simply checks if swapping node’s token for node-type still parses.
Syntax is never explicitly dealt with, it is simply added by the parser to meet the constraints. For example, the only way to add another expression in an argument list is to add a comma first, so if we assert that a new expression should exist in an argument list it will naturally follow.
These operations compose very well to construct more complex operations. For example, raise travels up the parents until it can find an ancestor that can be replaced with node, and then replaces it, Slurp/barf are composed of delete/insertion operations, etc.
See here for some interactive examples.
For newly inserted node, there’s no singular way to represent it as text. An
argument list could have 20 elements, or 0 – either way it’s an argument list!
So tree-edit-syntax-snippets
in the tree-edit-<language>.el
files define how
new nodes are represented.
Tree-edit interacts with different languages by using mode-local variables for
each language, defined in the tree-edit-<language>
files.
While tree-sitter emits JSON files, we don’t use them directly – instead we
preprocess the JSON and output it to a tree-edit-<language>-grammar.el
file.
This is handled by dev/tree-edit-generate-grammars.el. It performs some
transformations on the grammar for better use with the relational parser, and
performs some precalculations, like what node types show up for a given node,
supertypes and subtypes, etc.
Hidden nodes
Tree-sitter allows nodes to be hidden by prefixing their names with _
. This
can be disastrous for tree-edit in that it provides a false view of the tokens
for a given node, causing a seemingly correct node to fail to parse. Tree-edit
simply ignores the existence of these nodes in the parser.
For example, _newline
, _indent
, and _dedent
are hidden in the Python
grammar. Tree-edit hacks around this by using the whitespace rules, which works
fine for the most part.