Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract askama_parser from askama_derive #782

Closed
wants to merge 64 commits into from

Conversation

couchand
Copy link
Contributor

This builds on the great work by @Kijewski in #772 to make the parser code more maintainable, and replaces the draft PR on this matter at #748.

An attempt is made to keep the code unchanged where possible. Several minor changes were made to try to keep the dependencies and API surface minimal, including:

  • removing the span from CompileError (it's always Span::call_site anyway)
  • effectively reverting the revert at 747958b. I'm curious to know if any benchmarking has been done on the use of borrowed Strings in the Syntax struct -- in my back-of-the-napkin analysis I don't see how that one-time copy of six short strings would be performance-sensitive.

Next steps include bringing over the parser AST documentation that I had started on #748, and additional API cleanup as necessary.

c.f. #760

@couchand couchand mentioned this pull request Feb 28, 2023
@Kijewski Kijewski self-requested a review February 28, 2023 19:25
@Kijewski
Copy link
Contributor

Could you please split up the commit a little? Your explanations sound sensible, but it would be a bit easier to review if every change was its own commit (removing Span, making Config 'static again, moving the code).

Cc @GuillaumeGomez

@GuillaumeGomez
Copy link
Contributor

Seems like we can now create the parser outside of the proc macro so that looks good to me. 👍

@couchand couchand force-pushed the 2023-02/extract-parser branch 2 times, most recently from ee1994a to 109bb62 Compare February 28, 2023 22:22
@couchand
Copy link
Contributor Author

@Kijewski I've split things up here into hopefully bite-sized pieces. Please let me know if you there's anything else you think could use improvement.

@couchand
Copy link
Contributor Author

Upon further review, I'm not sure the cut is in exactly the right place here. Towit: Syntax should be in askama_parser but I now think Config should remain in askama_derive. I'm going to take a closer look at what it would take to split up the config module.

@couchand couchand force-pushed the 2023-02/extract-parser branch 2 times, most recently from 9136054 to 51d49e8 Compare March 1, 2023 05:18
@couchand
Copy link
Contributor Author

couchand commented Mar 1, 2023

Thanks for your patience @Kijewski @djc @GuillaumeGomez. I've re-split the crates per my previous comment, and I think the commit sequence looks good.

Please take another look and let me know any feedback you have, thank you!

Copy link
Owner

@djc djc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with most of the direction here, though I have bunch of (mostly stylistic) nits.

@@ -0,0 +1,22 @@
#[derive(Debug)]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like tiny modules like this. Please move it to the bottom of parser/mod.rs instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

use std::borrow::Cow;
use std::fmt;

#[derive(Debug, Clone)]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no sense to stick CompileError in parser. We can just return Cow<'static, str> directly in the Result from parse(), instead.

Copy link
Contributor Author

@couchand couchand Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that. I had not grokked that it's not used internally within the parser, just at that top-level parse method. After further reflection, I'd prefer an error type with public accessors to get the line and column numbers and input snippet, and perhaps in the future additional details.

After looking at the three error cases returned by the parse method, I believe they can be coalesced into a single case (logically, if not at the type level) with the combinators nom::combinator::all_consuming and nom::compbinator::complete, so I'd propose doing that as well.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with having a fully-formed Error type, if you want to wire something up -- it just shouldn't be the CompileError type that's used for all the other stuff in derive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1,7 +1,8 @@
use crate::config::{get_template_source, read_config_file, Config, WhitespaceHandling};
use crate::heritage::{Context, Heritage};
use crate::input::{Print, Source, TemplateInput};
use crate::parser::{parse, CompileError, Cond, CondTest, Expr, Loop, Node, Target, When, Whitespace, Ws};
use crate::parser::node::{Cond, CondTest, Loop, Node, Target, When, Whitespace, Ws};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the current number of items, re-exporting items from the nested modules in the parser module seems like a better solution to me than exposing parser's internal module structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imported like this to make the change minimal, but if I had been writing the code from scratch all the references here would be in the style node::Cond, node::When, etc. But I hear you on the blanket imports here. If we're expecting them to be imported directly, would it be a good idea to make the names more verbose (e.g. LoopNode)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so? Again, the number of types exposed here is pretty small, and pretty much all of them are nodes anyway. You added a bunch of documentation, I think that should be good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -35,17 +35,17 @@ pub(crate) enum Expr<'a> {
}

impl Expr<'_> {
pub(super) fn parse(i: &str) -> IResult<&str, Expr<'_>> {
pub(crate) fn parse(i: &str) -> IResult<&str, Expr<'_>> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are actually identical in this case, just leave them as-is for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pub(crate) mod expr;
pub(crate) mod node;
pub(crate) mod syntax;
pub mod error;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be kept pub(crate).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer applicable.

//! which represent either some literal text in the template or one of the
//! three types of template tags: expressions, blocks, or comments.
//!
//! The main entry point to this crate is the [`parse()`](./fn.parse.html)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a very nice public API. Maybe change it to wrap a struct Ast<'a> { nodes: Vec<Node<'a>> } and change parse() to become Ast::from_str()? (In a separate commit.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree, and propose that the type be named Template.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like Template since that's already what the trait is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was thinking std::str::FromStr, but upon review I noticed this:

FromStr does not have a lifetime parameter, and so you can only parse types that do not contain a lifetime parameter themselves.

Perhaps you mean a non-trait from_str method? Since it also takes a Syntax parameter, I'm wondering if from_str fully captures the meaning.

As to the new struct's name, some things stood out. The parse_template helper is what produces this top-level result and the bodies of loops, if and else clauses, etc. and it makes sense to unify the concepts. Those elements are frequently referred to as blocks. Perhaps struct Block<'a> { nodes : Vec<Node<'a>> }?

#[derive(Debug, PartialEq)]
pub enum Node<'a> {
/// Literal text to output directly.
///
/// The first and third tuple elements are the left- and right-side
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If making this API public, consider changing these variants to name their fields instead of using tuple variants. (Separate commit.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the API refactor, these tuple variants all seem self-explanatory, please advise if you disagree.

Name(&'a str),
/// Destruture a tuple value.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done.

///
/// Second field is "minus/plus sign was used on the right part of the item".
/// First tuple value is the setting (`-`/`+`/`~`) for the left side of the block.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably name the fields here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pub ws3: Ws,
}

/// A single branch of a match block.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very fond of type aliases these days. I think we should make When a proper struct with fields, and same for Cond.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 20 to 22
Err(e) => syn::Error::new(Span::call_site(), e)
.to_compile_error()
.into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I guess we could implement to_compile_error() manually (untested):

format!("::core::compile_error!({:?});", format_args!("{e}")).parse()

If we can omit the proc_macro2 dependency altogether in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, although proc_macro2 is still required by a number of the integration crates.

@couchand
Copy link
Contributor Author

couchand commented Mar 3, 2023

For those following along at home, the requested changes not related to API improvements and documentation are up at main...couchand:askama:2023-03/extract-parser

@couchand couchand force-pushed the 2023-02/extract-parser branch 3 times, most recently from 3880bf4 to a1481da Compare March 6, 2023 20:00
@couchand
Copy link
Contributor Author

couchand commented Mar 6, 2023

I've rewritten the patch set with the idea that we want to keep everything within askama_derive until we've completely refactored the parser API to a public form. I've removed the net new stuff (capturing comment bodies and the interstitial between match and the first match arm), figuring that it's better to focus just on extracting the parser as it's used here.

There are a large number of commits that are each the implementation of some big picture idea for one of the AST node types, repeated for each node type. These goals are all in pursuit of making the AST well-structured, so that containment within sub-blocks is clearly maintained. The major sticking point to that was the treatment of Ws. Here are the various threads:

  • Redefine Ws semantics for nodes with nested blocks to keep the Ws for the "outside" of the node in one place, and associate the "inside" Ws with the sub-block that it logically contains.
  • Create the Tag struct to represent any Askama tag with its "outer" Ws, migrate non-Lit nodes to Tag.
  • Create the Block struct to represent a sub-block with its "inner" Ws, migrate all paired Vec<Node<'a>> and Ws to Block.
  • Extract structs for each non-trivial Tag.

I appreciate your patience reviewing this patch set, hopefully I have made each commit small enough that it is not too much of a bear to go through it. Happy to take any feedback or suggestions!

@Kijewski
Copy link
Contributor

Kijewski commented Mar 6, 2023

I like the PR very much! Every change is well thought out and well explained.

Actually the only thing I don't like is the name "Tag". I don't associate anything with the name, but maybe that's just me. I guess if we apply the names XML uses, then our "Tag" is an "Element", our "Lit" is "Data", and our "Node" is, well, a "Node".

@djc
Copy link
Owner

djc commented Mar 7, 2023

I appreciate your patience reviewing this patch set, hopefully I have made each commit small enough that it is not too much of a bear to go through it. Happy to take any feedback or suggestions!

I would prefer to merge something closer to the original scope of the PR, before going into increasing depth of changing the structure of the AST types to your preferences. In particular, it would be nice to get the documentation changes in separately, because one potentially useful input to judging AST type refactoring is to review the diff's line count changes.

In its current state (with a bunch of changes with naming choices I question and other choices that don't have obvious value to me), I don't think I'll be able to prioritize digesting this PR among the other ~70 GitHub notifications I should be attending to.

@couchand
Copy link
Contributor Author

couchand commented Mar 7, 2023

I like the PR very much! Every change is well thought out and well explained.

Thanks @Kijewski!

I would prefer to merge something closer to the original scope of the PR

I hear you @djc. If you'd prefer separate PRs for the refactorings rather than including it all here, I'm happy to split those commits out. Before doing so, I'd like to get general consensus on the overall direction so I'm not spinning my wheels. (It's a lot of commits because there are many variants but it's only a handful of "changes", and without the context of the end goal each can seem to lack purpose).

This PR works backwards from the last commit (which extracts the crate). Per previous discussion, it seems wise to figure out the public API before extracting the code rather than after. So based on your previous comments I worked to design that public API in a way that's internally consistent and exposes the structure a downstream consumer would need.

one potentially useful input to judging AST type refactoring is to review the diff's line count changes.

I'd suggest one of the best ways to judge this refactor is to look at askama_derive/src/generator.rs. That file lost 40 lines net, basically just by consolidating a bunch of handle_ws, prepare_ws, and flush_ws calls. These whitespace-handling calls are sprinkled about the generation code, somewhat inconsistently, and it's not clear what the principled place is for any one of them. The new AST structure pairs the whitespace handling to the template structure it corresponds to (rather than merely its location in the textual source), so it's clear where whitespace handling must be done.

with a bunch of changes with naming choices I question

Happy to take naming suggestions. The structs currently called Tag and Block are big and visible so I'm not surprised we'd want to bikeshed on them for a bit. Otherwise new names tend to follow directly from context, but please do leave comments on specific examples.

@djc
Copy link
Owner

djc commented Mar 8, 2023

I would prefer to merge something closer to the original scope of the PR

I hear you @djc. If you'd prefer separate PRs for the refactorings rather than including it all here, I'm happy to split those commits out. Before doing so, I'd like to get general consensus on the overall direction so I'm not spinning my wheels. (It's a lot of commits because there are many variants but it's only a handful of "changes", and without the context of the end goal each can seem to lack purpose).

I'm not sure what overall direction you mean. If by overall direction you mean a bunch of the refactoring of the Node type then there is no good way to get to consensus without me having to plow through all the changes to get a good sense of what they are, so that doesn't help me deal with reviewing this.

This PR works backwards from the last commit (which extracts the crate). Per previous discussion, it seems wise to figure out the public API before extracting the code rather than after. So based on your previous comments I worked to design that public API in a way that's internally consistent and exposes the structure a downstream consumer would need.

"internally consistent" is too abstract to be meaningful to me, and "the structure a downstream consumer" seems quite specific to your opinions -- I think the existing code generator is a downstream consumer and while it might not be the prettiest code, it does okay.

I'd suggest one of the best ways to judge this refactor is to look at askama_derive/src/generator.rs. That file lost 40 lines net, basically just by consolidating a bunch of handle_ws, prepare_ws, and flush_ws calls. These whitespace-handling calls are sprinkled about the generation code, somewhat inconsistently, and it's not clear what the principled place is for any one of them. The new AST structure pairs the whitespace handling to the template structure it corresponds to (rather than merely its location in the textual source), so it's clear where whitespace handling must be done.

I can get behind the idea that the whitespace handling could be more principled. Maybe that should be a separate first PR, with minimal changes to the rest of the structure and naming?

with a bunch of changes with naming choices I question

Happy to take naming suggestions. The structs currently called Tag and Block are big and visible so I'm not surprised we'd want to bikeshed on them for a bit. Otherwise new names tend to follow directly from context, but please do leave comments on specific examples.

I don't like Tag and I don't like Block in the way it currently seems to be used. In general, it's not obvious me to me that splitting the Node type in Tag and Block makes things clearer. In particular, there's the notion of a block in the semantics here which we would not want to confuse with the notion of a Block here, which is different. You kept the Node type but it actually no longer represents the highest-level type in the AST type hierarchy, which IMO is confusing. In the old structure, things with {% %} were nodes and things with {{ }} were expressions, which seems conceptually clear. The distinction between Node, Tag and Block seems pretty unclear to me (and comments saying that blocks come in three flavors certainly don't make it seem like the hierarchy you've picked is clearly represented in the type system).

The way to avoid bikeshedding is to avoid renaming things for now while we get the structure right, then renaming things can come last.

@couchand
Copy link
Contributor Author

couchand commented Mar 8, 2023

Ok, to be honest I'm struggling with what's actionable here. Do you want me to:

  • rename things before you review in detail?
  • submit a refactoring PR first?
    • if so, do you have a suggestion about which commits should be included?
  • just be patient?

I'm not sure what overall direction you mean.

I mean the shape of the new public API. Here is one proposal, if it's not right let's discuss where it falls short so that my next attempt can meet your goals.

"the structure a downstream consumer" seems quite specific to your opinions -- I think the existing code generator is a downstream consumer and while it might not be the prettiest code, it does okay.

These changes are mostly driven by the generator code, the whitespace handling in particular. I've even removed the new things I'd added for other potential users. But, I'd point out, we also shouldn't totally ignore potential new users when designing a new public API.

I can get behind the idea that the whitespace handling could be more principled. Maybe that should be a separate first PR, with minimal changes to the rest of the structure and naming?

I hear what you're saying, but I'm not sure if there's a big win to be had there? I don't know if it really makes sense without the structural changes -- the structure is what makes it principled, just as with structured programming generally.

By my count, that PR would have 51 of the 64 commits from this one, everything from 1366d70 to 09a264a. Is there a smaller subset of these commits that you'd suggest go first?

In particular, there's the notion of a block in the semantics here which we would not want to confuse with the notion of a Block here, which is different.

Fair point, though OTOH, it's already a bit muddled given usage like:

pub(crate) else_block: Vec<Node<'a>>,
Some(_) => return Err("multiple extend blocks found".into()),
"extends, macro or import blocks not allowed below top level".into(),
Node::Loop(ref loop_block) => {

You kept the Node type but it actually no longer represents the highest-level type in the AST type hierarchy, which IMO is confusing.

FWIW, Node was not previously the highest-level in a template's AST either, that honor goes to Vec<Node>, which as you previously commented is a bit awkward.

In the old structure, things with {% %} were nodes and things with {{ }} were expressions

Well, not exactly. In the old structure, {{ }} became Node::Expr(Expr), {# #} became Node::Comment, and {% %} became Node::AnyOtherVariantExceptLit, none of which has changed much.

The distinction between Node, Tag and Block seems pretty unclear to me

If I may, a Node is either Literal text or an Askama Tag, and a Block is a sequence of nodes. All Tags have "outer" Ws that applies to the last whitespace before and the first whitespace after. All Blocks have "inner" Ws that applies to the first and last whitespace within.

comments saying that blocks come in three flavors certainly don't make it seem like the hierarchy you've picked is clearly represented in the type system

I played around with having an enum with variants for these (Jinja docs call them "delimiters"), but I didn't find it to be practically useful either for the generator code or for writing clear documentation, so I would now consider that a non-goal.

What is incredibly useful is correlating whitespace suppression to the structure it concerns, which is why my focus is there.

The way to avoid bikeshedding is to avoid renaming things for now while we get the structure right, then renaming things can come last.

I strongly agree. Let's focus on the purpose of the structures first and then later identify the right names for them.

@djc
Copy link
Owner

djc commented Mar 13, 2023

"just be patient" won't get us going anywhere. Honestly, I don't have clear idea on how to navigate this either. Maybe an idea would be to first transpose the prev_ws/next_ws structure to outer_ws/inner_ws, while otherwise keeping the current structure? Would that be enough to create more of the structure you got to in this PR?

Another approach would be to start by landing smaller parts that we have consensus on already, to at least make some progress.

FWIW, Node was not previously the highest-level in a template's AST either, that honor goes to Vec<Node>, which as you previously commented is a bit awkward.

It's not obvious to me that it's useful/necessary to wrap (the equivalent of) Vec<Node> in another type (except at the top level, where we can attach a parse() function to it).

If I may, a Node is either Literal text or an Askama Tag, and a Block is a sequence of nodes. All Tags have "outer" Ws that applies to the last whitespace before and the first whitespace after. All Blocks have "inner" Ws that applies to the first and last whitespace within.

In my mind what you call Tag is closer to what was referred to as Block before. Some additional structure might be needed in between because that's a helpful way to track whitespace.

@djc
Copy link
Owner

djc commented Jul 31, 2023

#834 extracts a parser crate and improves the parser API a bunch.

@djc
Copy link
Owner

djc commented Jul 31, 2023

I'm going to close this, as the stated goal (extract parser crate) has been achieved and the remaining changes will need substantial work to rebase -- it probably makes sense to start a new PR instead.

@djc djc closed this Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants