Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a "catch all" token #2017

Closed
knpwrs opened this issue Jan 26, 2024 · 5 comments
Closed

Creating a "catch all" token #2017

knpwrs opened this issue Jan 26, 2024 · 5 comments

Comments

@knpwrs
Copy link

knpwrs commented Jan 26, 2024

I am using Chevrotain to try and make a lexer for the liquid templating language. Consider the following template (including the comment):

<!-- if array = [1,2,3,4,5,6] -->
{% for item in array limit:2 %}
  {{ item }}
{% endfor %}

As a first step, I am making a multi-mode lexer. The first mode, main, has three tokens:

export const ObjectStart = createToken({
  name: 'ObjectStart',
  pattern: /{{-?/,
  push_mode: MODE_OBJECT,
})

export const TagStart = createToken({
  name: 'TagStart',
  pattern: /{%-?/,
  push_mode: MODE_TAG,
})

export const Text = createToken({
  name: 'Text',
  pattern: /[\s\S]+/,
  line_breaks: true,
})

const lexer = new Lexer({
  modes: {
    'main': [ObjectStart, TagStart, Text],
    'object': [/* not relevant to issue */],
    'tag: [/* not relevant to issue */],
  },
  defaultMode: 'main',
})

Already you can see a problem with my Text token in that it will consume everything since ObjectStart and TagStart don't match. Essentially I want to match everything up until either {{ opens a liquid object or {% opens a liquid tag. I've tried /(?!{{|{%)+/ but this pattern matches empty strings. /(.+)(?:{{|{%)?/ appears to work, but in every case, including /[\s\S]+/, I am hitting something that I simply do not understand.

My lexer returns the following errors:

{
  "errors": [
    {
      "column": 1,
      "length": 1,
      "line": 1,
      "message": "unexpected character: -><<- at offset: 0, skipped 1 characters.",
      "offset": 0,
    },
    {
      "column": 2,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->!<- at offset: 1, skipped 1 characters.",
      "offset": 1,
    },
    {
      "column": 3,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->-<- at offset: 2, skipped 1 characters.",
      "offset": 2,
    },
    {
      "column": 4,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->-<- at offset: 3, skipped 1 characters.",
      "offset": 3,
    },
  ]
}

The initial <!-- does not match, and then the tokens start at if array. With pattern set to /(.+)(?:{{|{%)?/, I get the following errors:

  "errors": [
    {
      "column": 34,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->
<- at offset: 33, skipped 1 characters.",
      "offset": 33,
    },
    {
      "column": 66,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->
<- at offset: 65, skipped 1 characters.",
      "offset": 65,
    },
    {
      "column": 79,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->
<- at offset: 78, skipped 1 characters.",
      "offset": 78,
    },
    {
      "column": 92,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->
<- at offset: 91, skipped 1 characters.",
      "offset": 91,
    },
  ]

Basically every new line is unexpected.

I've also tried a variation on moving the /(.+)(?:{{|{%)?/ pattern to the front of the mode, but that's producing errors of its own.

What is the best way to create a "catch all" token that captures everything up until another token in the current mode would be valid?

Semantically, in a liquid template everything that is outside of an object ({{ }}) or a tag ({% %}) is just text.

EDIT: I've also tried /([\s\S]+)(?:{{|{%)?/ and this appears to also produce the same errors as originally.

@bd82
Copy link
Member

bd82 commented Jan 27, 2024

Hello @knpwrs

What is the best way to create a "catch all" token that captures everything up until another token in the current mode would be valid?

Your approach: Greedy Matching and Lookahead

Many of the regexp patterns you have tried seem to allow matching these two characters sequences ({{ {%)
followed by optional lookahead assertions:

  • /(.+)(?:{{|{%)?/
  • /([\s\S]+)(?:{{|{%)?/

I have not tested this so there may be other issues but I assume that if the regexp engine is greedy (attempts longest match) which is the default afaik. Then it would match the longest sub-string of the input that fits the pattern instead of the shortest string until the input.

Using non-greedy quantifiers (+? *?) may help, but I'm still wary of the combination with optional lookahead

  • (?:{{|{%)?

Suggestion (try this)

My default approach in this case would be to not allow the pattern to match the two characters sequence
which marks the beginning of the "meaningful" part of the template.
So I would define the "free text" part as a sequence of:

  • single non { characters
  • two character pairs of { followed by something which is not { or %

e.g: /([^{]|({[^%{]))+/

Edge Case

There is still an edge case where the last token in the input is a "free Text" which ends with a single { character.
And I don't think Chevrotain allows you to include end of input anchor ($) in the pattern regexps.
But that could potentially be handled by simple pre-lexing input processing (appending another character if the input ends with {

@knpwrs
Copy link
Author

knpwrs commented Jan 29, 2024

I wound up trying a custom token:

export const Text = createToken({
  name: 'Text',
  line_breaks: true,
  pattern: {
    exec: (text, startOffset) => {
      let endOffset = startOffset
      let charCode = text.charCodeAt(endOffset)
      let nextCharCode = text.charCodeAt(endOffset + 1)

      while (
        !Number.isNaN(charCode) &&
        !Number.isNaN(nextCharCode) &&
        charCode !== OpenBrace &&
        nextCharCode !== OpenBrace &&
        nextCharCode !== PercentSign
      ) {
        endOffset += 1
        charCode = text.charCodeAt(endOffset)
        nextCharCode = text.charCodeAt(endOffset + 1)
      }

      if (endOffset === startOffset) {
        return null
      }

      const match = text.substring(startOffset, endOffset)
      return [match]
    },
  },
})

And I am very confused by this output:

{
  "errors": [
    {
      "column": 34,
      "length": 1,
      "line": 1,
      "message": "unexpected character: ->
<- at offset: 33, skipped 1 characters.",
      "offset": 33,
    },
    {
      "column": 2,
      "length": 1,
      "line": 2,
      "message": "unexpected character: -> <- at offset: 67, skipped 1 characters.",
      "offset": 67,
    },
    {
      "column": 13,
      "length": 1,
      "line": 2,
      "message": "unexpected character: ->
<- at offset: 78, skipped 1 characters.",
      "offset": 78,
    },
    {
      "column": 26,
      "length": 1,
      "line": 2,
      "message": "unexpected character: ->
<- at offset: 91, skipped 1 characters.",
      "offset": 91,
    },
  ],
  "groups": {},
  "tokens": [
    {
      "endColumn": 33,
      "endLine": 1,
      "endOffset": 32,
      "image": "<!-- if array = [1,2,3,4,5,6] -->",
      "startColumn": 1,
      "startLine": 1,
      "startOffset": 0,
      "tokenType": {
        "CATEGORIES": [],
        "LINE_BREAKS": true,
        "PATTERN": {
          "exec": [Function],
        },
        "categoryMatches": [],
        "categoryMatchesMap": {},
        "isParent": false,
        "name": "Text",
        "tokenTypeIdx": 11,
      },
      "tokenTypeIdx": 11,
    },
    {
      "endColumn": 36,
      "endLine": 1,
      "endOffset": 35,
      "image": "{%",
      "startColumn": 35,
      "startLine": 1,
      "startOffset": 34,
      "tokenType": {
        "CATEGORIES": [],
        "PATTERN": /\\{%-\\?/,
        "PUSH_MODE": "tag",
        "categoryMatches": [],
        "categoryMatchesMap": {},
        "isParent": false,
        "name": "TagStart",
        "tokenTypeIdx": 10,
      },
      "tokenTypeIdx": 10,
    },

Why would the line breaks be unexpected? I have line_breaks: true.

I'm also thinking perhaps it would be beneficial for Chevrotain to ship an official Mustache Template Syntax lexer/parser. Mustache is the simplest language I'm aware of for this style of templates and it would demonstrate how to work around this problem for all similar languages.

@bd82
Copy link
Member

bd82 commented Jan 31, 2024

line_breaks : true does not make the token able to include line_breaks.
Instead it tells Chevrotain that the token may have included line_breaks, so it should update the line/column trackers.

If you want your Text token to handle the new lines, you have to explicitly implement it in your custom token code.
Although I suspect your code does handle it.

I suspect you may have a logical bug where your loop halts one index before the expected position, e.g:

  • When the charCode is \n and the nextCharCode is { the loop will halt even though we have not reached a {% or {{.

You should also test the edge case of a Text token which is the last token in the input.

<!-- if array = [1,2,3,4,5,6] -->
{% for item in array limit:2 %}
  {{ item }}
{% endfor %}
123456

@bd82
Copy link
Member

bd82 commented Jan 31, 2024

I'm also thinking perhaps it would be beneficial for Chevrotain to ship an official Mustache Template Syntax lexer/parser. Mustache is the simplest language I'm aware of for this style of templates and it would demonstrate how to work around this problem for all similar languages.

"Official" and "ship" are beyond the scope of the provided examples as most of those are non-productive
quality examples...

But a smaller (more focused) example of "catch all" token example PR would be positively reviewed if you are interested in contributing it.

@knpwrs
Copy link
Author

knpwrs commented Feb 1, 2024

My custom pattern wound up being problematic, so I used your suggested pattern and it's working well so far.

I'd love to contribute an example, maybe after this project wraps and I gain some confidence in how it's all working together.

Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants