On Reducing Inline Quotes False-Positives #36

tajmone · 2021-07-30T03:39:32Z

tajmone
Jul 30, 2021
Maintainer

Just dumping here some considerations that crossed my mind...

I've noticed that even though I've put quite some energy in improving quotes inline elements with paired-delimiters (e.g. bold, italic, and similar), I still get the occasional document break-up due to literal occurrences of these symbols being false-positively matched as an opening delimiter of a quote formatting element.

I remember that when the original syntax mismatched a (well formed) literal * as an opening bold delimiter, the whole document would break-up due out-of-synch delimiters and, often, the parsing stack being trapped in the bold context.

Adding a rule which would force-pop of a quote element when reaching the EOL was a huge improvements in such elements, for at least those false positives would only disrupt a single line.

There was also the assumption that idiomatic AsciiDoc demands splitting text on-line-per-sentence (and not, like we often see in Markdown, wrapping to 80, semantic wrapping, etc.). I believe that this has proven to be a good choice — lacking a full parser, how could we safely handle an opening inline delimiter that starts mid-sentence, if its closing counterpart is found in another line? It would open the doors to the original problem, and be a non-idiomatic case too.

So, thinking along the same lines, I was wondering whether we could further improve those inline elements ("quotes", as they were once called in the docs) by using a lookhead to ensure there's a closing counterpart, before actually entering the matching context.

E.g., the bold element is currently defined as:

  strong:
    - match: |
        (?x)
        (\[[^\]]*?\])?      # might start with an attributes list
        (?<=^|\W)(?<!\\|})  # must be preceded by non-word char, and not by escape or } (attribute)
        (\*)(?=\S)          # star delimiter must be followed by a non-space char
      captures:

The above could be roughly changed to something like:

  strong:
    - match: |
        (?x)
        (?=\[[^\]]*?\])?
        (?<=^|\W)(?<!\\|})
        (?=\*[^\*]+\*)  # Ensure there's a pair of stars!
      push: strong_begin

i.e. only start capturing the opening delimiter (and in this case, the attributes list) if a valid opening delimiter and its closing counterpart are both present in the current line being parsed. We'd be basically renaming the original strong context to strong_begin and replacing strong with a lookahead trigger.

Of course, the above RegEx could be better smoothed out, without all those perky checks (attributes lists should be defined elsewhere, and just included were needed), but the whole point here is to demonstrate that a non-matching lookahead could be used as a trigger for the actual context, thus avoiding capturing anything unless a pair of delimiters is known to be present.

Surely, the presumed closing delimiter might actually be a literal character instead (escaped), or some other construct (depending on the quote at hand), but this should still be a safeguard able to reduce false-positives and document breakage.

Worth trying it out and experiment it via thorough tests for known context which are currently problematic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On Reducing Inline Quotes False-Positives #36

{{title}}

Replies: 0 comments

Select a reply

On Reducing Inline Quotes False-Positives #36

tajmone Jul 30, 2021 Maintainer

Replies: 0 comments

tajmone
Jul 30, 2021
Maintainer