Use LaTeX and TeX delimiters by default #5515

rda0 · 2020-12-21T14:09:39Z

This PR fixes element-hq/element-web#15846 and I believe it addresses all suggestions in the comments.

I tried to find regular expressions for the $...$ and $$...$$ parsing to avoid interference for normal users as much as possible. The regex allows to escape $ characters inside and outside of the maths pattern, disallows multi-line inline maths and has some more restrictions to avoid unwanted mangling of the input (see code comments).

This PR proposes to switch the default delimiters to use LaTeX style delimiters for sending and editing messages:

inline math: $...$
display math: \[...\]

Alternatively TeX style delimiters can be used for sending maths:

inline math: $...$
display math: $$...$$

According to Are $ and $ preferable to dollar signs for math mode? and Why is \[ … \] preferable to $$ … $$?:

$ ... $ is LaTeX syntax. $ ... $ is TeX syntax.

To avoid mangled code blocks, the input is converted to markdown twice, once using latex-then-markdown and once using markdown-only parsing. The clean code blocks are then copied back to the combined output.

The configuration options introduced in #5244 may not be compatible anymore, since the regex replace code for inline maths has slightly changed. You can reproduce the default behavior in config.json now as follows:

    "latex_maths_delims": {
        "inline": {
            "left": "\\(",
            "right": "\\)",
            "pattern": {
                "latex": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
                "tex": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$"
            }
        },
        "display": {
            "left": "\\[",
            "right": "\\]",
            "pattern": {
                "latex": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
                "tex": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
            }
        }
    },

Signed-off-by: Sven Mäder maeder@phys.ethz.ch

Since parsing for $'s as maths delimiters is tricky, switch the default to $...$ for inline and \[...\] for display maths as it is used in LaTeX. Add /tex command to explicitly parse in TeX mode, which uses $...$ for inline and $$...$$ for display maths. Signed-off-by: Sven Mäder <maeder@phys.ethz.ch>

Signed-off-by: Sven Mäder <maeder@phys.ethz.ch>

jryans · 2020-12-22T15:19:12Z

Adding @uhoreg separately to get his opinion on these change as a user of LaTeX input in Matrix.

rda0 · 2021-01-04T10:27:28Z

If the proposal using a /tex command would be acceptable for LaTeX users, the regex for $ parsing could also be less restrictive. Maybe the only requirement should be to backslash-escape literal $. So that you could also write /tex $ \TeX $. This will however break for example having more than one permalink in such a message, but this could be acceptable here.

I only got feedback from two of our interested Matrix/LaTeX users and both reported that /tex $ \TeX $ doesn't work but /tex $\TeX$ does.

An additional configuration option for the regex in the /tex command could also be added. But maybe it is better for Element if the command works the same on all installations.

uhoreg · 2021-01-15T17:19:55Z

Even though  and \[ \] may be technically better, I think that a lot of people are used to using $, and expect it to work like that, so I'd like to give it at shot at making it parse $ better. If it ends up being too unwieldy, then this approach sounds like it would be a reasonable solution.

uhoreg · 2021-01-15T23:29:57Z

Here's the result of my attempts today at playing with regular expressions:

const testCases = [
    // inline math:
    ["$x^2$", '<span data-mx-maths="x^2"></span>'],
    // math can be followed by punctuation
    ["$x^2$?", '<span data-mx-maths="x^2"></span>?'],
    // escaped dollar sign inside math
    ["$\\$$", '<span data-mx-maths="\\$"></span>'],
    ["$3 + \\$3$", '<span data-mx-maths="3 + \\$3"></span>'],
    // math can be preceded and followed by whitespace
    ["hello $x^2$ world", 'hello <span data-mx-maths="x^2"></span> world'],
    ["hello\n$x^2$\nworld", 'hello\n<span data-mx-maths="x^2"></span>\nworld'],
    // when using \( \), be less strict (e.g. allow spaces after opening
    // delimiter, before delimiter, and newlines inside)
    ["a\\( x^2 \\)b", 'a<span data-mx-maths=" x^2 "></span>b'],
    ["a\\(x\ny\\)b", 'a<span data-mx-maths="x\ny"></span>b'],
    // display math: (mostly similar to inline, but a bit less strict)
    ["$$x^2$$", '<div data-mx-maths="x^2">\n\n</div>'],
    // allow newlines inside display math
    ["$$x\ny$$", '<div data-mx-maths="x\ny">\n\n</div>'],
    ["hello $$x^2$$", 'hello <div data-mx-maths="x^2">\n\n</div>'],
    ["hello\n$$x^2$$", 'hello\n<div data-mx-maths="x^2">\n\n</div>'],
    ["a\\[ x^2 \\]b", 'a<div data-mx-maths=" x^2 ">\n\n</div>b'],
    ["a\\[x\ny\\]b", 'a<div data-mx-maths="x\ny">\n\n</div>b'],
    // not math:
    ["$$", '$$'],
    ["$ $", '$ $'],
    // don't allow newlines inside inline math with dollars -- if you want to
    // have newlines, use \(...\)
    ["$x\ny$", '$x\ny$'],
    // don't allow space after first $ or before second $
    ["$ x$", '$ x$'],
    ["$x $", '$x $'],
    // first $ has to be preceded by whitespace or be at beginning of message,
    // second $ has to be followed by whitespace or be at end of message
    ["$x = $y", '$x = $y'],
    ["1$ and 2$", '1$ and 2$'],
    // escaped dollar signs
    ["\\$x$", '$x$'],
    ["$x\\$", '$x$'],
];

const nonSpaceMathChar = "(?:[^$\\\\\\s\\n]|\\\\.)";
const nonNewlineMathChar = "(?:[^$\\\\\\n]|\\\\.)";
const mathChar = "(?:[^$\\\\]|\\\\.)";
const inlineDollar = "(^|\\s)\\$(" +
      nonSpaceMathChar +
      "|(?:" + nonSpaceMathChar + nonNewlineMathChar + "*" + nonSpaceMathChar +
      "))\\$(?=$|\\s|[.,!?:;])";
const displayDollar = "(^|\\s)\\$\\$(" + mathChar + "+)\\$\\$(?=$|\\s|[.,!?:;])";
const inlineBracket = "\\\\\\((" + mathChar + "+)\\\\\\)";
const displayBracket = "\\\\\\[(" + mathChar + "+)\\\\\\]";
const latexRegexp = RegExp(
    "(" + inlineDollar + "|" + displayDollar + "|" + inlineBracket + "|" +
    displayBracket + "|\\\\\\$)",
    "gm",
);

function replaceLatex(str) {
    return str.replace(
        latexRegexp,
        function(m, p1, p2, p3, p4, p5, p6, p7) {
            if (p1 === "\\$") {
                return "$";
            } else if (p3) {
                // FIXME: p3 should be HTML escaped -- same mith p3, p6, p7 below
                return `${p2}<span data-mx-maths="${p3}"></span>`;
            } else if (p5) {
                return `${p4}<div data-mx-maths="${p5}">\n\n</div>`;
            } else if (p6) {
                return `<span data-mx-maths="${p6}"></span>`;
            } else {
                return `<div data-mx-maths="${p7}">\n\n</div>`;
            }
        },
    );
}

for (const [input, expected] of testCases) {
    const result = replaceLatex(input);
    if (result === expected) {
        console.log("✔️", JSON.stringify(result));
    } else {
        console.log("got", result);
    }
}

Basically, it allows users to use both $...$ /$$...$$ and $...$/\[...\]-style notation. It's more strict with the dollar signs: with the dollar signs, you need whitespace before the first delimiter and after the second delimiter, but cannot have whitespace after the first delimiter or before the second delimiter, and for inline math, you can't have a newline inside it. Literal dollar signs can be escaped in case something would otherwise be interpreted as math. With the bracket notation, it's less strict, so it doesn't care about whitespace.

jryans · 2021-01-28T12:23:34Z

@rda0 What do you think about changing to the regexs @uhoreg proposed above?

rda0 · 2021-01-28T16:51:05Z

@jryans Sorry for the delay, I will have a look at it soon. I think the code @uhoreg posted looks cleaner. But I have some more test cases I would like to run it against.

rda0 · 2021-01-28T23:28:23Z

I think with c4f0987 it now provides the same functionality as in #5515 (comment). It also passes all tests in that comment. The following changed:

Use both $...$ /$$...$$ and $...$/\[...\]-style notation for serialize
Disallow inline multi-line maths
Allow inline display maths
No restrictions using /tex command
Default behavior is now (config.json):

"latex_maths_delims": {
  "inline_left": "\\(",
  "inline_right": "\\)",
  "inline_pattern": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
  "display_left": "\\[",
  "display_right": "\\]",
  "display_pattern": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]"
}

I will try to use the code in #5515 (comment) tomorrow and run some more tests. If both solutions are equivalent, then I do not have a preference which one to use.

@uhoreg What do think about still providing the /tex command, but without restrictions for inline maths?

uhoreg · 2021-01-29T00:39:16Z

What do think about still providing the /tex command, but without restrictions for inline maths?

I am indifferent. I think that with allowing both $...$ and $...$, there is less need for it, but if other people feel it's necessary, I'm not opposed to it.

rda0 · 2021-01-29T13:18:54Z

@uhoreg I tested the code from #5515 (comment) in Element, and I think the code/regex in the PR as it is now works better. My tests are not as nicely done as yours, I just send a huge message with all kinds of maths/markdown combinations (resulting in the screenshot) and a huge message with random tests.

I tested this with:

export function htmlSerializeIfNeeded(model: EditorModel, {forceHTML = false} = {}) {
    let md = mdSerialize(model);
    // copy of raw input to remove unwanted math later
    const orig = md;

    if (SettingsStore.getValue("feature_latex_maths")) {
        const nonSpaceMathChar = "(?:[^$\\\\\\s\\n]|\\\\.)";
        const nonNewlineMathChar = "(?:[^$\\\\\\n]|\\\\.)";
        const mathChar = "(?:[^$\\\\]|\\\\.)";
        const inlineDollar = "(^|\\s)\\$(" +
              nonSpaceMathChar +
              "|(?:" + nonSpaceMathChar + nonNewlineMathChar + "*" + nonSpaceMathChar +
              "))\\$(?=$|\\s|[.,!?:;])";
        const displayDollar = "(^|\\s)\\$\\$(" + mathChar + "+)\\$\\$(?=$|\\s|[.,!?:;])";
        const inlineBracket = "\\\\\\((" + mathChar + "+)\\\\\\)";
        const displayBracket = "\\\\\\[(" + mathChar + "+)\\\\\\]";
        const latexRegexp = RegExp(
            "(" + inlineDollar + "|" + displayDollar + "|" + inlineBracket + "|" +
            displayBracket + "|\\\\\\$)",
            "gm",
        );

        md = md.replace(
            latexRegexp,
            function(m, p1, p2, p3, p4, p5, p6, p7) {
                if (p1 === "\\$") {
                    return "$";
                } else if (p3) {
                    const p3e = AllHtmlEntities.encode(p3);
                    return `${p2}<span data-mx-maths="${p3e}"></span>`;
                } else if (p5) {
                    const p5e = AllHtmlEntities.encode(p5);
                    return `${p4}<div data-mx-maths="${p5e}">\n\n</div>`;
                } else if (p6) {
                    const p6e = AllHtmlEntities.encode(p6);
                    return `<span data-mx-maths="${p6e}"></span>`;
                } else {
                    const p7e = AllHtmlEntities.encode(p7);
                    return `<div data-mx-maths="${p7e}">\n\n</div>`;
                }
            },
        );

        // make sure div tags always start on a new line, otherwise it will confuse
        // the markdown parser
        md = md.replace(/(.)<div/g, function(m, p1) { return `${p1}\n<div`; });
    }

    const parser = new Markdown(md);
    if (!parser.isPlainText() || forceHTML) {
        // feed Markdown output to HTML parser
        const phtml = cheerio.load(parser.toHTML(),
            { _useHtmlParser2: true, decodeEntities: false });

        if (SettingsStore.getValue("feature_latex_maths")) {
            // original Markdown without LaTeX replacements
            const parserOrig = new Markdown(orig);
            const phtmlOrig = cheerio.load(parserOrig.toHTML(),
                { _useHtmlParser2: true, decodeEntities: false });

            // since maths delimiters are handled before Markdown,
            // code blocks could contain mangled content.
            // replace code blocks with original content
            phtml('code').contents('div, span').each(function(i) {
                const origData = phtmlOrig('code').contents('div, span')[i].data;
                phtml('code').contents('div, span')[i].data = origData;
            });

            // add fallback output for latex math, which should not be interpreted as markdown
            phtml('div, span').each(function(i, e) {
                const tex = phtml(e).attr('data-mx-maths')
                if (tex) {
                    phtml(e).html(`<code>${tex}</code>`)
                }
            });
        }
        return phtml.html();
    }
    // ensure removal of escape backslashes in non-Markdown messages
    if (md.indexOf("\\") > -1) {
        return parser.toPlaintext();
    }
}

Test messages:

markdown_latex.txt
markdown_tex.txt
random_latex.txt
random_tex.txt

Test results:

markdown_latex_result_pr.txt
markdown_latex_result_uhoreg.txt
markdown_tex_result_pr.txt
markdown_tex_result_uhoreg.txt
random_latex_result_pr.txt
random_latex_result_uhoreg.txt
random_tex_result_pr.txt
random_tex_result_uhoreg.txt

From what I can see, only markdown_tex_result_* look equivalent.

Could be that I did something wrong trying to integrate the code into the SDK.

rda0 · 2021-01-29T14:58:20Z

ac1f9b4 makes the alternative pattern (using $...$ /$$...$$) configurable, default config:

"latex_maths_delims": {
    "inline_left": "\\(",
    "inline_right": "\\)",
    "inline_pattern": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
    "inline_pattern_alternative": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$",
    "display_left": "\\[",
    "display_right": "\\]",
    "display_pattern": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
    "display_pattern_alternative": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
},

src/editor/serialize.ts

behrmann · 2021-03-26T09:24:54Z

Sorry to ask, but is there anything to help this PR along, e.g. with testing in the field? We have LaTeX support enabled by default on our Element instance, since people are more likely to need LaTeX than $ sigils on code blocks, but I'm still bitten by this daily. :)

jryans

I believe we're quite close to something that could land here. Let's try refactoring a bit to reduce duplication, and then try it out.

jryans · 2021-03-31T16:09:23Z

src/editor/serialize.ts

+        // conditions for display math detection $$...$$:
+        // - pattern starts at beginning of line or is not prefixed with backslash or dollar
+        // - left delimiter ($$) is not escaped by backslash
+        const displayPatternAlternative = (SdkConfig.get()['latex_maths_delims'] ||


It looks we have two similar chunks of code, one for the alternate patterns and another for default. Rather than duplicating the logic in two chunks, could you refactor it out so the common work is de-duplicated to function the takes e.g. the patterns it needs?

I refactored the code and hope it is what you expected.

rda0 · 2021-04-01T10:31:20Z

In 73130ca I also changed the pattern naming in the config to be more specific:

"latex_maths_delims": {
    "inline_left": "\\(",
    "inline_right": "\\)",
    "inline_pattern_latex": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
    "inline_pattern_tex": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$",
    "display_left": "\\[",
    "display_right": "\\]",
    "display_pattern_latex": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
    "display_pattern_tex": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
},

jryans · 2021-04-06T09:41:43Z

@rda0 Thanks for the refactoring! It seems like the config might be a bit nicer to use if inline vs. display were grouped in the their own nested objects, sort of similar to what you've done with the defaults. Do you agree?

There are also some test failures... perhaps try merging develop?

rda0 · 2021-04-06T13:27:05Z

@jryans The config is now:

    "latex_maths_delims": {
        "inline": {
            "left": "\\(",
            "right": "\\)",
            "pattern": {
                "latex": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
                "tex": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$"
            }
        },
        "display": {
            "left": "\\[",
            "right": "\\]",
            "pattern": {
                "latex": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
                "tex": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
            }
        }
    },

I also merged develop without problems except for new linting errors appearing now, which I do not know how to resolve. I am sorry, but my skills as developer and especially using this language is very limited😉 Would be great if someone could help out.

jryans · 2021-04-07T09:28:34Z

Hmm, looks like we may need a few tweaks to adjust to the cheerio API:

For example, it seems contents doesn't take any arguments so the div, span could be removed... unless you need to filter out other elements, in which case you may need a different API.

About .data, what does reading and writing that actually do...? I admit I am not quite sure what it's trying to achieve when looking closely... 😅

rda0 · 2021-04-08T09:57:29Z

@jryans Thanks to a friend who figured out how to use the cheerio api correctly, we were able to remove my hack via the .data attribute.

jryans

Thanks for all the work here, let's give it a go! 😄

rda0 mentioned this pull request Dec 21, 2020

LaTeX parsing cannot discern $ when not intended to be used element-hq/element-web#15846

Closed

rda0 added 2 commits December 22, 2020 12:18

Improve inline latex regex matching

fb57123

Signed-off-by: Sven Mäder <maeder@phys.ethz.ch>

Fix linting error

d5c72b9

Signed-off-by: Sven Mäder <maeder@phys.ethz.ch>

jryans requested review from a team and uhoreg December 22, 2020 15:12

Use TeX and LaTeX delimiters by default for serialize

c4f0987

Remove /tex command

bcc0697

Add config keys for alternative patterns

ac1f9b4

rda0 changed the title ~~Use LaTeX delimiters by default, add /tex command~~ Use LaTeX and TeX delimiters by default Jan 29, 2021

jryans reviewed Feb 4, 2021

View reviewed changes

src/editor/serialize.ts Outdated Show resolved Hide resolved

jryans self-requested a review March 30, 2021 10:56

jryans suggested changes Mar 31, 2021

View reviewed changes

jryans removed the request for review from uhoreg March 31, 2021 16:10

Refactor latex replace code

73130ca

rda0 added 2 commits April 6, 2021 12:34

Merge remote-tracking branch 'upstream/develop' into maths-parsing-latex

3c1169c

Refactor maths config options to nested structure

1d70045

rda0 added 2 commits April 7, 2021 17:22

Remove unused function arguments

7205995

Use cheerio api for code replacements

31e85ec

jryans approved these changes Apr 9, 2021

View reviewed changes

jryans merged commit ae2082b into matrix-org:develop Apr 9, 2021

rda0 deleted the maths-parsing-latex branch April 9, 2021 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use LaTeX and TeX delimiters by default #5515

Use LaTeX and TeX delimiters by default #5515

rda0 commented Dec 21, 2020 •

edited

Loading

jryans commented Dec 22, 2020

rda0 commented Jan 4, 2021 •

edited

Loading

uhoreg commented Jan 15, 2021

uhoreg commented Jan 15, 2021

jryans commented Jan 28, 2021 •

edited

Loading

rda0 commented Jan 28, 2021 •

edited

Loading

rda0 commented Jan 28, 2021 •

edited

Loading

uhoreg commented Jan 29, 2021

rda0 commented Jan 29, 2021 •

edited

Loading

rda0 commented Jan 29, 2021 •

edited

Loading

behrmann commented Mar 26, 2021

jryans left a comment

jryans Mar 31, 2021

rda0 Apr 1, 2021

rda0 commented Apr 1, 2021 •

edited

Loading

jryans commented Apr 6, 2021

rda0 commented Apr 6, 2021

jryans commented Apr 7, 2021

rda0 commented Apr 8, 2021

jryans left a comment

Use LaTeX and TeX delimiters by default #5515

Use LaTeX and TeX delimiters by default #5515

Conversation

rda0 commented Dec 21, 2020 • edited Loading

jryans commented Dec 22, 2020

rda0 commented Jan 4, 2021 • edited Loading

uhoreg commented Jan 15, 2021

uhoreg commented Jan 15, 2021

jryans commented Jan 28, 2021 • edited Loading

rda0 commented Jan 28, 2021 • edited Loading

rda0 commented Jan 28, 2021 • edited Loading

uhoreg commented Jan 29, 2021

rda0 commented Jan 29, 2021 • edited Loading

rda0 commented Jan 29, 2021 • edited Loading

behrmann commented Mar 26, 2021

jryans left a comment

Choose a reason for hiding this comment

jryans Mar 31, 2021

Choose a reason for hiding this comment

rda0 Apr 1, 2021

Choose a reason for hiding this comment

rda0 commented Apr 1, 2021 • edited Loading

jryans commented Apr 6, 2021

rda0 commented Apr 6, 2021

jryans commented Apr 7, 2021

rda0 commented Apr 8, 2021

jryans left a comment

Choose a reason for hiding this comment

rda0 commented Dec 21, 2020 •

edited

Loading

rda0 commented Jan 4, 2021 •

edited

Loading

jryans commented Jan 28, 2021 •

edited

Loading

rda0 commented Jan 28, 2021 •

edited

Loading

rda0 commented Jan 28, 2021 •

edited

Loading

rda0 commented Jan 29, 2021 •

edited

Loading

rda0 commented Jan 29, 2021 •

edited

Loading

rda0 commented Apr 1, 2021 •

edited

Loading