Skip to content
This repository has been archived by the owner on Sep 11, 2024. It is now read-only.

Use LaTeX and TeX delimiters by default #5515

Merged
merged 11 commits into from
Apr 9, 2021

Conversation

rda0
Copy link
Contributor

@rda0 rda0 commented Dec 21, 2020

This PR fixes element-hq/element-web#15846 and I believe it addresses all suggestions in the comments.

I tried to find regular expressions for the $...$ and $$...$$ parsing to avoid interference for normal users as much as possible. The regex allows to escape $ characters inside and outside of the maths pattern, disallows multi-line inline maths and has some more restrictions to avoid unwanted mangling of the input (see code comments).

This PR proposes to switch the default delimiters to use LaTeX style delimiters for sending and editing messages:

  • inline math: \(...\)
  • display math: \[...\]

Alternatively TeX style delimiters can be used for sending maths:

  • inline math: $...$
  • display math: $$...$$

According to Are \( and \) preferable to dollar signs for math mode? and Why is \[ … \] preferable to $$ … $$?:

\( ... \) is LaTeX syntax. $ ... $ is TeX syntax.

To avoid mangled code blocks, the input is converted to markdown twice, once using latex-then-markdown and once using markdown-only parsing. The clean code blocks are then copied back to the combined output.

The configuration options introduced in #5244 may not be compatible anymore, since the regex replace code for inline maths has slightly changed. You can reproduce the default behavior in config.json now as follows:

    "latex_maths_delims": {
        "inline": {
            "left": "\\(",
            "right": "\\)",
            "pattern": {
                "latex": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
                "tex": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$"
            }
        },
        "display": {
            "left": "\\[",
            "right": "\\]",
            "pattern": {
                "latex": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
                "tex": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
            }
        }
    },

image

Signed-off-by: Sven Mäder maeder@phys.ethz.ch

Since parsing for $'s as maths delimiters is tricky, switch the default
to \(...\) for inline and \[...\] for display maths as it is used in
LaTeX. Add /tex command to explicitly parse in TeX mode, which uses
$...$ for inline and $$...$$ for display maths.

Signed-off-by: Sven Mäder <maeder@phys.ethz.ch>
rda0 added 2 commits December 22, 2020 12:18
Signed-off-by: Sven Mäder <maeder@phys.ethz.ch>
Signed-off-by: Sven Mäder <maeder@phys.ethz.ch>
@jryans jryans requested review from a team and uhoreg December 22, 2020 15:12
@jryans
Copy link
Collaborator

jryans commented Dec 22, 2020

Adding @uhoreg separately to get his opinion on these change as a user of LaTeX input in Matrix.

@rda0
Copy link
Contributor Author

rda0 commented Jan 4, 2021

If the proposal using a /tex command would be acceptable for LaTeX users, the regex for $ parsing could also be less restrictive. Maybe the only requirement should be to backslash-escape literal $. So that you could also write /tex $ \TeX $. This will however break for example having more than one permalink in such a message, but this could be acceptable here.

I only got feedback from two of our interested Matrix/LaTeX users and both reported that /tex $ \TeX $ doesn't work but /tex $\TeX$ does.

An additional configuration option for the regex in the /tex command could also be added. But maybe it is better for Element if the command works the same on all installations.

@uhoreg
Copy link
Member

uhoreg commented Jan 15, 2021

Even though \( \) and \[ \] may be technically better, I think that a lot of people are used to using $, and expect it to work like that, so I'd like to give it at shot at making it parse $ better. If it ends up being too unwieldy, then this approach sounds like it would be a reasonable solution.

@uhoreg
Copy link
Member

uhoreg commented Jan 15, 2021

Here's the result of my attempts today at playing with regular expressions:

const testCases = [
    // inline math:
    ["$x^2$", '<span data-mx-maths="x^2"></span>'],
    // math can be followed by punctuation
    ["$x^2$?", '<span data-mx-maths="x^2"></span>?'],
    // escaped dollar sign inside math
    ["$\\$$", '<span data-mx-maths="\\$"></span>'],
    ["$3 + \\$3$", '<span data-mx-maths="3 + \\$3"></span>'],
    // math can be preceded and followed by whitespace
    ["hello $x^2$ world", 'hello <span data-mx-maths="x^2"></span> world'],
    ["hello\n$x^2$\nworld", 'hello\n<span data-mx-maths="x^2"></span>\nworld'],
    // when using \( \), be less strict (e.g. allow spaces after opening
    // delimiter, before delimiter, and newlines inside)
    ["a\\( x^2 \\)b", 'a<span data-mx-maths=" x^2 "></span>b'],
    ["a\\(x\ny\\)b", 'a<span data-mx-maths="x\ny"></span>b'],
    // display math: (mostly similar to inline, but a bit less strict)
    ["$$x^2$$", '<div data-mx-maths="x^2">\n\n</div>'],
    // allow newlines inside display math
    ["$$x\ny$$", '<div data-mx-maths="x\ny">\n\n</div>'],
    ["hello $$x^2$$", 'hello <div data-mx-maths="x^2">\n\n</div>'],
    ["hello\n$$x^2$$", 'hello\n<div data-mx-maths="x^2">\n\n</div>'],
    ["a\\[ x^2 \\]b", 'a<div data-mx-maths=" x^2 ">\n\n</div>b'],
    ["a\\[x\ny\\]b", 'a<div data-mx-maths="x\ny">\n\n</div>b'],
    // not math:
    ["$$", '$$'],
    ["$ $", '$ $'],
    // don't allow newlines inside inline math with dollars -- if you want to
    // have newlines, use \(...\)
    ["$x\ny$", '$x\ny$'],
    // don't allow space after first $ or before second $
    ["$ x$", '$ x$'],
    ["$x $", '$x $'],
    // first $ has to be preceded by whitespace or be at beginning of message,
    // second $ has to be followed by whitespace or be at end of message
    ["$x = $y", '$x = $y'],
    ["1$ and 2$", '1$ and 2$'],
    // escaped dollar signs
    ["\\$x$", '$x$'],
    ["$x\\$", '$x$'],
];

const nonSpaceMathChar = "(?:[^$\\\\\\s\\n]|\\\\.)";
const nonNewlineMathChar = "(?:[^$\\\\\\n]|\\\\.)";
const mathChar = "(?:[^$\\\\]|\\\\.)";
const inlineDollar = "(^|\\s)\\$(" +
      nonSpaceMathChar +
      "|(?:" + nonSpaceMathChar + nonNewlineMathChar + "*" + nonSpaceMathChar +
      "))\\$(?=$|\\s|[.,!?:;])";
const displayDollar = "(^|\\s)\\$\\$(" + mathChar + "+)\\$\\$(?=$|\\s|[.,!?:;])";
const inlineBracket = "\\\\\\((" + mathChar + "+)\\\\\\)";
const displayBracket = "\\\\\\[(" + mathChar + "+)\\\\\\]";
const latexRegexp = RegExp(
    "(" + inlineDollar + "|" + displayDollar + "|" + inlineBracket + "|" +
    displayBracket + "|\\\\\\$)",
    "gm",
);

function replaceLatex(str) {
    return str.replace(
        latexRegexp,
        function(m, p1, p2, p3, p4, p5, p6, p7) {
            if (p1 === "\\$") {
                return "$";
            } else if (p3) {
                // FIXME: p3 should be HTML escaped -- same mith p3, p6, p7 below
                return `${p2}<span data-mx-maths="${p3}"></span>`;
            } else if (p5) {
                return `${p4}<div data-mx-maths="${p5}">\n\n</div>`;
            } else if (p6) {
                return `<span data-mx-maths="${p6}"></span>`;
            } else {
                return `<div data-mx-maths="${p7}">\n\n</div>`;
            }
        },
    );
}

for (const [input, expected] of testCases) {
    const result = replaceLatex(input);
    if (result === expected) {
        console.log("✔️", JSON.stringify(result));
    } else {
        console.log("got", result);
    }
}

Basically, it allows users to use both $...$/$$...$$ and \(...\)/\[...\]-style notation. It's more strict with the dollar signs: with the dollar signs, you need whitespace before the first delimiter and after the second delimiter, but cannot have whitespace after the first delimiter or before the second delimiter, and for inline math, you can't have a newline inside it. Literal dollar signs can be escaped in case something would otherwise be interpreted as math. With the bracket notation, it's less strict, so it doesn't care about whitespace.

@jryans
Copy link
Collaborator

jryans commented Jan 28, 2021

@rda0 What do you think about changing to the regexs @uhoreg proposed above?

@rda0
Copy link
Contributor Author

rda0 commented Jan 28, 2021

@jryans Sorry for the delay, I will have a look at it soon. I think the code @uhoreg posted looks cleaner. But I have some more test cases I would like to run it against.

@rda0
Copy link
Contributor Author

rda0 commented Jan 28, 2021

I think with c4f0987 it now provides the same functionality as in #5515 (comment). It also passes all tests in that comment. The following changed:

  • Use both $...$/$$...$$ and \(...\)/\[...\]-style notation for serialize
  • Disallow inline multi-line maths
  • Allow inline display maths
  • No restrictions using /tex command
  • Default behavior is now (config.json):
"latex_maths_delims": {
  "inline_left": "\\(",
  "inline_right": "\\)",
  "inline_pattern": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
  "display_left": "\\[",
  "display_right": "\\]",
  "display_pattern": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]"
}

I will try to use the code in #5515 (comment) tomorrow and run some more tests. If both solutions are equivalent, then I do not have a preference which one to use.

@uhoreg What do think about still providing the /tex command, but without restrictions for inline maths?

@uhoreg
Copy link
Member

uhoreg commented Jan 29, 2021

What do think about still providing the /tex command, but without restrictions for inline maths?

I am indifferent. I think that with allowing both $...$ and \(...\), there is less need for it, but if other people feel it's necessary, I'm not opposed to it.

@rda0
Copy link
Contributor Author

rda0 commented Jan 29, 2021

@uhoreg I tested the code from #5515 (comment) in Element, and I think the code/regex in the PR as it is now works better. My tests are not as nicely done as yours, I just send a huge message with all kinds of maths/markdown combinations (resulting in the screenshot) and a huge message with random tests.

I tested this with:

export function htmlSerializeIfNeeded(model: EditorModel, {forceHTML = false} = {}) {
    let md = mdSerialize(model);
    // copy of raw input to remove unwanted math later
    const orig = md;

    if (SettingsStore.getValue("feature_latex_maths")) {
        const nonSpaceMathChar = "(?:[^$\\\\\\s\\n]|\\\\.)";
        const nonNewlineMathChar = "(?:[^$\\\\\\n]|\\\\.)";
        const mathChar = "(?:[^$\\\\]|\\\\.)";
        const inlineDollar = "(^|\\s)\\$(" +
              nonSpaceMathChar +
              "|(?:" + nonSpaceMathChar + nonNewlineMathChar + "*" + nonSpaceMathChar +
              "))\\$(?=$|\\s|[.,!?:;])";
        const displayDollar = "(^|\\s)\\$\\$(" + mathChar + "+)\\$\\$(?=$|\\s|[.,!?:;])";
        const inlineBracket = "\\\\\\((" + mathChar + "+)\\\\\\)";
        const displayBracket = "\\\\\\[(" + mathChar + "+)\\\\\\]";
        const latexRegexp = RegExp(
            "(" + inlineDollar + "|" + displayDollar + "|" + inlineBracket + "|" +
            displayBracket + "|\\\\\\$)",
            "gm",
        );

        md = md.replace(
            latexRegexp,
            function(m, p1, p2, p3, p4, p5, p6, p7) {
                if (p1 === "\\$") {
                    return "$";
                } else if (p3) {
                    const p3e = AllHtmlEntities.encode(p3);
                    return `${p2}<span data-mx-maths="${p3e}"></span>`;
                } else if (p5) {
                    const p5e = AllHtmlEntities.encode(p5);
                    return `${p4}<div data-mx-maths="${p5e}">\n\n</div>`;
                } else if (p6) {
                    const p6e = AllHtmlEntities.encode(p6);
                    return `<span data-mx-maths="${p6e}"></span>`;
                } else {
                    const p7e = AllHtmlEntities.encode(p7);
                    return `<div data-mx-maths="${p7e}">\n\n</div>`;
                }
            },
        );

        // make sure div tags always start on a new line, otherwise it will confuse
        // the markdown parser
        md = md.replace(/(.)<div/g, function(m, p1) { return `${p1}\n<div`; });
    }

    const parser = new Markdown(md);
    if (!parser.isPlainText() || forceHTML) {
        // feed Markdown output to HTML parser
        const phtml = cheerio.load(parser.toHTML(),
            { _useHtmlParser2: true, decodeEntities: false });

        if (SettingsStore.getValue("feature_latex_maths")) {
            // original Markdown without LaTeX replacements
            const parserOrig = new Markdown(orig);
            const phtmlOrig = cheerio.load(parserOrig.toHTML(),
                { _useHtmlParser2: true, decodeEntities: false });

            // since maths delimiters are handled before Markdown,
            // code blocks could contain mangled content.
            // replace code blocks with original content
            phtml('code').contents('div, span').each(function(i) {
                const origData = phtmlOrig('code').contents('div, span')[i].data;
                phtml('code').contents('div, span')[i].data = origData;
            });

            // add fallback output for latex math, which should not be interpreted as markdown
            phtml('div, span').each(function(i, e) {
                const tex = phtml(e).attr('data-mx-maths')
                if (tex) {
                    phtml(e).html(`<code>${tex}</code>`)
                }
            });
        }
        return phtml.html();
    }
    // ensure removal of escape backslashes in non-Markdown messages
    if (md.indexOf("\\") > -1) {
        return parser.toPlaintext();
    }
}

Test messages:

markdown_latex.txt
markdown_tex.txt
random_latex.txt
random_tex.txt

Test results:

markdown_latex_result_pr.txt
markdown_latex_result_uhoreg.txt
markdown_tex_result_pr.txt
markdown_tex_result_uhoreg.txt
random_latex_result_pr.txt
random_latex_result_uhoreg.txt
random_tex_result_pr.txt
random_tex_result_uhoreg.txt

From what I can see, only markdown_tex_result_* look equivalent.

Could be that I did something wrong trying to integrate the code into the SDK.

@rda0 rda0 changed the title Use LaTeX delimiters by default, add /tex command Use LaTeX and TeX delimiters by default Jan 29, 2021
@rda0
Copy link
Contributor Author

rda0 commented Jan 29, 2021

ac1f9b4 makes the alternative pattern (using $...$/$$...$$) configurable, default config:

"latex_maths_delims": {
    "inline_left": "\\(",
    "inline_right": "\\)",
    "inline_pattern": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
    "inline_pattern_alternative": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$",
    "display_left": "\\[",
    "display_right": "\\]",
    "display_pattern": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
    "display_pattern_alternative": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
},

src/editor/serialize.ts Outdated Show resolved Hide resolved
@behrmann
Copy link

Sorry to ask, but is there anything to help this PR along, e.g. with testing in the field? We have LaTeX support enabled by default on our Element instance, since people are more likely to need LaTeX than $ sigils on code blocks, but I'm still bitten by this daily. :)

@jryans jryans self-requested a review March 30, 2021 10:56
Copy link
Collaborator

@jryans jryans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we're quite close to something that could land here. Let's try refactoring a bit to reduce duplication, and then try it out.

// conditions for display math detection $$...$$:
// - pattern starts at beginning of line or is not prefixed with backslash or dollar
// - left delimiter ($$) is not escaped by backslash
const displayPatternAlternative = (SdkConfig.get()['latex_maths_delims'] ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks we have two similar chunks of code, one for the alternate patterns and another for default. Rather than duplicating the logic in two chunks, could you refactor it out so the common work is de-duplicated to function the takes e.g. the patterns it needs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the code and hope it is what you expected.

@jryans jryans removed the request for review from uhoreg March 31, 2021 16:10
@rda0
Copy link
Contributor Author

rda0 commented Apr 1, 2021

In 73130ca I also changed the pattern naming in the config to be more specific:

"latex_maths_delims": {
    "inline_left": "\\(",
    "inline_right": "\\)",
    "inline_pattern_latex": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
    "inline_pattern_tex": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$",
    "display_left": "\\[",
    "display_right": "\\]",
    "display_pattern_latex": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
    "display_pattern_tex": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
},

@jryans
Copy link
Collaborator

jryans commented Apr 6, 2021

@rda0 Thanks for the refactoring! It seems like the config might be a bit nicer to use if inline vs. display were grouped in the their own nested objects, sort of similar to what you've done with the defaults. Do you agree?

There are also some test failures... perhaps try merging develop?

@rda0
Copy link
Contributor Author

rda0 commented Apr 6, 2021

@jryans The config is now:

    "latex_maths_delims": {
        "inline": {
            "left": "\\(",
            "right": "\\)",
            "pattern": {
                "latex": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
                "tex": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$"
            }
        },
        "display": {
            "left": "\\[",
            "right": "\\]",
            "pattern": {
                "latex": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
                "tex": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
            }
        }
    },

I also merged develop without problems except for new linting errors appearing now, which I do not know how to resolve. I am sorry, but my skills as developer and especially using this language is very limited😉 Would be great if someone could help out.

@jryans
Copy link
Collaborator

jryans commented Apr 7, 2021

Hmm, looks like we may need a few tweaks to adjust to the cheerio API:

image

For example, it seems contents doesn't take any arguments so the div, span could be removed... unless you need to filter out other elements, in which case you may need a different API.

About .data, what does reading and writing that actually do...? I admit I am not quite sure what it's trying to achieve when looking closely... 😅

@rda0
Copy link
Contributor Author

rda0 commented Apr 8, 2021

@jryans Thanks to a friend who figured out how to use the cheerio api correctly, we were able to remove my hack via the .data attribute.

Copy link
Collaborator

@jryans jryans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work here, let's give it a go! 😄

@jryans jryans merged commit ae2082b into matrix-org:develop Apr 9, 2021
@rda0 rda0 deleted the maths-parsing-latex branch April 9, 2021 10:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LaTeX parsing cannot discern $ when not intended to be used
4 participants