-
-
Notifications
You must be signed in to change notification settings - Fork 829
Use LaTeX and TeX delimiters by default #5515
Conversation
Since parsing for $'s as maths delimiters is tricky, switch the default to \(...\) for inline and \[...\] for display maths as it is used in LaTeX. Add /tex command to explicitly parse in TeX mode, which uses $...$ for inline and $$...$$ for display maths. Signed-off-by: Sven Mäder <maeder@phys.ethz.ch>
Signed-off-by: Sven Mäder <maeder@phys.ethz.ch>
Signed-off-by: Sven Mäder <maeder@phys.ethz.ch>
Adding @uhoreg separately to get his opinion on these change as a user of LaTeX input in Matrix. |
If the proposal using a I only got feedback from two of our interested Matrix/LaTeX users and both reported that An additional configuration option for the regex in the |
Even though |
Here's the result of my attempts today at playing with regular expressions: const testCases = [
// inline math:
["$x^2$", '<span data-mx-maths="x^2"></span>'],
// math can be followed by punctuation
["$x^2$?", '<span data-mx-maths="x^2"></span>?'],
// escaped dollar sign inside math
["$\\$$", '<span data-mx-maths="\\$"></span>'],
["$3 + \\$3$", '<span data-mx-maths="3 + \\$3"></span>'],
// math can be preceded and followed by whitespace
["hello $x^2$ world", 'hello <span data-mx-maths="x^2"></span> world'],
["hello\n$x^2$\nworld", 'hello\n<span data-mx-maths="x^2"></span>\nworld'],
// when using \( \), be less strict (e.g. allow spaces after opening
// delimiter, before delimiter, and newlines inside)
["a\\( x^2 \\)b", 'a<span data-mx-maths=" x^2 "></span>b'],
["a\\(x\ny\\)b", 'a<span data-mx-maths="x\ny"></span>b'],
// display math: (mostly similar to inline, but a bit less strict)
["$$x^2$$", '<div data-mx-maths="x^2">\n\n</div>'],
// allow newlines inside display math
["$$x\ny$$", '<div data-mx-maths="x\ny">\n\n</div>'],
["hello $$x^2$$", 'hello <div data-mx-maths="x^2">\n\n</div>'],
["hello\n$$x^2$$", 'hello\n<div data-mx-maths="x^2">\n\n</div>'],
["a\\[ x^2 \\]b", 'a<div data-mx-maths=" x^2 ">\n\n</div>b'],
["a\\[x\ny\\]b", 'a<div data-mx-maths="x\ny">\n\n</div>b'],
// not math:
["$$", '$$'],
["$ $", '$ $'],
// don't allow newlines inside inline math with dollars -- if you want to
// have newlines, use \(...\)
["$x\ny$", '$x\ny$'],
// don't allow space after first $ or before second $
["$ x$", '$ x$'],
["$x $", '$x $'],
// first $ has to be preceded by whitespace or be at beginning of message,
// second $ has to be followed by whitespace or be at end of message
["$x = $y", '$x = $y'],
["1$ and 2$", '1$ and 2$'],
// escaped dollar signs
["\\$x$", '$x$'],
["$x\\$", '$x$'],
];
const nonSpaceMathChar = "(?:[^$\\\\\\s\\n]|\\\\.)";
const nonNewlineMathChar = "(?:[^$\\\\\\n]|\\\\.)";
const mathChar = "(?:[^$\\\\]|\\\\.)";
const inlineDollar = "(^|\\s)\\$(" +
nonSpaceMathChar +
"|(?:" + nonSpaceMathChar + nonNewlineMathChar + "*" + nonSpaceMathChar +
"))\\$(?=$|\\s|[.,!?:;])";
const displayDollar = "(^|\\s)\\$\\$(" + mathChar + "+)\\$\\$(?=$|\\s|[.,!?:;])";
const inlineBracket = "\\\\\\((" + mathChar + "+)\\\\\\)";
const displayBracket = "\\\\\\[(" + mathChar + "+)\\\\\\]";
const latexRegexp = RegExp(
"(" + inlineDollar + "|" + displayDollar + "|" + inlineBracket + "|" +
displayBracket + "|\\\\\\$)",
"gm",
);
function replaceLatex(str) {
return str.replace(
latexRegexp,
function(m, p1, p2, p3, p4, p5, p6, p7) {
if (p1 === "\\$") {
return "$";
} else if (p3) {
// FIXME: p3 should be HTML escaped -- same mith p3, p6, p7 below
return `${p2}<span data-mx-maths="${p3}"></span>`;
} else if (p5) {
return `${p4}<div data-mx-maths="${p5}">\n\n</div>`;
} else if (p6) {
return `<span data-mx-maths="${p6}"></span>`;
} else {
return `<div data-mx-maths="${p7}">\n\n</div>`;
}
},
);
}
for (const [input, expected] of testCases) {
const result = replaceLatex(input);
if (result === expected) {
console.log("✔️", JSON.stringify(result));
} else {
console.log("got", result);
}
} Basically, it allows users to use both |
I think with c4f0987 it now provides the same functionality as in #5515 (comment). It also passes all tests in that comment. The following changed:
"latex_maths_delims": {
"inline_left": "\\(",
"inline_right": "\\)",
"inline_pattern": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
"display_left": "\\[",
"display_right": "\\]",
"display_pattern": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]"
} I will try to use the code in #5515 (comment) tomorrow and run some more tests. If both solutions are equivalent, then I do not have a preference which one to use. @uhoreg What do think about still providing the |
I am indifferent. I think that with allowing both |
@uhoreg I tested the code from #5515 (comment) in Element, and I think the code/regex in the PR as it is now works better. My tests are not as nicely done as yours, I just send a huge message with all kinds of maths/markdown combinations (resulting in the screenshot) and a huge message with random tests. I tested this with: export function htmlSerializeIfNeeded(model: EditorModel, {forceHTML = false} = {}) {
let md = mdSerialize(model);
// copy of raw input to remove unwanted math later
const orig = md;
if (SettingsStore.getValue("feature_latex_maths")) {
const nonSpaceMathChar = "(?:[^$\\\\\\s\\n]|\\\\.)";
const nonNewlineMathChar = "(?:[^$\\\\\\n]|\\\\.)";
const mathChar = "(?:[^$\\\\]|\\\\.)";
const inlineDollar = "(^|\\s)\\$(" +
nonSpaceMathChar +
"|(?:" + nonSpaceMathChar + nonNewlineMathChar + "*" + nonSpaceMathChar +
"))\\$(?=$|\\s|[.,!?:;])";
const displayDollar = "(^|\\s)\\$\\$(" + mathChar + "+)\\$\\$(?=$|\\s|[.,!?:;])";
const inlineBracket = "\\\\\\((" + mathChar + "+)\\\\\\)";
const displayBracket = "\\\\\\[(" + mathChar + "+)\\\\\\]";
const latexRegexp = RegExp(
"(" + inlineDollar + "|" + displayDollar + "|" + inlineBracket + "|" +
displayBracket + "|\\\\\\$)",
"gm",
);
md = md.replace(
latexRegexp,
function(m, p1, p2, p3, p4, p5, p6, p7) {
if (p1 === "\\$") {
return "$";
} else if (p3) {
const p3e = AllHtmlEntities.encode(p3);
return `${p2}<span data-mx-maths="${p3e}"></span>`;
} else if (p5) {
const p5e = AllHtmlEntities.encode(p5);
return `${p4}<div data-mx-maths="${p5e}">\n\n</div>`;
} else if (p6) {
const p6e = AllHtmlEntities.encode(p6);
return `<span data-mx-maths="${p6e}"></span>`;
} else {
const p7e = AllHtmlEntities.encode(p7);
return `<div data-mx-maths="${p7e}">\n\n</div>`;
}
},
);
// make sure div tags always start on a new line, otherwise it will confuse
// the markdown parser
md = md.replace(/(.)<div/g, function(m, p1) { return `${p1}\n<div`; });
}
const parser = new Markdown(md);
if (!parser.isPlainText() || forceHTML) {
// feed Markdown output to HTML parser
const phtml = cheerio.load(parser.toHTML(),
{ _useHtmlParser2: true, decodeEntities: false });
if (SettingsStore.getValue("feature_latex_maths")) {
// original Markdown without LaTeX replacements
const parserOrig = new Markdown(orig);
const phtmlOrig = cheerio.load(parserOrig.toHTML(),
{ _useHtmlParser2: true, decodeEntities: false });
// since maths delimiters are handled before Markdown,
// code blocks could contain mangled content.
// replace code blocks with original content
phtml('code').contents('div, span').each(function(i) {
const origData = phtmlOrig('code').contents('div, span')[i].data;
phtml('code').contents('div, span')[i].data = origData;
});
// add fallback output for latex math, which should not be interpreted as markdown
phtml('div, span').each(function(i, e) {
const tex = phtml(e).attr('data-mx-maths')
if (tex) {
phtml(e).html(`<code>${tex}</code>`)
}
});
}
return phtml.html();
}
// ensure removal of escape backslashes in non-Markdown messages
if (md.indexOf("\\") > -1) {
return parser.toPlaintext();
}
} Test messages: markdown_latex.txt Test results: markdown_latex_result_pr.txt From what I can see, only Could be that I did something wrong trying to integrate the code into the SDK. |
ac1f9b4 makes the alternative pattern (using "latex_maths_delims": {
"inline_left": "\\(",
"inline_right": "\\)",
"inline_pattern": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
"inline_pattern_alternative": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$",
"display_left": "\\[",
"display_right": "\\]",
"display_pattern": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
"display_pattern_alternative": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
}, |
Sorry to ask, but is there anything to help this PR along, e.g. with testing in the field? We have LaTeX support enabled by default on our Element instance, since people are more likely to need LaTeX than $ sigils on code blocks, but I'm still bitten by this daily. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we're quite close to something that could land here. Let's try refactoring a bit to reduce duplication, and then try it out.
src/editor/serialize.ts
Outdated
// conditions for display math detection $$...$$: | ||
// - pattern starts at beginning of line or is not prefixed with backslash or dollar | ||
// - left delimiter ($$) is not escaped by backslash | ||
const displayPatternAlternative = (SdkConfig.get()['latex_maths_delims'] || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks we have two similar chunks of code, one for the alternate patterns and another for default. Rather than duplicating the logic in two chunks, could you refactor it out so the common work is de-duplicated to function the takes e.g. the patterns it needs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the code and hope it is what you expected.
In 73130ca I also changed the pattern naming in the config to be more specific: "latex_maths_delims": {
"inline_left": "\\(",
"inline_right": "\\)",
"inline_pattern_latex": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
"inline_pattern_tex": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$",
"display_left": "\\[",
"display_right": "\\]",
"display_pattern_latex": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
"display_pattern_tex": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
}, |
@rda0 Thanks for the refactoring! It seems like the config might be a bit nicer to use if inline vs. display were grouped in the their own nested objects, sort of similar to what you've done with the defaults. Do you agree? There are also some test failures... perhaps try merging develop? |
@jryans The config is now: "latex_maths_delims": {
"inline": {
"left": "\\(",
"right": "\\)",
"pattern": {
"latex": "(^|[^\\\\])\\\\\\((?!\\\\\\))(.*?)\\\\\\)",
"tex": "(^|\\s|[.,!?:;])(?!\\\\)\\$(?!\\s)(([^$\\n]|\\\\\\$)*([^\\\\\\s\\$]|\\\\\\$)(?:\\\\\\$)?)\\$"
}
},
"display": {
"left": "\\[",
"right": "\\]",
"pattern": {
"latex": "(^|[^\\\\])\\\\\\[(?!\\\\\\])(.*?)\\\\\\]",
"tex": "(^|[^\\\\$])\\$\\$(([^$]|\\\\\\$)+?)\\$\\$"
}
}
}, I also merged develop without problems except for new linting errors appearing now, which I do not know how to resolve. I am sorry, but my skills as developer and especially using this language is very limited😉 Would be great if someone could help out. |
Hmm, looks like we may need a few tweaks to adjust to the For example, it seems About |
@jryans Thanks to a friend who figured out how to use the cheerio api correctly, we were able to remove my hack via the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work here, let's give it a go! 😄
This PR fixes element-hq/element-web#15846 and I believe it addresses all suggestions in the comments.
I tried to find regular expressions for the
$...$
and$$...$$
parsing to avoid interference for normal users as much as possible. The regex allows to escape$
characters inside and outside of the maths pattern, disallows multi-line inline maths and has some more restrictions to avoid unwanted mangling of the input (see code comments).This PR proposes to switch the default delimiters to use LaTeX style delimiters for sending and editing messages:
\(...\)
\[...\]
Alternatively TeX style delimiters can be used for sending maths:
$...$
$$...$$
According to Are \( and \) preferable to dollar signs for math mode? and Why is \[ … \] preferable to $$ … $$?:
To avoid mangled code blocks, the input is converted to markdown twice, once using latex-then-markdown and once using markdown-only parsing. The clean code blocks are then copied back to the combined output.
The configuration options introduced in #5244 may not be compatible anymore, since the regex replace code for inline maths has slightly changed. You can reproduce the default behavior in
config.json
now as follows:Signed-off-by: Sven Mäder maeder@phys.ethz.ch