Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2191: Markup for mathematical messages #2191

Merged
merged 7 commits into from
Apr 15, 2024
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
138 changes: 138 additions & 0 deletions proposals/2191-maths.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# MSC2191: Markup for mathematical messages

Some people write using an odd language that has strange symbols. No, I'm not
talking about computer programmers; I'm talking about mathematicians. In order
to aid these people in communicating, Matrix should define a standard way of
including mathematical notation in messages.
Comment on lines +3 to +6
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that the vast majority of users will never send mathematical expression to each other. Is the complexity really worth it? LaTeX is non-trivial to parse nor to render.

Also, if Matrix is going to access mathematical notations, what about other domains, like chemistry, physics, … al the myriad of other notations?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clients aren't required to render or parse the notation, which is why a fallback is present. Several clients do wish to represent mathematical expressions to users though, and having a consistent and standardized way to do so is important.

MSCs for other notations are equally accepted, provided they have similar fallback mechanics.


This proposal presents a format using LaTeX, in contrast with a [previous
proposal](https://github.com/matrix-org/matrix-doc/pull/1722/) that used
MathML.
KitsuneRal marked this conversation as resolved.
Show resolved Hide resolved

See also:

- https://github.com/vector-im/riot-web/issues/1945


## Proposal

A new attribute `data-mx-maths` will be added for use in `<span>` or `<div>`
elements. Its value will be mathematical notation in LaTeX format. `<span>`
is used for inline math, and `<div>` for display math. The contents of the
`<span>` or `<div>` will be a fallback representation or the desired notation
for clients that do no support mathematical display, or that are unable to
uhoreg marked this conversation as resolved.
Show resolved Hide resolved
render the entire `data-mx-maths` attribute. The fallback representation is
uhoreg marked this conversation as resolved.
Show resolved Hide resolved
left up to the sending client and could be, for example, an image, or an HTML
approximation, or the raw LaTeX source. When using an image as a fallback, the
sending client should be aware of issues that may arise from the receiving
client using a different background colours.
anoadragon453 marked this conversation as resolved.
Show resolved Hide resolved

Example (with line breaks and indentation added to `formatted_body` for clarity):

```json
{
"content": {
"body": "This is an equation: sin(x)=a/b",
"format": "org.matrix.custom.html",
"formatted_body": "This is an equation:
<span data-mx-maths=\"\\sin(x)=\\frac{a}{b}\">
sin(<i>x</i>)=<sup><i>a</i></sup>/<sub><i>b</i></sub>
</span>",
"msgtype": "m.text"
},
"event_id": "$eventid:example.com",
"origin_server_ts": 1234567890,
"sender": "@alice:example.com",
"type": "m.room.message",
"room_id": "!soomeroom:example.com"
}
```
Copy link
Member

@anoadragon453 anoadragon453 Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you show an example of the event JSON that a sending client would use if they were including a fallback for the receiver?

Copy link
Member Author

@uhoreg uhoreg Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I understand what you're asking. The example given already includes the fallback. The <span data-mx-maths=\"\\sin(x)=\\frac{a}{b}\">sin(<i>x</i>)=<sup><i>a</i></sup>/<sub><i>b</i></sub></span> includes the LaTeX notation (\sin(x)=\frac{a}{b}) and the fallback (sin(<i>x</i>)=<sup><i>a</i></sup>/<sub><i>b</i></sub>, which in this case is an HTML rendering of the equation)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right! I had entirely missed that the body of the <span> was the fallback representation. Including the second half of your comment just under the example would have helped nail that point home in the MSC.



## Other solutions

[MSC1722](https://github.com/matrix-org/matrix-doc/pull/1722/) proposes using
MathML as the format of transporting mathematical notation. It also summarizes
some other solutions in its "Other Solutions" section.

In comparison with MathML, LaTeX has several advantages and disadvantages.

The first advantage, which is quite obvious, is that LaTeX is much less verbose
and more readable than MathML. In many cases, the LaTeX code is a suitable
fallback for the rendered notation.

LaTeX is a suitable input method for many people, and so converting from a
user's input to the message format would be a no-op.

However, balanced against these advantages, LaTeX has several disadvantages as
a message format. Some of these are covered in the "Potential issues" and
"Security considerations".


## Potential issues

### "LaTeX" as a format is poorly defined
uhoreg marked this conversation as resolved.
Show resolved Hide resolved

There are several extensions to LaTeX that are commonly used, such as
AMS-LaTeX. It is unclear which extensions should be supported, and which
should not be supported. Different LaTeX-rendering libraries support different
sets of commands.

This proposal suggests that the receiving client should render the LaTeX
uhoreg marked this conversation as resolved.
Show resolved Hide resolved
version if possible, but if it contains unsupported commands, then it should
display the fallback. Thus, it is up to the receiving client to decide what
commands it will support, rather than dictating what commands must be
supported. This comes at a cost of possible inconsistency between clients, but
is somewhat mitigated by the use of a fallback. Clients should, however, aim
to support, at minimum, the basic LaTeX2e maths commands and the TeX maths
commands, with the possible exception of commands that could be security risks
(see below).

To improve compatibility, the sender's client may warn the sender if they are
using a command that comes from another package, such as AMS-LaTeX.

### Lack of libraries for displaying mathematics

see the corresponding section in [MSC1722](https://github.com/matrix-org/matrix-spec-proposals/pull/1722/files#diff-4a271297299040dbfa622bfc6d2aab02f9bc82be0b28b2a92ce30b14c5621f94R148-R164)


## Security considerations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done little with LaTeX, but it does a lot more than just math symbols -- it is a whole typesetting system. This sounds confusing to be embedding into an HTML property, especially since you have to escape backslashes (which are used a lot in LaTeX).

I was curious how Wikipedia handled formulas, since they have to render untrusted input as well. tl;dr is that you need to install and use a whole heap of software to do this correctly, including texvc (which uses OCaml), LaTeX itself, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that most implementations will use MathJax or similar which as far as I know just implements a subset of LaTeX specifically geared towards math.

It might be worth explicitly recommending the use of a narrowly-scoped LaTeX rendering library.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this proposal is solely about the math part of TeX/LaTeX, and not about any of the other document processing bits. I can try to clarify it.

I do recommend against running the latex command to render it. I can explicitly recommend some software (MathJax and KaTeX are the main ones). I'm surprised Wikipedia still uses images for math.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this proposal is solely about the math part of TeX/LaTeX, and not about any of the other document processing bits. I can try to clarify it.

I think the proposal is clear, I'm just concerned that it would be an easy vector to add security vulnerabilities to applications.

I do recommend against running the latex command to render it. I can explicitly recommend some software (MathJax and KaTeX are the main ones). I'm surprised Wikipedia still uses images for math.

Looking through the implementations it doesn't seem they attempt to sanitize input or anything -- maybe this is OK though since MathJax and KaTeX only handle math anyway? Looking at the MathJax docs it does seem to allow e.g. defining macros by default.

This might just need a big warning in the spec PR that says to be careful, but it seems a bit weird that the spec is very explicit about what HTML tags/attributes to support but here we just shrug and don't give real advice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks like MathJax and KaTeX both allow defining macros, but they limit recursion, which is the main issue that macros can cause.

This might just need a big warning in the spec PR that says to be careful, but it seems a bit weird that the spec is very explicit about what HTML tags/attributes to support but here we just shrug and don't give real advice.

Yeah. Part of the reason here is that there is a huge number of LaTeX commands -- mostly for specific symbols. There are also extensions that define their own commands that some renderers may want to support. Another reason for being lax with specifying what to support is the tooling, or lack thereof. While there are a lot of HTML sanitizers that allow you to specify exactly what's allowed, there is a general lack of LaTeX sanitizers. Instead, the rendering libraries generally provide options for what unsafe things to allow, if any. So if we tried to tell clients what commands to support and what commands not to support, clients authors might need to write their own LaTeX parsers, which would not be pleasant.


LaTeX is a [Turing complete programming
language](https://web.archive.org/web/20160110102145/http://en.literateprograms.org/Turing_machine_simulator_%28LaTeX%29);
it is possible to write a LaTeX document that contains an infinite loop, or
that will require large amounts of memory. While it may be fun to write a
[LaTeX file that can control a Mars
Rover](https://wiki.haskell.org/wikiupload/8/85/TMR-Issue13.pdf#chapter.2), it
is not desireable for a mathematical formula embedded in a Matrix message to
control a Mars Rover. Clients should take precautions when rendering LaTeX.
Clients that use a rendering library should only use one that can process the
LaTeX safely.

Clients should not render mathematics by calling the `latex` executable without
proper sandboxing, as the `latex` executable was not written to handle
untrusted input. (see, for example, <https://hovav.net/ucsd/dist/texhack.pdf>,
<https://0day.work/hacking-with-latex/>, and
<https://hovav.net/ucsd/dist/tex-login.pdf>.) Some LaTeX rendering libraries
are better suited for processing untrusted input.

Certain commands, such as [those that can create
macros](https://katex.org/docs/supported#macros), are potentially dangerous;
clients should either decline to process those commands, or should take care to
ensure that they are handled in safe ways (such as by limiting recursion). In
general, LaTeX commands should be filtered by allowing known-good commands
rather than forbidding known-bad commands. Some LaTeX libraries may have
options for doing this.

In general, LaTeX places a heavy burden on client authors to ensure that it is
processed safely. Some LaTeX rendering libraries provide security advice, for
example, <https://github.com/KaTeX/KaTeX/blob/main/docs/security.md>.


## Conclusion

Math(s) is hard, but LaTeX makes it easier to write mathematical notation.
However, using LaTeX as a format for including mathematics in Matrix messages
has some serious downsides. Nevertheless, if clients handle the LaTeX
carefully, or rely on the fallback representation, the concerns can be
addressed.
uhoreg marked this conversation as resolved.
Show resolved Hide resolved