Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTE inserts spurious backslashes into URLs (including room aliases, etc) in code blocks #4809

Closed
ara4n opened this issue Aug 17, 2017 · 35 comments
Labels
P1 S-Minor Impairs non-critical functionality or suitable workarounds exist T-Defect X-Regression

Comments

@ara4n
Copy link
Member

ara4n commented Aug 17, 2017

It shows up in the timeline as #freenode\_#foo:matrix.org

@superdump
Copy link
Contributor

It's not just rooms. It's a bunch of situations with underscores.

@lukebarnard1
Copy link
Contributor

Only happens when something gets linkified because we escape URLs in MD, regardless of whether they're in a code block. So actual = expected behaviour outside of code blocks only.

@lampholder
Copy link
Member

@lukebarnard1 came to discuss this IRL - we floated the idea of

  • when in markdown mode
  • all links (or linkifiable strings) that would need escaping will be flagged with an icon
  • clicking that icon excapes the link

@lampholder lampholder added T-Defect S-Minor Impairs non-critical functionality or suitable workarounds exist P2 type:rte labels Aug 18, 2017
@richvdh richvdh changed the title pasting a room alias like #freenode_#foo:matrix.org into a triple-tick block in RTE results in spuriously escaped underscore RTE inserts spurious underscores into URLs (including room aliases, etc) Sep 7, 2017
@richvdh richvdh changed the title RTE inserts spurious underscores into URLs (including room aliases, etc) RTE inserts spurious backslashes into URLs (including room aliases, etc) Sep 7, 2017
@richvdh richvdh changed the title RTE inserts spurious backslashes into URLs (including room aliases, etc) RTE inserts spurious backslashes into URLs (including room aliases, etc) in code blocks Sep 7, 2017
@richvdh
Copy link
Member

richvdh commented Sep 7, 2017

it feels wrong to be making the user figure out whether the URL should be escaped or not. The user really shouldn't have to be making that sort of decision.

The "right" solution seems to be that we shouldn't be escaping underscores in URLs in code-blocks. Is the reason for dismissing that that it's Hard?

@lampholder
Copy link
Member

I think I need to talk to @lukebarnard1 about this one again when he gets back...

@lukebarnard1
Copy link
Contributor

Is the reason for dismissing that that it's Hard?

Basically, yes. The solution that involves parsing the MD and only escaping URLs outside of code blocks seems tractable but fiddly. The alternative being we un-escape stuff in code blocks regardless of what's in them, but that sounds worse.

@ara4n
Copy link
Member Author

ara4n commented Sep 27, 2017

this is really really annoying, and it's a regression - it stops you doing completely reasonable things like:

screen shot 2017-09-27 at 09 53 37

...which means that folks can't then copy-paste the block to get a working config.

Why are we escaping URLs in the first place?

@t3chguy
Copy link
Member

t3chguy commented Sep 27, 2017

so urls like http://google.com/_thing_ don't end up with thing (italics)

@ara4n
Copy link
Member Author

ara4n commented Sep 27, 2017

okay, so we've broken one thing (URLs in codeblocks) with a somewhat questionable fix for another thing (escaping underscores in random inline URLs).

I don't think we should be inventing our own MD dialect by escaping underscores in the first place. If users are in MD mode, we should accept normal commonmark rather than trying to pass it through a custom preprocessor. In commonmark, http://google.com/_thing_ is indeed italics, and users need to know to put it in backticks to avoid it. In practice Joe Public should be using the RTE rather than markdown anyway, leaving MD to the geeks who should be aware of this, imo.

Another solution would be to use a better MD dialect than commonmark - github flavoured markdown gets the http://google.com/_thing_ right without needing backticks. But of course we moved away from GFM to commonmark in order to be more consistent cross-platform.

So, TL;DR: i think we should remove the escaping across the board, as adding fixes which introduce new problems are probably the wrong fixes in the first place.

@richvdh
Copy link
Member

richvdh commented Sep 27, 2017

ftr, the thing that we fixed was #3428. In particular it was painful that URLs like https://riot.im/app/#room/#_foo_bar:matrix.org (outside codeblocks) got turned into https://riot.im/app/#room/#<italics>foo<italics>bar:matrix.org.

@maxgrenderjones
Copy link

Does that count as a bug in common mark, given how unlikely it is that the italics would be the intended behaviour? Perhaps you could persuade them to fix?

@t3chguy
Copy link
Member

t3chguy commented Sep 27, 2017

@maxgrenderjones
In part, for urls but this issue also applies to formattification of matrix id's such as #_foo_bar:matrix.org

@t3chguy
Copy link
Member

t3chguy commented Sep 27, 2017

Which gfm somehow handles sanely fwiw

@ara4n
Copy link
Member Author

ara4n commented Sep 28, 2017

So after further discussion this morning, my proposal doesn't work given how common underscores are in our URLs, as Rich points out. I see three possible solutions:

  1. move away from _ in URLs, shifting to dots for URL separators instead. This feels preferable anyway, given _'s disappear when hyperlinks get underlined, and the whole thing looks rather ugly. #irc.freenode.matrix:matrix.org looks much prettier aesthetically than #_freenode_matrix:matrix.org imo.
  2. move away from commonmark to another markdown dialect which doesn't turn https://foo.com/_bar_ into italics. My vote would be GFM, which served us well at the time (and gives us strikethru and tables and other stuff)... other than:
  1. hack all our commonmark libraries to linkify URLs which aren't in code-blocks before sending. This means doing so after having parsed the MD into code blocks, and technically means we're speaking our own dialect - at which point many of the benefits of commonmark fall away.

Thoughts?

@turt2live
Copy link
Member

I like option 1 regardless of this issue. It kinda sucks that the migration path would be manual and time consuming, if a migration path is desired.

GFM sounds like a reasonable choice, as it is reasonably well known, reasonably documented, and mostly okay. The changing spec and lack of updates is a bit worrying though. If web goes with GFM, I'd also consider it reasonable to have the mobile apps not be GFM for the sake of resources. GFM gives more features, but users are (hopefully) unlikely to do any serious markdown in the mobile apps.

Just my 2c.

@maxgrenderjones
Copy link

I'd be for 1. All the _s are an eyesore and matrix look unnecessarily weird & non-standard

@ara4n
Copy link
Member Author

ara4n commented Oct 5, 2017

i'm also torn between whether errant escaped underscores in codeblocks is more or less offensive than errant italics in room aliases...

@richvdh
Copy link
Member

richvdh commented Oct 5, 2017

I wouldn't be against 1, but (a) I'm not sure dots are very much better; (b) migrating sounds very painful.

The absence of maintained parsers for GFM makes it something of a non-starter for me. It's certainly a good step in the right direction that there is a spec, but there aren't any parsers I really trust. Tables are nice but feel overspecced for an instant messenger. Strikethrough is <del>...</del> in CM, which isn't super convenient but it is there.

I'd be interested to know how much work it would be to patch the CM parsers. I'd hope it wouldn't be too much. Speaking a slightly different variant doesn't sound awful to me.

@ara4n
Copy link
Member Author

ara4n commented Nov 14, 2017

i'm seeing yet more errant escaping going on unrelated to backslashes:

you just hyperlink to `https://matrix.to/#/@matthew:matrix.org` in the HTML formatted bit of the message

ends up getting rendered as:

you just hyperlink to <code>https://matrix.to/#/@matthew:matrix.org\\</code> in the HTML formatted bit of the message

and I have no idea where that backslash is coming from >:(

@richvdh
Copy link
Member

richvdh commented Nov 14, 2017

i'm seeing yet more errant escaping going on unrelated to backslashes:

do you mean underscores? In general it's known that the problem isn't limited to underscores; though I must say your example above seems a particular mystery.

@ara4n
Copy link
Member Author

ara4n commented Jan 3, 2018

i would really like to fix this by returning to GFM rather than inventing our own dialect of markdown, especially as GFM has a spec now: https://github.github.com/gfm/. does anyone object to actually doing this?

@ara4n
Copy link
Member Author

ara4n commented Jan 4, 2018

The reason for this is:

  • We've already customised commonmark into its own dialect:
    • to (badly) autoescape underscores in things which linkify thinks are links
    • to strip paragraph tags from oneliners
    • to stop blockquotes all being on oneline
  • As well as a bunch of other custom behaviour for whitelisting particular tags etc.
  • GFM has a spec now.

Whereas the only problem I actually know we had with GFM is that it turned -_- into an hr - which empirically github and other GFM editors get right these days. Plus marked didn't have a proper html parser, but I'm not sure we actually need it in the end?

@dbkr, wdyt?

@ara4n
Copy link
Member Author

ara4n commented Jan 4, 2018

here's another broken example related to this:

i.e. `for i in \`seq 0 1500\`; do curl https://ldbco.de/_matrix/.../createRoom; done`

was rendered as:

screen shot 2018-01-04 at 11 26 08

@maxgrenderjones
Copy link

maxgrenderjones commented Jan 4, 2018

Given GFM spec is built on top of commonmark, I don't think your problems go away entirely. Granted autolinks means that http://foo.com/_example_/bar should do the right thing outside a codeblock, but #_freenode_matrix:matrix.org won't because it doesn't look sufficiently URLy. (Of course if you ban/avoid doubled _ in channel names that would ameliorate the problem. Or extend autolinks)

@ara4n
Copy link
Member Author

ara4n commented Jan 4, 2018

untrue: #_freenode_matrix:matrix.org (no escapes) works fine.

@maxgrenderjones
Copy link

I stand corrected

@ara4n
Copy link
Member Author

ara4n commented Jan 4, 2018

(i think it used to get this wrong a few years ago before we switched to commonmark, but happily GFM has been updated since then - presumably because of matrix room aliases ;P)

@dbkr
Copy link
Member

dbkr commented Jan 4, 2018

Ftr, matrix-org/matrix-react-sdk#575 summaries the reasons we switched to commonmark - now it has a spec, one of those is invalid. I share @richvdh 's concern about lack of parsers (presumably the lack of an android impl is still a concern?)

I would say that all of the things we've done to commonmark are all presentation-level things, ie. you can write the same markdown and it will still mean the same thing, so I'd say we are still using standard commonmark over the wire (apart from the whitelisted tags). Also bear in mind that I wrote a bunch of extra stuff on top of gfm, eg. to avoid it wrapping everything in an extra layer of html tags and to disable linkification (see stuff removed in matrix-org/matrix-react-sdk@4d29264) so moving to GFM is probably still not without its foibles.

I think ideally I would lean towards fixing commonmark to follow the way GFM does formatting suppression, which seems like a well defined and achievable goal, with the hope of maybe nudging the commonmark spec in that direction? The main thing drawing me to this is that basically the dialect over the wire stays the same.

@t3chguy
Copy link
Member

t3chguy commented Jan 4, 2018

presumably the lack of an android impl is still a concern?

riot-android is still using a JS implementation of GFM right now :L

@ara4n
Copy link
Member Author

ara4n commented Jan 4, 2018

I would say that all of the things we've done to commonmark are all presentation-level things

Isn't this whole issue about an entirely non-presentation-level thing? That if you say moo_foo_ then with commonmark you get moofoo, but with our mutant commonmark, if you say #moo_foo:matrix.org it's magically not mangled? (Unless it's inside a code block of some kind, where it then sprouts backslashes everywhere)?

@dbkr
Copy link
Member

dbkr commented Jan 4, 2018

Well, I was classing that a a presentation-level thing on the assumption that the semantics are normally well defined for those things and we just have to display them correctly, in the sense that it's probably a safe assumption nobody's saying #moo_foo_:matrix.org and intending for the foo to be underlined. As opposed to something like strikethrough being <del> rather than ~~.

@richvdh
Copy link
Member

richvdh commented Jan 5, 2018

It does look as though GFM has moved on somewhat, and the arguments against it are now weaker than they were before - there are now new JS parser impls, and it's not hard to imagine a java one appearing in less time than it has taken us to fail to move android over to CM.

However, as Dave says, GFM is not without its foibles, and needed some customisation too. Also, I am not keen on vacillating between formats every year. If we're seriously considering switching back, I'd find it helpful if somebody could gather (somewhere that isn't github) a document setting out the problems with each format, and the customisations we have made so that we can better understand the background.

@ara4n
Copy link
Member Author

ara4n commented Jan 7, 2018

Okay. I've had more of a dig to see if we can fix the actual problem here without shifting to GFM. Findings are:

I think that the breakage caused by introducing backslashes into URLs in code blocks is WAY worse than the fix caused by fairly obscure URLs like http://google.com/_thing_ breaking, so I'm going to roll back matrix-org/matrix-react-sdk#1270 (like I should have months ago).

However, I still feel that it's very annoying that commonmark gets this wrong, and there's no easy way to do a partial-parse of the MD to fix it. Patching commonmark to do the right thing looks scary in the extreme, and definitely causes a major fork. So then we're back with me saying "gah, why can't we just use GFM, which gets this right, and has no known disadvantages?".

Eitherway, we can close this up by reverting matrix-org/matrix-react-sdk#1270, albeit reopening #4674

@ara4n
Copy link
Member Author

ara4n commented Jan 7, 2018

going to close this as the bug in question is now fixed; suggest future discussion goes on #4674

@ara4n
Copy link
Member Author

ara4n commented Jan 7, 2018

(the seq example given in #4809 (comment) is actually MD working as intended - https://meta.stackexchange.com/questions/82718/how-do-i-escape-a-backtick-in-markdown)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 S-Minor Impairs non-critical functionality or suitable workarounds exist T-Defect X-Regression
Projects
None yet
Development

No branches or pull requests

9 participants