Unescaped left angle brackets in link destination #193

ScottAbbey · 2017-04-30T04:45:32Z

The CommonMark spec has this to say about link destinations:

A link destination consists of either

a sequence of zero or more characters between an opening < and a closing > that contains no >spaces, line breaks, or unescaped < or > characters, or

Given the following input:

[a]

[a]: <te<st>

this implementation outputs:

<p><a href="te%3Cst">a</a></p>

However, the Javascript reference implementation outputs:

<p><a href="%3Cte%3Cst%3E">a</a></p>

Note the %3C characters surrounding te%3Cst are missing in the output from this implementation. They should be present, as this should be treated as the non-<...> type of link definition.

This also seems to affect inline link destinations.

The text was updated successfully, but these errors were encountered:

jgm · 2017-04-30T08:13:51Z

Thanks, bug confirmed.

jgm · 2017-07-13T06:43:04Z

The fix was wrong.

Spec says there are two types of link destination:

a sequence of zero or more characters between an opening < and a
closing > that contains no spaces, line breaks, or unescaped
< or > characters, or
a nonempty sequence of characters that does not include
ASCII space or control characters, and includes parentheses
only if (a) they are backslash-escaped or (b) they are part of
a balanced pair of unescaped parentheses.

The point of this bug report was that <te<st> should be treated as a type 2 link destination, not as a type 1 destination. But the "fix" makes it a type 1 link destination. Expected behavior is to get a link with URL "%3Cte%3Cst%3E".

Commit 14ea489

jgm · 2017-07-13T07:05:14Z

I reverted the completely mistaken fix. Now we have a failing test.

What we need is a fix for manual_scan_link_url. In writing this, I previously assumed that we'd be able to tell at the outset whether we were scanning for a type 1 or type 2 URL. But this isn't right, if <te<st> should be a type 2 URL. Scanning gets a bit tricky if we can't tell till the end whether we have type 1 or type 2, because type 1 doesn't require parentheses to be balanced.

jgm · 2017-07-13T07:06:13Z

Now I think I understand why I originally disallowed unescaped nested parens! This made scanning much easier.

jgm · 2017-07-13T07:07:28Z

We could always have two scanners: first try with a type 1 (if < is the first character), and if that fails, try with type 2. This will mean backtracking, but only on unusual inputs, and only once, so it shouldn't cause performance issues.

kivikakk · 2017-07-14T04:53:58Z

That seems quite acceptable! Here's a pretty gross diff that does just that:

diff --git a/src/inlines.c b/src/inlines.c
index e30c2af..0500177 100644
--- a/src/inlines.c
+++ b/src/inlines.c
@@ -859,10 +859,14 @@ static bufsize_t manual_scan_link_url(cmark_chunk *input, bufsize_t offset) {
         i += 2;
       else if (cmark_isspace(input->data[i]))
         return -1;
-      else
+      else if (input->data[i] == '<') {
+        i = offset;
+        goto type2;
+      } else
         ++i;
     }
   } else {
+type2:
     while (i < input->len) {
       if (input->data[i] == '\\' &&
          i + 1 < input-> len &&

Is this the kind of thing you mean?

jgm · 2017-07-14T14:19:37Z

+++ Yuki Izumi [Jul 14 17 04:53 ]:

Is this the kind of thing you mean?

Yes, but I wouldn't want to use goto in this way. I think it's cleaner to have two scanners, one for <..> form, the other for others. The main scanner could apply the first scanner, and if it didn't match anything, then apply the second scanner.

kivikakk · 2017-07-17T05:42:41Z

#219 is a PR for a better fix.

jgm · 2017-07-25T20:14:32Z

Reopening because we don't correctly handle this case:

[hi](<aaa)

which should be a link with URL "%3caaa", but is not parsed as a link at all.
There should be a regression test, or better a spec test, for this.

mity · 2017-08-06T07:09:23Z

Note the PR #219 also does not fix the following case:

[a](<x>X)

The problem is that if the parsing of the whole link (or ref. def.) with link destination type 1 fails, parsing of whole link (or ref. def.) with the type 2 should be retried. Doing the retry only on the destination itself is not enough.

~~BTW, MD4C also fails on this.~~
EDIT: MD4C now has this fixed but my proposal below to simplify it still holds.

Although it is fixable, IMHO the question is whether it is worth it. Take this as my vote to ban the retrying completely in the specs. To be more precise, I vote for something as follows (bold text is new):

A link destination consists of either

a sequence of zero or more characters between an opening < and a closing > that contains no spaces, line breaks, or unescaped < or > characters, or

a nonempty sequence of characters that does not include ASCII space or control characters, does not begin with unescaped <, and includes parentheses only if (a) they are backslash-escaped or (b) they are part of a balanced pair of unescaped parentheses. (Implementations may impose limits on parentheses nesting to avoid performance issues, but at least three levels of nesting should be supported.)

This would allow the parser to decide how to parse by checking the first character of the maybe-link-destination string.

Rationale:

Link URLs starting with < are rare so usage of it for type 2 is likely minimal.
With such URL, users may need to escape < anyway even now if the URL also ends with >.
Simplification of the implementation to the level which would make similar regressions unlikely.

Note the ambiguity implied in the point 2 is unfixable without change in specification because, with its current wording, <xxxx> simply fulfills both types of link destinations.

kivikakk · 2018-01-25T02:19:42Z

+1 to banning retrying. I can't think of many useful URIs that begin with <.

mity · 2019-03-26T13:27:57Z

This issue should be likely closed, as a result of commonmark/commonmark-spec@c1e0183. That actually bans the retrying as asked for above.

jgm · 2019-03-26T18:46:40Z

Yes, we can close this now.

jgm closed this as completed in 14ea489 Jun 2, 2017

jgm mentioned this issue Jul 13, 2017

Pathological input: Unclosed inline links (another) #218

Closed

jgm reopened this Jul 13, 2017

jgm added a commit that referenced this issue Jul 13, 2017

Reverted mistaken fix to #193.

1ea9cd8

Commit 14ea489

kivikakk mentioned this issue Jul 17, 2017

Fix URL scanner #219

Merged

jgm closed this as completed in #219 Jul 25, 2017

jgm reopened this Jul 25, 2017

mity mentioned this issue Aug 6, 2017

Angle brackets in link destinations mity/md4c#24

Closed

mity mentioned this issue Aug 10, 2017

[foo](<bar>"baz") interpretation #229

Closed

mity mentioned this issue Mar 26, 2019

[WIP] Updates for spec 0.29 mity/md4c#67

Merged

jgm closed this as completed Mar 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unescaped left angle brackets in link destination #193

Unescaped left angle brackets in link destination #193

ScottAbbey commented Apr 30, 2017

jgm commented Apr 30, 2017 via email

jgm commented Jul 13, 2017

jgm commented Jul 13, 2017

jgm commented Jul 13, 2017

jgm commented Jul 13, 2017

kivikakk commented Jul 14, 2017

jgm commented Jul 14, 2017 via email

kivikakk commented Jul 17, 2017

jgm commented Jul 25, 2017 •

edited

Loading

mity commented Aug 6, 2017 •

edited

Loading

kivikakk commented Jan 25, 2018

mity commented Mar 26, 2019

jgm commented Mar 26, 2019

Unescaped left angle brackets in link destination #193

Unescaped left angle brackets in link destination #193

Comments

ScottAbbey commented Apr 30, 2017

jgm commented Apr 30, 2017 via email

jgm commented Jul 13, 2017

jgm commented Jul 13, 2017

jgm commented Jul 13, 2017

jgm commented Jul 13, 2017

kivikakk commented Jul 14, 2017

jgm commented Jul 14, 2017 via email

kivikakk commented Jul 17, 2017

jgm commented Jul 25, 2017 • edited Loading

mity commented Aug 6, 2017 • edited Loading

kivikakk commented Jan 25, 2018

mity commented Mar 26, 2019

jgm commented Mar 26, 2019

jgm commented Jul 25, 2017 •

edited

Loading

mity commented Aug 6, 2017 •

edited

Loading