Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unescaped left angle brackets in link destination #193

Closed
ScottAbbey opened this issue Apr 30, 2017 · 13 comments
Closed

Unescaped left angle brackets in link destination #193

ScottAbbey opened this issue Apr 30, 2017 · 13 comments

Comments

@ScottAbbey
Copy link

The CommonMark spec has this to say about link destinations:

A link destination consists of either

a sequence of zero or more characters between an opening < and a closing > that contains no >spaces, line breaks, or unescaped < or > characters, or

Given the following input:

[a]

[a]: <te<st>

this implementation outputs:

<p><a href="te%3Cst">a</a></p>

However, the Javascript reference implementation outputs:

<p><a href="%3Cte%3Cst%3E">a</a></p>

Note the %3C characters surrounding te%3Cst are missing in the output from this implementation. They should be present, as this should be treated as the non-<...> type of link definition.

This also seems to affect inline link destinations.

@jgm
Copy link
Member

jgm commented Apr 30, 2017 via email

@jgm
Copy link
Member

jgm commented Jul 13, 2017

The fix was wrong.

Spec says there are two types of link destination:

  1. a sequence of zero or more characters between an opening < and a
    closing > that contains no spaces, line breaks, or unescaped
    < or > characters, or

  2. a nonempty sequence of characters that does not include
    ASCII space or control characters, and includes parentheses
    only if (a) they are backslash-escaped or (b) they are part of
    a balanced pair of unescaped parentheses.

The point of this bug report was that <te<st> should be treated as a type 2 link destination, not as a type 1 destination. But the "fix" makes it a type 1 link destination. Expected behavior is to get a link with URL "%3Cte%3Cst%3E".

jgm added a commit that referenced this issue Jul 13, 2017
@jgm
Copy link
Member

jgm commented Jul 13, 2017

I reverted the completely mistaken fix. Now we have a failing test.

What we need is a fix for manual_scan_link_url. In writing this, I previously assumed that we'd be able to tell at the outset whether we were scanning for a type 1 or type 2 URL. But this isn't right, if <te<st> should be a type 2 URL. Scanning gets a bit tricky if we can't tell till the end whether we have type 1 or type 2, because type 1 doesn't require parentheses to be balanced.

@jgm
Copy link
Member

jgm commented Jul 13, 2017

Now I think I understand why I originally disallowed unescaped nested parens! This made scanning much easier.

@jgm
Copy link
Member

jgm commented Jul 13, 2017

We could always have two scanners: first try with a type 1 (if < is the first character), and if that fails, try with type 2. This will mean backtracking, but only on unusual inputs, and only once, so it shouldn't cause performance issues.

@kivikakk
Copy link
Contributor

That seems quite acceptable! Here's a pretty gross diff that does just that:

diff --git a/src/inlines.c b/src/inlines.c
index e30c2af..0500177 100644
--- a/src/inlines.c
+++ b/src/inlines.c
@@ -859,10 +859,14 @@ static bufsize_t manual_scan_link_url(cmark_chunk *input, bufsize_t offset) {
         i += 2;
       else if (cmark_isspace(input->data[i]))
         return -1;
-      else
+      else if (input->data[i] == '<') {
+        i = offset;
+        goto type2;
+      } else
         ++i;
     }
   } else {
+type2:
     while (i < input->len) {
       if (input->data[i] == '\\' &&
          i + 1 < input-> len &&

Is this the kind of thing you mean?

@jgm
Copy link
Member

jgm commented Jul 14, 2017 via email

@kivikakk
Copy link
Contributor

#219 is a PR for a better fix.

@jgm jgm closed this as completed in #219 Jul 25, 2017
@jgm jgm reopened this Jul 25, 2017
@jgm
Copy link
Member

jgm commented Jul 25, 2017

Reopening because we don't correctly handle this case:

[hi](<aaa)

which should be a link with URL "%3caaa", but is not parsed as a link at all.
There should be a regression test, or better a spec test, for this.

@mity
Copy link
Contributor

mity commented Aug 6, 2017

Note the PR #219 also does not fix the following case:

[a](<x>X)

The problem is that if the parsing of the whole link (or ref. def.) with link destination type 1 fails, parsing of whole link (or ref. def.) with the type 2 should be retried. Doing the retry only on the destination itself is not enough.

BTW, MD4C also fails on this.
EDIT: MD4C now has this fixed but my proposal below to simplify it still holds.

Although it is fixable, IMHO the question is whether it is worth it. Take this as my vote to ban the retrying completely in the specs. To be more precise, I vote for something as follows (bold text is new):

A link destination consists of either

  • a sequence of zero or more characters between an opening < and a closing > that contains no spaces, line breaks, or unescaped < or > characters, or

  • a nonempty sequence of characters that does not include ASCII space or control characters, does not begin with unescaped <, and includes parentheses only if (a) they are backslash-escaped or (b) they are part of a balanced pair of unescaped parentheses. (Implementations may impose limits on parentheses nesting to avoid performance issues, but at least three levels of nesting should be supported.)

This would allow the parser to decide how to parse by checking the first character of the maybe-link-destination string.

Rationale:

  1. Link URLs starting with < are rare so usage of it for type 2 is likely minimal.
  2. With such URL, users may need to escape < anyway even now if the URL also ends with >.
  3. Simplification of the implementation to the level which would make similar regressions unlikely.

Note the ambiguity implied in the point 2 is unfixable without change in specification because, with its current wording, <xxxx> simply fulfills both types of link destinations.

@kivikakk
Copy link
Contributor

+1 to banning retrying. I can't think of many useful URIs that begin with <.

@mity
Copy link
Contributor

mity commented Mar 26, 2019

This issue should be likely closed, as a result of commonmark/commonmark-spec@c1e0183. That actually bans the retrying as asked for above.

@jgm
Copy link
Member

jgm commented Mar 26, 2019

Yes, we can close this now.

@jgm jgm closed this as completed Mar 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants