Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix autolink literals in link labels #6

Merged
merged 2 commits into from
Jan 29, 2021
Merged

Conversation

jimmybergman
Copy link
Contributor

@jimmybergman jimmybergman commented Jan 22, 2021

For markdown such as:

[Visit www.example.com](http://www.example.com) please.

The following HTML is generated:

<p>[Visit <a href="http://www.example.com%5D(http://www.example.com">www.example.com](http://www.example.com</a>) please.</p>\n

With the fix instead it behaves as github.com

@wooorm
Copy link
Member

wooorm commented Jan 22, 2021

Awesome! Thanks for working on this!

I do believe the issue is a bit more complex (PS props for figuring out how this works, because it isn‘t really documented yet): labelLink is used for a [, but it’s not required that there’s another ][label] or ](reference) following it. And now you’re checking whether there’s a [ somewhere in the (roughly) paragraph, whereas this autolink could be in something after a link (e.g., [link]() <http://autolink>) would fail with this code I think.

@codecov-io
Copy link

codecov-io commented Jan 25, 2021

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@jimmybergman
Copy link
Contributor Author

@wooorm I added a test for

[link]() <http://autolink>

and it does seem to work as expected.

I kind of expected that https://github.com/micromark/micromark/blob/5a90585c1ba8414f05888329265440c1aa112311/lib/tokenize/label-end.mjs#L42 intended to splice away the event for the [ when the ] arrived, but honestly I did not completely understand the logic and I didn't verify my assumption.

I see what you mean though, a potential problem would be:

[ is a nice char, it would be nice to have a www.example.com/autolink.

So lets try how github does:
[ is a nice char, it would be nice to have a www.example.com/autolink.

compared to:
it would be nice to have a www.example.com/autolink.

@wooorm
Copy link
Member

wooorm commented Jan 25, 2021

So lets try how github does:
[ is a nice char, it would be nice to have a www.example.com/autolink

This is blowing my mind! I knew github was weird, but this is weird!!!

@jimmybergman
Copy link
Contributor Author

This is blowing my mind! I knew github was weird, but this is weird!!!

I would guess it is for the same reason as for this PR behaving like that. :-)

I guess if you want to support the "is a nice char" thingy anyway, then I could have a stab at it. What do you say?

@wooorm wooorm changed the title Don't autolink in link labels Fix autolink literals in link labels Jan 29, 2021
@wooorm wooorm merged commit f5b9aaa into micromark:main Jan 29, 2021
@wooorm
Copy link
Member

wooorm commented Jan 29, 2021

It gets weirder:

Link start.

[https://a.com

[http://a.com

[www.a.com

[a@b.c

Link start and label end.

[https://a.com]

[http://a.com]

[www.a.com]

[a@b.c]

Link label with reference.

[https://a.com][x]

[http://a.com][x]

[www.a.com][x]

[a@b.c][x]

[x]: #

Link label with resource.

[https://a.com]()

[http://a.com]()

[www.a.com]()

[a@b.c]()

Autolink literal after link.

[a]() https://a.com

[a]() http://a.com

[a]() www.a.com

[a]() a@b.c

->

Link start.

[https://a.com

[http://a.com

[www.a.com

[a@b.c

Link start and label end.

[https://a.com]

[http://a.com]

[www.a.com]

[a@b.c]

Link label with reference.

https://a.com

http://a.com

www.a.com

a@b.c

Link label with resource.

https://a.com

http://a.com

www.a.com

a@b.c

Autolink literal after link.

a https://a.com

a http://a.com

a www.a.com

a a@b.c


but, as a gist (probably as a readme), it’s turned into: https://gist.github.com/wooorm/8199cc9e70e37e01b8470293c119fb28

@wooorm
Copy link
Member

wooorm commented Jan 29, 2021

Alright, so I’m thinking about this, and it’s going to get a bit complex:

First, it’s not just plain links, it’s also images, and perhaps even <a href="..."> (this used to be supported in marked, not sure if GH (still) does it)

Second, we don’t know if something is a []() link until at the ]() part (which could also be ][label], ][], or ], depending on whether the text references something).

Perhaps, it’s just impossible to know in this project, so maybe it could be added in micromark?
Or, perhaps better, in https://github.com/syntax-tree/mdast-util-gfm-autolink-literal?

@wooorm
Copy link
Member

wooorm commented Jan 31, 2021

Have been digging in some more. I’m now under the assumption that cmark-gfm is used by comments (and probably /issues/prs/releases), where the naïve behavior of this PR is used, through these lines:

https://github.com/github/cmark-gfm/blob/85d895289c5ab67f988ca659493a64abb5fec7b4/src/inlines.c

Which are used here:

https://github.com/github/cmark-gfm/blob/85d895289c5ab67f988ca659493a64abb5fec7b4/extensions/autolink.c#L255-L257


But, the actual behavior on Gists (and probably readmes and such) uses a different compiler, is more complex as stated before.

It seems that the behavior of autolink literal is done after parsing, similar probably to mentions (similar to mentions: https://github.com/remarkjs/remark-github), because for example in www.example.com/b](#), that’s a whole link, but for [www.example.com/b](#), it’s a link to # with the example url as plain text.

That does make it harder to handle the “final character reference” case described in the spec:

www.example.com/b&amp;](#)

www.example.com/b](#)&amp;

[www.example.com/b&amp;](#)

[www.example.com/b](#)&amp;

www.example.com/b&amp;](#)

www.example.com/b](#)&

www.example.com/b&

www.example.com/b&

^-- which Is behavior which I don’t get 🤷‍♂️

@wooorm
Copy link
Member

wooorm commented Feb 1, 2021

Some more behavior that I can’t wrap my head around:

https://gist.github.com/wooorm/fcacba9afeffcdcd0b1c6bf0e74e598f

So, the opening bracket does have some affect. Even though it won’t end up forming an actual link.

@wooorm
Copy link
Member

wooorm commented Feb 1, 2021

I think I finally found something: GH uses two algorithms. One at parsing, and one that matches on the AST.

Take this example: https://gist.github.com/wooorm/580c02085cd9c96d3c3c2a31f1cd1e5c

Note that when the bracket count is balanced (all the open ones are closed), the character reference does not work (is not decoded): the tokenizer just treats it as more raw characters. My assumption is this tokeniser doesn‘t do anything if there are open brackets. And instead that there’s some more final processing, because the character reference works (is decoded) in the more-open-brackets-case.
Extra interesting is the H1 and H5 examples because there the http: vs. www. style autolink literals behave differently. Probably a bug in their code where their tokenizer does not work for www., and thus leaves it to something that operates on the syntax tree

wooorm added a commit that referenced this pull request Feb 3, 2021
Related to GH-5.
Related to GH-6.
wooorm added a commit to syntax-tree/mdast-util-gfm-autolink-literal that referenced this pull request Feb 3, 2021
wooorm added a commit to syntax-tree/mdast-util-gfm-autolink-literal that referenced this pull request Aug 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants