Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

turndown not escaping links properly #459

Closed
shaipetel opened this issue Apr 2, 2024 · 5 comments · May be fixed by EugeniyKiyashko/DevExtreme#6 or alexslavr/DevExtreme#4
Closed

turndown not escaping links properly #459

shaipetel opened this issue Apr 2, 2024 · 5 comments · May be fixed by EugeniyKiyashko/DevExtreme#6 or alexslavr/DevExtreme#4

Comments

@shaipetel
Copy link

I noticed this scenario and think it would be useful for others if turndown would handle this better.

Background:
( and ) are valid characters in a URL, and won't get escaped in normal HTML.
In markdown, links are surrounded by ( and ), if your link needs to have ( and ) you will need to escape them with \

For this input html:

<a href="https://google.com/file(1).jpg">link</a>

A valid markdown should be:

[link](https://google.com\(1\).jpg)

However turndown returns:

[link](https://google.com(1).jpg)

I'm now pre-fixing image and links in my html prior to calling the turndown service, but I really think this should be handled in the parser.
When you do, be sure not to double-escape URLs if there were already escaped (meaning, if the URL has "(1).jpg" there is no need to double-escape it.

@pavelhoral
Copy link
Collaborator

pavelhoral commented Apr 3, 2024

Good catch. Btw. according to the spec it seems that parentheses does not have to be escaped as long as they are balanced. So in your example you should be fine without any preprocessing.

I don't understand your last comment about "double-escaping URLs". Can you elaborate a bit more?

@shaipetel
Copy link
Author

Yeah, sure.
If someone (like me lol) loads an html:

<a href="https://google.com/file(1).jpg">link</a>

But, protected it before calling turndown, due to this issue:

<a href="https://google.com/file\(1\).jpg">link</a>

See that the parentheses were already escaped and leave them be.

FYI - I tried 2 markdown editors, they both automatically escape links that have (1).
You are right, they can load a markdown file that have (1) in a link - but they escape it as soon as its loaded, and if I save the file they will save it escaped.

@pavelhoral
Copy link
Collaborator

<a href="https://google.com/file\(1\).jpg">link</a>

I see. But that would be invalid URL in the HTML input. I don't think this library should take something like that into account.

I think we should add "aggressive" parenthesis escaping (i.e. don't bother checking if the parenthesis are balanced, simply escape them every time). Such behavior is in line with https://github.com/mixmark-io/turndown?tab=readme-ov-file#escaping-markdown-characters.

Such change might be quite simple - adding test case to [test/index.html](https://github.com/mixmark-io/turndown/blob/master/test/index.html and adding replace to commonmark-rules.js#.

pavelhoral added a commit to orchitech/turndown that referenced this issue Apr 3, 2024
pavelhoral added a commit to orchitech/turndown that referenced this issue Apr 5, 2024
pavelhoral added a commit to orchitech/turndown that referenced this issue Apr 8, 2024
martincizek added a commit that referenced this issue Apr 8, 2024
Escape parenthesis in link URLs. Fix #459.
@shaipetel
Copy link
Author

<a href="https://google.com/file\(1\).jpg">link</a>

I see. But that would be invalid URL in the HTML input. I don't think this library should take something like that into account.

I think we should add "aggressive" parenthesis escaping (i.e. don't bother checking if the parenthesis are balanced, simply escape them every time). Such behavior is in line with https://github.com/mixmark-io/turndown?tab=readme-ov-file#escaping-markdown-characters.

Such change might be quite simple - adding test case to [test/index.html](https://github.com/mixmark-io/turndown/blob/master/test/index.html and adding replace to commonmark-rules.js#.

Yeah, totally agree!
thanks.

@shaipetel
Copy link
Author

shaipetel commented Jul 16, 2024

I finally got around to verify this issue, however it was fixed for links but not for images.

this html:

<img src="https://google.com/file 1).jpg" />

produces this markdown:

![](https://google.com/file 1).jpg)

In this example I used an unbalanced bracket, as per earlier comments. I can confirm the fix for links handles both balanced and unbalanced brackets, which would be my preferred fix.

using 7.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants