Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences in URL Encoding for links, text and Ids #380

Closed
RickStrahl opened this issue Oct 29, 2019 · 6 comments
Closed

Differences in URL Encoding for links, text and Ids #380

RickStrahl opened this issue Oct 29, 2019 · 6 comments
Labels

Comments

@RickStrahl
Copy link
Contributor

RickStrahl commented Oct 29, 2019

I'm running into issues trying to consolidate links that require encoding and jumping between them in a page. The problem is that it looks like the encoding for links ([]()) and generated text and more importantly element IDs are not handled in the same way.

No good way to show the ID handling in Babelmark, but the problem shows itself in actual text rendering. Notice the difference in the text/url encoding for the has above vs. the code:

https://babelmark.github.io/?text=*+%5BNamenskonventionen+f%C3%BCr+Forms%5D(%23namenskonventionen-f%C3%BCr-f%22%2C%26orms)%0A%0A%23%23%23+Namenskonventionen+f%C3%BCr+F%22%2C%26orms

If you render with Auto-Ids the IDs use the same encoding as the text on the bottom which doesn't match the link encoding used above.

For a use case of this if I generate a link in a page and want to link it automatically to a header below, there's no single way that I can encode that link. The very specific scenario is a TOC generator where I pick out all the topic headers and then generate a toc of links that point to those same headers. But because the encoding is different the links don't work.

* [Namenskonventionen für Forms](#namenskonventionen-für-forms)

### Namenskonventionen für Forms

turns into:

<li><a href="#namenskonventionen-f%C3%BCr-forms">Namenskonventionen für Forms</a</li>

<h3 id="namenskonventionen-für-forms">Namenskonventionen für Forms</h3>

The differences in encoding cause the link to not navigate.

There are a number of differences in how things are encoded, but in the above the umlaut probably shouldn't be encoded .

So the question is - should there be a consistent way to encode links that matches what the id generators are using?

Edge case for sure, but this has bitten me for a number of things related to creating reliable intra-document cross links. As it is I have to take over link navigation manually in my document solutions, but I'm not sure how to deal with the above.

@MihaZupan
Copy link
Collaborator

MihaZupan commented Oct 30, 2019

I believe the HTML you are seeing is correct and it works on regular browsers,

If you look at the html source GitHub is using you can see that it also encodes the href while leaving the characters in the id as-is.

  • There are a number of differences in how things are encoded, but in the above the umlaut probably shouldn't be encoded .

The escaping in the href is done according to the CommonMark spec and I believe these characters should be escaped.

  • So the question is - should there be a consistent way to encode links that matches what the id generators are using?

I would recommend using the AutoLink functionality of AutoIdentifiersExtension instead of trying to guess what the generated id will be like.

When using the default UseAutoIdentifiers you avoid this problem as it uses AllowOnlyAscii, normalizing the id and thus avoiding character escaping in the href.

Since you are using the GitHub way of generating the id, non-ascii characters are preserved in the id and then escaped in the href.

  • Edge case for sure

Does the preview renderer you are using correctly establish the link if the html looks like the following - with the value of the heading id also url encoded, thus matching the href?

<li><a href="#namenskonventionen-f%C3%BCr-forms">Namenskonventionen für Forms</a</li>

<h3 id="namenskonventionen-f%C3%BCr-forms">Namenskonventionen für Forms</h3>

If that does work in your use-case, an extra setting controlling whether heading IDs are URL-encoded could be exposed, off by default (I tested it locally and the change needed is rather trivial).

@RickStrahl
Copy link
Contributor Author

I am already using the AutoLinks pipeline extension and that's how the ID gets generated, but as mentioned they are not getting URL encoded the same way. Note, that there is some URL encoding happening.

Checked out your example, and sure enough, even if the URL encoding matches it doesn't work, so encoding isn't a solution either.

doesn't work:

<li><a href="#namenskonventionen-f%C3%BCr-forms">Namenskonventionen für Forms</a</li>

<h3 id="namenskonventionen-f%C3%BCr-forms">Namenskonventionen für Forms</h3>

this works:

<li id="pragma-line-0"><a href="#namenskonventionen-fuer-forms">Namenskonventionen für Forms</a></li>

<h3 id="namenskonventionen-fuer-forms">Namenskonventionen für Forms</h3>

this also works:

<li id="pragma-line-0"><a href="#namenskonventionen-für-forms">Namenskonventionen für Forms</a></li>

<h3 id="namenskonventionen-für-forms">Namenskonventionen für Forms</h3>

So it looks like if the link umlaut is URL Encoded the navigation just doesn't work.

Note although I'm using a tool for previewing (Markdown Monster) which uses the IE WebBrowser control in WPF, the same behavior happens in Chrome both with local file URLs as well as running against local Web urls.

@RickStrahl
Copy link
Contributor Author

RickStrahl commented Oct 30, 2019

Sigh...

more info. It looks like the <base href="" /> tag in the document messes with all of this. If I don't have a base tag in the document at all, then urlencoded to raw text, or urlencoded works.

In the application previewer the base tag is required in order to properly find all the related resources relative to the document. However, with the base tag the navigation fails as soon as the hash is URL encoded. No encoded characters - it works fine.

I already intercept navigation of the tag and manually try to locate elements, so I guess it's possible to do a bit more work to normalize the IDs and URLs by explicitly url-decoding them, but that will then fail if somebody just dumps out the preview locally. Exports try to avoid the base tag, so that's all good and on a typical Web page there likely won't be a base tag.

While I still think that it would be better to not URL encode upper Unicode characters (just for the sheer overhead of it), I think that Markdig is actually doing the right thing, and I'm dealing with a HTML DOM quirk related to the <base> tag.

After some more thought I think we can probably close this but I'll leave it open a little longer in case somebody has any other ideas on a good way to deal with this.

At the end of the day this may bite others as well - anytime there are base tags in a page plus some URLEncoded hash content in a link will make this show, but I don't think based on the observations above that there's a good workaround for this short of using {#explicit-id} with extra attributes.

@MihaZupan
Copy link
Collaborator

MihaZupan commented Oct 30, 2019

Since your preview differs from the actual export, there is a way (a bit of a hack).

1. Don't manually add a link destination when refering to a header.

- [Namenskonventionen für Forms](#namenskonventionen-für-forms)
+ [Namenskonventionen für Forms]

# Namenskonventionen für Forms

2. In the preview pipeline, use

.UseAutoIdentifiers()
// which is the same as
.UseAutoIdentifiers(AutoIdentifierOptions.AllowOnlyAscii | AutoIdentifierOptions.AutoLink)

and in the release/export pipeline, use

.UseAutoIdentifiers(AutoIdentifierOptions.GitHub | AutoIdentifierOptions.AutoLink)

The html will obviously differ in such a case between the pipelines, but characters like umlauts will be normalized during preview. Preview HTML looks like

<p><a href="#namenskonventionen-fur-forms">Namenskonventionen für Forms</a></p>
<h1 id="namenskonventionen-fur-forms">Namenskonventionen für Forms</h1>

And the release HTML stays the same

<p><a href="#namenskonventionen-f%C3%BCr-forms">Namenskonventionen für Forms</a></p>
<h1 id="namenskonventionen-für-forms">Namenskonventionen für Forms</h1>

3.

While this does mean markdown like this can't work in the preview as there will be normalization happening, it doesn't work right now either so I don't see this as a real regression.

* [Namenskonventionen für Forms](#namenskonventionen-für-forms)

### Namenskonventionen für Forms

@RickStrahl
Copy link
Contributor Author

@MihaZupan Thank you - yes that would work. However the easier solution was to modify the render script that drives the preview and already intercepts hash navigation which is inconsistent anyway due to the file based nature (file:/// links) of the previewer.

The solution was actually quite simple by simple UrlDecoding the hash. Since auto-linking tends to strip spaces, quotes and other symbols the only encoded content should be Unicode characters so decoding should work fine.

if (hash) {
    hash = decodeURIComponent(hash);
    var sel = "a[name='" + hash.substr(1) + "']," + hash;
    var $el = $(sel);
    $("html,body").scrollTop($el.offset().top - 100);
    return false;
}

@MihaZupan
Copy link
Collaborator

Glad to hear you've found a solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants