URL encoding suggestion pathname's unintended side effect #958

rgaudin · 2023-06-12T13:48:12Z

Here's another tricky behavior.

In this ZIM, because it's a single page app, we have ZIM front entries that are stub HTML documents simply redirecting to the SPA's URL with appropriate fragment.

For instance, one content is available from:

index.html#english/python-for-everybody/comparing-and-sorting-tuples

We thus have a front article at redirect/english/python-for-everybody/comparing-and-sorting-tuples.

This works fine with and without the viewer. The random feature works as expected, serving up the redirects.

The problem is with the Suggestions. As you can see, the suggestion engines returns correct results and those are displayed fine but click on any leads to a 404.

Instead of arriving (search for “comp”) at the expected location, one arrives at https://dev.library.kiwix.org/viewer#index.html

As you can see, the bookName part is gone and one would expect the redirect to have gone to far. It's not the case.

It's not a problem with how the location is changed (https://github.com/kiwix/libkiwix/blob/main/static/skin/viewer.js#L50).

What happens is that the suggestion code urlencodes the result turning redirect/english/python-for-everybody/comparing-and-sorting-tuples into redirect%2Fenglish%2Fpython-for-everybody%2Fcomparing-and-sorting-tuples.

Now read carefully:

Browser is fine with that encoded URL and properly retrieves the entry.
Redirect is performed but the scope/source location is incorrect and we end-up elsewhere.

This is not linked to the viewer and the iframe. You can reproduce on /content:

Counter-intuitively, the browser does not translates the %2F into / and moves there but instead it queries the server with the %2F, gets the response and renders it from it's redirectXXX location, not accounting those %2F as path segments.
That's exactly the purpose of URL-encoding and all browsers do the same 👍

My question at this point would be: why are we URL-encoding this information? From what I understand, this is a ZIM-Entry path that is sent to us via the Suggestion API and we just use it as source of the iframe (prefixed with root if any).
Need for URL-encoding would be to prevent characters from that ZIM entry path to be interpreted as special-meaning for being reserved characters.
While ZIM path are entirely free, ZIM creators already limit themselves and will do as long as ZIM are mainly run in a Web context.

Article Format
URLs
Reserved characters: ␣ ! " # $ % & ' ( ) * + , / : ; = ? @ [ ]

The text was updated successfully, but these errors were encountered:

mgautierfr · 2023-06-12T14:07:10Z

My question at this point would be: why are we URL-encoding this information? From what I understand, this is a ZIM-Entry path that is sent to us via the Suggestion API and we just use it as source of the iframe (prefixed with root if any).
Need for URL-encoding would be to prevent characters from that ZIM entry path to be interpreted as special-meaning for being reserved characters.

We url encode as we may have ? or # being part of the entry path. If we don't url encode the path, the path foo/bar?option=true#section1 would lead to a request for foo/bar, with query string option=true and without section has every thing after the # is not send to the server.

While ZIM path are entirely free, ZIM creators already limit themselves and will do as long as ZIM are mainly run in a Web context.

At least zimit stores the querystring as part of the path (foo/bar?option=true).

At first sight, I would said that we should not url encode / in the path.

kelson42 · 2023-06-20T12:07:23Z

@veloman-yunkan This is a must fix before releasing 12.1.0... wonder a bit that we still have URL encoding issues.

veloman-yunkan · 2023-07-01T12:47:13Z

At first sight, I would said that we should not url encode / in the path.

Building suggestion links using URI-encoded components (book name and the article path) was introduced in #860. The main targets were only a few dangerous symbols, however in the absence of a standard function for selective URI-encoding all reserved symbols were encoded leading to this issue.

wonder a bit that we still have URL encoding issues.

@kelson42 The non-homogeneous structure of URLs but the natural desire to keep things simple has resulted in an encoding scheme that makes the world go nuts when one URL has to be embedded in another URL. Therefore there is no surprise that a human mind cannot comprehend all possible interactions of various scenarios (especially when pieces of code from different people having different assumptions meet in one place). Hence, please be prepared for a few more issues of this kind with exponentially growing weirdness until fixing the last one will produce a space-time singularity that will destroy our Universe. 😄

Before this fix suggestion links were built out of fully URI-encoded book name and article path components despite the fact that this measure was taken against only a few dangerous symbols such as '#', '?', '"' and '\'. However, URI-encoding the slash symbols in the path has some undesirable side-effects (see #958). Henceforth only the problematic symbols are encoded in the article path component. The book name is still fully URI-encoded since I don't see any counter-arguments.

rgaudin added bug question kiwix-serve labels Jun 12, 2023

kelson42 added this to the 12.1.0 milestone Jun 12, 2023

kelson42 assigned veloman-yunkan Jun 12, 2023

veloman-yunkan mentioned this issue Jul 1, 2023

Quasi-URI-encoded suggestion links #963

Merged

kelson42 closed this as completed in #963 Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL encoding suggestion pathname's unintended side effect #958

URL encoding suggestion pathname's unintended side effect #958

rgaudin commented Jun 12, 2023

mgautierfr commented Jun 12, 2023 •

edited

Loading

kelson42 commented Jun 20, 2023

veloman-yunkan commented Jul 1, 2023

URL encoding suggestion pathname's unintended side effect #958

URL encoding suggestion pathname's unintended side effect #958

Comments

rgaudin commented Jun 12, 2023

mgautierfr commented Jun 12, 2023 • edited Loading

kelson42 commented Jun 20, 2023

veloman-yunkan commented Jul 1, 2023

mgautierfr commented Jun 12, 2023 •

edited

Loading