Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL encoding suggestion pathname's unintended side effect #958

Closed
rgaudin opened this issue Jun 12, 2023 · 3 comments · Fixed by #963
Closed

URL encoding suggestion pathname's unintended side effect #958

rgaudin opened this issue Jun 12, 2023 · 3 comments · Fixed by #963

Comments

@rgaudin
Copy link
Member

rgaudin commented Jun 12, 2023

Here's another tricky behavior.

In this ZIM, because it's a single page app, we have ZIM front entries that are stub HTML documents simply redirecting to the SPA's URL with appropriate fragment.

For instance, one content is available from:

We thus have a front article at redirect/english/python-for-everybody/comparing-and-sorting-tuples.

This works fine with and without the viewer. The random feature works as expected, serving up the redirects.

The problem is with the Suggestions. As you can see, the suggestion engines returns correct results and those are displayed fine but click on any leads to a 404.

Instead of arriving (search for “comp”) at the expected location, one arrives at https://dev.library.kiwix.org/viewer#index.html

As you can see, the bookName part is gone and one would expect the redirect to have gone to far. It's not the case.

It's not a problem with how the location is changed (https://github.com/kiwix/libkiwix/blob/main/static/skin/viewer.js#L50).

What happens is that the suggestion code urlencodes the result turning redirect/english/python-for-everybody/comparing-and-sorting-tuples into redirect%2Fenglish%2Fpython-for-everybody%2Fcomparing-and-sorting-tuples.

Now read carefully:

  • Browser is fine with that encoded URL and properly retrieves the entry.
  • Redirect is performed but the scope/source location is incorrect and we end-up elsewhere.

This is not linked to the viewer and the iframe. You can reproduce on /content:

Counter-intuitively, the browser does not translates the %2F into / and moves there but instead it queries the server with the %2F, gets the response and renders it from it's redirectXXX location, not accounting those %2F as path segments.
That's exactly the purpose of URL-encoding and all browsers do the same 👍

My question at this point would be: why are we URL-encoding this information? From what I understand, this is a ZIM-Entry path that is sent to us via the Suggestion API and we just use it as source of the iframe (prefixed with root if any).
Need for URL-encoding would be to prevent characters from that ZIM entry path to be interpreted as special-meaning for being reserved characters.
While ZIM path are entirely free, ZIM creators already limit themselves and will do as long as ZIM are mainly run in a Web context.

@mgautierfr
Copy link
Member

mgautierfr commented Jun 12, 2023

My question at this point would be: why are we URL-encoding this information? From what I understand, this is a ZIM-Entry path that is sent to us via the Suggestion API and we just use it as source of the iframe (prefixed with root if any).
Need for URL-encoding would be to prevent characters from that ZIM entry path to be interpreted as special-meaning for being reserved characters.

We url encode as we may have ? or # being part of the entry path. If we don't url encode the path, the path foo/bar?option=true#section1 would lead to a request for foo/bar, with query string option=true and without section has every thing after the # is not send to the server.

While ZIM path are entirely free, ZIM creators already limit themselves and will do as long as ZIM are mainly run in a Web context.

At least zimit stores the querystring as part of the path (foo/bar?option=true).


At first sight, I would said that we should not url encode / in the path.

@kelson42 kelson42 added this to the 12.1.0 milestone Jun 12, 2023
@kelson42
Copy link
Collaborator

@veloman-yunkan This is a must fix before releasing 12.1.0... wonder a bit that we still have URL encoding issues.

@veloman-yunkan
Copy link
Collaborator

At first sight, I would said that we should not url encode / in the path.

Building suggestion links using URI-encoded components (book name and the article path) was introduced in #860. The main targets were only a few dangerous symbols, however in the absence of a standard function for selective URI-encoding all reserved symbols were encoded leading to this issue.

wonder a bit that we still have URL encoding issues.

@kelson42 The non-homogeneous structure of URLs but the natural desire to keep things simple has resulted in an encoding scheme that makes the world go nuts when one URL has to be embedded in another URL. Therefore there is no surprise that a human mind cannot comprehend all possible interactions of various scenarios (especially when pieces of code from different people having different assumptions meet in one place). Hence, please be prepared for a few more issues of this kind with exponentially growing weirdness until fixing the last one will produce a space-time singularity that will destroy our Universe. 😄

veloman-yunkan added a commit that referenced this issue Jul 1, 2023
Before this fix suggestion links were built out of fully URI-encoded
book name and article path components despite the fact that this measure
was taken against only a few dangerous symbols such as '#', '?', '"' and
'\'.  However, URI-encoding the slash symbols in the path has some
undesirable side-effects (see #958).

Henceforth only the problematic symbols are encoded in the article path
component. The book name is still fully URI-encoded since I don't see
any counter-arguments.
veloman-yunkan added a commit that referenced this issue Jul 1, 2023
Before this fix suggestion links were built out of fully URI-encoded
book name and article path components despite the fact that this measure
was taken against only a few dangerous symbols such as '#', '?', '"' and
'\'.  However, URI-encoding the slash symbols in the path has some
undesirable side-effects (see #958).

Henceforth only the problematic symbols are encoded in the article path
component. The book name is still fully URI-encoded since I don't see
any counter-arguments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants