Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick-fix of issues caused by URL encoding #870

Merged
merged 16 commits into from
Jan 25, 2023
Merged

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented Jan 18, 2023

Fixes kiwix/kiwix-tools#594 (hopefully, without introducing another bug).

I started working on a more robust solution to #775 on branch robust_uri_encoding but then started having a feeling of overdesigning and overcomplicating things. So I tried a quick hack (see the last commit in this PR) and it seemed to work, so I decided to package the new unit-tests, fixes to a couple of bugs revealed by those tests and the hack into a PR of its own.

@codecov
Copy link

codecov bot commented Jan 18, 2023

Codecov Report

Base: 71.45% // Head: 72.01% // Increases project coverage by +0.55% 🎉

Coverage data is based on head (ca079a7) compared to base (cf59a93).
Patch coverage: 92.85% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #870      +/-   ##
==========================================
+ Coverage   71.45%   72.01%   +0.55%     
==========================================
  Files          53       53              
  Lines        3749     3745       -4     
  Branches     2096     2093       -3     
==========================================
+ Hits         2679     2697      +18     
+ Misses       1068     1046      -22     
  Partials        2        2              
Impacted Files Coverage Δ
src/tools/otherTools.cpp 83.76% <ø> (-0.41%) ⬇️
src/tools/stringTools.h 100.00% <ø> (ø)
src/tools/stringTools.cpp 62.61% <87.50%> (+9.83%) ⬆️
src/opds_dumper.cpp 99.18% <100.00%> (ø)
src/search_renderer.cpp 95.45% <100.00%> (+0.04%) ⬆️
src/server/internalServer.cpp 88.52% <100.00%> (ø)
src/server/request_context.cpp 86.06% <100.00%> (ø)
src/server/request_context.h 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr Please assess if the quick-fix has any undesirable side-effects. I can think of only subtle ones due to broken interplay between ZIM creation software that might put %2F (URL-encoded /) sequences in item names relying on the fact that it isn't decoded in kiwix-serve.

@mgautierfr
Copy link
Member

If I resume the change:

Old New
EncodeReserved: false
/ -> / / -> /
# -> %23 # -> #
EncodeReserved: true
/ -> %2f / -> /
# -> %23 # -> %23

So with this change and encoded reserved to true, what changes is that we don't encode /
with this change and encoded reserved to false, what changes is that we don't encode #

(Please confirm I right)

urlEncode(..., false) is used for :

  • build the redirect. This is indeed what we want
  • encode the item's path when building the absolute path of a search result (we are good here)
  • encode the query in the ODPSFeed2. We are NOT good as # is a begging of the fragment, and the fragment is not send to the server
  • encode the category name in the categories opds feed. We should encode the # here as category may contain it. But I'm not aware of categories containing a #. @kelson42 @rgaudin ?

urlEncode(..., true) is used for:

  • Build a search pattern in handle_content if we cannot find a book or an article. I think we are good here to not encode /
  • Rebuild the queryString we received (parsed) from libmicrohttpd. I think we are good if we not encode /
  • Build querystring in search_renderer. We are good also
  • Encode the zim name when building the absolute path of search result. As zim name should not contains / we should be good also
  • Encode the contentId (book name) in ods_dumper fullEntryXML. Same has above
  • In urlencoded filter for mustache template. We probably want to encode / here (although I haven't found a use of this filter)

@rgaudin
Copy link
Member

rgaudin commented Jan 24, 2023

  • encode the category name in the categories opds feed. We should encode the # here as category may contain it. But I'm not aware of categories containing a #. @kelson42 @rgaudin ?

As you know, any value but ; is allowed in Tags but _category is private and we should thus only consider our own use, which doesn't make uses of #. 👍

@veloman-yunkan
Copy link
Collaborator Author

@mgautierfr Thanks for the meticulous analysis!

So with this change and encoded reserved to true, what changes is that we don't encode / with this change and encoded reserved to false, what changes is that we don't encode #

(Please confirm I right)

Confirming

urlEncode(..., false) is used for :

...

  • encode the item's path when building the absolute path of a search result (we are good here)

This one is a bug. URI delimiters occurring in the path must be encoded! Fixed.

  • encode the query in the ODPSFeed2. We are NOT good as # is a begging of the fragment, and the fragment is not send to the server

Good catch. While working on the robust solution, I changed RequestContext::get_query() so that it returns a properly URI-encoded representation of the query, but since that change broke more unit-tests than I was willing to accept, I had to stash it away. Now I see that I will have to deal with the same issue in this quick-fix PR.

  • encode the category name in the categories opds feed. We should encode the # here as category may contain it. But I'm not aware of categories containing a #. @kelson42 @rgaudin ?

Fixed. Category name is now fully URI-encoded.

urlEncode(..., true) is used for:

...

  • In urlencoded filter for mustache template. We probably want to encode / here (although I haven't found a use of this filter)

Dropped this one.

@veloman-yunkan
Copy link
Collaborator Author

  • encode the query in the ODPSFeed2. We are NOT good as # is a begging of the fragment, and the fragment is not send to the server

Good catch. While working on the robust solution, I changed RequestContext::get_query() so that it returns a properly URI-encoded representation of the query, but since that change broke more unit-tests than I was willing to accept, I had to stash it away. Now I see that I will have to deal with the same issue in this quick-fix PR.

Done

@veloman-yunkan
Copy link
Collaborator Author

Also fixed usage of urlEncode() in kiwix::getSeachUrl(), whereupon no instance of urlEncode(..., false) is left (other than in the unit-test of urlEncode()). Should we drop the extra parameter of urlEncode()?

@mgautierfr
Copy link
Member

encode the item's path when building the absolute path of a search result (we are good here)

This one is a bug. URI delimiters occurring in the path must be encoded! Fixed.

We should not. The search result link is in the form : http://content/<bookName>/<path_to_article>.
If we encode / in path_to_article we will break relative links.

Also fixed usage of urlEncode() in kiwix::getSeachUrl(), whereupon no instance of urlEncode(..., false) is left (other than in the unit-test of urlEncode()). Should we drop the extra parameter of urlEncode()?

Yes. And we could call it encodeURIComponent.
For the encoding of a item path (keeping the /) we could have a method with this (pseudo) code :

string encodeItemPath(string itemPath) {
    auto components = itemPath.split('/');
    string encoded;
    char separator = '';
    for (component: components) {
        encoded += separator + encodeUrIComponent(component);
        separator = '/'
    }
   return encoded;
}

This way we preserve / in path and we still encode any #/? in article path which will break the url.

@veloman-yunkan
Copy link
Collaborator Author

encode the item's path when building the absolute path of a search result (we are good here)

This one is a bug. URI delimiters occurring in the path must be encoded! Fixed.

We should not. The search result link is in the form : http://content/<bookName>/<path_to_article>. If we encode / in path_to_article we will break relative links.

But that's what is changed by this PR! / is now never encoded by urlEncode().

@mgautierfr
Copy link
Member

But that's what is changed by this PR! / is now never encoded by urlEncode().

🤦🏽‍♂️ My bad. I've mixed too much things.

So I think we are good. We just have to remove the encodeReserved argument. But it should be functionally good.


Just want to mention for potential issue later: We are ok to not encode / in zim name because the zim name is inferred from the zim filename and so doesn't contain /. But if it changes, we would have to adapt again and encode / in bookname (for exemple, we may want to use the name in the library as @rgaudin thank at first sight in kiwix/kiwix-tools#596)
In fact, we should already encode the / as the zim name may be any name if the libkiwix user provide its own namemapper.
But for a quick-fix we are good. To keep in mind for the "final" solution of #775.

Added a unit-test for urlEncode() that passes for its current
implementation despite the two bugs that were revealed while creating
the unit-test.
Replaced tabs with spaces.
This change is a quick hack solving known issues with URI-encoding in
libkiwix.

This change removes the slash character from the list of URL separator
symbols in URL encoding/decoding utilities, and makes it a symbol that
is safe to leave unencoded.

Effects:

- `urlEncode()` never encodes the '/' symbol (even when it is requested
  to encode the URL separator symbols too).

- `urlDecode(str)`/`urlDecode(..., false)` will now decode %2F to '/';
  other encoded URL separator symbols are NOT decoded when the second
  argument of `urlDecode()` is set to false (which is the default).
Special URI symbols occurring in the item path part of the search result
link were NOT encoded, because that would also encode the path separator (/)
symbol. Now that `urlEncode()` never encodes the / symbol, it is safe to
encode all other URI-special symbols in the path.
This is a precautionary step before dropping the said parameter.
`urlEncode(str)` is now equivalent to the previous `urlEncode(str, true)`.
@veloman-yunkan
Copy link
Collaborator Author

So I think we are good. We just have to remove the encodeReserved argument. But it should be functionally good.

Done (also rebased).

we may want to use the name in the library as @rgaudin thank at first sight in kiwix/kiwix-tools#596

I stumbled upon "thank" in this sentence, but then figured out that you meant "thought" :)

@mgautierfr mgautierfr merged commit 76dfc03 into main Jan 25, 2023
@mgautierfr mgautierfr deleted the urlEncode_quickfix branch January 25, 2023 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression with URL encoding: Zimit ZIM files are not readable anymore
3 participants