Quick-fix of issues caused by URL encoding #870

veloman-yunkan · 2023-01-18T13:50:41Z

Fixes kiwix/kiwix-tools#594 (hopefully, without introducing another bug).

I started working on a more robust solution to #775 on branch robust_uri_encoding but then started having a feeling of overdesigning and overcomplicating things. So I tried a quick hack (see the last commit in this PR) and it seemed to work, so I decided to package the new unit-tests, fixes to a couple of bugs revealed by those tests and the hack into a PR of its own.

codecov · 2023-01-18T13:55:36Z

Codecov Report

Base: 71.45% // Head: 72.01% // Increases project coverage by +0.55% 🎉

Coverage data is based on head (ca079a7) compared to base (cf59a93).
Patch coverage: 92.85% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #870      +/-   ##
==========================================
+ Coverage   71.45%   72.01%   +0.55%     
==========================================
  Files          53       53              
  Lines        3749     3745       -4     
  Branches     2096     2093       -3     
==========================================
+ Hits         2679     2697      +18     
+ Misses       1068     1046      -22     
  Partials        2        2

Impacted Files	Coverage Δ
src/tools/otherTools.cpp	`83.76% <ø> (-0.41%)`	⬇️
src/tools/stringTools.h	`100.00% <ø> (ø)`
src/tools/stringTools.cpp	`62.61% <87.50%> (+9.83%)`	⬆️
src/opds_dumper.cpp	`99.18% <100.00%> (ø)`
src/search_renderer.cpp	`95.45% <100.00%> (+0.04%)`	⬆️
src/server/internalServer.cpp	`88.52% <100.00%> (ø)`
src/server/request_context.cpp	`86.06% <100.00%> (ø)`
src/server/request_context.h	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

veloman-yunkan · 2023-01-18T13:59:11Z

@mgautierfr Please assess if the quick-fix has any undesirable side-effects. I can think of only subtle ones due to broken interplay between ZIM creation software that might put %2F (URL-encoded /) sequences in item names relying on the fact that it isn't decoded in kiwix-serve.

mgautierfr · 2023-01-24T17:48:47Z

If I resume the change:

Old	New
EncodeReserved: false
`/` -> `/`	`/` -> `/`
`#` -> `%23`	`#` -> `#`
EncodeReserved: true
`/` -> `%2f`	`/` -> `/`
`#` -> `%23`	`#` -> `%23`

So with this change and encoded reserved to true, what changes is that we don't encode /
with this change and encoded reserved to false, what changes is that we don't encode #

(Please confirm I right)

urlEncode(..., false) is used for :

build the redirect. This is indeed what we want
encode the item's path when building the absolute path of a search result (we are good here)
encode the query in the ODPSFeed2. We are NOT good as # is a begging of the fragment, and the fragment is not send to the server
encode the category name in the categories opds feed. We should encode the # here as category may contain it. But I'm not aware of categories containing a #. @kelson42 @rgaudin ?

urlEncode(..., true) is used for:

Build a search pattern in handle_content if we cannot find a book or an article. I think we are good here to not encode /
Rebuild the queryString we received (parsed) from libmicrohttpd. I think we are good if we not encode /
Build querystring in search_renderer. We are good also
Encode the zim name when building the absolute path of search result. As zim name should not contains / we should be good also
Encode the contentId (book name) in ods_dumper fullEntryXML. Same has above
In urlencoded filter for mustache template. We probably want to encode / here (although I haven't found a use of this filter)

rgaudin · 2023-01-24T17:55:09Z

encode the category name in the categories opds feed. We should encode the # here as category may contain it. But I'm not aware of categories containing a #. @kelson42 @rgaudin ?

As you know, any value but ; is allowed in Tags but _category is private and we should thus only consider our own use, which doesn't make uses of #. 👍

veloman-yunkan · 2023-01-25T12:33:14Z

@mgautierfr Thanks for the meticulous analysis!

So with this change and encoded reserved to true, what changes is that we don't encode / with this change and encoded reserved to false, what changes is that we don't encode #

(Please confirm I right)

Confirming

urlEncode(..., false) is used for :

...

encode the item's path when building the absolute path of a search result (we are good here)

This one is a bug. URI delimiters occurring in the path must be encoded! Fixed.

encode the query in the ODPSFeed2. We are NOT good as # is a begging of the fragment, and the fragment is not send to the server

Good catch. While working on the robust solution, I changed RequestContext::get_query() so that it returns a properly URI-encoded representation of the query, but since that change broke more unit-tests than I was willing to accept, I had to stash it away. Now I see that I will have to deal with the same issue in this quick-fix PR.

encode the category name in the categories opds feed. We should encode the # here as category may contain it. But I'm not aware of categories containing a #. @kelson42 @rgaudin ?

Fixed. Category name is now fully URI-encoded.

urlEncode(..., true) is used for:

...

In urlencoded filter for mustache template. We probably want to encode / here (although I haven't found a use of this filter)

Dropped this one.

veloman-yunkan · 2023-01-25T12:54:34Z

encode the query in the ODPSFeed2. We are NOT good as # is a begging of the fragment, and the fragment is not send to the server

Good catch. While working on the robust solution, I changed RequestContext::get_query() so that it returns a properly URI-encoded representation of the query, but since that change broke more unit-tests than I was willing to accept, I had to stash it away. Now I see that I will have to deal with the same issue in this quick-fix PR.

Done

veloman-yunkan · 2023-01-25T13:21:11Z

Also fixed usage of urlEncode() in kiwix::getSeachUrl(), whereupon no instance of urlEncode(..., false) is left (other than in the unit-test of urlEncode()). Should we drop the extra parameter of urlEncode()?

mgautierfr · 2023-01-25T13:42:11Z

encode the item's path when building the absolute path of a search result (we are good here)

This one is a bug. URI delimiters occurring in the path must be encoded! Fixed.

We should not. The search result link is in the form : http://content/<bookName>/<path_to_article>.
If we encode / in path_to_article we will break relative links.

Also fixed usage of urlEncode() in kiwix::getSeachUrl(), whereupon no instance of urlEncode(..., false) is left (other than in the unit-test of urlEncode()). Should we drop the extra parameter of urlEncode()?

Yes. And we could call it encodeURIComponent.
For the encoding of a item path (keeping the /) we could have a method with this (pseudo) code :

string encodeItemPath(string itemPath) {
    auto components = itemPath.split('/');
    string encoded;
    char separator = '';
    for (component: components) {
        encoded += separator + encodeUrIComponent(component);
        separator = '/'
    }
   return encoded;
}

This way we preserve / in path and we still encode any #/? in article path which will break the url.

veloman-yunkan · 2023-01-25T13:57:02Z

encode the item's path when building the absolute path of a search result (we are good here)

This one is a bug. URI delimiters occurring in the path must be encoded! Fixed.

We should not. The search result link is in the form : http://content/<bookName>/<path_to_article>. If we encode / in path_to_article we will break relative links.

But that's what is changed by this PR! / is now never encoded by urlEncode().

mgautierfr · 2023-01-25T14:32:00Z

But that's what is changed by this PR! / is now never encoded by urlEncode().

🤦🏽‍♂️ My bad. I've mixed too much things.

So I think we are good. We just have to remove the encodeReserved argument. But it should be functionally good.

Just want to mention for potential issue later: We are ok to not encode / in zim name because the zim name is inferred from the zim filename and so doesn't contain /. But if it changes, we would have to adapt again and encode / in bookname (for exemple, we may want to use the name in the library as @rgaudin thank at first sight in kiwix/kiwix-tools#596)
In fact, we should already encode the / as the zim name may be any name if the libkiwix user provide its own namemapper.
But for a quick-fix we are good. To keep in mind for the "final" solution of #775.

Added a unit-test for urlEncode() that passes for its current implementation despite the two bugs that were revealed while creating the unit-test.

Replaced tabs with spaces.

This change is a quick hack solving known issues with URI-encoding in libkiwix. This change removes the slash character from the list of URL separator symbols in URL encoding/decoding utilities, and makes it a symbol that is safe to leave unencoded. Effects: - `urlEncode()` never encodes the '/' symbol (even when it is requested to encode the URL separator symbols too). - `urlDecode(str)`/`urlDecode(..., false)` will now decode %2F to '/'; other encoded URL separator symbols are NOT decoded when the second argument of `urlDecode()` is set to false (which is the default).

Special URI symbols occurring in the item path part of the search result link were NOT encoded, because that would also encode the path separator (/) symbol. Now that `urlEncode()` never encodes the / symbol, it is safe to encode all other URI-special symbols in the path.

This is a precautionary step before dropping the said parameter.

`urlEncode(str)` is now equivalent to the previous `urlEncode(str, true)`.

veloman-yunkan · 2023-01-25T15:16:42Z

So I think we are good. We just have to remove the encodeReserved argument. But it should be functionally good.

Done (also rebased).

we may want to use the name in the library as @rgaudin thank at first sight in kiwix/kiwix-tools#596

I stumbled upon "thank" in this sentence, but then figured out that you meant "thought" :)

veloman-yunkan requested a review from mgautierfr January 18, 2023 13:50

veloman-yunkan added 16 commits January 25, 2023 19:15

Added a unit-test of buggy urlEncode()

07c7d39

Added a unit-test for urlEncode() that passes for its current implementation despite the two bugs that were revealed while creating the unit-test.

Fixed urlEncode() for chars below 0x10

e49081d

'#' is a URI delimiter symbol

82d4770

Fixed indentation

aa2e443

Replaced tabs with spaces.

Added a unit-test for urlDecode()

822fb37

Extracted isHarmlessUriChar()

c5ccbd3

Removed unused code

bad13d7

Category name is fully URI-encoded

772243e

RequestContext::get_query() is fully URI-encoded

63e0d5c

Demonstrating bugs of kiwix::getSearchUrl()

82dcba5

Proper URI-encoding in kiwix::getSearchUrl()

ec81d59

Made 2nd param of urlEncode() mandatory

3bf8211

This is a precautionary step before dropping the said parameter.

Dropped the 2nd param of urlEncode()

471c5b8

`urlEncode(str)` is now equivalent to the previous `urlEncode(str, true)`.

Some clean-up

ca079a7

veloman-yunkan force-pushed the urlEncode_quickfix branch from 125a576 to ca079a7 Compare January 25, 2023 15:15

mgautierfr approved these changes Jan 25, 2023

View reviewed changes

mgautierfr merged commit 76dfc03 into main Jan 25, 2023

mgautierfr deleted the urlEncode_quickfix branch January 25, 2023 15:41

veloman-yunkan mentioned this pull request Feb 10, 2023

kiwix-serve and url encoding #775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick-fix of issues caused by URL encoding #870

Quick-fix of issues caused by URL encoding #870

veloman-yunkan commented Jan 18, 2023 •

edited

Loading

codecov bot commented Jan 18, 2023 •

edited

Loading

veloman-yunkan commented Jan 18, 2023

mgautierfr commented Jan 24, 2023

rgaudin commented Jan 24, 2023

veloman-yunkan commented Jan 25, 2023

veloman-yunkan commented Jan 25, 2023

veloman-yunkan commented Jan 25, 2023

mgautierfr commented Jan 25, 2023

veloman-yunkan commented Jan 25, 2023

mgautierfr commented Jan 25, 2023

veloman-yunkan commented Jan 25, 2023

Quick-fix of issues caused by URL encoding #870

Quick-fix of issues caused by URL encoding #870

Conversation

veloman-yunkan commented Jan 18, 2023 • edited Loading

codecov bot commented Jan 18, 2023 • edited Loading

Codecov Report

veloman-yunkan commented Jan 18, 2023

mgautierfr commented Jan 24, 2023

rgaudin commented Jan 24, 2023

veloman-yunkan commented Jan 25, 2023

veloman-yunkan commented Jan 25, 2023

veloman-yunkan commented Jan 25, 2023

mgautierfr commented Jan 25, 2023

veloman-yunkan commented Jan 25, 2023

mgautierfr commented Jan 25, 2023

veloman-yunkan commented Jan 25, 2023

veloman-yunkan commented Jan 18, 2023 •

edited

Loading

codecov bot commented Jan 18, 2023 •

edited

Loading