-
-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urljoin since 3.5 incorrectly filters out double slashes #84774
Comments
In Python 3.5 the urljoin function was rewritten to be RFC 3986 compliant and fix long standing issues. In the initial rewrite duplicate slashes were added by accident, and so code was added to prevent that. The discussion is here: https://bugs.python.org/issue22118 The code within urljoin is this: # filter out elements that would cause redundant slashes on re-joining This seems sensible, you would not want double slashes in a URL, right? The problem is: double slashes are perfectly legal in a URI/URL, and for reasons I don't understand, they are in use in the wild. The code above was written to remove them because urljoin accidentally introduced them, the problem is that it also removes intentional double slashes: >>> urljoin("http://www.example.com/", "this//double/path")
'http://www.example.com/this/double/path' Where as the expected result should be: 'http://www.example.com/this//double/path' I suggest that the fix for this is to remove the aforementioned filter code, e.g. remove this: # filter out elements that would cause redundant slashes on re-joining ...and remove this code too: if base_parts[-1] != '':
# the last item is not a directory, so will not be taken into account
# in resolving the relative path
del base_parts[-1] and instead simply add: del base_parts[-1] ...because the last part of the split base URL should always be deleted. If the last element of the list (the base URL split) is an empty string, then the URL ended with a slash, and so we should remove the last element otherwise a double slash will occur when we combine it with the second parameter to urljoin. If the last element is not an empty string then the last part of the URL was not a directory, and should be removed. Thus the last element should always be removed. By following this logic the "remove all double slashes" filter is not necessary, because the cause of the accidental addition of double slashes is removed. |
I can take this |
And what is the de-facto behavior of browsers, when double URL is used?
This statement should be supported by a) An RFC Please share if you have any references and we should confirm ourselves to those. |
According to section 3.3 of RFC 3986[1], and also RFC 2396[2] which preceded it, a path is made up of zero or more consecutive "/" + section pieces, and a section may be empty. In other words, paths may include consecutive slashes. Several StackOverflow answers to questions on this subject (e.g. this[3], and this[4]) also agree that consecutive slashes are valid, based on the aforementioned RFCs and other references. .. [1]: https://tools.ietf.org/html/rfc3986#section-3.3 |
Hi Tal, Thank you. To be specific, it is about parsing behavior, specifically urljoin behavior, There is no question about the validity of the double-slash in the URL. If we find a source (like libcurl) which exhibits parsing behavior different than CPython, or sections in docs which give examples which our current test suite fails, those serve as good sources for expected results. |
Putting aside backwards compatibility, I would argue the opposite: Since consecutive slashes are valid, I suggest we would need to see that collapsing them into a single slash is the status quo, otherwise we should avoid such collapsing. Here's some evidence that we should keep consecutive slashes: The curl cli does not appear to collapse consecutive slashes in URLs before sending them: $ nc -l localhost 8000 &
[1] 39380
$ curl --dump-header - --proxy localhost:8000 --silent --max-time 1 http://www.example.com/this//double/path
GET http://www.example.com/this//double/path HTTP/1.1
Host: www.example.com
User-Agent: curl/7.64.1
Accept: */*
Proxy-Connection: Keep-Alive [1] + 39380 done nc -l localhost 8000 With NodeJS v10.18.0 and v12.16.1:
For me this is evidence enough that urljoin should *not* be collapsing consecutive slashes. |
To Senthil Kumaran: https://bugs.python.org/issue40594#msg371092
How did you test them? |
https://about.google//products/ Were redirecting, and collapsing multiple slashes to single. Now, when I try this on:
was not. Sorry. It is false to say that it was a default client behavior of chrome. |
See also 25403, 37235. |
Until the problem is fixed, I've identified a workaround (tested on python 3.10.6): insert slash (/) as first character in the url argument (before the actual url):
Hope it helps. |
Note this also applies if the double-slashes are in the base url, e.g.: >>> urllib.parse.urljoin('https://web.archive.org/web/https://example.com/a', 'c.html')
'https://web.archive.org/web/https:/example.com/c.html' Archive.org is the most widespread-user of double-slashes in URLs that I know of, they nest the original URL inside their own. Seems like a perfect example of a good use-case for double-slashes in URLs. (thanks @tqobqbq, from: ArchiveBox/ArchiveBox#1411) |
urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""
urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""
urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""
urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""
urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""
Bumps [yarl](https://github.com/aio-libs/yarl) from 1.9.9 to 1.9.11. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/aio-libs/yarl/releases">yarl's releases</a>.</em></p> <blockquote> <h2>1.9.11</h2> <h2>Bug fixes</h2> <ul> <li> <p>Fixed a :exc:<code>TypeError</code> with <code>MultiDictProxy</code> and Python 3.8 -- by :user:<code>bdraco</code>.</p> <p><em>Related issues and pull requests on GitHub:</em> <a href="https://redirect.github.com/aio-libs/yarl/issues/1084">#1084</a>, <a href="https://redirect.github.com/aio-libs/yarl/issues/1105">#1105</a>, <a href="https://redirect.github.com/aio-libs/yarl/issues/1107">#1107</a>.</p> </li> </ul> <h2>Miscellaneous internal changes</h2> <ul> <li> <p>Improved performance of encoding hosts -- by :user:<code>bdraco</code>.</p> <p>Previously, the library would unconditionally try to parse a host as an IP Address. The library now avoids trying to parse a host as an IP Address if the string is not in one of the formats described in :rfc:<code>3986#section-3.2.2</code>.</p> <p><em>Related issues and pull requests on GitHub:</em> <a href="https://redirect.github.com/aio-libs/yarl/issues/1104">#1104</a>.</p> </li> </ul> <hr /> <h2>1.9.10</h2> <h2>Bug fixes</h2> <ul> <li> <p>:meth:<code>URL.join() <yarl.URL.join></code> has been changed to match :rfc:<code>3986</code> and align with :meth:<code>/ operation <yarl.URL.__truediv__></code> and :meth:<code>URL.joinpath() <yarl.URL.joinpath></code> when joining URLs with empty segments. Previously :py:func:<code>urllib.parse.urljoin</code> was used, which has known issues with empty segments (<code>python/cpython#84774 <https://github.com/python/cpython/issues/84774></code>_).</p> <p>Due to the semantics of :meth:<code>URL.join() <yarl.URL.join></code>, joining an URL with scheme requires making it relative, prefixing with <code>./</code>.</p> <p>.. code-block:: pycon</p> <blockquote> <blockquote> <blockquote> <p>URL("<a href="https://web.archive.org/web/%22).join(URL(%22./https://github.com/aio-libs/yarl%22)">https://web.archive.org/web/").join(URL("./https://github.com/aio-libs/yarl")</a>) URL('<a href="https://web.archive.org/web/https://github.com/aio-libs/yarl">https://web.archive.org/web/https://github.com/aio-libs/yarl</a>')</p> </blockquote> </blockquote> </blockquote> <p>Empty segments are honored in the base as well as the joined part.</p> <p>.. code-block:: pycon</p> <blockquote> <blockquote> <blockquote> <p>URL("<a href="https://web.archive.org/web/https://%22).join(URL(%22github.com/aio-libs/yarl%22)">https://web.archive.org/web/https://").join(URL("github.com/aio-libs/yarl")</a>) URL('<a href="https://web.archive.org/web/https://github.com/aio-libs/yarl">https://web.archive.org/web/https://github.com/aio-libs/yarl</a>')</p> </blockquote> </blockquote> </blockquote> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/aio-libs/yarl/blob/master/CHANGES.rst">yarl's changelog</a>.</em></p> <blockquote> <h1>1.9.11</h1> <p><em>(2024-09-04)</em></p> <h2>Bug fixes</h2> <ul> <li> <p>Fixed a :exc:<code>TypeError</code> with <code>MultiDictProxy</code> and Python 3.8 -- by :user:<code>bdraco</code>.</p> <p><em>Related issues and pull requests on GitHub:</em> :issue:<code>1084</code>, :issue:<code>1105</code>, :issue:<code>1107</code>.</p> </li> </ul> <h2>Miscellaneous internal changes</h2> <ul> <li> <p>Improved performance of encoding hosts -- by :user:<code>bdraco</code>.</p> <p>Previously, the library would unconditionally try to parse a host as an IP Address. The library now avoids trying to parse a host as an IP Address if the string is not in one of the formats described in :rfc:<code>3986#section-3.2.2</code>.</p> <p><em>Related issues and pull requests on GitHub:</em> :issue:<code>1104</code>.</p> </li> </ul> <hr /> <h1>1.9.10</h1> <p><em>(2024-09-04)</em></p> <h2>Bug fixes</h2> <ul> <li> <p>:meth:<code>URL.join() <yarl.URL.join></code> has been changed to match :rfc:<code>3986</code> and align with :meth:<code>/ operation <yarl.URL.__truediv__></code> and :meth:<code>URL.joinpath() <yarl.URL.joinpath></code> when joining URLs with empty segments. Previously :py:func:<code>urllib.parse.urljoin</code> was used, which has known issues with empty segments (<code>python/cpython#84774 <https://github.com/python/cpython/issues/84774></code>_).</p> <p>Due to the semantics of :meth:<code>URL.join() <yarl.URL.join></code>, joining an URL with scheme requires making it relative, prefixing with <code>./</code>.</p> <p>.. code-block:: pycon</p> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/aio-libs/yarl/commit/29d693deb083287745b41138db6597a4ed2da59b"><code>29d693d</code></a> Release 1.9.11</li> <li><a href="https://github.com/aio-libs/yarl/commit/7acfed4c62f85acb6d98b4aa836cd8fb78da01bc"><code>7acfed4</code></a> Fix <code>TypeError</code> with <code>MultiDictProxy</code> and Python 3.8 (<a href="https://redirect.github.com/aio-libs/yarl/issues/1107">#1107</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/51541ecc582cbc0196bd34c4cad74b3b15f8988b"><code>51541ec</code></a> Avoid trying to parse strings that cannot be IP Addresses (<a href="https://redirect.github.com/aio-libs/yarl/issues/1104">#1104</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/bccb8af95ed6867339188b891d1430e1e55c50e8"><code>bccb8af</code></a> Increment version to 1.9.11.dev0 (<a href="https://redirect.github.com/aio-libs/yarl/issues/1106">#1106</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/bfef13050dd7d299f7b993b058e61cb50eacbd63"><code>bfef130</code></a> Release 1.9.10</li> <li><a href="https://github.com/aio-libs/yarl/commit/af585b2f694dbc870eeceddd3d87928719fe7ad4"><code>af585b2</code></a> RFC3986 compatible URL.join honoring empty segments again (<a href="https://redirect.github.com/aio-libs/yarl/issues/1082">#1082</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/b631867c54335bdd05b84843d5a2dd85e2db4166"><code>b631867</code></a> Introduce absolute cached property (<a href="https://redirect.github.com/aio-libs/yarl/issues/1100">#1100</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/f06445c2c8904227ea469114cdb09d4d9805ae92"><code>f06445c</code></a> Adjust missing type changelog message to be attr instead of meth (<a href="https://redirect.github.com/aio-libs/yarl/issues/1101">#1101</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/3d4b8cad4096b02a24f049559f60249c58331e75"><code>3d4b8ca</code></a> Increment version to 1.9.10.dev0 (<a href="https://redirect.github.com/aio-libs/yarl/issues/1099">#1099</a>)</li> <li>See full diff in <a href="https://github.com/aio-libs/yarl/compare/v1.9.9...v1.9.11">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=yarl&package-manager=pip&previous-version=1.9.9&new-version=1.9.11)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: