urljoin since 3.5 incorrectly filters out double slashes #84774

DavidBell · 2020-05-11T11:53:50Z

BPO	40594
Nosy	@orsenthil, @taleinat, @openandclose, @idomic, @iritkatriel

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-05-11.11:53:50.366>
labels = ['3.7', '3.8', 'type-bug', 'library', '3.9']
title = 'urljoin since 3.5 incorrectly filters out double slashes'
updated_at = <Date 2021-12-01.23:04:01.306>
user = 'https://bugs.python.org/DavidBell'

bugs.python.org fields:

activity = <Date 2021-12-01.23:04:01.306>
actor = 'iritkatriel'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2020-05-11.11:53:50.366>
creator = 'David Bell'
dependencies = []
files = []
hgrepos = []
issue_num = 40594
keywords = []
message_count = 9.0
messages = ['368623', '370821', '371092', '371098', '371101', '371135', '371685', '371691', '407503']
nosy_count = 6.0
nosy_names = ['orsenthil', 'taleinat', 'op368', 'Ido Michael', 'David Bell', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue40594'
versions = ['Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9']

DavidBell · 2020-05-11T11:53:50Z

In Python 3.5 the urljoin function was rewritten to be RFC 3986 compliant and fix long standing issues. In the initial rewrite duplicate slashes were added by accident, and so code was added to prevent that. The discussion is here: https://bugs.python.org/issue22118

The code within urljoin is this:

# filter out elements that would cause redundant slashes on re-joining
# the resolved_path
segments[1:-1] = filter(None, segments[1:-1])

This seems sensible, you would not want double slashes in a URL, right? The problem is: double slashes are perfectly legal in a URI/URL, and for reasons I don't understand, they are in use in the wild. The code above was written to remove them because urljoin accidentally introduced them, the problem is that it also removes intentional double slashes:

>>> urljoin("http://www.example.com/", "this//double/path")
'http://www.example.com/this/double/path'

Where as the expected result should be:

'http://www.example.com/this//double/path'

I suggest that the fix for this is to remove the aforementioned filter code, e.g. remove this:

# filter out elements that would cause redundant slashes on re-joining
# the resolved_path
segments[1:-1] = filter(None, segments[1:-1])

...and remove this code too:

if base_parts[-1] != '':
    # the last item is not a directory, so will not be taken into account
    # in resolving the relative path
    del base_parts[-1]

and instead simply add:

del base_parts[-1]

...because the last part of the split base URL should always be deleted. If the last element of the list (the base URL split) is an empty string, then the URL ended with a slash, and so we should remove the last element otherwise a double slash will occur when we combine it with the second parameter to urljoin. If the last element is not an empty string then the last part of the URL was not a directory, and should be removed. Thus the last element should always be removed.

By following this logic the "remove all double slashes" filter is not necessary, because the cause of the accidental addition of double slashes is removed.

idomic · 2020-06-06T13:49:09Z

I can take this

orsenthil · 2020-06-09T11:46:35Z

The problem is: double slashes are perfectly legal in a URI/URL, and for reasons I don't understand, they are in use in the wild

And what is the de-facto behavior of browsers, when double URL is used?
As far as I know, the browsers remove them, combine them as a single URL.
(Tested on Chrome with some URLs to confirm on June 2020)


Where as the expected result should be:

'http://www.example.com/this//double/path'

This statement should be supported by

a) An RFC
b) Or behavior or some popular clients.
c) Or libcurl exhibiting this behavior will gain a huge support too.

Please share if you have any references and we should confirm ourselves to those.

taleinat · 2020-06-09T12:20:47Z

According to section 3.3 of RFC 3986[1], and also RFC 2396[2] which preceded it, a path is made up of zero or more consecutive "/" + section pieces, and a section may be empty. In other words, paths may include consecutive slashes.

Several StackOverflow answers to questions on this subject (e.g. this[3], and this[4]) also agree that consecutive slashes are valid, based on the aforementioned RFCs and other references.

.. [1]: https://tools.ietf.org/html/rfc3986#section-3.3
.. [2]: https://tools.ietf.org/html/rfc2396#section-3.3
.. [3]: https://stackoverflow.com/questions/20523318/is-a-url-with-in-the-path-section-valid
.. [4]: https://stackoverflow.com/questions/10161177/url-with-multiple-forward-slashes-does-it-break-anything

orsenthil · 2020-06-09T12:51:29Z

Hi Tal,

Thank you. To be specific, it is about parsing behavior, specifically urljoin behavior,

There is no question about the validity of the double-slash in the URL.

If we find a source (like libcurl) which exhibits parsing behavior different than CPython, or sections in docs which give examples which our current test suite fails, those serve as good sources for expected results.

taleinat · 2020-06-09T18:55:12Z

Putting aside backwards compatibility, I would argue the opposite: Since consecutive slashes are valid, I suggest we would need to see that collapsing them into a single slash is the status quo, otherwise we should avoid such collapsing.

Here's some evidence that we should keep consecutive slashes:

The curl cli does not appear to collapse consecutive slashes in URLs before sending them:

$ nc -l localhost 8000 &
[1] 39380
$ curl --dump-header - --proxy localhost:8000 --silent --max-time 1 http://www.example.com/this//double/path
GET http://www.example.com/this//double/path HTTP/1.1
Host: www.example.com
User-Agent: curl/7.64.1
Accept: */*
Proxy-Connection: Keep-Alive

[1] + 39380 done nc -l localhost 8000

With NodeJS v10.18.0 and v12.16.1:

const urllib = require('url');
new url.URL("this//double/path", "http://www.example.com/").href
'http://www.example.com/this//double/path'

For me this is evidence enough that urljoin should *not* be collapsing consecutive slashes.

openandclose · 2020-06-16T17:15:45Z

To Senthil Kumaran:

https://bugs.python.org/issue40594#msg371092

As far as I know, the browsers remove them, combine them as a single URL.
(Tested on Chrome with some URLs to confirm on June 2020)

How did you test them?

orsenthil · 2020-06-16T19:29:25Z

How did you test them?

https://about.google//products/
https://www.google.com///search?q=something

Were redirecting, and collapsing multiple slashes to single.

Now, when I try this on:

https://en.wikipedia.org/w//index.php?search=foobar+something

was not.

Sorry. It is false to say that it was a default client behavior of chrome.

iritkatriel · 2021-12-01T23:04:01Z

See also 25403, 37235.

dccioata · 2022-11-01T15:02:37Z

Until the problem is fixed, I've identified a workaround (tested on python 3.10.6): insert slash (/) as first character in the url argument (before the actual url):

>>> urllib.parse.urljoin("http://a.b", "c/d//e") == "http://a.b/c/d//e"
False
>>> urllib.parse.urljoin("http://a.b", "/c/d//e") == "http://a.b/c/d//e"
True

Hope it helps.

pirate · 2024-04-25T00:17:04Z

Note this also applies if the double-slashes are in the base url, e.g.:

>>> urllib.parse.urljoin('https://web.archive.org/web/https://example.com/a', 'c.html')
'https://web.archive.org/web/https:/example.com/c.html'

Archive.org is the most widespread-user of double-slashes in URLs that I know of, they nest the original URL inside their own. Seems like a perfect example of a good use-case for double-slashes in URLs.

(thanks @tqobqbq, from: ArchiveBox/ArchiveBox#1411)

urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""

Bumps [yarl](https://github.com/aio-libs/yarl) from 1.9.9 to 1.9.11. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/aio-libs/yarl/releases">yarl's releases</a>.</em></p> <blockquote> <h2>1.9.11</h2> <h2>Bug fixes</h2> <ul> <li> <p>Fixed a :exc:<code>TypeError</code> with <code>MultiDictProxy</code> and Python 3.8 -- by :user:<code>bdraco</code>.</p> <p><em>Related issues and pull requests on GitHub:</em> <a href="https://redirect.github.com/aio-libs/yarl/issues/1084">#1084</a>, <a href="https://redirect.github.com/aio-libs/yarl/issues/1105">#1105</a>, <a href="https://redirect.github.com/aio-libs/yarl/issues/1107">#1107</a>.</p> </li> </ul> <h2>Miscellaneous internal changes</h2> <ul> <li> <p>Improved performance of encoding hosts -- by :user:<code>bdraco</code>.</p> <p>Previously, the library would unconditionally try to parse a host as an IP Address. The library now avoids trying to parse a host as an IP Address if the string is not in one of the formats described in :rfc:<code>3986#section-3.2.2</code>.</p> <p><em>Related issues and pull requests on GitHub:</em> <a href="https://redirect.github.com/aio-libs/yarl/issues/1104">#1104</a>.</p> </li> </ul> <hr /> <h2>1.9.10</h2> <h2>Bug fixes</h2> <ul> <li> <p>:meth:<code>URL.join() <yarl.URL.join></code> has been changed to match :rfc:<code>3986</code> and align with :meth:<code>/ operation <yarl.URL.__truediv__></code> and :meth:<code>URL.joinpath() <yarl.URL.joinpath></code> when joining URLs with empty segments. Previously :py:func:<code>urllib.parse.urljoin</code> was used, which has known issues with empty segments (<code>python/cpython#84774 <https://github.com/python/cpython/issues/84774></code>_).</p> <p>Due to the semantics of :meth:<code>URL.join() <yarl.URL.join></code>, joining an URL with scheme requires making it relative, prefixing with <code>./</code>.</p> <p>.. code-block:: pycon</p> <blockquote> <blockquote> <blockquote> <p>URL("<a href="https://web.archive.org/web/%22).join(URL(%22./https://github.com/aio-libs/yarl%22)">https://web.archive.org/web/").join(URL("./https://github.com/aio-libs/yarl")</a>) URL('<a href="https://web.archive.org/web/https://github.com/aio-libs/yarl">https://web.archive.org/web/https://github.com/aio-libs/yarl</a>')</p> </blockquote> </blockquote> </blockquote> <p>Empty segments are honored in the base as well as the joined part.</p> <p>.. code-block:: pycon</p> <blockquote> <blockquote> <blockquote> <p>URL("<a href="https://web.archive.org/web/https://%22).join(URL(%22github.com/aio-libs/yarl%22)">https://web.archive.org/web/https://").join(URL("github.com/aio-libs/yarl")</a>) URL('<a href="https://web.archive.org/web/https://github.com/aio-libs/yarl">https://web.archive.org/web/https://github.com/aio-libs/yarl</a>')</p> </blockquote> </blockquote> </blockquote> </li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/aio-libs/yarl/blob/master/CHANGES.rst">yarl's changelog</a>.</em></p> <blockquote> <h1>1.9.11</h1> <p><em>(2024-09-04)</em></p> <h2>Bug fixes</h2> <ul> <li> <p>Fixed a :exc:<code>TypeError</code> with <code>MultiDictProxy</code> and Python 3.8 -- by :user:<code>bdraco</code>.</p> <p><em>Related issues and pull requests on GitHub:</em> :issue:<code>1084</code>, :issue:<code>1105</code>, :issue:<code>1107</code>.</p> </li> </ul> <h2>Miscellaneous internal changes</h2> <ul> <li> <p>Improved performance of encoding hosts -- by :user:<code>bdraco</code>.</p> <p>Previously, the library would unconditionally try to parse a host as an IP Address. The library now avoids trying to parse a host as an IP Address if the string is not in one of the formats described in :rfc:<code>3986#section-3.2.2</code>.</p> <p><em>Related issues and pull requests on GitHub:</em> :issue:<code>1104</code>.</p> </li> </ul> <hr /> <h1>1.9.10</h1> <p><em>(2024-09-04)</em></p> <h2>Bug fixes</h2> <ul> <li> <p>:meth:<code>URL.join() <yarl.URL.join></code> has been changed to match :rfc:<code>3986</code> and align with :meth:<code>/ operation <yarl.URL.__truediv__></code> and :meth:<code>URL.joinpath() <yarl.URL.joinpath></code> when joining URLs with empty segments. Previously :py:func:<code>urllib.parse.urljoin</code> was used, which has known issues with empty segments (<code>python/cpython#84774 <https://github.com/python/cpython/issues/84774></code>_).</p> <p>Due to the semantics of :meth:<code>URL.join() <yarl.URL.join></code>, joining an URL with scheme requires making it relative, prefixing with <code>./</code>.</p> <p>.. code-block:: pycon</p> </li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/aio-libs/yarl/commit/29d693deb083287745b41138db6597a4ed2da59b"><code>29d693d</code></a> Release 1.9.11</li> <li><a href="https://github.com/aio-libs/yarl/commit/7acfed4c62f85acb6d98b4aa836cd8fb78da01bc"><code>7acfed4</code></a> Fix <code>TypeError</code> with <code>MultiDictProxy</code> and Python 3.8 (<a href="https://redirect.github.com/aio-libs/yarl/issues/1107">#1107</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/51541ecc582cbc0196bd34c4cad74b3b15f8988b"><code>51541ec</code></a> Avoid trying to parse strings that cannot be IP Addresses (<a href="https://redirect.github.com/aio-libs/yarl/issues/1104">#1104</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/bccb8af95ed6867339188b891d1430e1e55c50e8"><code>bccb8af</code></a> Increment version to 1.9.11.dev0 (<a href="https://redirect.github.com/aio-libs/yarl/issues/1106">#1106</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/bfef13050dd7d299f7b993b058e61cb50eacbd63"><code>bfef130</code></a> Release 1.9.10</li> <li><a href="https://github.com/aio-libs/yarl/commit/af585b2f694dbc870eeceddd3d87928719fe7ad4"><code>af585b2</code></a> RFC3986 compatible URL.join honoring empty segments again (<a href="https://redirect.github.com/aio-libs/yarl/issues/1082">#1082</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/b631867c54335bdd05b84843d5a2dd85e2db4166"><code>b631867</code></a> Introduce absolute cached property (<a href="https://redirect.github.com/aio-libs/yarl/issues/1100">#1100</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/f06445c2c8904227ea469114cdb09d4d9805ae92"><code>f06445c</code></a> Adjust missing type changelog message to be attr instead of meth (<a href="https://redirect.github.com/aio-libs/yarl/issues/1101">#1101</a>)</li> <li><a href="https://github.com/aio-libs/yarl/commit/3d4b8cad4096b02a24f049559f60249c58331e75"><code>3d4b8ca</code></a> Increment version to 1.9.10.dev0 (<a href="https://redirect.github.com/aio-libs/yarl/issues/1099">#1099</a>)</li> <li>See full diff in <a href="https://github.com/aio-libs/yarl/compare/v1.9.9...v1.9.11">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=yarl&package-manager=pip&previous-version=1.9.9&new-version=1.9.11)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

DavidBell mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels May 11, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

wilbrdt mentioned this issue Dec 27, 2023

Crawl statement to feed the Experience Index openfun/warren#167

Merged

7 tasks

pirate mentioned this issue Apr 25, 2024

a bug of urllib.parse.urljoin ArchiveBox/ArchiveBox#1411

Closed

commonism mentioned this issue Jun 12, 2024

Fix trailing slash dropped since v1.9.1 aio-libs/yarl#1023

Closed

3 tasks

commonism added a commit to commonism/yarl that referenced this issue Jun 15, 2024

rfc3986 - URL.join

e0b8586

urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""

commonism added a commit to commonism/yarl that referenced this issue Jul 3, 2024

rfc3986 - URL.join

d8b8def

urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""

commonism mentioned this issue Jul 6, 2024

Normalize URL string ports per RFC 3986 section 3.2.3 aio-libs/yarl#1033

Merged

3 tasks

commonism added a commit to commonism/yarl that referenced this issue Jul 18, 2024

rfc3986 - URL.join

778b5ac

urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""

commonism mentioned this issue Jul 18, 2024

RFC3986 compatible URL.join honoring empty segments aio-libs/yarl#1039

Merged

3 tasks

commonism added a commit to commonism/yarl that referenced this issue Sep 2, 2024

rfc3986 - URL.join

9b5bd13

urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""

commonism mentioned this issue Sep 2, 2024

RFC3986 compatible URL.join honoring empty segments again aio-libs/yarl#1082

Merged

3 tasks

commonism added a commit to commonism/yarl that referenced this issue Sep 2, 2024

rfc3986 - URL.join

d657987

urllib.parse.urljoin does not honor empty segments, python/cpython#84774 implement using joinpath change query string generation to not emit empty values as ""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urljoin since 3.5 incorrectly filters out double slashes #84774

urljoin since 3.5 incorrectly filters out double slashes #84774

DavidBell mannequin commented May 11, 2020

DavidBell mannequin commented May 11, 2020

idomic mannequin commented Jun 6, 2020

orsenthil commented Jun 9, 2020

taleinat commented Jun 9, 2020

orsenthil commented Jun 9, 2020

taleinat commented Jun 9, 2020

openandclose mannequin commented Jun 16, 2020

orsenthil commented Jun 16, 2020

iritkatriel commented Dec 1, 2021

dccioata commented Nov 1, 2022 •

edited

Loading

pirate commented Apr 25, 2024 •

edited

Loading

urljoin since 3.5 incorrectly filters out double slashes #84774

urljoin since 3.5 incorrectly filters out double slashes #84774

Comments

DavidBell mannequin commented May 11, 2020

DavidBell mannequin commented May 11, 2020

idomic mannequin commented Jun 6, 2020

orsenthil commented Jun 9, 2020

taleinat commented Jun 9, 2020

orsenthil commented Jun 9, 2020

taleinat commented Jun 9, 2020

openandclose mannequin commented Jun 16, 2020

orsenthil commented Jun 16, 2020

iritkatriel commented Dec 1, 2021

dccioata commented Nov 1, 2022 • edited Loading

pirate commented Apr 25, 2024 • edited Loading

dccioata commented Nov 1, 2022 •

edited

Loading

pirate commented Apr 25, 2024 •

edited

Loading