Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urljoin since 3.5 incorrectly filters out double slashes #84774

Open
DavidBell mannequin opened this issue May 11, 2020 · 11 comments
Open

urljoin since 3.5 incorrectly filters out double slashes #84774

DavidBell mannequin opened this issue May 11, 2020 · 11 comments
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@DavidBell
Copy link
Mannequin

DavidBell mannequin commented May 11, 2020

BPO 40594
Nosy @orsenthil, @taleinat, @openandclose, @idomic, @iritkatriel

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-05-11.11:53:50.366>
labels = ['3.7', '3.8', 'type-bug', 'library', '3.9']
title = 'urljoin since 3.5 incorrectly filters out double slashes'
updated_at = <Date 2021-12-01.23:04:01.306>
user = 'https://bugs.python.org/DavidBell'

bugs.python.org fields:

activity = <Date 2021-12-01.23:04:01.306>
actor = 'iritkatriel'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2020-05-11.11:53:50.366>
creator = 'David Bell'
dependencies = []
files = []
hgrepos = []
issue_num = 40594
keywords = []
message_count = 9.0
messages = ['368623', '370821', '371092', '371098', '371101', '371135', '371685', '371691', '407503']
nosy_count = 6.0
nosy_names = ['orsenthil', 'taleinat', 'op368', 'Ido Michael', 'David Bell', 'iritkatriel']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue40594'
versions = ['Python 3.5', 'Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9']

@DavidBell
Copy link
Mannequin Author

DavidBell mannequin commented May 11, 2020

In Python 3.5 the urljoin function was rewritten to be RFC 3986 compliant and fix long standing issues. In the initial rewrite duplicate slashes were added by accident, and so code was added to prevent that. The discussion is here: https://bugs.python.org/issue22118

The code within urljoin is this:

# filter out elements that would cause redundant slashes on re-joining
# the resolved_path
segments[1:-1] = filter(None, segments[1:-1])

This seems sensible, you would not want double slashes in a URL, right? The problem is: double slashes are perfectly legal in a URI/URL, and for reasons I don't understand, they are in use in the wild. The code above was written to remove them because urljoin accidentally introduced them, the problem is that it also removes intentional double slashes:

>>> urljoin("http://www.example.com/", "this//double/path")
'http://www.example.com/this/double/path'

Where as the expected result should be:

'http://www.example.com/this//double/path'

I suggest that the fix for this is to remove the aforementioned filter code, e.g. remove this:

# filter out elements that would cause redundant slashes on re-joining
# the resolved_path
segments[1:-1] = filter(None, segments[1:-1])

...and remove this code too:

if base_parts[-1] != '':
    # the last item is not a directory, so will not be taken into account
    # in resolving the relative path
    del base_parts[-1]

and instead simply add:

del base_parts[-1]

...because the last part of the split base URL should always be deleted. If the last element of the list (the base URL split) is an empty string, then the URL ended with a slash, and so we should remove the last element otherwise a double slash will occur when we combine it with the second parameter to urljoin. If the last element is not an empty string then the last part of the URL was not a directory, and should be removed. Thus the last element should always be removed.

By following this logic the "remove all double slashes" filter is not necessary, because the cause of the accidental addition of double slashes is removed.

@DavidBell DavidBell mannequin added 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels May 11, 2020
@idomic
Copy link
Mannequin

idomic mannequin commented Jun 6, 2020

I can take this

@orsenthil
Copy link
Member

The problem is: double slashes are perfectly legal in a URI/URL, and for reasons I don't understand, they are in use in the wild

And what is the de-facto behavior of browsers, when double URL is used?
As far as I know, the browsers remove them, combine them as a single URL.
(Tested on Chrome with some URLs to confirm on June 2020)


Where as the expected result should be:

'http://www.example.com/this//double/path'

This statement should be supported by

a) An RFC
b) Or behavior or some popular clients.
c) Or libcurl exhibiting this behavior will gain a huge support too.

Please share if you have any references and we should confirm ourselves to those.

@taleinat
Copy link
Contributor

taleinat commented Jun 9, 2020

According to section 3.3 of RFC 3986[1], and also RFC 2396[2] which preceded it, a path is made up of zero or more consecutive "/" + section pieces, and a section may be empty. In other words, paths may include consecutive slashes.

Several StackOverflow answers to questions on this subject (e.g. this[3], and this[4]) also agree that consecutive slashes are valid, based on the aforementioned RFCs and other references.

.. [1]: https://tools.ietf.org/html/rfc3986#section-3.3
.. [2]: https://tools.ietf.org/html/rfc2396#section-3.3
.. [3]: https://stackoverflow.com/questions/20523318/is-a-url-with-in-the-path-section-valid
.. [4]: https://stackoverflow.com/questions/10161177/url-with-multiple-forward-slashes-does-it-break-anything

@orsenthil
Copy link
Member

Hi Tal,

Thank you. To be specific, it is about parsing behavior, specifically urljoin behavior,

There is no question about the validity of the double-slash in the URL.

If we find a source (like libcurl) which exhibits parsing behavior different than CPython, or sections in docs which give examples which our current test suite fails, those serve as good sources for expected results.

@taleinat
Copy link
Contributor

taleinat commented Jun 9, 2020

Putting aside backwards compatibility, I would argue the opposite: Since consecutive slashes are valid, I suggest we would need to see that collapsing them into a single slash is the status quo, otherwise we should avoid such collapsing.

Here's some evidence that we should keep consecutive slashes:

The curl cli does not appear to collapse consecutive slashes in URLs before sending them:

$ nc -l localhost 8000 &
[1] 39380
$ curl --dump-header - --proxy localhost:8000 --silent --max-time 1 http://www.example.com/this//double/path
GET http://www.example.com/this//double/path HTTP/1.1
Host: www.example.com
User-Agent: curl/7.64.1
Accept: */*
Proxy-Connection: Keep-Alive

[1] + 39380 done nc -l localhost 8000

With NodeJS v10.18.0 and v12.16.1:


const urllib = require('url');
new url.URL("this//double/path", "http://www.example.com/").href
'http://www.example.com/this//double/path'

For me this is evidence enough that urljoin should *not* be collapsing consecutive slashes.

@openandclose
Copy link
Mannequin

openandclose mannequin commented Jun 16, 2020

To Senthil Kumaran:

https://bugs.python.org/issue40594#msg371092

As far as I know, the browsers remove them, combine them as a single URL.
(Tested on Chrome with some URLs to confirm on June 2020)

How did you test them?

@orsenthil
Copy link
Member

How did you test them?

https://about.google//products/
https://www.google.com///search?q=something

Were redirecting, and collapsing multiple slashes to single.

Now, when I try this on:

https://en.wikipedia.org/w//index.php?search=foobar+something

was not.

Sorry. It is false to say that it was a default client behavior of chrome.

@iritkatriel
Copy link
Member

See also 25403, 37235.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@dccioata
Copy link

dccioata commented Nov 1, 2022

Until the problem is fixed, I've identified a workaround (tested on python 3.10.6): insert slash (/) as first character in the url argument (before the actual url):

>>> urllib.parse.urljoin("http://a.b", "c/d//e") == "http://a.b/c/d//e"
False
>>> urllib.parse.urljoin("http://a.b", "/c/d//e") == "http://a.b/c/d//e"
True

Hope it helps.

@pirate
Copy link

pirate commented Apr 25, 2024

Note this also applies if the double-slashes are in the base url, e.g.:

>>> urllib.parse.urljoin('https://web.archive.org/web/https://example.com/a', 'c.html')
'https://web.archive.org/web/https:/example.com/c.html'

Archive.org is the most widespread-user of double-slashes in URLs that I know of, they nest the original URL inside their own. Seems like a perfect example of a good use-case for double-slashes in URLs.

(thanks @tqobqbq, from: ArchiveBox/ArchiveBox#1411)

commonism added a commit to commonism/yarl that referenced this issue Jun 15, 2024
urllib.parse.urljoin does not honor empty segments,
python/cpython#84774

implement using joinpath

change query string generation to not emit empty values as ""
commonism added a commit to commonism/yarl that referenced this issue Jul 3, 2024
urllib.parse.urljoin does not honor empty segments,
python/cpython#84774

implement using joinpath

change query string generation to not emit empty values as ""
commonism added a commit to commonism/yarl that referenced this issue Jul 18, 2024
urllib.parse.urljoin does not honor empty segments,
python/cpython#84774

implement using joinpath

change query string generation to not emit empty values as ""
commonism added a commit to commonism/yarl that referenced this issue Sep 2, 2024
urllib.parse.urljoin does not honor empty segments,
python/cpython#84774

implement using joinpath

change query string generation to not emit empty values as ""
commonism added a commit to commonism/yarl that referenced this issue Sep 2, 2024
urllib.parse.urljoin does not honor empty segments,
python/cpython#84774

implement using joinpath

change query string generation to not emit empty values as ""
github-actions bot pushed a commit to aio-libs/aiohttp that referenced this issue Sep 5, 2024
Bumps [yarl](https://github.com/aio-libs/yarl) from 1.9.9 to 1.9.11.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/aio-libs/yarl/releases">yarl's
releases</a>.</em></p>
<blockquote>
<h2>1.9.11</h2>
<h2>Bug fixes</h2>
<ul>
<li>
<p>Fixed a :exc:<code>TypeError</code> with <code>MultiDictProxy</code>
and Python 3.8 -- by :user:<code>bdraco</code>.</p>
<p><em>Related issues and pull requests on GitHub:</em>
<a
href="https://redirect.github.com/aio-libs/yarl/issues/1084">#1084</a>,
<a
href="https://redirect.github.com/aio-libs/yarl/issues/1105">#1105</a>,
<a
href="https://redirect.github.com/aio-libs/yarl/issues/1107">#1107</a>.</p>
</li>
</ul>
<h2>Miscellaneous internal changes</h2>
<ul>
<li>
<p>Improved performance of encoding hosts -- by
:user:<code>bdraco</code>.</p>
<p>Previously, the library would unconditionally try to parse a host as
an IP Address. The library now avoids trying to parse a host as an IP
Address if the string is not in one of the formats described in
:rfc:<code>3986#section-3.2.2</code>.</p>
<p><em>Related issues and pull requests on GitHub:</em>
<a
href="https://redirect.github.com/aio-libs/yarl/issues/1104">#1104</a>.</p>
</li>
</ul>
<hr />
<h2>1.9.10</h2>
<h2>Bug fixes</h2>
<ul>
<li>
<p>:meth:<code>URL.join() &lt;yarl.URL.join&gt;</code> has been changed
to match
:rfc:<code>3986</code> and align with
:meth:<code>/ operation &lt;yarl.URL.__truediv__&gt;</code> and
:meth:<code>URL.joinpath() &lt;yarl.URL.joinpath&gt;</code>
when joining URLs with empty segments.
Previously :py:func:<code>urllib.parse.urljoin</code> was used,
which has known issues with empty segments
(<code>python/cpython#84774
&lt;https://github.com/python/cpython/issues/84774&gt;</code>_).</p>
<p>Due to the semantics of :meth:<code>URL.join()
&lt;yarl.URL.join&gt;</code>, joining an
URL with scheme requires making it relative, prefixing with
<code>./</code>.</p>
<p>.. code-block:: pycon</p>
<blockquote>
<blockquote>
<blockquote>
<p>URL(&quot;<a
href="https://web.archive.org/web/%22).join(URL(%22./https://github.com/aio-libs/yarl%22)">https://web.archive.org/web/&quot;).join(URL(&quot;./https://github.com/aio-libs/yarl&quot;)</a>)
URL('<a
href="https://web.archive.org/web/https://github.com/aio-libs/yarl">https://web.archive.org/web/https://github.com/aio-libs/yarl</a>')</p>
</blockquote>
</blockquote>
</blockquote>
<p>Empty segments are honored in the base as well as the joined
part.</p>
<p>.. code-block:: pycon</p>
<blockquote>
<blockquote>
<blockquote>
<p>URL(&quot;<a
href="https://web.archive.org/web/https://%22).join(URL(%22github.com/aio-libs/yarl%22)">https://web.archive.org/web/https://&quot;).join(URL(&quot;github.com/aio-libs/yarl&quot;)</a>)
URL('<a
href="https://web.archive.org/web/https://github.com/aio-libs/yarl">https://web.archive.org/web/https://github.com/aio-libs/yarl</a>')</p>
</blockquote>
</blockquote>
</blockquote>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/aio-libs/yarl/blob/master/CHANGES.rst">yarl's
changelog</a>.</em></p>
<blockquote>
<h1>1.9.11</h1>
<p><em>(2024-09-04)</em></p>
<h2>Bug fixes</h2>
<ul>
<li>
<p>Fixed a :exc:<code>TypeError</code> with <code>MultiDictProxy</code>
and Python 3.8 -- by :user:<code>bdraco</code>.</p>
<p><em>Related issues and pull requests on GitHub:</em>
:issue:<code>1084</code>, :issue:<code>1105</code>,
:issue:<code>1107</code>.</p>
</li>
</ul>
<h2>Miscellaneous internal changes</h2>
<ul>
<li>
<p>Improved performance of encoding hosts -- by
:user:<code>bdraco</code>.</p>
<p>Previously, the library would unconditionally try to parse a host as
an IP Address. The library now avoids trying to parse a host as an IP
Address if the string is not in one of the formats described in
:rfc:<code>3986#section-3.2.2</code>.</p>
<p><em>Related issues and pull requests on GitHub:</em>
:issue:<code>1104</code>.</p>
</li>
</ul>
<hr />
<h1>1.9.10</h1>
<p><em>(2024-09-04)</em></p>
<h2>Bug fixes</h2>
<ul>
<li>
<p>:meth:<code>URL.join() &lt;yarl.URL.join&gt;</code> has been changed
to match
:rfc:<code>3986</code> and align with
:meth:<code>/ operation &lt;yarl.URL.__truediv__&gt;</code> and
:meth:<code>URL.joinpath() &lt;yarl.URL.joinpath&gt;</code>
when joining URLs with empty segments.
Previously :py:func:<code>urllib.parse.urljoin</code> was used,
which has known issues with empty segments
(<code>python/cpython#84774
&lt;https://github.com/python/cpython/issues/84774&gt;</code>_).</p>
<p>Due to the semantics of :meth:<code>URL.join()
&lt;yarl.URL.join&gt;</code>, joining an
URL with scheme requires making it relative, prefixing with
<code>./</code>.</p>
<p>.. code-block:: pycon</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/aio-libs/yarl/commit/29d693deb083287745b41138db6597a4ed2da59b"><code>29d693d</code></a>
Release 1.9.11</li>
<li><a
href="https://github.com/aio-libs/yarl/commit/7acfed4c62f85acb6d98b4aa836cd8fb78da01bc"><code>7acfed4</code></a>
Fix <code>TypeError</code> with <code>MultiDictProxy</code> and Python
3.8 (<a
href="https://redirect.github.com/aio-libs/yarl/issues/1107">#1107</a>)</li>
<li><a
href="https://github.com/aio-libs/yarl/commit/51541ecc582cbc0196bd34c4cad74b3b15f8988b"><code>51541ec</code></a>
Avoid trying to parse strings that cannot be IP Addresses (<a
href="https://redirect.github.com/aio-libs/yarl/issues/1104">#1104</a>)</li>
<li><a
href="https://github.com/aio-libs/yarl/commit/bccb8af95ed6867339188b891d1430e1e55c50e8"><code>bccb8af</code></a>
Increment version to 1.9.11.dev0 (<a
href="https://redirect.github.com/aio-libs/yarl/issues/1106">#1106</a>)</li>
<li><a
href="https://github.com/aio-libs/yarl/commit/bfef13050dd7d299f7b993b058e61cb50eacbd63"><code>bfef130</code></a>
Release 1.9.10</li>
<li><a
href="https://github.com/aio-libs/yarl/commit/af585b2f694dbc870eeceddd3d87928719fe7ad4"><code>af585b2</code></a>
RFC3986 compatible URL.join honoring empty segments again (<a
href="https://redirect.github.com/aio-libs/yarl/issues/1082">#1082</a>)</li>
<li><a
href="https://github.com/aio-libs/yarl/commit/b631867c54335bdd05b84843d5a2dd85e2db4166"><code>b631867</code></a>
Introduce absolute cached property (<a
href="https://redirect.github.com/aio-libs/yarl/issues/1100">#1100</a>)</li>
<li><a
href="https://github.com/aio-libs/yarl/commit/f06445c2c8904227ea469114cdb09d4d9805ae92"><code>f06445c</code></a>
Adjust missing type changelog message to be attr instead of meth (<a
href="https://redirect.github.com/aio-libs/yarl/issues/1101">#1101</a>)</li>
<li><a
href="https://github.com/aio-libs/yarl/commit/3d4b8cad4096b02a24f049559f60249c58331e75"><code>3d4b8ca</code></a>
Increment version to 1.9.10.dev0 (<a
href="https://redirect.github.com/aio-libs/yarl/issues/1099">#1099</a>)</li>
<li>See full diff in <a
href="https://github.com/aio-libs/yarl/compare/v1.9.9...v1.9.11">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=yarl&package-manager=pip&previous-version=1.9.9&new-version=1.9.11)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

5 participants