Skip to content

Commit

Permalink
pythonGH-125866: RFC8089 file URIs in urllib.request
Browse files Browse the repository at this point in the history
Adjust `urllib.request.pathname2url` and `url2pathname()` to generate and
accept file URIs as described in RFC8089.

`pathname2url()` gains a new *include_scheme* argument, which defaults to
false. When set to true, the returned URL includes a `file:` prefix.

`url2pathname()` now automatically removes a `file:` prefix if present.

On Windows, `pathname2url()` now generates URIs that begin with two slashes
rather than four when given a UNC path.

On other platforms, `pathname2url()` now generates URIs that begin with
three slashes rather than one when given an absolute path. `url2pathname()`
now performs the opposite transformation, so `file:///etc/hosts` becomes
`/etc/hosts`. Furthermore, `url2pathname()` now ignores local hosts (like
"localhost" or any alias) and raises `URLError` for non-local hosts.
  • Loading branch information
barneygale committed Oct 29, 2024
1 parent 6742f14 commit fb92f42
Show file tree
Hide file tree
Showing 6 changed files with 217 additions and 55 deletions.
31 changes: 23 additions & 8 deletions Doc/library/urllib.request.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,18 +147,33 @@ The :mod:`urllib.request` module defines the following functions:
attribute to modify its position in the handlers list.


.. function:: pathname2url(path)
.. function:: pathname2url(path, include_scheme=False)

Convert the pathname *path* from the local syntax for a path to the form used in
the path component of a URL. This does not produce a complete URL. The return
value will already be quoted using the :func:`~urllib.parse.quote` function.
Convert the local pathname *path* to a percent-encoded URL. If
*include_scheme* is false (the default), the URL is returned without a
``file:`` scheme prefix; set this argument to true to generate a complete
URL.

.. versionchanged:: 3.14
The *include_scheme* argument was added.

.. function:: url2pathname(path)
.. versionchanged:: 3.14
Generates :rfc:`8089`-compliant file URLs for absolute paths. URLs for
UNC paths on Windows systems begin with two slashes (previously four.)
URLs for absolute paths on non-Windows systems begin with three slashes
(previously one.)


.. function:: url2pathname(url)

Convert the percent-encoded *url* to a local pathname.

.. versionchanged:: 3.14
Supports :rfc:`8089`-compliant file URLs. Raises :exc:`URLError` if a
scheme other than ``file:`` is used. If the URL uses a non-local
authority, then on Windows a UNC path is returned, and on other
platforms a :exc:`URLError` exception is raised.

Convert the path component *path* from a percent-encoded URL to the local syntax for a
path. This does not accept a complete URL. This function uses
:func:`~urllib.parse.unquote` to decode *path*.

.. function:: getproxies()

Expand Down
22 changes: 22 additions & 0 deletions Doc/whatsnew/3.14.rst
Original file line number Diff line number Diff line change
Expand Up @@ -447,6 +447,28 @@ unittest
(Contributed by Jacob Walls in :gh:`80958`.)


urllib.request
--------------

* Improve support for ``file:`` URIs in :mod:`urllib.request`:

* :func:`~urllib.request.pathname2url` accepts a *include_scheme*
argument, which defaults to false. When set to true, a complete URL
with a ``file:`` prefix is returned.
* :func:`~urllib.request.url2pathname` discards a ``file:`` prefix if given.
* On Windows, :func:`~urllib.request.pathname2url` generates URIs that
begin with two slashes (rather than four) when given a UNC path.
* On non-Windows platforms, :func:`~urllib.request.pathname2url` generates
URIs that begin with three slashes (rather than one) when given an
absolute path. :func:`~urllib.request.url2pathname` performs the opposite
transformation, so ``file:///etc/hosts` becomes ``/etc/hosts``.
* On non-Windows platforms, :func:`~urllib.request.url2pathname` raises
:exc:`urllib.error.URLError` if the URI includes a non-local authority,
like ``file://other-machine/etc/hosts``.

(Contributed by Barney Gale in :gh:`125866`.)


.. Add improved modules above alphabetically, not here at the end.
Optimizations
Expand Down
4 changes: 2 additions & 2 deletions Lib/nturl2path.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
"""Convert a NT pathname to a file URL and vice versa.
This module only exists to provide OS-specific code
This module previously provided OS-specific code
for urllib.requests, thus do not use directly.
"""
# Testing is done through test_urllib.
# Testing is done through test_nturl2path.

def url2pathname(url):
"""OS-specific conversion from a relative URL of the 'file' scheme
Expand Down
111 changes: 111 additions & 0 deletions Lib/test/test_nturl2path.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
import nturl2path
import unittest
import urllib.parse


class nturl2path_Tests(unittest.TestCase):
"""Test pathname2url() and url2pathname()"""

def test_basic(self):
# Make sure simple tests pass
expected_path = "parts\\of\\a\\path"
expected_url = "parts/of/a/path"
result = nturl2path.pathname2url(expected_path)
self.assertEqual(expected_url, result,
"pathname2url() failed; %s != %s" %
(result, expected_url))
result = nturl2path.url2pathname(expected_url)
self.assertEqual(expected_path, result,
"url2pathame() failed; %s != %s" %
(result, expected_path))

def test_quoting(self):
# Test automatic quoting and unquoting works for pathnam2url() and
# url2pathname() respectively
given = "needs\\quot=ing\\here"
expect = "needs/%s/here" % urllib.parse.quote("quot=ing")
result = nturl2path.pathname2url(given)
self.assertEqual(expect, result,
"pathname2url() failed; %s != %s" %
(expect, result))
expect = given
result = nturl2path.url2pathname(result)
self.assertEqual(expect, result,
"url2pathname() failed; %s != %s" %
(expect, result))
given = "make sure\\using_quote"
expect = "%s/using_quote" % urllib.parse.quote("make sure")
result = nturl2path.pathname2url(given)
self.assertEqual(expect, result,
"pathname2url() failed; %s != %s" %
(expect, result))
given = "make+sure/using_unquote"
expect = "make+sure\\using_unquote"
result = nturl2path.url2pathname(given)
self.assertEqual(expect, result,
"url2pathname() failed; %s != %s" %
(expect, result))

def test_pathname2url(self):
# Test special prefixes are correctly handled in pathname2url()
fn = nturl2path.pathname2url
self.assertEqual(fn('\\\\?\\C:\\dir'), '///C:/dir')
self.assertEqual(fn('\\\\?\\unc\\server\\share\\dir'), '/server/share/dir')
self.assertEqual(fn("C:"), '///C:')
self.assertEqual(fn("C:\\"), '///C:')
self.assertEqual(fn('C:\\a\\b.c'), '///C:/a/b.c')
self.assertEqual(fn('C:\\a\\b%#c'), '///C:/a/b%25%23c')
self.assertEqual(fn('C:\\a\\b\xe9'), '///C:/a/b%C3%A9')
self.assertEqual(fn('C:\\foo\\bar\\spam.foo'), "///C:/foo/bar/spam.foo")
# Long drive letter
self.assertRaises(IOError, fn, "XX:\\")
# No drive letter
self.assertEqual(fn("\\folder\\test\\"), '/folder/test/')
self.assertEqual(fn("\\\\folder\\test\\"), '////folder/test/')
self.assertEqual(fn("\\\\\\folder\\test\\"), '/////folder/test/')
self.assertEqual(fn('\\\\some\\share\\'), '////some/share/')
self.assertEqual(fn('\\\\some\\share\\a\\b.c'), '////some/share/a/b.c')
self.assertEqual(fn('\\\\some\\share\\a\\b%#c\xe9'), '////some/share/a/b%25%23c%C3%A9')
# Round-tripping
urls = ['///C:',
'/////folder/test/',
'///C:/foo/bar/spam.foo']
for url in urls:
self.assertEqual(fn(nturl2path.url2pathname(url)), url)

def test_url2pathname_win(self):
fn = nturl2path.url2pathname
self.assertEqual(fn('/C:/'), 'C:\\')
self.assertEqual(fn("///C|"), 'C:')
self.assertEqual(fn("///C:"), 'C:')
self.assertEqual(fn('///C:/'), 'C:\\')
self.assertEqual(fn('/C|//'), 'C:\\')
self.assertEqual(fn('///C|/path'), 'C:\\path')
# No DOS drive
self.assertEqual(fn("///C/test/"), '\\\\\\C\\test\\')
self.assertEqual(fn("////C/test/"), '\\\\C\\test\\')
# DOS drive paths
self.assertEqual(fn('C:/path/to/file'), 'C:\\path\\to\\file')
self.assertEqual(fn('C|/path/to/file'), 'C:\\path\\to\\file')
self.assertEqual(fn('/C|/path/to/file'), 'C:\\path\\to\\file')
self.assertEqual(fn('///C|/path/to/file'), 'C:\\path\\to\\file')
self.assertEqual(fn("///C|/foo/bar/spam.foo"), 'C:\\foo\\bar\\spam.foo')
# Non-ASCII drive letter
self.assertRaises(IOError, fn, "///\u00e8|/")
# UNC paths
self.assertEqual(fn('//server/path/to/file'), '\\\\server\\path\\to\\file')
self.assertEqual(fn('////server/path/to/file'), '\\\\server\\path\\to\\file')
self.assertEqual(fn('/////server/path/to/file'), '\\\\\\server\\path\\to\\file')
# Localhost paths
self.assertEqual(fn('//localhost/C:/path/to/file'), 'C:\\path\\to\\file')
self.assertEqual(fn('//localhost/C|/path/to/file'), 'C:\\path\\to\\file')
# Round-tripping
paths = ['C:',
r'\\\C\test\\',
r'C:\foo\bar\spam.foo']
for path in paths:
self.assertEqual(fn(nturl2path.pathname2url(path)), path)


if __name__ == '__main__':
unittest.main()
14 changes: 7 additions & 7 deletions Lib/test/test_urllib.py
Original file line number Diff line number Diff line change
Expand Up @@ -1551,9 +1551,9 @@ def test_pathname2url_win(self):
'test specific to POSIX pathnames')
def test_pathname2url_posix(self):
fn = urllib.request.pathname2url
self.assertEqual(fn('/'), '/')
self.assertEqual(fn('/a/b.c'), '/a/b.c')
self.assertEqual(fn('/a/b%#c'), '/a/b%25%23c')
self.assertEqual(fn('/'), '///')
self.assertEqual(fn('/a/b.c'), '///a/b.c')
self.assertEqual(fn('/a/b%#c'), '///a/b%25%23c')

@unittest.skipUnless(sys.platform == 'win32',
'test specific to Windows pathnames.')
Expand Down Expand Up @@ -1595,10 +1595,10 @@ def test_url2pathname_win(self):
def test_url2pathname_posix(self):
fn = urllib.request.url2pathname
self.assertEqual(fn('/foo/bar'), '/foo/bar')
self.assertEqual(fn('//foo/bar'), '//foo/bar')
self.assertEqual(fn('///foo/bar'), '///foo/bar')
self.assertEqual(fn('////foo/bar'), '////foo/bar')
self.assertEqual(fn('//localhost/foo/bar'), '//localhost/foo/bar')
self.assertRaises(urllib.error.URLError, fn, '//foo/bar')
self.assertEqual(fn('///foo/bar'), '/foo/bar')
self.assertEqual(fn('////foo/bar'), '//foo/bar')
self.assertEqual(fn('//localhost/foo/bar'), '/foo/bar')

class Utility_Tests(unittest.TestCase):
"""Testcase to test the various utility functions in the urllib."""
Expand Down
90 changes: 52 additions & 38 deletions Lib/urllib/request.py
Original file line number Diff line number Diff line change
Expand Up @@ -1448,16 +1448,6 @@ def parse_http_list(s):
return [part.strip() for part in res]

class FileHandler(BaseHandler):
# Use local file or FTP depending on form of URL
def file_open(self, req):
url = req.selector
if url[:2] == '//' and url[2:3] != '/' and (req.host and
req.host != 'localhost'):
if not req.host in self.get_names():
raise URLError("file:// scheme is supported only on localhost")
else:
return self.open_local_file(req)

# names for the localhost
names = None
def get_names(self):
Expand All @@ -1474,8 +1464,7 @@ def get_names(self):
def open_local_file(self, req):
import email.utils
import mimetypes
host = req.host
filename = req.selector
filename = req.full_url
localfile = url2pathname(filename)
try:
stats = os.stat(localfile)
Expand All @@ -1485,24 +1474,22 @@ def open_local_file(self, req):
headers = email.message_from_string(
'Content-type: %s\nContent-length: %d\nLast-modified: %s\n' %
(mtype or 'text/plain', size, modified))
if host:
host, port = _splitport(host)
if not host or \
(not port and _safe_gethostbyname(host) in self.get_names()):
if host:
origurl = 'file://' + host + filename
else:
origurl = 'file://' + filename
return addinfourl(open(localfile, 'rb'), headers, origurl)
return addinfourl(open(localfile, 'rb'), headers, filename)
except OSError as exp:
raise URLError(exp)
raise URLError('file not on local host')

def _safe_gethostbyname(host):
file_open = open_local_file


def _is_local_host(host):
if not host or host == 'localhost':
return True
try:
return socket.gethostbyname(host)
name = socket.gethostbyname(host)
except socket.gaierror:
return None
return False
return name in FileHandler().get_names()


class FTPHandler(BaseHandler):
def ftp_open(self, req):
Expand Down Expand Up @@ -1649,19 +1636,46 @@ def data_open(self, req):

MAXFTPCACHE = 10 # Trim the ftp cache beyond this size

# Helper for non-unix systems
if os.name == 'nt':
from nturl2path import url2pathname, pathname2url
else:
def url2pathname(pathname):
"""OS-specific conversion from a relative URL of the 'file' scheme
to a file system path; not recommended for general use."""
return unquote(pathname)

def pathname2url(pathname):
"""OS-specific conversion from a file system path to a relative URL
of the 'file' scheme; not recommended for general use."""
return quote(pathname)
def pathname2url(path, include_scheme=False):
"""Convert the local pathname *path* to a percent-encoded URL."""
prefix = 'file:' if include_scheme else ''
if os.name == 'nt':
path = path.replace('\\', '/')
drive, root, tail = os.path.splitroot(path)
if drive:
if drive[1:2] == ':':
prefix += '///'
elif root:
prefix += '//'
tail = quote(tail)
return prefix + drive + root + tail

def url2pathname(url):
"""Convert the percent-encoded URL *url* to a local pathname."""
scheme, authority, path = urlsplit(url, scheme='file')[:3]
if scheme != 'file':
raise URLError(f'URI does not use "file" scheme: {url!r}')
if os.name == 'nt':
path = unquote(path)
if authority and authority != 'localhost':
# e.g. file://server/share/path
path = f'//{authority}{path}'
elif path.startswith('///'):
# e.g. file://///server/share/path
path = path[1:]
else:
if path[0:1] == '/' and path[2:3] in ':|':
# e.g. file:////c:/path
path = path[1:]
if path[1:2] == '|':
# e.g. file:///c|path
path = path[:1] + ':' + path[2:]
path = path.replace('/', '\\')
else:
if not _is_local_host(authority):
raise URLError(f'file URI not on local host: {url!r}')
path = unquote(path)
return path


ftpcache = {}
Expand Down

0 comments on commit fb92f42

Please sign in to comment.