downloading of large files fails with urllib.request with recent Python 3.x #3455

zao · 2020-10-01T14:40:11Z

In download_file we use urllib.request, which seems to throw an error in url_fd.read() when trying to read the whole file at once.

It can be reproduced with this script:

import urllib.request

x = urllib.request.urlopen('https://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run')
x.read()

Traceback (most recent call last):
  File "./foo.py", line 5, in <module>
    x.read()
  File "/usr/lib/python3.8/http/client.py", line 467, in read
    s = self._safe_read(self.length)
  File "/usr/lib/python3.8/http/client.py", line 608, in _safe_read
    data = self.fp.read(amt)
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
OverflowError: signed integer is greater than maximum

While urllib is bugged like this, we need to either read it in chunks ourselves or skip the naive read and combine it with the subsequent write to file via shutil.copyfileobj:

with open('/dev/shm/out.raw', 'wb') as fh:
        shutil.copyfileobj(x, fh)

As a bonus, we won't be reading the whole thing into memory, possibly exhausting it.

The text was updated successfully, but these errors were encountered:

zao · 2020-10-01T14:40:47Z

This is Ubuntu 20.04 with:

Python 3.8.2 (default, Jul 16 2020, 14:00:26)
[GCC 9.3.0] on linux

boegel · 2020-10-01T14:45:46Z

Seems like a good idea to always do it?

How big does the file need to be to trigger this issue? Would be good to get this covered by the tests...

zao · 2020-10-01T14:46:40Z

CUDA is 2924M so guessing wildly at 2 GiB.

zao · 2020-10-01T14:50:38Z

2048 megabytes doesn't trigger it, but 2049 megabytes does.

Flamefire · 2020-10-02T08:07:41Z

I think we could change

easybuild-framework/easybuild/tools/filetools.py

Lines 245 to 246 in f90a2b0

    
           with open(path, mode) as handle: 
        
               handle.write(data)

to handle file-like objects so we don't need to pass x.read() to it. But I'm not sure about how to uniformly handle both. But I guess this could well be worth it as similar issues apply for copying a file via write_file

verdurin · 2021-03-03T19:45:55Z

Seeing this with MCR, after fixing the URLs to be https.

Fixes downloads of large files where the size exceeds the maximum integer size Fixes easybuilders#3455

boegel · 2021-03-17T13:11:44Z

I'm looking into a test that triggers this issue (to include in PR #3614 which fixes this).

Some more context here:

This is known bug in Python 3.8 and newer, see https://bugs.python.org/issue42853 .
It only occurs for files being read from an HTTPS URL.

Especially the last part makes this difficult to reproduce in a test, since you can't trigger the problem with a big enough local file (unless you somehow serve that over HTTPS)... I don't think downloading a 2049MB file every time we run the tests is a good idea. :)

Flamefire · 2021-03-17T13:19:12Z

I wouldn't make it so complicated. Reading a downloaded file completely into memory before storing it on disk IMO was never a good idea and the PR fixes that which, as a side effect, also avoids the problem here. ;)
We should have probably used a better high-level function like urllib.request.urlretrieve which does buffering already and stores directly to a file.

Fixes downloads of large files where the size exceeds the maximum integer size Fixes easybuilders#3455

boegel added this to the next release (4.3.1) milestone Oct 1, 2020

boegel added the bug report label Oct 1, 2020

boegel modified the milestones: next release (4.3.1), 4.4.0 Oct 14, 2020

boegel mentioned this issue Feb 2, 2021

"signed integer is greater than maximum" error when downloading CUDA installation file easybuilders/easybuild-easyconfigs#12021

Closed

boegel modified the milestones: 4.3.3, release after 4.3.3 Feb 3, 2021

boegel mentioned this issue Feb 8, 2021

Fail to download CUDA using the python in $EPREFIX EESSI/software-layer#69

Closed

verdurin mentioned this issue Mar 3, 2021

{math}[system/system] MCR vR2020b easybuilders/easybuild-easyconfigs#12331

Merged

boegel changed the title ~~Download of large file fails with urllib.request~~ downloading of large files fails with urllib.request with recent Python 3.x Mar 16, 2021

Flamefire added a commit to Flamefire/easybuild-framework that referenced this issue Mar 17, 2021

Add option to write file from file-like object and use in download_file

16d19bc

Fixes downloads of large files where the size exceeds the maximum integer size Fixes easybuilders#3455

Flamefire mentioned this issue Mar 17, 2021

Add option to write file from file-like object and use in download_file #3614

Merged

boegel closed this as completed in #3614 Mar 17, 2021

bartoldeman pushed a commit to ComputeCanada/easybuild-framework that referenced this issue Apr 26, 2021

Add option to write file from file-like object and use in download_file

cfb3225

Fixes downloads of large files where the size exceeds the maximum integer size Fixes easybuilders#3455

boegel mentioned this issue Nov 26, 2021

implement download lock to prevent failing builds to due partially downloaded file #3904

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

downloading of large files fails with urllib.request with recent Python 3.x #3455

downloading of large files fails with urllib.request with recent Python 3.x #3455

zao commented Oct 1, 2020

zao commented Oct 1, 2020

boegel commented Oct 1, 2020

zao commented Oct 1, 2020

zao commented Oct 1, 2020

Flamefire commented Oct 2, 2020

verdurin commented Mar 3, 2021

boegel commented Mar 17, 2021

Flamefire commented Mar 17, 2021

downloading of large files fails with urllib.request with recent Python 3.x #3455

downloading of large files fails with urllib.request with recent Python 3.x #3455

Comments

zao commented Oct 1, 2020

zao commented Oct 1, 2020

boegel commented Oct 1, 2020

zao commented Oct 1, 2020

zao commented Oct 1, 2020

Flamefire commented Oct 2, 2020

verdurin commented Mar 3, 2021

boegel commented Mar 17, 2021

Flamefire commented Mar 17, 2021