Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downloading of large files fails with urllib.request with recent Python 3.x #3455

Closed
zao opened this issue Oct 1, 2020 · 8 comments · Fixed by #3614
Closed

downloading of large files fails with urllib.request with recent Python 3.x #3455

zao opened this issue Oct 1, 2020 · 8 comments · Fixed by #3614
Milestone

Comments

@zao
Copy link
Contributor

zao commented Oct 1, 2020

In download_file we use urllib.request, which seems to throw an error in url_fd.read() when trying to read the whole file at once.

It can be reproduced with this script:

import urllib.request

x = urllib.request.urlopen('https://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run')
x.read()
Traceback (most recent call last):
  File "./foo.py", line 5, in <module>
    x.read()
  File "/usr/lib/python3.8/http/client.py", line 467, in read
    s = self._safe_read(self.length)
  File "/usr/lib/python3.8/http/client.py", line 608, in _safe_read
    data = self.fp.read(amt)
  File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.8/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.8/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
OverflowError: signed integer is greater than maximum

While urllib is bugged like this, we need to either read it in chunks ourselves or skip the naive read and combine it with the subsequent write to file via shutil.copyfileobj:

with open('/dev/shm/out.raw', 'wb') as fh:
        shutil.copyfileobj(x, fh)

As a bonus, we won't be reading the whole thing into memory, possibly exhausting it.

@zao
Copy link
Contributor Author

zao commented Oct 1, 2020

This is Ubuntu 20.04 with:

Python 3.8.2 (default, Jul 16 2020, 14:00:26)
[GCC 9.3.0] on linux

@boegel
Copy link
Member

boegel commented Oct 1, 2020

Seems like a good idea to always do it?

How big does the file need to be to trigger this issue? Would be good to get this covered by the tests...

@boegel boegel added this to the next release (4.3.1) milestone Oct 1, 2020
@zao
Copy link
Contributor Author

zao commented Oct 1, 2020

CUDA is 2924M so guessing wildly at 2 GiB.

@zao
Copy link
Contributor Author

zao commented Oct 1, 2020

2048 megabytes doesn't trigger it, but 2049 megabytes does.

@Flamefire
Copy link
Contributor

I think we could change

with open(path, mode) as handle:
handle.write(data)
to handle file-like objects so we don't need to pass x.read() to it. But I'm not sure about how to uniformly handle both. But I guess this could well be worth it as similar issues apply for copying a file via write_file

@verdurin
Copy link
Member

verdurin commented Mar 3, 2021

Seeing this with MCR, after fixing the URLs to be https.

@boegel boegel changed the title Download of large file fails with urllib.request downloading of large files fails with urllib.request with recent Python 3.x Mar 16, 2021
Flamefire added a commit to Flamefire/easybuild-framework that referenced this issue Mar 17, 2021
Fixes downloads of large files where the size exceeds the maximum integer size
Fixes easybuilders#3455
@boegel
Copy link
Member

boegel commented Mar 17, 2021

I'm looking into a test that triggers this issue (to include in PR #3614 which fixes this).

Some more context here:

Especially the last part makes this difficult to reproduce in a test, since you can't trigger the problem with a big enough local file (unless you somehow serve that over HTTPS)... I don't think downloading a 2049MB file every time we run the tests is a good idea. :)

@Flamefire
Copy link
Contributor

I wouldn't make it so complicated. Reading a downloaded file completely into memory before storing it on disk IMO was never a good idea and the PR fixes that which, as a side effect, also avoids the problem here. ;)
We should have probably used a better high-level function like urllib.request.urlretrieve which does buffering already and stores directly to a file.

bartoldeman pushed a commit to ComputeCanada/easybuild-framework that referenced this issue Apr 26, 2021
Fixes downloads of large files where the size exceeds the maximum integer size
Fixes easybuilders#3455
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants