Skip to content
This repository has been archived by the owner on Feb 2, 2024. It is now read-only.

web2py + IE8 + Downloading Files > 64kb = Corrupt Downloads #1

Open
explorigin opened this issue Jan 21, 2012 · 4 comments
Open

web2py + IE8 + Downloading Files > 64kb = Corrupt Downloads #1

explorigin opened this issue Jan 21, 2012 · 4 comments

Comments

@explorigin
Copy link
Owner

web2py users have reported corrupt downloads using Rocket. It seems that only IE8 (and lower versions) are affected. I can reproduce this with web2py but I cannot reproduce it with Rocket alone.

web2py with any other webserver does not exhibit this issue. There is some interaction between Rocket and web2py that causes this.

This also only happens when downloading files. Uploaded files seem to be unaffected.

In this scenario, files are sent to the browser in 64k blocks. In what seems to be random circumstance, 4kb may be missing from the beginning of a block. I've never seen this happen with the first block. It is typically first shows on the 4th or 5th block.

The steps to reproduce this in web2py are detailed here: http://groups.google.com/group/web2py/browse_thread/thread/d7f6faddb841790b/d67ed796649fc3f1?pli=1

Any help with this issue would be much appreciated.

@nick-name
Copy link

I have seen this issue happening before, but wasn't able to predictably reproduce it, unfortunately.

Prompted by this request, after reading the code for a couple of hours, I'm almost sure that the problem is some interaction between Python's socket.py wrapper for write() and/or sendall(), which interacts with Rocket's exception handling, that would (eventually) close() a timed out socket, flushing the rest of the failed sendall() or something like that. I haven't been able to work out the exact scenario, but reading through the code, it seems impossible to recover from Python's sendall() timeout, which Rocket tries to do (by virtue of not immediately terminating the connection).

Python's sendall() is not atomic with respect to timeouts; any part of the data might already have been sent when the timeout exception is raised. This is further complicated by the socket.py wrapper, which adds buffering to write()/writeline(), and adds a flush() at close() -- so that if there was a sequence of writes() followed by a sendall() that timed out, attempting to close() would flush() the remains of the writes(), and the whole output is unpredictable. (If I'm reading the source correctly, it would double some bytes, rather than skip some bytes, but perhaps I'm not reading it correctly).

In case that is not the problem, the following observations might still help, however:

  • The transfer encoding when this happened wasn't chunked.
  • Problem with multiple clients (firefox, wget, curl)
  • The missing block is not always at the same place, and not always at the same length
  • The chunk size passed to the web2py streamer makes a difference to how many bytes until the first corruption: the bigger the chunk size, the more bytes get transfered before the first missing block; however, this is purely statistical, and a block won't always get lost. The size of the missing block is not necessarily a multiple of the chunk size, but was also a "round" binary number (I've seen 1024, 2048. 4096, 65536 - with various chunksizes; I think also with round decimal chunksizes like 40000, but I'm not sure -- I'll be able to try again next week.
  • the download speed appears to play a part: if the download is slow (between two hosts with ~50KB/s bandiwidth), it is much more likely to happen. If the download is immediate (e.g. localhost, no throttling), it is very unlikely to happen.
  • rocket/web2py is sure it sends the entire file. However, wget on the other side will realize that not all the file made it, and download the remaining bytes again (with a Byte Range request) -- which is not the right thing to do , of course, but if you use wget to do the testing, the file length will always turn out right; compare content instead.

While trying to figure this out, I noticed two small problems in rocket.py, neither of which (as far as I can tell) is the cause, but which you might want to fix:

  • in WSGIWorker.write(), there's a sendall(data) which should probably be sendall(b(data)) (but I'm not sure)
  • in WSGIWorker.run_app(), the call to makefile() in the PY3K branch corresponds to Python 2.x arguments, and the other branch is a bug. (Note that there are other calls to makefile() that use the 2.x arguments unconditionally)

While I haven't been able to find the reason, my suspects are:

  • sendall() - I've read the C source code that implements it 100 times or so, and it seems fine -- and yet, something there might not be right. If I manage to reproduce the bug predictably, I'd try to rewrite sendall() in python using send()
  • internal_select() - still in the C source. Also used by send(), so if the problem is here, the above won't help; however, setting the socket timeout to 0.0 would not try to select() at all, and rely on the OS to block.
  • some kind of race condition, that happens when browser/wget does a tcp-half-close of its reading side. I can't find one, but e.g. blocking_read will silently turn any exception into a "socket closed" response, even if that is EINTR or something like that.
  • Python standard library's own socket.py's "fileobject" wrapper: I suspect flush() has a bug that interacts with timeout: if the timeout is reached by the sendall() call, but after some data has already been successfully written by that sendall(), the finally() would ignore that part of the data and would repeat it if the timeout is ever recovered, but then data would actually be doubled, rather than missing.

@nick-name
Copy link

Ok, the culprit is definitely ignoring exceptions raised in sendall.

How to reproduce: you have to have a wsgi worker, that produces output in parts (that is, returns a list or yields part as a generator). e.g: use web2py's "static" file server (which uses wsgi and does not use the FileSystermWorker).

  1. Make sure that there's a large payload produced, and that it is made of a lot of small parts. e.g. put a 10MB file in web2py/applications/welcome/static/file10mb.data (web2py will use 64K parts by default)
  2. Consume file slowly, e.g. wget --limit=100k http://localhost:8000/welcome/static/file10mb.data ; this would take 100 seconds to download the whole file even on localhost.
  3. Let file download for 10 seconds, then pause wget (e.g. suspend it by using Ctrl-Z on linux/osx)
  4. Wait 20 seconds
  5. Let it continue (e.g. type 'fg' if you suspended it with ctrl-z)
  6. Notice that when it reaches the end, wget will complain about missing bytes, reconnect and download the rest of the file (and will be happy with it). However, the file will be corrupt: A block (or many blocks) will be missing from the middle, and the last few blocks will be repeated (by the 2nd wget connection; if you disallow wget from resuming, the file will just be shorter).

A better idea where the problem is can be seen from the following ugly patch (applied against web2py's "one file" rocket.py)

@@ -1929,6 +1929,9 @@ class WSGIWorker(Worker):
                 self.conn.sendall(b('%x\r\n%s\r\n' % (len(data), data)))
             else:
                 self.conn.sendall(data)
+        except socket.timeout:
+            self.closeConnection = True
+            print 'Exception lost'
         except socket.error:
             # But some clients will close the connection before that
             # resulting in a socket error.

Running the same experiment with the patched rocket.py will show that files get corrupted if 'exception lost' is printed to the web2py's terminal.

Discussion: The only way to use sendall() reliably is to immediately terminate the connection upon any error (including timeout), as there is no way to know how many bytes were sent. (That there is no way to know how many bytes were sent is clearly stated in the documentation; the implication that it is impossible to reliably recover from this is not). However, there are sendall() calls all over rocket.py, and some will result in additional sendalls() following a failed sendall(). The worst offender seems to be WSGIWorker.write(), but I'm not sure the other sendalls are safe either.

Temporary workarounds: increase SOCKET_TIMEOUT significantly (default is 1 second; bump to e.g. 10), and not swallow socket.timeout in WSGIWorker.write().

Increasing the chunk size is NOT a helpful, because it only changes the number of bytes before the first loss (at a given bandwidth), but from that point, the problem is the same.

@explorigin
Copy link
Owner Author

Many thanks. I'll dig into this soon.

@nick-name
Copy link

Any updates / insights?

Thanks in advance

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants