-
Notifications
You must be signed in to change notification settings - Fork 67
Reading files with HTTP fails if range headers are not supported #411
Reading files with HTTP fails if range headers are not supported #411
Conversation
Example traceback with debugging print out without this fix applied: Got response of size 20166303 chunksize is 1048576 -1765915941 -30694 -1671616656 4090602292 -7813 16221 -118717542 708660270 self._compressedbytes, self._uncompressedbytes = -1765908128 -1671616656 cursor.index - start 0 self._compressedbytes -1765908128 self._uncompressed = None Traceback (most recent call last): File "<string>", line 1, in <module> File "uproot/rootio.py", line 60, in open return http(path, httpsource=httpsource, **options) File "uproot/rootio.py", line 89, in http return ROOTDirectory.read(openfcn(path), **options) File "uproot/rootio.py", line 154, in read streamerinfos, streamerinfosmap, streamerrules = _readstreamers(streamerkey._source, streamerkey._cursor, streamercontext, None) File "uproot/rootio.py", line 556, in _readstreamers tlist = TList.read(source, cursor, context, parent) File "uproot/rootio.py", line 949, in read out = cls._readinto(out, source, cursor, context, parent) File "uproot/rootio.py", line 1455, in _readinto start, cnt, self._classversion = _startcheck(source, cursor) File "uproot/rootio.py", line 426, in _startcheck cnt, vers = cursor.fields(source, _startcheck._format_cntvers) File "uproot/source/cursor.py", line 46, in fields return format.unpack(source.data(start, stop)) File "uproot/source/compressed.py", line 183, in data return self._uncompressed[start:stop] TypeError: 'NoneType' object is not subscriptable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some comments in line. Only one of them is a "request changes": I don't want any warnings in the codebase. I like the idea of warnings as a middle-ground between things-one-must-fix and everything-is-okay, but in practice, they happen deeply enough that most users don't know what they mean or what to do to fix them, if it's even possible. Also, this warning is just a "performance may be degraded," and there are many ways other than this one for that to happen: we can't catch them all (nor do we want to—there are times when a user wants something that can only be done slowly).
When you remove the warning and are happy with the changes, give me the go-ahead and I'll merge the PR. I'm happy with the rest of the changes, too.
By the way, about SyntaxErrors
, the only way to add code that would be a syntax error in some versions of Python is through eval
or exec
. I'm willing to do that (sparingly). Here is an example.
I think the reason performance is usually treated as secondary it's because it's so dependent on the use-case. What you're calling a "performance degradation" here is based on assumption that the user will want to read nearby arrays from the file. (And with that basket-sized cache blocks improvement we talked about, "the same array more than once.") This is a common use-case, though not the only one, and a warning about performance for those users would be very confusing. What I had in mind for remote files was to reproduce the memory map API that works so well on local files. If the file is too big for memory—a common situation in analysis, though not tutorials—the cache will thrash. This is the case with all kinds of caches, and it's the reason that single pass (e.g. doing all your work in an Is the situation here that the cache loaded to read basket TKeys will get evicted before we can read the array data? It's one thing if the cache thrashes when the user asks for the same data twice, but quite another if it thrashes while servicing a single request. |
Oh! What's happening is that you ask HTTP for a chunk, but it answers by sending the whole file? If we know that, why don't we chop it up into the right-sized chunks send fill the cache? In other words, take the error as a gift? Why would HTTP do that? Does it only happen for small files? I imagine it would be a problem if asking for a byte range results in downloading TB of unwanted data. To do this, we'd have to be certain of where the downloaded part starts. If |
I should have been clearer as this is not a solution for #393, it's just a bug I noticed when starting to work on it.
Yes! Sorry I really should have said that in the first comment of this PR. I'll blame jet lag as I've just got back from CHEP 😄
It's dependent on if the server. I've not looked into it enough to know exactly but as range requests are just a header in the request, if the HTTP server is very simple like
That is what is currently done in this PR. The big performance issue if the cache is smaller than the file. The cache becomes useless when trying to read almost anything in a 4GB file with the default cache size of 100MB and you'll download 4GB for every single call to
This is why I suggested a warning in option 1 as it's not just "bad performance for some usage patterns" it's repeatedly reading millions of times more bytes than you need to and not being able to cache them effectively unless you only access the same chunk or one at the end of the file. The alternative I see is to ignore the cache size when this happens and just use a normal dictionary (option 2). It's not nice though as memory usage might blow up. Personally I find that more acceptable to do without a warning as it's easier to debug when it causes a problem.
Empirically this seems to start at zero as if the header was ignored and I don't see how there could be any other reasonable behaviour. Not being sure about this was another reason to include the warning. Are there any magic bytes at the start of a ROOT file which I can look for to check this assumption? |
Yes! % hexdump -C issue412.root | head -1
00000000 72 6f 6f 74 00 10 33 a8 00 00 00 64 00 00 00 00 |root..3....d....| The first four characters of every ROOT file are guaranteed to be The Your suggestion of increasing the cache size is dangerous because people do run out of memory, and these cases are hard to track down. Making the cache behavior more complex will make those cases more difficult to debug. However, if we're in this situation where we've unexpectedly downloaded the whole file and Python is still alive, then we've demonstrably fit the whole thing into memory. Casting it as an array with How about this: if HTTP returns a whole file instead of the requested part, which we can determine by So, yes, that's option 2. I'm on board now. |
Thank you for the interesting write up. (As always!) Checking for The response from |
@jpivarski Assuming you're happy with it, this can be merged. As nobody has reported it as a bug I think it can wait for the next release instead of having a dedicated one made. |
Okay, thanks! |
Uproot fails to read files from servers that don't support range requests. Example traceback with extra debugging print out and without this fix applied:
Error can be seen when using Python's built in HTTP server to open a "large" file (i.e. more than one chunk):
Also includes some minor fixes for linter errors I stumbled upon. It might be necessary to use a more general exception in some cases. I've not checked closely but perhaps
SyntaxError
needs to be caught in some cases?