Reading files with HTTP fails if range headers are not supported #411

chrisburr · 2019-11-21T10:05:09Z

Uproot fails to read files from servers that don't support range requests. Example traceback with extra debugging print out and without this fix applied:

Got response of size 20166303 chunksize is 1048576
-1765915941 -30694 -1671616656 4090602292 -7813 16221 -118717542 708660270
self._compressedbytes, self._uncompressedbytes = -1765908128 -1671616656
cursor.index - start 0
self._compressedbytes -1765908128
self._uncompressed = None
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "uproot/rootio.py", line 60, in open
    return http(path, httpsource=httpsource, **options)
  File "uproot/rootio.py", line 89, in http
    return ROOTDirectory.read(openfcn(path), **options)
  File "uproot/rootio.py", line 154, in read
    streamerinfos, streamerinfosmap, streamerrules = _readstreamers(streamerkey._source, streamerkey._cursor, streamercontext, None)
  File "uproot/rootio.py", line 556, in _readstreamers
    tlist = TList.read(source, cursor, context, parent)
  File "uproot/rootio.py", line 949, in read
    out = cls._readinto(out, source, cursor, context, parent)
  File "uproot/rootio.py", line 1455, in _readinto
    start, cnt, self._classversion = _startcheck(source, cursor)
  File "uproot/rootio.py", line 426, in _startcheck
    cnt, vers = cursor.fields(source, _startcheck._format_cntvers)
  File "uproot/source/cursor.py", line 46, in fields
    return format.unpack(source.data(start, stop))
  File "uproot/source/compressed.py", line 183, in data
    return self._uncompressed[start:stop]
TypeError: 'NoneType' object is not subscriptable

Error can be seen when using Python's built in HTTP server to open a "large" file (i.e. more than one chunk):

python -m http.server 8000 --directory tests/samples

Also includes some minor fixes for linter errors I stumbled upon. It might be necessary to use a more general exception in some cases. I've not checked closely but perhaps SyntaxError needs to be caught in some cases?

Example traceback with debugging print out without this fix applied: Got response of size 20166303 chunksize is 1048576 -1765915941 -30694 -1671616656 4090602292 -7813 16221 -118717542 708660270 self._compressedbytes, self._uncompressedbytes = -1765908128 -1671616656 cursor.index - start 0 self._compressedbytes -1765908128 self._uncompressed = None Traceback (most recent call last): File "<string>", line 1, in <module> File "uproot/rootio.py", line 60, in open return http(path, httpsource=httpsource, **options) File "uproot/rootio.py", line 89, in http return ROOTDirectory.read(openfcn(path), **options) File "uproot/rootio.py", line 154, in read streamerinfos, streamerinfosmap, streamerrules = _readstreamers(streamerkey._source, streamerkey._cursor, streamercontext, None) File "uproot/rootio.py", line 556, in _readstreamers tlist = TList.read(source, cursor, context, parent) File "uproot/rootio.py", line 949, in read out = cls._readinto(out, source, cursor, context, parent) File "uproot/rootio.py", line 1455, in _readinto start, cnt, self._classversion = _startcheck(source, cursor) File "uproot/rootio.py", line 426, in _startcheck cnt, vers = cursor.fields(source, _startcheck._format_cntvers) File "uproot/source/cursor.py", line 46, in fields return format.unpack(source.data(start, stop)) File "uproot/source/compressed.py", line 183, in data return self._uncompressed[start:stop] TypeError: 'NoneType' object is not subscriptable

jpivarski

Just some comments in line. Only one of them is a "request changes": I don't want any warnings in the codebase. I like the idea of warnings as a middle-ground between things-one-must-fix and everything-is-okay, but in practice, they happen deeply enough that most users don't know what they mean or what to do to fix them, if it's even possible. Also, this warning is just a "performance may be degraded," and there are many ways other than this one for that to happen: we can't catch them all (nor do we want to—there are times when a user wants something that can only be done slowly).

When you remove the warning and are happy with the changes, give me the go-ahead and I'll merge the PR. I'm happy with the rest of the changes, too.

By the way, about SyntaxErrors, the only way to add code that would be a syntax error in some versions of Python is through eval or exec. I'm willing to do that (sparingly). Here is an example.

uproot/source/chunked.py

uproot/tree.py

uproot/source/compressed.py

uproot/source/chunked.py

uproot/interp/jagged.py

uproot/interp/auto.py

…s always valid

jpivarski · 2019-11-22T12:15:18Z

I think the reason performance is usually treated as secondary it's because it's so dependent on the use-case. What you're calling a "performance degradation" here is based on assumption that the user will want to read nearby arrays from the file. (And with that basket-sized cache blocks improvement we talked about, "the same array more than once.") This is a common use-case, though not the only one, and a warning about performance for those users would be very confusing.

What I had in mind for remote files was to reproduce the memory map API that works so well on local files. If the file is too big for memory—a common situation in analysis, though not tutorials—the cache will thrash. This is the case with all kinds of caches, and it's the reason that single pass (e.g. doing all your work in an uproot.iterate loop) is faster.

Is the situation here that the cache loaded to read basket TKeys will get evicted before we can read the array data? It's one thing if the cache thrashes when the user asks for the same data twice, but quite another if it thrashes while servicing a single request.

jpivarski · 2019-11-22T12:22:07Z

Oh! What's happening is that you ask HTTP for a chunk, but it answers by sending the whole file? If we know that, why don't we chop it up into the right-sized chunks send fill the cache? In other words, take the error as a gift?

Why would HTTP do that? Does it only happen for small files? I imagine it would be a problem if asking for a byte range results in downloading TB of unwanted data.

To do this, we'd have to be certain of where the downloaded part starts. If len(chunk) > chunksize, does that mean that it starts at 0 or at the requested start?

chrisburr · 2019-11-22T15:18:31Z

I should have been clearer as this is not a solution for #393, it's just a bug I noticed when starting to work on it.

Oh! What's happening is that you ask HTTP for a chunk, but it answers by sending the whole file?

Yes! Sorry I really should have said that in the first comment of this PR. I'll blame jet lag as I've just got back from CHEP 😄

Why would HTTP do that? Does it only happen for small files?

It's dependent on if the server. I've not looked into it enough to know exactly but as range requests are just a header in the request, if the HTTP server is very simple like python -m http.server (or if something is misconfigured and the header gets lost) the header is ignored and you get the entire file.

If we know that, why don't we chop it up into the right-sized chunks send fill the cache? In other words, take the error as a gift?

That is what is currently done in this PR. The big performance issue if the cache is smaller than the file. The cache becomes useless when trying to read almost anything in a 4GB file with the default cache size of 100MB and you'll download 4GB for every single call to Source.data. The newest two commits make it a little better if you only read nearby data (this isn't the case in my test files).

I imagine it would be a problem if asking for a byte range results in downloading TB of unwanted data.

This is why I suggested a warning in option 1 as it's not just "bad performance for some usage patterns" it's repeatedly reading millions of times more bytes than you need to and not being able to cache them effectively unless you only access the same chunk or one at the end of the file.

The alternative I see is to ignore the cache size when this happens and just use a normal dictionary (option 2). It's not nice though as memory usage might blow up. Personally I find that more acceptable to do without a warning as it's easier to debug when it causes a problem.

To do this, we'd have to be certain of where the downloaded part starts. If len(chunk) > chunksize, does that mean that it starts at 0 or at the requested start?

Empirically this seems to start at zero as if the header was ignored and I don't see how there could be any other reasonable behaviour. Not being sure about this was another reason to include the warning. Are there any magic bytes at the start of a ROOT file which I can look for to check this assumption?

jpivarski · 2019-11-22T16:02:02Z

Are there any magic bytes at the start of a ROOT file which I can look for to check this assumption?

Yes!

% hexdump -C issue412.root | head -1
00000000  72 6f 6f 74 00 10 33 a8  00 00 00 64 00 00 00 00  |root..3....d....|

The first four characters of every ROOT file are guaranteed to be "root" (that aspect of the format is specified). If character data in a ROOT file were uniformly random, the chance of hitting these four characters by accident is about 1 in 4 billion. It's not uniformly random; there are a few places in the file where the filename is stored, which probably ends in ".root". (Surprise! If you name your file a dirty word while filling it, then change its name on the commandline, it will have that dirty word embedded inside the file. Another fun quirk.) If HTTP returns a random fragment of the file (which I'd consider a bug), then there is a small chance that those non-beginning bytes are "root". Not as small as 1 in 4 billion, but negligible.

The "root" magic is followed by a 4-byte big-endian version number (such as 61800 for 6.18/00, possibly with 1000000 added if it's a big file, like 1061800). The next 4 bytes (if small) or 8 bytes (if big) is fBEGIN, which is almost always 100. Maybe these could be used to disambiguate—it's not likely to be randomly after "root", but perhaps that's too complex to check for.

Your suggestion of increasing the cache size is dangerous because people do run out of memory, and these cases are hard to track down. Making the cache behavior more complex will make those cases more difficult to debug. However, if we're in this situation where we've unexpectedly downloaded the whole file and Python is still alive, then we've demonstrably fit the whole thing into memory. Casting it as an array with numpy.frombuffer and slicing it into tiles is a zero-copy operation that wouldn't add to the total usage.

How about this: if HTTP returns a whole file instead of the requested part, which we can determine by len(chunk) > chunksize and chunk[:4] == b"root", then let's chop it up into the correct chunksize (assuming that it starts at 0) and put all of the chunks into a simple-dictionary cache, rather than the eviction-happy cache. After all, evicting a view doesn't even reduce memory usage, so the normal rules don't apply here anyway.

So, yes, that's option 2. I'm on board now.

chrisburr · 2019-11-23T05:50:36Z

Thank you for the interesting write up. (As always!)

Checking for b"root" should be adequate as checking for a sensible version seems either overly general or fragile. If the check doesn't pass it raises a NotImplementedError.

The response from Source._read is already put through return numpy.frombuffer(data, dtype=numpy.uint8) so I think this is now ready.

chrisburr · 2019-12-03T13:08:03Z

@jpivarski Assuming you're happy with it, this can be merged.

As nobody has reported it as a bug I think it can wait for the next release instead of having a dedicated one made.

jpivarski · 2019-12-03T13:42:51Z

Okay, thanks!

chrisburr added 5 commits November 21, 2019 14:05

Linter: Fix undefined variables

e70d936

Linter: Don't use bare except

61b149e

Linter: Fix multiply defined

d814a48

Dismiss futures if range requests aren't working

cdac40a

jpivarski suggested changes Nov 21, 2019

View reviewed changes

chrisburr added 2 commits November 22, 2019 05:02

Optimise cache usage when HTTP range requests fail and ensure chunk i…

88c6e3f

…s always valid

Clarify warning when HTTP range requests fail

16455f1

Ignore cache size if entire file is returned with HTTP

4be5e67

jpivarski merged commit 163bf0a into scikit-hep:master Dec 3, 2019

chrisburr deleted the missing-range-header-support branch December 3, 2019 13:44

jpivarski mentioned this pull request Dec 18, 2019

Test suite with two failing tests #422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading files with HTTP fails if range headers are not supported #411

Reading files with HTTP fails if range headers are not supported #411

chrisburr commented Nov 21, 2019

jpivarski left a comment

jpivarski commented Nov 22, 2019

jpivarski commented Nov 22, 2019

chrisburr commented Nov 22, 2019

jpivarski commented Nov 22, 2019

chrisburr commented Nov 23, 2019

chrisburr commented Dec 3, 2019

jpivarski commented Dec 3, 2019

Reading files with HTTP fails if range headers are not supported #411

Reading files with HTTP fails if range headers are not supported #411

Conversation

chrisburr commented Nov 21, 2019

jpivarski left a comment

Choose a reason for hiding this comment

jpivarski commented Nov 22, 2019

jpivarski commented Nov 22, 2019

chrisburr commented Nov 22, 2019

jpivarski commented Nov 22, 2019

chrisburr commented Nov 23, 2019

chrisburr commented Dec 3, 2019

jpivarski commented Dec 3, 2019