Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Reading files with HTTP fails if range headers are not supported #411

Merged
merged 8 commits into from
Dec 3, 2019

Conversation

chrisburr
Copy link
Member

Uproot fails to read files from servers that don't support range requests. Example traceback with extra debugging print out and without this fix applied:

Got response of size 20166303 chunksize is 1048576
-1765915941 -30694 -1671616656 4090602292 -7813 16221 -118717542 708660270
self._compressedbytes, self._uncompressedbytes = -1765908128 -1671616656
cursor.index - start 0
self._compressedbytes -1765908128
self._uncompressed = None
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "uproot/rootio.py", line 60, in open
    return http(path, httpsource=httpsource, **options)
  File "uproot/rootio.py", line 89, in http
    return ROOTDirectory.read(openfcn(path), **options)
  File "uproot/rootio.py", line 154, in read
    streamerinfos, streamerinfosmap, streamerrules = _readstreamers(streamerkey._source, streamerkey._cursor, streamercontext, None)
  File "uproot/rootio.py", line 556, in _readstreamers
    tlist = TList.read(source, cursor, context, parent)
  File "uproot/rootio.py", line 949, in read
    out = cls._readinto(out, source, cursor, context, parent)
  File "uproot/rootio.py", line 1455, in _readinto
    start, cnt, self._classversion = _startcheck(source, cursor)
  File "uproot/rootio.py", line 426, in _startcheck
    cnt, vers = cursor.fields(source, _startcheck._format_cntvers)
  File "uproot/source/cursor.py", line 46, in fields
    return format.unpack(source.data(start, stop))
  File "uproot/source/compressed.py", line 183, in data
    return self._uncompressed[start:stop]
TypeError: 'NoneType' object is not subscriptable

Error can be seen when using Python's built in HTTP server to open a "large" file (i.e. more than one chunk):

python -m http.server 8000 --directory tests/samples

Also includes some minor fixes for linter errors I stumbled upon. It might be necessary to use a more general exception in some cases. I've not checked closely but perhaps SyntaxError needs to be caught in some cases?

Example traceback with debugging print out without this fix applied:

Got response of size 20166303 chunksize is 1048576
-1765915941 -30694 -1671616656 4090602292 -7813 16221 -118717542 708660270
self._compressedbytes, self._uncompressedbytes = -1765908128 -1671616656
cursor.index - start 0
self._compressedbytes -1765908128
self._uncompressed = None
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "uproot/rootio.py", line 60, in open
    return http(path, httpsource=httpsource, **options)
  File "uproot/rootio.py", line 89, in http
    return ROOTDirectory.read(openfcn(path), **options)
  File "uproot/rootio.py", line 154, in read
    streamerinfos, streamerinfosmap, streamerrules = _readstreamers(streamerkey._source, streamerkey._cursor, streamercontext, None)
  File "uproot/rootio.py", line 556, in _readstreamers
    tlist = TList.read(source, cursor, context, parent)
  File "uproot/rootio.py", line 949, in read
    out = cls._readinto(out, source, cursor, context, parent)
  File "uproot/rootio.py", line 1455, in _readinto
    start, cnt, self._classversion = _startcheck(source, cursor)
  File "uproot/rootio.py", line 426, in _startcheck
    cnt, vers = cursor.fields(source, _startcheck._format_cntvers)
  File "uproot/source/cursor.py", line 46, in fields
    return format.unpack(source.data(start, stop))
  File "uproot/source/compressed.py", line 183, in data
    return self._uncompressed[start:stop]
TypeError: 'NoneType' object is not subscriptable
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments in line. Only one of them is a "request changes": I don't want any warnings in the codebase. I like the idea of warnings as a middle-ground between things-one-must-fix and everything-is-okay, but in practice, they happen deeply enough that most users don't know what they mean or what to do to fix them, if it's even possible. Also, this warning is just a "performance may be degraded," and there are many ways other than this one for that to happen: we can't catch them all (nor do we want to—there are times when a user wants something that can only be done slowly).

When you remove the warning and are happy with the changes, give me the go-ahead and I'll merge the PR. I'm happy with the rest of the changes, too.

By the way, about SyntaxErrors, the only way to add code that would be a syntax error in some versions of Python is through eval or exec. I'm willing to do that (sparingly). Here is an example.

uproot/source/chunked.py Outdated Show resolved Hide resolved
uproot/tree.py Show resolved Hide resolved
uproot/source/compressed.py Show resolved Hide resolved
uproot/source/chunked.py Outdated Show resolved Hide resolved
uproot/interp/jagged.py Show resolved Hide resolved
uproot/interp/auto.py Show resolved Hide resolved
@jpivarski
Copy link
Member

I think the reason performance is usually treated as secondary it's because it's so dependent on the use-case. What you're calling a "performance degradation" here is based on assumption that the user will want to read nearby arrays from the file. (And with that basket-sized cache blocks improvement we talked about, "the same array more than once.") This is a common use-case, though not the only one, and a warning about performance for those users would be very confusing.

What I had in mind for remote files was to reproduce the memory map API that works so well on local files. If the file is too big for memory—a common situation in analysis, though not tutorials—the cache will thrash. This is the case with all kinds of caches, and it's the reason that single pass (e.g. doing all your work in an uproot.iterate loop) is faster.

Is the situation here that the cache loaded to read basket TKeys will get evicted before we can read the array data? It's one thing if the cache thrashes when the user asks for the same data twice, but quite another if it thrashes while servicing a single request.

@jpivarski
Copy link
Member

Oh! What's happening is that you ask HTTP for a chunk, but it answers by sending the whole file? If we know that, why don't we chop it up into the right-sized chunks send fill the cache? In other words, take the error as a gift?

Why would HTTP do that? Does it only happen for small files? I imagine it would be a problem if asking for a byte range results in downloading TB of unwanted data.

To do this, we'd have to be certain of where the downloaded part starts. If len(chunk) > chunksize, does that mean that it starts at 0 or at the requested start?

@chrisburr
Copy link
Member Author

I should have been clearer as this is not a solution for #393, it's just a bug I noticed when starting to work on it.

Oh! What's happening is that you ask HTTP for a chunk, but it answers by sending the whole file?

Yes! Sorry I really should have said that in the first comment of this PR. I'll blame jet lag as I've just got back from CHEP 😄

Why would HTTP do that? Does it only happen for small files?

It's dependent on if the server. I've not looked into it enough to know exactly but as range requests are just a header in the request, if the HTTP server is very simple like python -m http.server (or if something is misconfigured and the header gets lost) the header is ignored and you get the entire file.

If we know that, why don't we chop it up into the right-sized chunks send fill the cache? In other words, take the error as a gift?

That is what is currently done in this PR. The big performance issue if the cache is smaller than the file. The cache becomes useless when trying to read almost anything in a 4GB file with the default cache size of 100MB and you'll download 4GB for every single call to Source.data. The newest two commits make it a little better if you only read nearby data (this isn't the case in my test files).

I imagine it would be a problem if asking for a byte range results in downloading TB of unwanted data.

This is why I suggested a warning in option 1 as it's not just "bad performance for some usage patterns" it's repeatedly reading millions of times more bytes than you need to and not being able to cache them effectively unless you only access the same chunk or one at the end of the file.

The alternative I see is to ignore the cache size when this happens and just use a normal dictionary (option 2). It's not nice though as memory usage might blow up. Personally I find that more acceptable to do without a warning as it's easier to debug when it causes a problem.

To do this, we'd have to be certain of where the downloaded part starts. If len(chunk) > chunksize, does that mean that it starts at 0 or at the requested start?

Empirically this seems to start at zero as if the header was ignored and I don't see how there could be any other reasonable behaviour. Not being sure about this was another reason to include the warning. Are there any magic bytes at the start of a ROOT file which I can look for to check this assumption?

@jpivarski
Copy link
Member

Are there any magic bytes at the start of a ROOT file which I can look for to check this assumption?

Yes!

% hexdump -C issue412.root | head -1
00000000  72 6f 6f 74 00 10 33 a8  00 00 00 64 00 00 00 00  |root..3....d....|

The first four characters of every ROOT file are guaranteed to be "root" (that aspect of the format is specified). If character data in a ROOT file were uniformly random, the chance of hitting these four characters by accident is about 1 in 4 billion. It's not uniformly random; there are a few places in the file where the filename is stored, which probably ends in ".root". (Surprise! If you name your file a dirty word while filling it, then change its name on the commandline, it will have that dirty word embedded inside the file. Another fun quirk.) If HTTP returns a random fragment of the file (which I'd consider a bug), then there is a small chance that those non-beginning bytes are "root". Not as small as 1 in 4 billion, but negligible.

The "root" magic is followed by a 4-byte big-endian version number (such as 61800 for 6.18/00, possibly with 1000000 added if it's a big file, like 1061800). The next 4 bytes (if small) or 8 bytes (if big) is fBEGIN, which is almost always 100. Maybe these could be used to disambiguate—it's not likely to be randomly after "root", but perhaps that's too complex to check for.

Your suggestion of increasing the cache size is dangerous because people do run out of memory, and these cases are hard to track down. Making the cache behavior more complex will make those cases more difficult to debug. However, if we're in this situation where we've unexpectedly downloaded the whole file and Python is still alive, then we've demonstrably fit the whole thing into memory. Casting it as an array with numpy.frombuffer and slicing it into tiles is a zero-copy operation that wouldn't add to the total usage.

How about this: if HTTP returns a whole file instead of the requested part, which we can determine by len(chunk) > chunksize and chunk[:4] == b"root", then let's chop it up into the correct chunksize (assuming that it starts at 0) and put all of the chunks into a simple-dictionary cache, rather than the eviction-happy cache. After all, evicting a view doesn't even reduce memory usage, so the normal rules don't apply here anyway.

So, yes, that's option 2. I'm on board now.

@chrisburr
Copy link
Member Author

Thank you for the interesting write up. (As always!)

Checking for b"root" should be adequate as checking for a sensible version seems either overly general or fragile. If the check doesn't pass it raises a NotImplementedError.

The response from Source._read is already put through return numpy.frombuffer(data, dtype=numpy.uint8) so I think this is now ready.

@chrisburr
Copy link
Member Author

@jpivarski Assuming you're happy with it, this can be merged.

As nobody has reported it as a bug I think it can wait for the next release instead of having a dedicated one made.

@jpivarski
Copy link
Member

Okay, thanks!

@jpivarski jpivarski merged commit 163bf0a into scikit-hep:master Dec 3, 2019
@chrisburr chrisburr deleted the missing-range-header-support branch December 3, 2019 13:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants