Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading files through HTTPS protocol results TypeError: a bytes-like object is required, not 'ServerDisconnectedError' #1233

Open
oshadura opened this issue Jun 20, 2024 · 10 comments
Labels
bug The problem described is something that must be fixed

Comments

@oshadura
Copy link
Collaborator

Reading files through HTTPS protocol results: TypeError: a bytes-like object is required, not 'ServerDisconnectedError'

Traceback: https://gist.github.com/oshadura/df21fcdf12c8ad9bac6b759ca164064c

>>> import uproot
>>> uproot.__version__
'5.3.7'

Reproducer:

import numpy as np
import awkward as ak
import matplotlib.pyplot as plt

import uproot

import dask
import hist
from hist.dask import Hist
from coffea.nanoevents import NanoEventsFactory
#prefix = 'root://xcache//store/user/ncsmith/samplegame/'
prefix = 'https://xrootd-local.unl.edu:1094//store/user/AGC/samplegame/'

samples = [
    uproot.dask(prefix + "sample%d.root" % i, open_files=False)
    for i in range(6)
]
h = (
    Hist.new
    .IntCat(range(6), label="Sample")
    .Reg(100, 0, 500, label="Jet $p_T$")
    .Double()
)
for i, sample in enumerate(samples):
    h.fill(i, ak.flatten(sample.Jet_pt))

fig, ax = plt.subplots()
h, *_ = dask.compute(h)
h.plot1d(ax=ax)
ax.set_yscale("log")
ax.legend(title="Sample")

@nsmith-

@alexander-held
Copy link
Member

Smaller reproducer:

import uproot

with uproot.open("https://xrootd-local.unl.edu:1094//store/user/AGC/samplegame/sample1.root") as f:
    f["Events"]  # this works
    f["Events"].arrays("Jet_pt")  # this breaks

@jpivarski
Copy link
Member

The small reproducer doesn't do it:

>>> import uproot
>>> with uproot.open("https://xrootd-local.unl.edu:1094//store/user/AGC/samplegame/sample1.root") as f:
...     f["Events"]  # this works
...     f["Events"].arrays("Jet_pt")  # this breaks
... 
<TTree 'Events' (431 branches) at 0x6ffffbf43290>
<Array [{Jet_pt: [57.5, ..., 15.3]}, ...] type='500000 * {Jet_pt: var * flo...'>

but the big one does. Or it's random (because it's a ServerDisconnectedError) and that's what I got on one invocation of each.

If the server disconnected, then some error has to be raised about that (unless it's a rare failure to be skipped and filed in a report). The failing line is

  File "/home/jpivarski/irishep/uproot5/src/uproot/source/chunk.py", line 388, in wait
    self._raw_data = numpy.frombuffer(self._future.result(), dtype=self._dtype)
                                      ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/uproot5/src/uproot/source/coalesce.py", line 36, in result
    return self._parent.result(timeout=timeout)[self._s]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
TypeError: 'ServerDisconnectedError' object is not subscriptable

which is to say that

return self._parent.result(timeout=timeout)[self._s]

isn't checking for the possibility that the return value of result might be an exception object. (Why is the exception returned, rather than raised?)

The coalesce.py submodule was added by @nsmith-, so maybe this line just needs something to handle the exception case. I'm still confused by the fact that it's returning, rather than raising, the exception, since

try:
if self._task is None:
raise RuntimeError("cannot run Future twice")
self._result = self._task(*self._args)
except Exception:
self._excinfo = sys.exc_info()

puts the exception class, object, and traceback into Future._excinfo, not Future._result, and

def result(self, timeout=None):
"""
Waits until the task completes (with a ``timeout``) and returns its
result.
If the task raises an exception in its background thread, this function
raises that exception on the thread on which it is called.
"""
self._finished.wait(timeout=timeout)
if self._excinfo is None:
return self._result
else:
delayed_raise(*self._excinfo)

raises the exception, rather than returning it, here:

def delayed_raise(exception_class, exception_value, traceback):
"""
Raise an exception from a background thread on the main thread.
"""
raise exception_value.with_traceback(traceback)

Some SliceFuture._parent is not following this protocol. @nsmith-, are all possible values of SliceFuture._parent an Uproot future?

@jpivarski jpivarski added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Jun 25, 2024
@nsmith-
Copy link
Member

nsmith- commented Jun 25, 2024

are all possible values of SliceFuture._parent an Uproot future?

In the case of FSSpec (the only source currently using the coalescing algorithm), none of them are. They are python futures as prepared by

return self._executor.submit(coroutine)

where self._executor is an instance of FSSpecLoopExecutor:
class FSSpecLoopExecutor(uproot.source.futures.Executor):
@property
def loop(self) -> asyncio.AbstractEventLoop:
return fsspec.asyn.get_loop()
def submit(self, coroutine) -> concurrent.futures.Future:
if not asyncio.iscoroutine(coroutine):
raise TypeError("loop executor can only submit coroutines")
return asyncio.run_coroutine_threadsafe(coroutine, self.loop)

I don't know yet why run_coroutine_threadsafe ends up setting the content instead of an exception

@oshadura
Copy link
Collaborator Author

The small reproducer doesn't do it:

>>> import uproot
>>> with uproot.open("https://xrootd-local.unl.edu:1094//store/user/AGC/samplegame/sample1.root") as f:
...     f["Events"]  # this works
...     f["Events"].arrays("Jet_pt")  # this breaks
... 
<TTree 'Events' (431 branches) at 0x6ffffbf43290>
<Array [{Jet_pt: [57.5, ..., 15.3]}, ...] type='500000 * {Jet_pt: var * flo...'>

but the big one does. Or it's random (because it's a ServerDisconnectedError) and that's what I got on one invocation of each.

If the server disconnected, then some error has to be raised about that (unless it's a rare failure to be skipped and filed in a report). The failing line is

  File "/home/jpivarski/irishep/uproot5/src/uproot/source/chunk.py", line 388, in wait
    self._raw_data = numpy.frombuffer(self._future.result(), dtype=self._dtype)
                                      ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpivarski/irishep/uproot5/src/uproot/source/coalesce.py", line 36, in result
    return self._parent.result(timeout=timeout)[self._s]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
TypeError: 'ServerDisconnectedError' object is not subscriptable

which is to say that

return self._parent.result(timeout=timeout)[self._s]

isn't checking for the possibility that the return value of result might be an exception object. (Why is the exception returned, rather than raised?)

The coalesce.py submodule was added by @nsmith-, so maybe this line just needs something to handle the exception case. I'm still confused by the fact that it's returning, rather than raising, the exception, since

try:
if self._task is None:
raise RuntimeError("cannot run Future twice")
self._result = self._task(*self._args)
except Exception:
self._excinfo = sys.exc_info()

puts the exception class, object, and traceback into Future._excinfo, not Future._result, and

def result(self, timeout=None):
"""
Waits until the task completes (with a ``timeout``) and returns its
result.
If the task raises an exception in its background thread, this function
raises that exception on the thread on which it is called.
"""
self._finished.wait(timeout=timeout)
if self._excinfo is None:
return self._result
else:
delayed_raise(*self._excinfo)

raises the exception, rather than returning it, here:

def delayed_raise(exception_class, exception_value, traceback):
"""
Raise an exception from a background thread on the main thread.
"""
raise exception_value.with_traceback(traceback)

Some SliceFuture._parent is not following this protocol. @nsmith-, are all possible values of SliceFuture._parent an Uproot future?

I will ask Carl to check server side logs and post them here...

@oshadura
Copy link
Collaborator Author

oshadura commented Jun 26, 2024

240626 15:46:12.195505 32239 acc_Audit: unknown.898250:416@c2409.shor.hcc grant https *@[::ffff:172.30.24.9] stat /store/user/AGC/samplegame/sample1.root
240626 15:46:12.195658 2192 cms_Decode: xrootd-local redirects unknown.898250:416@c2409.shor.hcc to red-xfer5.unl.edu:1094 /store/user/AGC/samplegame/sample1.root
240626 15:46:12.195710 32239 unknown.898250:416@c2409.shor.hcc Xrootd_Protocol: rc=-256 stat /store/user/AGC/samplegame/sample1.root

Same log entry on xfer5. The only thing catching my attention is :

5:50
240626 15:46:07.113526 17535 multiuser_UserSentry: Anonymous client; no user set, cannot change FS UIDs

From Carl: maybe https needs FS UID set?

@jthiltges
Copy link

I opened PR xrootd/xrootd#2290 to add keep-alive support when XRootD is replying with HTTP redirections. This seemed to resolve the issue in limited local testing.

When redirecting clients, XRootD indicates keep-alive support in the HTTP reply headers, but then immediately closes the connection. I'm guessing that if aiohttp tries to reuse the connection before the closure has been handled up the stack, we get the ServerDisconnectedError.

There's a fix on the aiohttp side, aio-libs/aiohttp#7363, which adds a retry. In testing, it did help, but didn't completely resolve the issue. The second attempt could also end up reusing a closed connection and fail.

@oshadura
Copy link
Collaborator Author

oshadura commented Jul 22, 2024

At the end @jthiltges 's xrootd patch fixed the issue on our side (the notebook is working back). As I understood it still needs to be tested further because of some failing tests, @jthiltges?

@jthiltges
Copy link

jthiltges commented Jul 22, 2024

As I understood it still needs to be tested further because of some failing tests, @jthiltges?

That's correct. The PR needs a bit more work to get the HTTP keepalive working as expected, with details in the xrootd PR. I'm hoping to have something ready for testing early this week.

@bockjoo
Copy link

bockjoo commented Aug 6, 2024

Hi @jthiltges @oshadura
Can the issue be reproduced if you directly use one of your servers instead of your site-local redirector?
Thanks,
Bockjoo

@jthiltges
Copy link

Hi @bockjoo,

I was only able to reproduce the issue when the redirector was involved and the client was using HTTP(S).

The xrootd PR to handle keepalive when redirecting has been merged, and appears on-track for release in v5.7.1.

Regards,
John

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
Development

No branches or pull requests

6 participants