Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Avoid ignoring asyncio exceptions in coroutines" #672

Merged
merged 2 commits into from
Apr 25, 2022

Conversation

simskij
Copy link
Contributor

@simskij simskij commented Apr 20, 2022

This reverts commit f7552be. The suggested fix did not really solve the issue it was supposed to solve as it instead fails with [Errno 9] Bad File Descriptor crashing most of our tests.

@jujubot
Copy link
Contributor

jujubot commented Apr 20, 2022

Can one of the admins verify this patch?

7 similar comments
@jujubot
Copy link
Contributor

jujubot commented Apr 20, 2022

Can one of the admins verify this patch?

@jujubot
Copy link
Contributor

jujubot commented Apr 20, 2022

Can one of the admins verify this patch?

@jujubot
Copy link
Contributor

jujubot commented Apr 20, 2022

Can one of the admins verify this patch?

@jujubot
Copy link
Contributor

jujubot commented Apr 20, 2022

Can one of the admins verify this patch?

@jujubot
Copy link
Contributor

jujubot commented Apr 20, 2022

Can one of the admins verify this patch?

@jujubot
Copy link
Contributor

jujubot commented Apr 20, 2022

Can one of the admins verify this patch?

@jujubot
Copy link
Contributor

jujubot commented Apr 20, 2022

Can one of the admins verify this patch?

@cderici
Copy link
Contributor

cderici commented Apr 20, 2022

Hey @simskij, sorry you're having trouble with that, though I'm surprised to hear that exceptions raised in the asyncio tasks that are internal to pylibjuju (and internal to asyncio/websocket connections) actually crashing your tests. Do your tests pass %100 after this revert change?

@rbarry82
Copy link

Hey @simskij, sorry you're having trouble with that, though I'm surprised to hear that exceptions raised in the asyncio tasks that are internal to pylibjuju (and internal to asyncio/websocket connections) actually crashing your tests. Do your tests pass %100 after this revert change?

Verified, yes.

Before that, the stack trace looks like:

Traceback (most recent call last):
  File "/home/runner/work/prometheus-scrape-config-k8s-operator/prometheus-scrape-config-k8s-operator/.tox/integration/lib/python3.8/site-packages/websockets/protocol.py", line 1056, in close_connection
    if (yield from self.wait_for_connection_lost()):
  File "/home/runner/work/prometheus-scrape-config-k8s-operator/prometheus-scrape-config-k8s-operator/.tox/integration/lib/python3.8/site-packages/websockets/protocol.py", line 1077, in wait_for_connection_lost
    yield from asyncio.wait_for(
  File "/usr/lib/python3.8/asyncio/tasks.py", line 472, in wait_for
    timeout_handle = loop.call_later(timeout, _release_waiter, waiter)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 687, in call_later
    timer = self.call_at(self.time() + delay, callback, *args,
  File "/usr/lib/python3.8/asyncio/base_events.py", line 698, in call_at
    self._check_closed()
  File "/usr/lib/python3.8/asyncio/base_events.py", line 508, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Task was destroyed but it is pending!
task: <Task pending name='Task-1390' coro=<Connection._receiver() running at /home/runner/work/prometheus-scrape-config-k8s-operator/prometheus-scrape-config-k8s-operator/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py:524> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f246c07ec40>()]>>
Task was destroyed but it is pending!
task: <Task pending name='Task-1769' coro=<WebSocketCommonProtocol.recv() running at /home/runner/work/prometheus-scrape-config-k8s-operator/prometheus-scrape-config-k8s-operator/.tox/integration/lib/python3.8/site-packages/websockets/protocol.py:416> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f246c118cd0>()]> cb=[create_task_with_handler.<locals>._task_result_exp_handler(task_name='tmp', logger=<Logger juju....ion (WARNING)>)() at /home/runner/work/prometheus-scrape-config-k8s-operator/prometheus-scrape-config-k8s-operator/.tox/integration/lib/python3.8/site-packages/juju/jasyncio.py:64]>
Task was destroyed but it is pending!
task: <Task pending name='Task-1770' coro=<Event.wait() running at /usr/lib/python3.8/asyncio/locks.py:309> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f24752e4b50>()]>>
Task was destroyed but it is pending!
task: <Task pending name='Task-1803' coro=<WebSocketCommonProtocol.keepalive_ping() running at /home/runner/work/prometheus-scrape-config-k8s-operator/prometheus-scrape-config-k8s-operator/.tox/integration/lib/python3.8/site-packages/websockets/protocol.py:[977](https://github.com/canonical/prometheus-scrape-config-k8s-operator/runs/6079841693?check_suite_focus=true#step:4:977)> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f246e1b07c0>()]>>
Task was destroyed but it is pending!
task: <Task pending name='Task-1810' coro=<Connection._pinger() running at /home/runner/work/prometheus-scrape-config-k8s-operator/prometheus-scrape-config-k8s-operator/.tox/integration/lib/python3.8/site-packages/juju/client/connection.py:572> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f246e1b0400>()]>>
Fatal error on SSL transport
protocol: <asyncio.sslproto.SSLProtocol object at 0x7f246e1833a0>
transport: <_SelectorSocketTransport closing fd=20>
Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 910, in write
    n = self._sock.send(data)
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 685, in _process_write_backlog
    self._transport.write(chunk)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 916, in write
    self._fatal_error(exc, 'Fatal write error on socket transport')
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 711, in _fatal_error
    self._force_close(exc)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 723, in _force_close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 719, in call_soon
    self._check_closed()
  File "/usr/lib/python3.8/asyncio/base_events.py", line 508, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

That OSError moves onto like 916, which passes it to asyncio.selector_events._SelectorSocketTransport._fatal_error, which tries to log it if it's an OSError and there's a loop with debug, then eventually it bombs through. It's actually just the single raise in jasyncio which saves it.

@juanmanuel-tirado
Copy link
Contributor

@rbarry82 could you provide us with some steps to reproduce this? We would like to discard other potential errors?

@cderici
Copy link
Contributor

cderici commented Apr 21, 2022

Hey @rbarry82, thanks for the explanation, @juanmanuel-tirado told me that this is blocking you, sorry to hear that, let's figure this out and help you guys asap.

So your explanation doesn't make sense to me for several reasons. First of all, what you see here is not an actual error (like, return 1), but just a printout of an internal error on stderr. And the f7552be you mentioned is about handling the exception when the tasks are destroyed, meaning those exceptions we see here are surfaced long after whatever was running is finished, so it shouldn't effect on the result of any tests whatsoever.

That OSError moves onto like 916, which passes it to asyncio.selector_events._SelectorSocketTransport._fatal_error, which tries to log it if it's an OSError and there's a loop with debug, then eventually it bombs through. It's actually just the single raise in jasyncio which saves it.

I don't fully understand what you mean by "OSError moves onto 916", and "there's a loop with debug". It would also help if you could elaborate on that.

In any case, we'd be fine to land this to help you guys out if this is absolutely needed asap, but I suspect that whatever is going on in your tests is misdiagnosed to be related to this and you'll still face it even if this revert is landed.

We'll be paying close attention to this to help you folks unblock asap 👍

@rbarry82
Copy link

rbarry82 commented Apr 21, 2022

This happens every single time a model is destroyed with pytest-operator. It's not specific to the tests (which pass), but in teardown of the model after the tests pass. See here, for example.

What we see here is the exception thrown by Python when you try to do anything to a closed socket. pytest-operator calls juju.model.connect() on the back, which filters up to here, hence the issue. Part of the thing is that every time the fixture is returned via yield, it connects, then it tries to clean up everything done with that object, so it destroys the models, but still.

Sorry for using bare line numbers. I mean that, in asyncio.selector_events.py, we end up here:

   907         if not self._buffer:
   908             # Optimization: try to send now.
   909             try:
   910                 n = self._sock.send(data)  "send" is not a known member of "None"
   911             except (BlockingIOError, InterruptedError):
   912                 pass
   913             except (SystemExit, KeyboardInterrupt):
   914                 raise
   915             except BaseException as exc:
   916                 self._fatal_error(exc, 'Fatal write error on socket transport')
   917                 return
   
   ...
   
      699     def _fatal_error(self, exc, message='Fatal error on transport'):
   700         # Should be called from exception handler only.
   701         if isinstance(exc, OSError):
E  702             if self._loop.get_debug():  Cannot access member "get_debug" for type "None"    Member "get_debug" is unknown
   703                 logger.debug("%r: %s", self, message, exc_info=True)
   704         else:
E  705             self._loop.call_exception_handler({  Cannot access member "call_exception_handler" for type "None"    Member "call_exception_handler" is unknown
   706                 'message': message,
   707                 'exception': exc,
   708                 'transport': self,
   709                 'protocol': self._protocol,
   710             })
   711         self._force_close(exc)
   712
   713     def _force_close(self, exc):
   714         if self._conn_lost:
   715             return
   716         if self._buffer:
   717             self._buffer.clear()
   718             self._loop._remove_writer(self._sock_fd)
   719         if not self._closing:
   720             self._closing = True
   721             self._loop._remove_reader(self._sock_fd) 
   722         self._conn_lost += 1
   723         self._loop.call_soon(self._call_connection_lost, exc)

This is just plain asyncio which... does not handle it.

Honestly, this is trivially reproducible anyway.

# In another window, create some model, start this script, then `juju destroy-model foo -y --destroy-storage`
from juju import jasyncio
from juju.model import Model


async def main():
    model = Model()
    for i in range(0, 3):
        await model.connect_current()
        await model.wait_for_idle(apps=["never"], status="active", timeout=60)

if __name__ == "__main__":
    jasyncio.run(main())

💣 💣 OSError: [Errno 9] ... and $? == 1

First you'll see a DeadEntityException, though.

Error in watcher
Traceback (most recent call last):
  File "/home/rbarry/charms/python-libjuju/juju/model.py", line 1125, in _all_watcher
    _post_step(new_obj)
  File "/home/rbarry/charms/python-libjuju/juju/model.py", line 1053, in _post_step
    if isinstance(obj, ModelInfo) and obj.safe_data is not None:
  File "/home/rbarry/charms/python-libjuju/juju/model.py", line 371, in safe_data
    raise DeadEntityException(
juju.exceptions.DeadEntityException: Entity model:f6c6501a-b8db-432f-801e-21f58616dc11 is dead - its attributes can no longer be accessed. Use the .previous() method on this object to get a copy of the object at its previous state.
Task exception was never retrieved
future: <Task finished name='Task-13' coro=<Model._watch.<locals>._all_watcher() done, defined at /home/rbarry/charms/python-libjuju/juju/model.py:1058> exception=DeadEntityException('Entity model:f6c6501a-b8db-432f-801e-21f58616dc11 is dead - its attributes can no longer be accessed. Use the .previous() method on this object to get a copy of the object at its previous state.')>
Traceback (most recent call last):
  File "/home/rbarry/charms/python-libjuju/juju/model.py", line 1125, in _all_watcher
    _post_step(new_obj)
  File "/home/rbarry/charms/python-libjuju/juju/model.py", line 1053, in _post_step
    if isinstance(obj, ModelInfo) and obj.safe_data is not None:
  File "/home/rbarry/charms/python-libjuju/juju/model.py", line 371, in safe_data
    raise DeadEntityException(
juju.exceptions.DeadEntityException: Entity model:f6c6501a-b8db-432f-801e-21f58616dc11 is dead - its attributes can no longer be accessed. Use the .previous() method on this object to get a copy of the object at its previous state.
Traceback (most recent call last):
  File "foo.py", line 21, in <module>
    jasyncio.run(main())
  File "/home/rbarry/charms/python-libjuju/juju/jasyncio.py", line 118, in run
    raise task.exception()
  File "foo.py", line 17, in main
    await model.wait_for_idle(apps=["never"], status="active", timeout=60)
  File "/home/rbarry/charms/python-libjuju/juju/model.py", line 2517, in wait_for_idle
    raise jasyncio.TimeoutError("Timed out waiting for model:\n" + busy)
asyncio.exceptions.TimeoutError: Timed out waiting for model:
never (missing)
Task was destroyed but it is pending!
task: <Task pending name='Task-12' coro=<Connection._pinger() running at /home/rbarry/charms/python-libjuju/juju/client/connection.py:572> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f6fa89a1430>()]>>
Task was destroyed but it is pending!
task: <Task pending name='Task-7' coro=<Connection._receiver() running at /home/rbarry/charms/python-libjuju/juju/client/connection.py:524> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f6fa898dc70>()]>>
Task was destroyed but it is pending!
task: <Task pending name='Task-5' coro=<WebSocketCommonProtocol.keepalive_ping() running at /home/rbarry/charms/python-libjuju/venv/lib/python3.8/site-packages/websockets/protocol.py:977> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f6fa89a1400>()]>>
Fatal error on SSL transport
protocol: <asyncio.sslproto.SSLProtocol object at 0x7f6fa89d2760>
transport: <_SelectorSocketTransport closing fd=9>
Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 910, in write
    n = self._sock.send(data)
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/sslproto.py", line 685, in _process_write_backlog
    self._transport.write(chunk)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 916, in write
    self._fatal_error(exc, 'Fatal write error on socket transport')
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 711, in _fatal_error
    self._force_close(exc)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 723, in _force_close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 719, in call_soon
    self._check_closed()
  File "/usr/lib/python3.8/asyncio/base_events.py", line 508, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
Task was destroyed but it is pending!
task: <Task pending name='Task-4' coro=<WebSocketCommonProtocol.transfer_data() running at /home/rbarry/charms/python-libjuju/venv/lib/python3.8/site-packages/websockets/protocol.py:674> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f6fa89a13a0>()]> cb=[<TaskWakeupMethWrapper object at 0x7f6fa89d2b50>(), _wait.<locals>._on_completion() at /usr/lib/python3.8/asyncio/tasks.py:518]>
Task was destroyed but it is pending!
task: <Task pending name='Task-6' coro=<WebSocketCommonProtocol.close_connection() running at /home/rbarry/charms/python-libjuju/venv/lib/python3.8/site-packages/websockets/protocol.py:1019> wait_for=<Task pending name='Task-4' coro=<WebSocketCommonProtocol.transfer_data() running at /home/rbarry/charms/python-libjuju/venv/lib/python3.8/site-packages/websockets/protocol.py:674> wait_for=<Future pending cb=[<TaskWakeupMethW
rapper object at 0x7f6fa89a13a0>()]> cb=[<TaskWakeupMethWrapper object at 0x7f6fa89d2b50>(), _wait.<locals>._on_completion() at /usr/lib/python3.8/asyncio/tasks.py:518]>>
Exception ignored in: <generator object WebSocketCommonProtocol.close_connection at 0x7f6fa89f37b0>
Traceback (most recent call last):
  File "/home/rbarry/charms/python-libjuju/venv/lib/python3.8/site-packages/websockets/protocol.py", line 1056, in close_connection
  File "/home/rbarry/charms/python-libjuju/venv/lib/python3.8/site-packages/websockets/protocol.py", line 1077, in wait_for_connection_lost
  File "/usr/lib/python3.8/asyncio/tasks.py", line 472, in wait_for
  File "/usr/lib/python3.8/asyncio/base_events.py", line 687, in call_later
  File "/usr/lib/python3.8/asyncio/base_events.py", line 698, in call_at
  File "/usr/lib/python3.8/asyncio/base_events.py", line 508, in _check_closed
RuntimeError: Event loop is closed
Task was destroyed but it is pending!
task: <Task pending name='Task-66' coro=<WebSocketCommonProtocol.recv() running at /home/rbarry/charms/python-libjuju/venv/lib/python3.8/site-packages/websockets/protocol.py:416> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f6fa898d5b0>()]> cb=[create_task_with_handler.<locals>._task_result_exp_handler(task_name='tmp', logger=<Logger juju....ion (WARNING)>)() at /home/rbarry/charms/python-libjuju/juju/j
asyncio.py:64]>
Task was destroyed but it is pending!
task: <Task pending name='Task-67' coro=<Event.wait() running at /usr/lib/python3.8/asyncio/locks.py:309> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f6fa89a1490>()]>>

@cderici
Copy link
Contributor

cderici commented Apr 21, 2022

Hey @rbarry82, thanks for the example, this is exactly what I was talking about. First, I agree with you, that it looks like the attempt to reduce the background noise (f7552be) is not doing a proper job, as a result, we're getting a bunch of logs about the Task errors.

However, in your example, you are getting a DeadEntityException because we're destroying the model in the background in your scenario, in which case we should get that exception, and that returning 1 you report ($? == 1) is coming from that exception, not from the OSError you see in the logs.

Regardless, I agree with you for the first reason I mentioned, that we should land this PR and revert that change, as it appears to be not doing the job well. So I'm gonna go ahead and give this a 👍 . However, I still doubt that reverting this will stop you from getting an error, or those Task exception logs. I hope I'm wrong, but let's at least unblock you guys. If things persist, feel free to open an issue and we'll go from there 👍

@rbarry82
Copy link

It does, but please see here, which comes from this fixture calling _cleanup_models(), which tries to destroy models through the controller, and starts the exception chain in the first link, failing the tests.

That throws the websocket connection exception back at pytest-operator.

awaiting the "forever taks" in juju.connection.close() would probably close the issue this was originally intended to solve. Something like:

async def close(self, to_reconnect=False):
    ...
    
    active_tasks = [t for t in self._pinger_task, self._receiver_task, self._debug_log_task if t]
    map(lambda t: t.cancel(), active_tasks)
    for t in active_tasks:
        try:
            await t
        except asyncio.CancelledError:
            logger.debug("Task was succesfully cancelled: %s", t)

@simskij
Copy link
Contributor Author

simskij commented Apr 25, 2022

I'm not trusted to merge this, but the PR should be up-to-date now at least. 😅

@cderici
Copy link
Contributor

cderici commented Apr 25, 2022

$$merge$$

@jameinel jameinel merged commit 8c938c9 into juju:master Apr 25, 2022
@simskij simskij deleted the revert-python-libjuju branch April 25, 2022 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants