Improve reconnect logic #23

michaelarnauts · 2024-03-29T22:33:44Z

No description provided.

litinoveweedle

Happy Easter and thank you very much for a patch in such short time.

First I have to apologize I was wrong stating, that there is no reconnect logic - there is/was in aiocomfoconnect/main.py, bit to my knowledge this is used only for CLI. It was also based on different exceptions type, that the one which was triggering the issue (IncompleteReadError, which also seem was incorrectly imported?).

Disclaimer, my Python skills are much lower than yours, so please take my following comments only as my best effort to help with the issue.

In the error case described in issues #33, #36:

exception asyncio.exceptions.IncompleteReadError is now caught in _process_message() and passed to _read_messages() as AioComfoConnectNotConnected exception. It is caught in bridge/_read_messages() but I am confused by syntax:

                except AioComfoConnectNotConnected:
                    # We have been disconnected
                    raise

i.e. raising it without any explicit context back to the _connect(). Is this implicit syntax to explicit???

                except AioComfoConnectNotConnected as exc:
                    # We have been disconnected
                    raise exc

Anyway exception will cause bridge.py/_connect() to be exited, raising back into comfoconnect.py/_reconnect_loop(), where caught will only continue to next iteration of the reconnect loop, therefore solving the disconnect issue. So to me, with my limited Python skills, this sounds solid.

Regarding disconnect when _send() and probable cause for issue #40, checking if connection is not stale is still same:

bridge.py/_send()

         if not self.is_connected():
            raise AioComfoConnectNotConnected()

Such check was not sufficient in the past to detect situation where asyncio raised IncompleteReadError during _read(). I understand, that now in some time connection will be reestablished from within reconnect_loop hopefully sooner, but without any _disconnect() i.e. asyncio.writer.close() trigerred on read error, this check will not catch this connection failure (see also one paragraph lower).

On the other hand adding self._disconnect() into _send() timeout could IMHO now potentially lead to race condition with _connect() within _reconnect_loop() as _send() and _process_message() are executed asynchronously. To my understanding _disconnect() should never be called if _connect() could be already running. I do not know how to implement this effectively without mutex and it can open different can of worms. Maybe asyncio.writer has such mutual exclusivity built in?

Also you removed reference to _disconnect() from within _connect(), probably when renaming bridge.py/disconnect() to _disconnect(). Now it not called at all, which IMHO it should (to clear resources and mark connection as stale for _send():

birdge.py/_connect():

    async def _connect(self, uuid: str):
        """Connect to the bridge."""
+        await self._disconnect()

In addition, it would be great, if _send() could also trigger reconnect on timeout (at least try once) and then it would try to resend given command after connection is reestablished.

Just some other random thoughts:

main.py run_show_sensor and run_show_sensors have some level of code duplication. Could it be implemented using shared/common function?
main.py follow argument seems to be implemented only for run_show_sensor and not for run_show_sensors, would this be useful?

michaelarnauts · 2024-03-30T18:03:37Z

Happy Easter and thank you very much for a patch in such short time.

Thanks!

First I have to apologize I was wrong stating, that there is no reconnect logic - there is/was in aiocomfoconnect/main.py, bit to my knowledge this is used only for CLI. It was also based on different exceptions type, that the one which was triggering the issue (IncompleteReadError, which also seem was incorrectly imported?).

Correct, __main__.py is only used for the CLI application, so there was no real reconnect logic in the library itself, but was done in the Home Assistant integration.

That's why the Home Assistant integration also will need a small change (to not connect again when the keepalive has failed).

Disclaimer, my Python skills are much lower than yours, so please take my following comments only as my best effort to help with the issue.

You overestimate my Python skills :) Especially asyncio is still a bit "magic" sometimes.

In the error case described in issues #33, #36:

exception asyncio.exceptions.IncompleteReadError is now caught in _process_message() and passed to _read_messages() as AioComfoConnectNotConnected exception. It is caught in bridge/_read_messages() but I am confused by syntax:
                except AioComfoConnectNotConnected:
                    # We have been disconnected
                    raise
i.e. raising it without any explicit context back to the _connect(). Is this implicit syntax to explicit???
                except AioComfoConnectNotConnected as exc:
                    # We have been disconnected
                    raise exc

This is indeed the same, note that I'm not handling the Exception, so I can just leave out the block, but for clarity, i've left it there so it's visual that we do know about the AioComfoConnectNotConnected exception, and that it's handled further down the line.

Anyway exception will cause bridge.py/_connect() to be exited, raising back into comfoconnect.py/_reconnect_loop(), where caught will only continue to next iteration of the reconnect loop, therefore solving the disconnect issue. So to me, with my limited Python skills, this sounds solid.

Important here is that _connect() in bridge.py is exited earlier. When it is called, we make a connection, start the _read_messages background loop to process inbound messages and exit the method. This is needed early on, because when we return to the _reconnect_loop in connect() in comfoconnect.py, we login on the bridge by calling cmd_start_session(), and this command needs a reply, so the _read_messages background loop needs to be running already at that point.

The _connect method does return a reference to the _read_messages background read task loop, and when this is exited (because we are shutting down, or because we caught a AioComfoConnectNotConnected due to an incomplete read in _process_message), we do another cylce in the _reconnect_loop to reconnect.

Note that the _reconnect_loop is also ran in the background, to not block the call to connect() itself.

I also have the connected future, that completes as soon as we are connected and logged in (and so ready to process actual mesages). That's why we await connected at the end of connect().

Regarding disconnect when _send() and probable cause for issue #40, checking if connection is not stale is still same:

bridge.py/_send()
         if not self.is_connected():
            raise AioComfoConnectNotConnected()
Such check was not sufficient in the past to detect situation where asyncio raised IncompleteReadError during _read(). I understand, that now in some time connection will be reestablished from within reconnect_loop hopefully sooner, but without any _disconnect() i.e. asyncio.writer.close() trigerred on read error, this check will not catch this connection failure (see also one paragraph lower).

I've noticed that a disconnect wasn't actually picked up in _read() (at least during my testing yesterday). It just kept reading. That's why the _send does a _disconnect() when it got a timeout of a command (can be any command, not just keepalives). This then make sure the read loop gets the exception, causing the reconnect logic to kick in (as described above).

To make sure we don't keep on writing to a closed connection, I've kept the if not self.is_connected() at the top of _send(). Note that this exception doesn't trigger a reconnect, it isn't handled and is returned back to the application down the line.

On the other hand adding self._disconnect() into _send() timeout could IMHO now potentially lead to race condition with _connect() within _reconnect_loop() as _send() and _process_message() are executed asynchronously. To my understanding _disconnect() should never be called if _connect() could be already running. I do not know how to implement this effectively without mutex and it can open different can of worms. Maybe asyncio.writer has such mutual exclusivity built in?

Hmm, you might be right about this, although when we got disconnected inside the _reconnect_loop, we will get a AioComfoConnectNotConnected from the read_task anyway, and will retry connecting in the next loop iteration.

I'm not sure how to properly test this though.

Also you removed reference to _disconnect() from within _connect(), probably when renaming bridge.py/disconnect() to _disconnect(). Now it not called at all, which IMHO it should (to clear resources and mark connection as stale for _send():

I don't think that was neccesary, the only thing _disconnect does is to actually disconnect the socket, there is no cleaning up anymore to be done. I think that there isn't any scenario anymore where we end up in _connect() while still having an open socket. And even if we do, I think python will disconnect since we lose the reference to the old self._reader and self._writer by opening a new connection.

But, this might need some more testing. I'm not 100% sure about this.

birdge.py/_connect():
    async def _connect(self, uuid: str):
        """Connect to the bridge."""
+        await self._disconnect()
In addition, it would be great, if _send() could also trigger reconnect on timeout (at least try once) and then it would try to resend given command after connection is reestablished.

This is hard to to, since _send is triggered from an actual call made by the application, and it can't communicate with my _reconnect_loop to do a reconnection. That's why we simply disconnect, so the read loop can pick this up.

Just some other random thoughts:

main.py run_show_sensor and run_show_sensors have some level of code duplication. Could it be implemented using shared/common function?

main.py follow argument seems to be implemented only for run_show_sensor and not for run_show_sensors, would this be useful?

The default of run_show_sensor is to get the value of a specific sensor and return. run_show_sensors always follows (=keeps open the connection and shows new sensor values).

I'm not happy with what __main__.py turned out. It used to be a simple cli tool to make some tests, but some people like to use this in other tools or shell scripts. It should probably be rewritten :)

litinoveweedle · 2024-03-31T00:17:53Z

Hello,

Thank you for the detailed explanations.

I now hopefully fully understand your implementation and IMHO it should solve reconnection.
I would still consider adding explicit _disconnect() into the _process_message() just before raising AioComfoConnectNotConnected exception. This would help to minimize any window for potential race condition from _send() as it will not pass _is_connected() check.

Based on my logs, the connection is lost every 5minutes for about 17sec. As the fix only helps to reconnect, (we do not yet know why connection drops regularly), the time to reconnect is critical. As the proposed solution is based on the timeouts of asyncio read or _send(), I am afraid that the solution would still keep the window, when there is no active connection, proportionally significant to the 5minutes disconnection interval. IMHO if during such window _send() is called it would still fail on timeout, and trigger HA error / exception.

You mentioned, that _disconnect() was added to _send() to: ...make sure the read loop gets the exception, causing the reconnect logic to kick in... I am afraid, that in such case reconnect loop only kick in, when _disconnect() will cause some read to fails on connection closed by _send(). And this is neither predictable nor fast enough. Or do I miss something?

Thinking more about how to minimize such delay of reconnect I am more inclined to have a completely independent async reconnect loop, which would only check the connection state using _is_connected(). This loop should be executed with minimal sleep between cycles, calling _connect() if there is no active connection to reestablish connection.

In addition if any read or write command would fail, it should immediately call _disconnect() to signal to the reconnect loop, that there is no active connection anymore.

Using such logic implementing retry in the send() should be simple: if _is_connected() in the top of _send() fails, just wait a little bit and retry the check after some time. As if there is no connection now, there would probably be a new one established by reconnect loop in a short time, so why not retry at least once. This is a simpler solution as you do not need to implement retry on _send() timeout, but only on _is_connected() check. It is not bullet proof, but it decreases the probability of _send() failing.

Again without some additional logic there is potential for race condition. Maybe to store timestamp in usec of time when connection is established and disallow _disconnect() on connection which is younger than failed initiation timestamp of _send() request, which is now timing out. Crude solution, but you would probably come with something better. 😉

Anyway do you think it would be possible for you to release beta, so I can help with testing it on my setup? This would be probably more helpful than my theoretic fairy tales. 😄 Thank you.

michaelarnauts · 2024-04-02T07:09:20Z

Hello,

Thank you for the detailed explanations.

I can say the same. Thanks for taking the time to look at my comments and changes!

I now hopefully fully understand your implementation and IMHO it should solve reconnection.

I would still consider adding explicit _disconnect() into the _process_message() just before raising AioComfoConnectNotConnected exception. This would help to minimize any window for potential race condition from _send() as it will not pass _is_connected() check.

Based on my logs, the connection is lost every 5minutes for about 17sec. As the fix only helps to reconnect, (we do not yet know why connection drops regularly), the time to reconnect is critical. As the proposed solution is based on the timeouts of asyncio read or _send(), I am afraid that the solution would still keep the window, when there is no active connection, proportionally significant to the 5minutes disconnection interval. IMHO if during such window _send() is called it would still fail on timeout, and trigger HA error / exception.

I don't see the 5 minutes drop, but I did read that in some firmware version (I'm not sure what version you have on your bridge), doing a DHCP renew causes the connection to be dropped. You might want to increase your DHCP lease to see if this improves your timings?

You are correct that sending a command while the connection is gone will trigger an exception towards Home Assistant. The goal is thus indeed to detect a disconnect as soon as possible.

But... I haven't found a way to detect that the connection is dropped besides when I try to send something (hence the keepalive), although this could also be the case on how I test this. I add an iptables rule to drop the traffic. It could be that the bridge actually nicely disconnects or closes the TCP session. If this is picked up in a read, then the reconnection should be almost instantaneous. I'll need to check this, but I'm not sure how to do this.

You mentioned, that _disconnect() was added to _send() to: ...make sure the read loop gets the exception, causing the reconnect logic to kick in... I am afraid, that in such case reconnect loop only kick in, when _disconnect() will cause some read to fails on connection closed by _send(). And this is neither predictable nor fast enough. Or do I miss something?

I don't fully get what you are saying here. I currently don't see a way to detect a non-responding bridge by simply reading. According to the operating system, the TCP session is still alive, so we just didn't get any data. Therefore there will always be a delay between the reconnection attempts until we do a keepalive. If we don't get a reply on our keepalive, we know that the connection has been dropped, so we do the _disconnect() to actually close the TCP session. It's from this point that we can actually detect this in our read loop (we get the IncompleteReadError when this happens) and reconnect.

Thinking more about how to minimize such delay of reconnect I am more inclined to have a completely independent async reconnect loop, which would only check the connection state using _is_connected(). This loop should be executed with minimal sleep between cycles, calling _connect() if there is no active connection to re-establish connection.

That's the whole problem, the _is_connected() doesn't pick up that we got disconnected. It's the explicit _disconnect() call when we notice that we don't get a reply on the keepalive that makes sure we can detect this.

In addition if any read or write command would fail, it should immediately call _disconnect() to signal to the reconnect loop, that there is no active connection anymore.

A write won't fail, it will just send the data to the bridge, but the bridge will not respond to it. Same with read, it doesn't know that the connection is gone until the TCP connection is actually correctly disconnected.

Using such logic implementing retry in the send() should be simple: if _is_connected() in the top of _send() fails, just wait a little bit and retry the check after some time. As if there is no connection now, there would probably be a new one established by reconnect loop in a short time, so why not retry at least once. This is a simpler solution as you do not need to implement retry on _send() timeout, but only on _is_connected() check. It is not bullet proof, but it decreases the probability of _send() failing.

I guess this could be added, waiting for the connection to be re-established instead of throwing the Exception back to Home Assistant immediately, although I don't really like this as it adds more complexity to the code.

Again without some additional logic there is potential for race condition. Maybe to store timestamp in usec of time when connection is established and disallow _disconnect() on connection which is younger than failed initiation timestamp of _send() request, which is now timing out. Crude solution, but you would probably come with something better. 😉

I don't see how we can get a race condition, unless the connection drops out again a few milliseconds after we actually got connected.

Anyway do you think it would be possible for you to release beta, so I can help with testing it on my setup? This would be probably more helpful than my theoretic fairy tales. 😄 Thank you.

It would indeed be a good idea to get this out somehow, but I want to test a few things first...

Since you indicate that your bridge gets connected every 5 minutes, can you make a TCP dump of this? I really want to know if the TCP session is disconnected from the bridge when this happens. If this happens, things will be a lot simpler.

litinoveweedle · 2024-04-02T13:42:55Z

Hello,

I have good news and bad news. ;-)

First thank you for pointing to potential FW issue.

My DHCP lease time is 10min. Exactly in the half (5min) time, gateway renew DHCP lease. In that time even gateway continue responded without problem to ping, it stopped sending any communication to the HA (observed by tcpdump) and message was logged:

INFO (MainThread) [aiocomfoconnect.bridge] "The connection was closed."

after some time (varied between 5-25sec) exception was raised and messages as follow logged in the close succession.

WARNING (MainThread) [aiocomfoconnect.bridge] Timeout while waiting for response from bridge
ERROR (MainThread) [homeassistant.helpers.entity] Update for select.comfoairq_balance_mode fails
Traceback (most recent call last):
File "/usr/local/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
return await fut
^^^^^^^^^
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/aiocomfoconnect/bridge.py", line 168, in _send
return await asyncio.wait_for(fut, TIMEOUT)
....
WARNING (MainThread) [aiocomfoconnect.bridge] Timeout while waiting for response from bridge

Immediately after connection was probably reestablished and traffic start flowing again between HA and gateway.
Increasing DHCP lease time seemed to prolong disconnected time, but I didn't tested it thoroughly.
I've installed phone app Zehnder ConfoControl app (newer used it before), but it struggled with concurrent connections from HA. I disable comfoconnect integration in HA, connected app to gateway and it porposed fw upgrade. My original version was U1.1.10, proposed version U1.5.1. Release notes mentioned as well: Improved ComfoConnect operation and Improved network stability.

Now the good news and bad news. It helped, and there are no more issues with GW stopping communicating during DHCP renew. So no more exceptions in HA. The bad new is, that my setup was optimal for testing you connection improvements and I lost it. :-o

I will write followup on your previous comment as well.

Improve reconnect logic

d067417

michaelarnauts mentioned this pull request Mar 29, 2024

Update for select.comfoairq_balance_mode fails michaelarnauts/home-assistant-comfoconnect#33

Closed

litinoveweedle reviewed Mar 30, 2024

View reviewed changes

michaelarnauts added 2 commits April 2, 2024 11:30

Add logging

d8c37a4

Disconnect when we get an IncompleteReadError

4c35053

michaelarnauts merged commit 48a5a83 into master Apr 2, 2024
0 of 2 checks passed

michaelarnauts mentioned this pull request Apr 2, 2024

Integration doesn't always properly reconnect to the bridge michaelarnauts/home-assistant-comfoconnect#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reconnect logic #23

Improve reconnect logic #23

michaelarnauts commented Mar 29, 2024

litinoveweedle left a comment •

edited

Loading

michaelarnauts commented Mar 30, 2024 •

edited

Loading

litinoveweedle commented Mar 31, 2024

michaelarnauts commented Apr 2, 2024

litinoveweedle commented Apr 2, 2024

Improve reconnect logic #23

Improve reconnect logic #23

Conversation

michaelarnauts commented Mar 29, 2024

litinoveweedle left a comment • edited Loading

Choose a reason for hiding this comment

michaelarnauts commented Mar 30, 2024 • edited Loading

litinoveweedle commented Mar 31, 2024

michaelarnauts commented Apr 2, 2024

litinoveweedle commented Apr 2, 2024

litinoveweedle left a comment •

edited

Loading

michaelarnauts commented Mar 30, 2024 •

edited

Loading