-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve reconnect logic #23
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy Easter and thank you very much for a patch in such short time.
First I have to apologize I was wrong stating, that there is no reconnect logic - there is/was in aiocomfoconnect/main.py, bit to my knowledge this is used only for CLI. It was also based on different exceptions type, that the one which was triggering the issue (IncompleteReadError, which also seem was incorrectly imported?).
Disclaimer, my Python skills are much lower than yours, so please take my following comments only as my best effort to help with the issue.
exception asyncio.exceptions.IncompleteReadError
is now caught in _process_message()
and passed to _read_messages()
as AioComfoConnectNotConnected
exception. It is caught in bridge/_read_messages()
but I am confused by syntax:
except AioComfoConnectNotConnected:
# We have been disconnected
raise
i.e. raising it without any explicit context back to the _connect()
. Is this implicit syntax to explicit???
except AioComfoConnectNotConnected as exc:
# We have been disconnected
raise exc
Anyway exception will cause bridge.py/_connect()
to be exited, raising back into comfoconnect.py/_reconnect_loop()
, where caught will only continue to next iteration of the reconnect loop, therefore solving the disconnect issue. So to me, with my limited Python skills, this sounds solid.
- Regarding disconnect when
_send()
and probable cause for issue #40, checking if connection is not stale is still same:
bridge.py/_send()
if not self.is_connected():
raise AioComfoConnectNotConnected()
Such check was not sufficient in the past to detect situation where asyncio raised IncompleteReadError
during _read()
. I understand, that now in some time connection will be reestablished from within reconnect_loop hopefully sooner, but without any _disconnect()
i.e. asyncio.writer.close()
trigerred on read error, this check will not catch this connection failure (see also one paragraph lower).
On the other hand adding self._disconnect()
into _send()
timeout could IMHO now potentially lead to race condition with _connect()
within _reconnect_loop()
as _send()
and _process_message()
are executed asynchronously. To my understanding _disconnect()
should never be called if _connect()
could be already running. I do not know how to implement this effectively without mutex and it can open different can of worms. Maybe asyncio.writer has such mutual exclusivity built in?
Also you removed reference to _disconnect()
from within _connect()
, probably when renaming bridge.py/disconnect()
to _disconnect()
. Now it not called at all, which IMHO it should (to clear resources and mark connection as stale for _send()
:
birdge.py/_connect():
async def _connect(self, uuid: str):
"""Connect to the bridge."""
+ await self._disconnect()
In addition, it would be great, if _send() could also trigger reconnect on timeout (at least try once) and then it would try to resend given command after connection is reestablished.
Just some other random thoughts:
-
main.py run_show_sensor and run_show_sensors have some level of code duplication. Could it be implemented using shared/common function?
-
main.py follow argument seems to be implemented only for run_show_sensor and not for run_show_sensors, would this be useful?
Thanks!
Correct, That's why the Home Assistant integration also will need a small change (to not connect again when the keepalive has failed).
You overestimate my Python skills :) Especially asyncio is still a bit "magic" sometimes.
This is indeed the same, note that I'm not handling the Exception, so I can just leave out the block, but for clarity, i've left it there so it's visual that we do know about the
Important here is that The Note that the I also have the
I've noticed that a disconnect wasn't actually picked up in To make sure we don't keep on writing to a closed connection, I've kept the
Hmm, you might be right about this, although when we got disconnected inside the I'm not sure how to properly test this though.
I don't think that was neccesary, the only thing But, this might need some more testing. I'm not 100% sure about this.
This is hard to to, since _send is triggered from an actual call made by the application, and it can't communicate with my _reconnect_loop to do a reconnection. That's why we simply disconnect, so the read loop can pick this up.
The default of I'm not happy with what |
Hello, Thank you for the detailed explanations.
Based on my logs, the connection is lost every 5minutes for about 17sec. As the fix only helps to reconnect, (we do not yet know why connection drops regularly), the time to reconnect is critical. As the proposed solution is based on the timeouts of asyncio read or _send(), I am afraid that the solution would still keep the window, when there is no active connection, proportionally significant to the 5minutes disconnection interval. IMHO if during such window _send() is called it would still fail on timeout, and trigger HA error / exception. You mentioned, that _disconnect() was added to _send() to: ...make sure the read loop gets the exception, causing the reconnect logic to kick in... I am afraid, that in such case reconnect loop only kick in, when _disconnect() will cause some read to fails on connection closed by _send(). And this is neither predictable nor fast enough. Or do I miss something? Thinking more about how to minimize such delay of reconnect I am more inclined to have a completely independent async reconnect loop, which would only check the connection state using _is_connected(). This loop should be executed with minimal sleep between cycles, calling _connect() if there is no active connection to reestablish connection. In addition if any read or write command would fail, it should immediately call _disconnect() to signal to the reconnect loop, that there is no active connection anymore. Using such logic implementing retry in the send() should be simple: if _is_connected() in the top of _send() fails, just wait a little bit and retry the check after some time. As if there is no connection now, there would probably be a new one established by reconnect loop in a short time, so why not retry at least once. This is a simpler solution as you do not need to implement retry on _send() timeout, but only on _is_connected() check. It is not bullet proof, but it decreases the probability of _send() failing. Again without some additional logic there is potential for race condition. Maybe to store timestamp in usec of time when connection is established and disallow _disconnect() on connection which is younger than failed initiation timestamp of _send() request, which is now timing out. Crude solution, but you would probably come with something better. 😉 Anyway do you think it would be possible for you to release beta, so I can help with testing it on my setup? This would be probably more helpful than my theoretic fairy tales. 😄 Thank you. |
I can say the same. Thanks for taking the time to look at my comments and changes!
I don't see the 5 minutes drop, but I did read that in some firmware version (I'm not sure what version you have on your bridge), doing a DHCP renew causes the connection to be dropped. You might want to increase your DHCP lease to see if this improves your timings? You are correct that sending a command while the connection is gone will trigger an exception towards Home Assistant. The goal is thus indeed to detect a disconnect as soon as possible. But... I haven't found a way to detect that the connection is dropped besides when I try to send something (hence the keepalive), although this could also be the case on how I test this. I add an iptables rule to drop the traffic. It could be that the bridge actually nicely disconnects or closes the TCP session. If this is picked up in a read, then the reconnection should be almost instantaneous. I'll need to check this, but I'm not sure how to do this.
I don't fully get what you are saying here. I currently don't see a way to detect a non-responding bridge by simply reading. According to the operating system, the TCP session is still alive, so we just didn't get any data. Therefore there will always be a delay between the reconnection attempts until we do a keepalive. If we don't get a reply on our keepalive, we know that the connection has been dropped, so we do the
That's the whole problem, the
A write won't fail, it will just send the data to the bridge, but the bridge will not respond to it. Same with read, it doesn't know that the connection is gone until the TCP connection is actually correctly disconnected.
I guess this could be added, waiting for the connection to be re-established instead of throwing the Exception back to Home Assistant immediately, although I don't really like this as it adds more complexity to the code.
I don't see how we can get a race condition, unless the connection drops out again a few milliseconds after we actually got connected.
It would indeed be a good idea to get this out somehow, but I want to test a few things first... Since you indicate that your bridge gets connected every 5 minutes, can you make a TCP dump of this? I really want to know if the TCP session is disconnected from the bridge when this happens. If this happens, things will be a lot simpler. |
Hello, I have good news and bad news. ;-) First thank you for pointing to potential FW issue.
after some time (varied between 5-25sec) exception was raised and messages as follow logged in the close succession.
Now the good news and bad news. It helped, and there are no more issues with GW stopping communicating during DHCP renew. So no more exceptions in HA. The bad new is, that my setup was optimal for testing you connection improvements and I lost it. :-o I will write followup on your previous comment as well. |
No description provided.