Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

websocket-client socket mode client doesn't handle reconnects properly #1379

Closed
leifwalsh opened this issue May 31, 2023 · 5 comments · Fixed by #1380
Closed

websocket-client socket mode client doesn't handle reconnects properly #1379

leifwalsh opened this issue May 31, 2023 · 5 comments · Fixed by #1380
Assignees
Labels
bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented socket-mode Version: 3x
Milestone

Comments

@leifwalsh
Copy link

leifwalsh commented May 31, 2023

(Filling out the following details about bugs will help us solve your issue sooner.)

Reproducible in:

pip freeze | grep slack
python --version
sw_vers && uname -v # or `ver`

slack-sdk==3.21.3
Python 3.11.3
Debian 10

The Slack SDK version

slack-sdk==3.21.3

Python runtime version

Python 3.11.3

OS info

Debian 10

Steps to reproduce:

(Share the commands to run, source code, and project settings (e.g., setup.py))

  1. Use slack_sdk.socket_mode.websocket_client.SocketModeClient
  2. Let the app run for long enough
  3. Receive a websocket-level disconnect

Expected result:

The client should properly reconnect

Actual result:

The client repeatedly tries to reconnect without re-issuing a wss url, causing all connections to fail in a loop forever

The logs are roughly the same as described in slackapi/bolt-python#470. It repeatedly gets WebSocketConnectionClosedException. Specifically, websocket logs Connection to remote host was lost. - goodbye and slack_sdk.socket_mode.websocket_client logs on_error invoked (error: WebSocketConnectionClosedException, message: Connection to remote host was lost.)

I believe the problem is that SocketModeClient.is_connected() just checks whether self.current_session is not None (https://github.com/slackapi/python-slack-sdk/blob/v3.21.3/slack_sdk/socket_mode/websocket_client/__init__.py#L139). If you compare that with the builtin client, it also checks whether connection.is_active() (https://github.com/slackapi/python-slack-sdk/blob/v3.21.3/slack_sdk/socket_mode/builtin/client.py#L152) which checks if self.sock is not None (https://github.com/slackapi/python-slack-sdk/blob/v3.21.3/slack_sdk/socket_mode/builtin/connection.py#L198). I think what happens is that in a certain type of disconnect scenario, BaseSocketModeClient.connect_to_new_endpoint() doesn't call issue_new_wss_url() when it should, because it thinks we're still connected (https://github.com/slackapi/python-slack-sdk/blob/v3.21.3/slack_sdk/socket_mode/client.py#L75).

I'm not sure how reliable that is in theory, there might be race conditions still. However, empirically, it seems that websocket-client's connection class does a reasonable job of setting its own .sock attribute to None when handling disconnects. I subclassed slack_sdk.socket_mode.websocket_client.SocketModeClient and overrode is_connected() with a version that also checks if self.current_session.sock is falsey, and ran that in prod for a couple weeks, and it seems like when this happens it does correctly reconnect. Running the app in a HA configuration (at least two instances running) means that, as long as they both don't disconnect simultaneously, I think we retain at least one active connection consistently and don't drop messages.

I think the fix is probably simple: just check that relying on checking websocket._app.WebSocketApp.sock is something you're comfortable doing, and then add that condition to slack_sdk.socket_mode.websocket_client.SocketModeClient.is_connected().

@seratch seratch added bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented Version: 3x socket-mode and removed untriaged labels May 31, 2023
@seratch seratch added this to the 3.22.0 milestone May 31, 2023
@seratch seratch self-assigned this May 31, 2023
@seratch
Copy link
Member

seratch commented May 31, 2023

Hi @leifwalsh, thank you very much for taking the time to report the issue!

Your suggestion looks great to me 👍 May I ask you to send a pull request to merge the changes in your subclass? Otherwise, I can make a PR with the change when I have time for it.

@leifwalsh
Copy link
Author

Thanks! If you could make the change (or figure out a better way to detect a broken connection), that would be great. Otherwise, I can send a PR but I have to go through an internal review process at my company to get permission to send you one, which could take a few days/weeks.

@seratch
Copy link
Member

seratch commented May 31, 2023

@leifwalsh Thanks for your quick reply! if that's the case, I am happy to make the change for you! (also, I will explore alternatives when working on it)

@seratch
Copy link
Member

seratch commented Jun 1, 2023

@leifwalsh Here is my PR to fix this issue: #1380 If you have any sugggestions, please feel free to leave comments on the PR!

@leifwalsh
Copy link
Author

Looks great, thank you!

seratch added a commit that referenced this issue Jun 1, 2023
…ts properly (#1380)

* Fix #1379 websocket-client socket mode client doesn't handle reconnects properly
* Adjust test code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented socket-mode Version: 3x
Projects
None yet
2 participants