Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core Library] ZMQ Network Recovery after Network Loss #355

Merged
merged 54 commits into from
Mar 6, 2025

Conversation

ChrisYx511
Copy link
Contributor

@ChrisYx511 ChrisYx511 commented Feb 4, 2025

Description

Currently, there is an issue whereReceiver, which extends OmnibusCommunicator, will deadlock when network connection is changed or lost when connecting to an external Omnibus Server. As a result, sinks stop working / fail silently. The solution currently envisioned is to reset the ZMQ socket whenever no messages are being received, so they can recover automatically after connection loss.

  • Added new reset() function to Receiver class in Omnibus Core library, which re-establishes the connection to the Omnibus server
  • Setup the receive message function to run reset() after not receiving messages for over 5 seconds (by default, 2 seconds for globallog), and reset every 5 seconds until a new message is received.
  • Prints warnings and info logs when network is offline / back online
  • Added startup screen incl. build number and date to Omnibus startup
    • BREAKING gitpython has been added as a dependency in order to print the build number. It is technically optional and nothing will crash without it, but it should be installed for all operation (so you can get the build number readout)
  • Measured performance of reset function, usually can start receiving messages roughly 0.0025 seconds after reset. Therefore maximum theoretical data loss during network failure and reconnect is roughly 5 seconds (2 seconds for globallog) + 0.0025.
  • Updated Parsley source to respect this behaviour by listening for Healthcheck as well on the Receiver, not just CAN/Commands.
  • Various other minor fixes, incl. adding a pyproject.toml to replace deprecated setup.py develop
  • Incremented core library version to 1.1.0
  • Fixed GitHub workflow enforcing version changes

This PR closes #302 .

Developer Testing

Here's what I did to test my changes:

  • Tested various disconnection methods, incl. server died, client died, client network changed, client network off using the actual DAQ computer at the bay using all sinks, focusing on Dashboard and Globallog
  • Wrote and ran several unit tests that verify message integrity before and after a reset, as well as simulating an auto-reset

Reviewer Testing

Here's what you should do to quickly validate my changes:

  • Run pytest for omnibus, ensure that all tests pass, including old and new ones
  • Test disconnecting and reconnecting the server, your client, in various ways you can think of
  • Use omnibus normally and report any other regressions as you see fit

This change is Reviewable

@ChrisYx511 ChrisYx511 self-assigned this Feb 4, 2025
@ChrisYx511
Copy link
Contributor Author

Alright, this should be feature complete!

@ChrisYx511 ChrisYx511 marked this pull request as ready for review February 20, 2025 23:23
@ChrisYx511 ChrisYx511 requested review from a team as code owners February 20, 2025 23:23
Copy link
Member

@BluCodeGH BluCodeGH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core feature looks good, but I have a handful of thoughts on the surrounding code changes.

@ChrisYx511
Copy link
Contributor Author

@BluCodeGH Thank you so much for the insight! I just had some quick clarifying questions that I left as comments to your review and I'll get right to implementing the changes once they are answered.
Thanks again!

@ChrisYx511 ChrisYx511 requested a review from BluCodeGH February 27, 2025 20:45
@ChrisYx511
Copy link
Contributor Author

@BluCodeGH @zangjiucheng pls re-review

Copy link
Member

@zangjiucheng zangjiucheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Chris that's really amazing... I will even say it's 2.0 instead of 1.1 for core library lol. I leave a few comments there.

Code makes sense to me, I will test on that with usb debugger tmr to confirm it keeps working with real parsley signal. As soon as tomorrow an approve of this pr will be send.

@zangjiucheng
Copy link
Member

image

For feature improvement,

  1. for the point when server got disconnected, maybe we need a better identification for plot.
  2. Also dashboard title maybe can give a UI reflection say server is down.
  3. Make sure in that case sender is not be able to send any signal (Maybe we should stash command need to be send in the buffer area when server is down)

@zangjiucheng
Copy link
Member

Please create relative issues before we merge this in (Make sure we won't forgot those features wait to be fixed)

(Could be just a title, I am going to give detail description later)

Copy link
Member

@zangjiucheng zangjiucheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, after testing on the GPS parser after message go offline. It can back online as expected.

@ChrisYx511
Copy link
Contributor Author

@zangjiucheng Did a minor typing fix, could you re-approve? 🙏 Also left a comment #355 (comment).

Copy link
Member

@BluCodeGH BluCodeGH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! In the future it would be better to split stuff like this into multiple more focused PRs to make review easier (eg ZMQ changes, readme updates, yaml formatting, build info, etc as separate PRs).

Copy link
Member

@zangjiucheng zangjiucheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ChrisYx511 ChrisYx511 merged commit ba5d9df into master Mar 6, 2025
2 checks passed
@ChrisYx511 ChrisYx511 deleted the c424yang/302-zmq-network-recovery branch March 6, 2025 03:10
ChrisYx511 added a commit that referenced this pull request Mar 7, 2025
On the receiver, patch to reset the ZMQ socket whenever no messages are being received, so they can start receiving messages again automatically after connection loss. And other misc. changes.

---------

Co-authored-by: waterloorocketry-common <132830721+waterloorocketry-common@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ZMQ Network Down Detection / Dropped Recovery
4 participants