Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V0.20 (session persistence) Release Discussion #207

Closed
MattBrittan opened this issue Nov 20, 2023 · 12 comments
Closed

V0.20 (session persistence) Release Discussion #207

MattBrittan opened this issue Nov 20, 2023 · 12 comments

Comments

@MattBrittan
Copy link
Contributor

Creating this issue to provide a place to discuss the next release (including session persistence). This release will contain major changes (a few of which are likely to break existing users code - permitted as we are still pre 1.0).

Due to the extent of the changes (particularly session persistence) we are keen to receive feedback from anyone using @master (the more people we know are using this successfully, the more confidence we will have that it's ready for release).

Issue creation prompted by an email to the paho-dev mailing list (an issue here seems the best place to discuss this).

@MattBrittan MattBrittan changed the title V0.2 (session persistence) Release Discussion V0.20 (session persistence) Release Discussion Nov 20, 2023
@MattBrittan
Copy link
Contributor Author

I'm now running @master in production with:

  • 8 remote clients (mix of x86 PC's and mips based routers) - these connect via a range of methods (cell, starlink, fibre) and comms are often unreliable (so a good test of the reconnection/session resumption functionality). Some of these have now been running for a month (but I have pushed out updates from time to time). Publish frequency ranges from multiple messages a second through to a message every 10 minutes (they also receive commands sporadically).
  • A few core services (ec2 - handling circa 200 messages a minute).

All connections run QOS1 and the remote clients utilise autopaho PublishViaQueue (with disk based persistence) and disk based session state.

To date the only real issue I've encountered was due to a corrupted queue (addressed in #194 and #199 ). This will have been caused by the power being pulled (was a component in a larger system being built in a customers factory, so regular power outages are to be expected); upcoming change to add Sync call in the queue/store may have helped prevent this too.

Apart from the one issue, I'd say that things are running better than the v3 client (got out-of-order packets fairly frequently with that following loss of connection, have not had the same issue with the V5 client). My app does perform checks for missing and out of order messages, and I've not seen any alerts, so fairly happy that the session management piece is working OK. The only downtime I've suffered was due to the corrupted queue issue (as this was a test site I was able to leave it as-is until the fix was available so have confirmation that the fix works).

@vishnureddy17
Copy link
Contributor

Thanks for all the work you've put in to get to this point, it's really appreciated!

Since the next release is going to break existing user code already, here are some additional breaking changes I think should be considered before the next release.

  • Revisit the logging situation. I think using log/slog for logging is worth considering, and I opened Use log/slog for logging  #169 to track that.
  • Fix the PingHandler (I believe it still does not function properly, see PingHandler will not time time out #137). While that fix is being done, revisit the Pinger interface to add a method that gets called to reset the timeout any time a packet is sent or received.

I can take a stab at one or both of these and submit a PR if that'd be useful.

@MattBrittan
Copy link
Contributor Author

My thoughts on to-do's for the next release:

  • Sort out the router (pr Remove router from most tests and examples. #215 should pretty much complete this - we could remove ClientConfig.Router but I think I'd prefer to do the release and then remove it as this gives users a bit more time to update their code).
  • Add a PacketSent call to the pinger interface. As per comment from @vishnureddy17 above (not too worried about changes to the implementation as long as we get the interface change done).
  • Make the config private (issue Client.ClientConfig public - invites race conditions #210 - have discussed with @alsm and think we are in agreement on this; the current setup invites data races).

I'm ambivalent on the logging change (this is primarily for library development and is not something most users will need to touch). Anyway looks like a release is not far off; would like to see this happen in Jan.

@MattBrittan
Copy link
Contributor Author

  • Fix the PingHandler (I believe it still does not function properly, see PingHandler will not time time out #137). While that fix is being done, revisit the Pinger interface to add a method that gets called to reset the timeout any time a packet is sent or received.
    I can take a stab at one or both of these and submit a PR if that'd be useful.

@vishnureddy17 would love to see a PR for this.

@vishnureddy17
Copy link
Contributor

@MattBrittan Sounds good. I'd be happy to do the pinger PR, and I can get that out next weekend (Jan 6-7).

Another change I think is important is to allow users to get back the ack on an async publish. One way to achieve this is to change AddToSession() in SessionManager to return a function that will block until the ack is received. The paho client would just return that function to the user. The function would be valid for use for as long as the session is.

This is quite important for my use-case. I can work on a PR for that and make an effort to get that out in January.

The autopaho manual ack issue (#160) is something to look at as well, but I think it is less important since users can implement their own managed client if autopaho doesn't meet their needs.

@MattBrittan
Copy link
Contributor Author

Another change I think is important is to allow users to get back the ack on an async publish.

My thought on this was to add callbacks to PublishOptions; we could then store the PublishOptions in state.clientPackets (making it easy to call when an associated event occurs). I prefer the callback approach because I think it's easier for the end user (and avoids the need to have a lot of go-routines hanging around waiting for acknowledgements). One note of caution is that the session may outlast the client instance (and I can't think of a good way of handling that!).

The autopaho manual ack issue (#160) is something to look at as well

Yep - I would like to see a solution for this but struggle with the logic around reconnections (I'm thinking more of the logic that would be required in users code than the logic in the client). It would be good to see use-cases from someone who needs this functionality.

@vishnureddy17
Copy link
Contributor

vishnureddy17 commented Dec 29, 2023

One note of caution is that the session may outlast the client instance (and I can't think of a good way of handling that!).

I think that's expected and necessary. As a user, I'd want to know when the publish was handled in the session (regardless of how many reconnections happened in the interim). In the solution you suggest, that would mean that the callback can potentially be called long after a particular client is dead.

I see how that could be confusing to the user, and that behavior would need to be documented.

@romansitina
Copy link

Hello all, appreciate your work! Would love to see PingHandler fix in the next release - we recently found out that this is causing issues in our deployment, resulting in more than 5 minutes lags when connection problem is unnoticed.

@MattBrittan
Copy link
Contributor Author

Update on the to-do list:

Done:

In progress:

Backlog / Undecided:

  • Revisit the logging situation Use log/slog for logging  #169 (I'm OK with the current logger, or one that's slog based, but don't see this as a big issue as the logger is mostly for debugging and should not usually be running in production code)
  • Allow users to get back the ack on an async publish (Nice to have but can wait as it should not be a breaking change)
  • Autopaho manual ack issue Manual ACK's, session state, and connection reestablishment #160 (nice to have but should not be a breaking change)

My thought is that once we get the pinger sorted I'll run this on a few production workloads for a week or so before cutting a release. After that release it would be good to pull a "Before V1" list because I think we are fairly close.

@vishnureddy17
Copy link
Contributor

vishnureddy17 commented Jan 12, 2024

Allow users to get back the ack on an async publish (Nice to have but can wait as it should not be a breaking change)

It wouldn't break users who are using paho defaults. It would be a breaking change to the SessionManager interface though, right?

Despite this, I think it's ok to defer this change to a future release as long as it's before V1. I don't anticipate that changes to SessionManager would be breaking to that many users. Just asking to clarify.

@MattBrittan
Copy link
Contributor Author

MattBrittan commented Jan 12, 2024

It would be a breaking change to the SessionManager interface though, right?

You are right (however realistically I don't see anyone else implementing this interface for some time :-) ).

With the introduction of the new Pinger I think we have everything needed for the release. I'll push the current @master out to a few non-critical systems and leave it running for a week as a final test (happy to see further changes but will probably
delay anything major).

Update 20th Jan: Further changes (pinger and clean shutdown process ref issue #227) will delay this slightly; I am running the current master (ef0065f) on a few machines in production without issue. Really appreciate any other feedback from testing (every use-case differs so I will not catch all bugs!).

@MattBrittan MattBrittan pinned this issue Jan 20, 2024
@MattBrittan
Copy link
Contributor Author

v0.20 released so I'll close this issue. Thanks very much to everyone who contributed!

@MattBrittan MattBrittan unpinned this issue Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants