Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CassandraSinkCluster lost messages #1843

Closed
rukai opened this issue Nov 26, 2024 · 0 comments · Fixed by #1845
Closed

CassandraSinkCluster lost messages #1843

rukai opened this issue Nov 26, 2024 · 0 comments · Fixed by #1845
Labels
bug Something isn't working

Comments

@rukai
Copy link
Member

rukai commented Nov 26, 2024

Lost messages can occur as a result of shotover closing outgoing connections as part of its logic for handing use statements.
Cassandra USE statements set per connection state. So to avoid issues where an incoming connection has some connections with different states, we close all outgoing connections and reopen them with the new USE keyspace.
Running USE on the existing connections, without closing them, would be better, but it would require significant refactors to KafkaSinkCluster to avoid returning the duplicate responses to the client.

Consider the following scenario which is causing intermittent test failures in CI and locally:

  1. client sends use statement
  2. client sends prepare
  3. shotover duplicates prepare requests, there is now a use statement in between some of the duplicated prepare requests.
  4. shotover sends one prepare request .
  5. shotover clears open connections for use statement, this closes the connection which we are currently waiting for a response on.
  6. shotover sends the other 2 prepare requests.
  7. Shotover only receives 2/3 of the prepare responses so it never responds to the client with the combined prepare response.
  8. The client eventually times out after 10s.

On my local machine I can reproduce this scenario by running cargo nextest run cassandra_int_tests::cassandra_5_cluster::case_2_cdrs in a loop within 10 tries.

But it should be possible to reproduce the issue by simply doing:

  1. send query
  2. send use statement
  3. shotover sends query to a connection
  4. shotover kills all outgoing connections as per use logic.
    1. the response to the query is lost.
  5. client times out waiting for response to query

Possible solution

The simplest possible solution is to flush the outgoing connections before killing them as part of the USE statement logic.

@rukai rukai added the bug Something isn't working label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant