You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Lost messages can occur as a result of shotover closing outgoing connections as part of its logic for handing use statements.
Cassandra USE statements set per connection state. So to avoid issues where an incoming connection has some connections with different states, we close all outgoing connections and reopen them with the new USE keyspace.
Running USE on the existing connections, without closing them, would be better, but it would require significant refactors to KafkaSinkCluster to avoid returning the duplicate responses to the client.
Consider the following scenario which is causing intermittent test failures in CI and locally:
client sends use statement
client sends prepare
shotover duplicates prepare requests, there is now a use statement in between some of the duplicated prepare requests.
shotover sends one prepare request .
shotover clears open connections for use statement, this closes the connection which we are currently waiting for a response on.
shotover sends the other 2 prepare requests.
Shotover only receives 2/3 of the prepare responses so it never responds to the client with the combined prepare response.
The client eventually times out after 10s.
On my local machine I can reproduce this scenario by running cargo nextest run cassandra_int_tests::cassandra_5_cluster::case_2_cdrs in a loop within 10 tries.
But it should be possible to reproduce the issue by simply doing:
send query
send use statement
shotover sends query to a connection
shotover kills all outgoing connections as per use logic.
the response to the query is lost.
client times out waiting for response to query
Possible solution
The simplest possible solution is to flush the outgoing connections before killing them as part of the USE statement logic.
The text was updated successfully, but these errors were encountered:
Lost messages can occur as a result of shotover closing outgoing connections as part of its logic for handing
use
statements.Cassandra
USE
statements set per connection state. So to avoid issues where an incoming connection has some connections with different states, we close all outgoing connections and reopen them with the new USE keyspace.Running USE on the existing connections, without closing them, would be better, but it would require significant refactors to KafkaSinkCluster to avoid returning the duplicate responses to the client.
Consider the following scenario which is causing intermittent test failures in CI and locally:
On my local machine I can reproduce this scenario by running
cargo nextest run cassandra_int_tests::cassandra_5_cluster::case_2_cdrs
in a loop within 10 tries.But it should be possible to reproduce the issue by simply doing:
Possible solution
The simplest possible solution is to flush the outgoing connections before killing them as part of the USE statement logic.
The text was updated successfully, but these errors were encountered: