Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

Targeted sweep tombstones causes perf degradation #7198

Open
fsamuel-bs opened this issue Jul 22, 2024 · 4 comments
Open

Targeted sweep tombstones causes perf degradation #7198

fsamuel-bs opened this issue Jul 22, 2024 · 4 comments

Comments

@fsamuel-bs
Copy link
Contributor

  1. currently, deletes happen at ALL. If one Cassandra node is down (e.g. because the Cassandra cluster is restarting), tombstones will be added on the nodes which are up, but fail for the client with a InsufficientConsistencyException.
  2. deletes will then be retried, causing tombstones to pile up until Cassandra compacts these tombstones away

With targeted sweep, writes also generate a tombstone:

  • for example, let's say we have a table with a key (series name) and dynamic columns (offsets) to model queues
  • a write to a series1 + offset1 at commit timestamp t1 will cause a ranged tombstone to be added to series1 + offset1 + [< t1]
  • reads which range-scan past this row we'll traverse over this tombstone. If there are several tombstones piled up, reads can fail.

Solutions that come to mind are:

  • fix deletes to no longer happen at ALL, but at QUORUM - in progress
  • not add a range-delete tombstone if there are no rows to delete? Not sure if we can know there are no rows to delete though - either would have to be informed by client or understood at read-time
  • reduce the rate of retries - if deletes are failing due to InsufficientConsistencyException, retry only 1h from now?
@fsamuel-bs
Copy link
Contributor Author

It seems the last one is already covered on

.insufficientConsistency_(getInsufficientConsistencyPauseAndCalculateNext())
, so might just need to make it more aggressive.

@fsamuel-bs fsamuel-bs changed the title Targeted sweep tombstones causing perf degradation Targeted sweep tombstones causes perf degradation Jul 22, 2024
@tpetracca
Copy link
Contributor

Generally clarifying this issue is meant to be a follow-up from PDS-554079

@jeremyk-91
Copy link
Contributor

Not sure if we can know there are no rows to delete though - either would have to be informed by client or understood at read-time

Yep, exactly.

reduce the rate of retries - if deletes are failing due to InsufficientConsistencyException, retry only 1h from now?

Interesting. I think I see where this is coming from though I'd be a little concerned about potentially automatically extending Sweep delays by an hour - the SLAs probably can take it, but clients might not / if we have a few rolls in close-ish proximity, this could have bad consequences as Sweep might remain inactive for the entire period. We currently have an exponential backoff that goes up to 30m - I'm not opposed to relaxing the parameters on that for a bit, especially for early attempts, if we think that'll be useful.

Maybe we can do some speculative thing where an InsufficientConsistencyException on a delete puts us in a suspected non-ALL state? And we try every 5m or so a CL ALL get (or some other non-destructive operation) on the same set of nodes when in a suspected non-ALL state, only resuming normal sweep when this succeeds.

@tpetracca
Copy link
Contributor

In practice I think we just press on with "sweep at quorum", and either as part of that work or in a follow-up we essentially try to poll the clusters health to understand if we're healthy enough for "sweep at ALL" to continue or not. rather than just attempting to do it with actual deletes being issued?

Which I guess is vaguely the same/similar as what you describe. But yea I think it just comes down to:

  • how to decide when to move from ALL to QUORUM
  • how to decide when to move from QUORUM to ALL -> and for this in particular can we do so without building up tons of duplicate "failed at all" deletes on the other 2 replicas

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants