Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

continuous write queue tuning #125

Closed
Dieterbe opened this issue Feb 1, 2016 · 2 comments
Closed

continuous write queue tuning #125

Dieterbe opened this issue Feb 1, 2016 · 2 comments

Comments

@Dieterbe
Copy link
Contributor

Dieterbe commented Feb 1, 2016

as I pointed out in #105

consider:

  • AM aka active metrics in metric tank. this constantly changes dynamically based on how many different keys are being sent into the system
    • AGGX aka the aggregation multiplier (in our case 19 because we have 3 rollup bands with each 6 series, and also the original)
    • cassandra-write-queue-size as configured. this setting dictates how many chunks can be buffered in the write queue before it starts blocking the callers

you want to tune cassandra-write-queue-size to always

  • be at the very minimum AM * AGGX otherwise message handlers will always block at timeboundaries where all chunks for all metrics (incl all rollup chunks) are being flushed (which was a cause of MT: load spike every 6 hours (mostly/only dev) #105)
  • .. but also not much more then that. specifically, based on the aggregation settings, you want incoming metrics to block well in time so that no dataloss incurs if you can't fix a cassandra problem in time.
    let's say MT is configured to remove data from memory after 6hours, then the write queue should never accumulate 7 hours worth of data before it starts blocking because if the primary crashes, the secondary cannot resume the write duties fully, as some data will be lost. in other words, the write queue should never contain data that would be removed out of the primary or secondary main AggMetric index. I haven't fully thought the details of this reasoning through but it seems more or less right to me.

So basically since AM is always changing, that means there's always a valid range for cassandra-write-queue-size as well.
This means that ideally, we're constantly adding new nodes and promoting them, with new settings, to accomodate for the ever changing live AM value. And we should also add monitoring that compares cassandra-write-queue-size against AM and AGGX to make sure it makes sense, and add alerting to it.
This is all a little clunky.
Note also that if AM raises very fast, we may not be able to promote a primary in time to avoid the blocking problem.

The alternative woud be adding in a custom queue that is more dynamic and adjusts itself, but in such scenario we don't have an upper bound for memory usage which is also a problem.

@Dieterbe
Copy link
Contributor Author

note: after 23f7489 it's only 4 series per rollup anymore (min,max,sum,count). and we now have 2 rollup series. so 9 in total.

@stale
Copy link

stale bot commented Apr 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 4, 2020
@stale stale bot closed this as completed Apr 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant