You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.
AM aka active metrics in metric tank. this constantly changes dynamically based on how many different keys are being sent into the system
AGGX aka the aggregation multiplier (in our case 19 because we have 3 rollup bands with each 6 series, and also the original)
cassandra-write-queue-size as configured. this setting dictates how many chunks can be buffered in the write queue before it starts blocking the callers
you want to tune cassandra-write-queue-size to always
be at the very minimum AM * AGGX otherwise message handlers will always block at timeboundaries where all chunks for all metrics (incl all rollup chunks) are being flushed (which was a cause of MT: load spike every 6 hours (mostly/only dev) #105)
.. but also not much more then that. specifically, based on the aggregation settings, you want incoming metrics to block well in time so that no dataloss incurs if you can't fix a cassandra problem in time.
let's say MT is configured to remove data from memory after 6hours, then the write queue should never accumulate 7 hours worth of data before it starts blocking because if the primary crashes, the secondary cannot resume the write duties fully, as some data will be lost. in other words, the write queue should never contain data that would be removed out of the primary or secondary main AggMetric index. I haven't fully thought the details of this reasoning through but it seems more or less right to me.
So basically since AM is always changing, that means there's always a valid range for cassandra-write-queue-size as well.
This means that ideally, we're constantly adding new nodes and promoting them, with new settings, to accomodate for the ever changing live AM value. And we should also add monitoring that compares cassandra-write-queue-size against AM and AGGX to make sure it makes sense, and add alerting to it.
This is all a little clunky.
Note also that if AM raises very fast, we may not be able to promote a primary in time to avoid the blocking problem.
The alternative woud be adding in a custom queue that is more dynamic and adjusts itself, but in such scenario we don't have an upper bound for memory usage which is also a problem.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
as I pointed out in #105
consider:
AM
aka active metrics in metric tank. this constantly changes dynamically based on how many different keys are being sent into the systemAGGX
aka the aggregation multiplier (in our case 19 because we have 3 rollup bands with each 6 series, and also the original)cassandra-write-queue-size
as configured. this setting dictates how many chunks can be buffered in the write queue before it starts blocking the callersyou want to tune
cassandra-write-queue-size
to alwaysAM * AGGX
otherwise message handlers will always block at timeboundaries where all chunks for all metrics (incl all rollup chunks) are being flushed (which was a cause of MT: load spike every 6 hours (mostly/only dev) #105)let's say MT is configured to remove data from memory after 6hours, then the write queue should never accumulate 7 hours worth of data before it starts blocking because if the primary crashes, the secondary cannot resume the write duties fully, as some data will be lost. in other words, the write queue should never contain data that would be removed out of the primary or secondary main AggMetric index. I haven't fully thought the details of this reasoning through but it seems more or less right to me.
So basically since
AM
is always changing, that means there's always a valid range forcassandra-write-queue-size
as well.This means that ideally, we're constantly adding new nodes and promoting them, with new settings, to accomodate for the ever changing live
AM
value. And we should also add monitoring that comparescassandra-write-queue-size
againstAM
andAGGX
to make sure it makes sense, and add alerting to it.This is all a little clunky.
Note also that if
AM
raises very fast, we may not be able to promote a primary in time to avoid the blocking problem.The alternative woud be adding in a custom queue that is more dynamic and adjusts itself, but in such scenario we don't have an upper bound for memory usage which is also a problem.
The text was updated successfully, but these errors were encountered: