-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drop measurement causes cache maximum memory size exceeded #7161
Comments
@sebito91 If it's currently locked up, can you attach these profiles of the instance to help identify the cause of the lockup?
|
The heap is on your SFTP, under /uploads/issue_7161 |
If a delete takes a long time to process while writes to the shard are occuring, it was possible for the cache to fill up and writes to be rejected. This occurred because we disabled all compactions while writing tombstone file to prevent deleted data from re-appearing after a compaction completed. Instead, we only disable the level compactions and allow snapshot compactions to continue. Snapshots already handle deleted data with the cache and wal. Fixes #7161
@jwilder thank you, should we build and test or is this in the current nightly? |
@sebito91 This is not merged to master yet. If you are able to test this PR and let me know if it helps, that would be great. |
@jwilder will do and get back to you |
If a delete takes a long time to process while writes to the shard are occuring, it was possible for the cache to fill up and writes to be rejected. This occurred because we disabled all compactions while writing tombstone file to prevent deleted data from re-appearing after a compaction completed. Instead, we only disable the level compactions and allow snapshot compactions to continue. Snapshots already handle deleted data with the cache and wal. Fixes #7161
Unfortunately still seeing the same issue :/
|
I'll take a look some more. You could also try setting the cache max memory size to 0 which should disable that limit. |
@sebito91 So this PR is working better? Or did you disable the max size of the cache as well? |
@jwilder this is before max size touched...still the same setting as before. Bit concerning that we block metrics for the hour during deletion though, that should be caught within the wal no? |
Your cache was likely already backed up from the prior failures. I would expect new deletes to not block writes with this patch. |
This was our 'backup' node, which we didn't actually run the deletes on. I'll try a fresh start and do the following:
This is our build of yesterday's HEAD + your patch |
Been out for a couple of days, sorry for the delay. Implemented the fix again against head on Friday, noticed the following:
I would have thought we'd be able to cache metrics into the wal even though we're blocking writes out to tsm, then ultimately flush from wal. I understand the race condition, which is basically that we might still be receiving measurements which we're deleting, but if they're in the wal and ONLY the existing tsm files are queried for deletes then this shouldn't result in lost data. Thoughts? |
Looking still closer on this one, it appears that ANY drop measurement function takes exactly one hour to block all incoming data and finally return (allowing new metrics to come in). We just tested this on another measurement that was collecting for about 5 mins. To clarify, the CLI returns relatively quickly and the database is still functional and interactive...but the table we delete from becomes completely blocked for precisely one hour due to cache max exceeded. Once the hour passes, data resumes ingest on the table but that window is empty for all measurements. We will try now with the cache-max-memory-size = 0 and report back. |
@sebito91 Hey, any update on this particular problem? Did you try with the cache-max-memory-size config? |
If a delete takes a long time to process while writes to the shard are occuring, it was possible for the cache to fill up and writes to be rejected. This occurred because we disabled all compactions while writing tombstone file to prevent deleted data from re-appearing after a compaction completed. Instead, we only disable the level compactions and allow snapshot compactions to continue. Snapshots already handle deleted data with the cache and wal. Fixes #7161
Bug report
Encountered stop-the-world data ingest issues after issuing multiple drop measurement commands on limited series. These series were part of a test collector and during cleanup the primary ingest mechanisms are completely blocked and not unlocking. Problem is getting worse...
We see many, many of these errors on the logs:
SYSTEM VERSION
System info:
Steps to reproduce:
Expected behavior: not block ALL incoming metrics forever, not clearing
Actual behavior: blocking metrics forever, not unlocking
Additional info: [Include gist of relevant config, logs, etc.]
SUPPORT FILES ADDED TO FTP!
The text was updated successfully, but these errors were encountered: