-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent storage in queued_retry, backed by file storage extension #3274
Persistent storage in queued_retry, backed by file storage extension #3274
Conversation
decaa82
to
939c199
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed partially.
exporter/exporterhelper/README.md
Outdated
|
||
When `wal_enabled` is set, the queue is being buffered to disk by the storage extension. This has some limitations | ||
currently. The items that have been passed to consumer for the actual exporting are removed from WAL. | ||
In effect, in case of a sudden shutdown, they might be lost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the nice diagram!
exporter/exporterhelper/README.md
Outdated
### WAL | ||
|
||
When `wal_enabled` is set, the queue is being buffered to disk by the storage extension. This has some limitations | ||
currently. The items that have been passed to consumer for the actual exporting are removed from WAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this behavior and remove only after the item is confirmed to be delivered? Otherwise we risk loosing data if we crash/restart which defeats the purpose of the persistent queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe we must do it. I didn't want to make this PR too complicated and taking consumers into play adds some level of complexity (actually, I am not sure what's the best implementation, so far, thought about separate set of keys for just the consumers to keep the currently processed items there)
@@ -66,7 +74,8 @@ func DefaultQueueSettings() QueueSettings { | |||
// This is a pretty decent value for production. | |||
// User should calculate this from the perspective of how many seconds to buffer in case of a backend outage, | |||
// multiply that by the number of requests per seconds. | |||
QueueSize: 5000, | |||
QueueSize: 5000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the in-memory size when WAL is enabled? Do we need the second setting to limit the size of the persisted data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about another setting for persistent data (since the number would be probably much different) but then maybe we should just keep the same setting to keep it simple
exporter/exporterhelper/wal.go
Outdated
} | ||
|
||
// Produce adds an item to the queue and returns true if it was accepted | ||
func (wq *WALQueue) Produce(item interface{}) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be changed to be more type safe?
func (wq *WALQueue) Produce(item interface{}) bool { | |
func (wq *WALQueue) Produce(item request) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I kept it like that to match the BoundedQueue interface
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
@pmm-sumo ping me when it is time to review again. |
Got a bit delayed but I am getting back to this now and should have an update by the end of the week |
939c199
to
4328cc6
Compare
@tigrannajaryan I have updated the codebase. It is now based on just the file storage interface (#3425). In effect, it requires OpenTelemetry Collector Contrib to run currently. Let me know if you would like to arrange it differently (and move most of the code here to Contrib - I wasn't sure how to do that without making it much more complex) I renamed it to "Persistent Queue" and also addressed most of the comments. The major one missing is persistence for the consumers (i.e. when the |
@pmm-sumo sorry for a delayed review. I think one typical approach to fix this problem is to have a queue with 3 pointers: write, read and delete. Reading does not delete the item from the queue. So the consuming end looks something like this:
I think it is important to validate that the current storage interface allows to efficiently support such behavior. |
Thank you @tigrannajaryan With multiple consumers this gets bit more complicated as the range between |
939849f
to
e4ffa81
Compare
Hi @tigrannajaryan, Can we get another review on this PR? @pmm-sumo is out for a couple of days now but we'll happily address any outstanding issues if there are any. |
Hi @tigrannajaryan, Do we want to resolve the issue/challenge of batching in this PR or do we want to park this until later on? (perhaps create an issue for this) If that's decided then we can think or resolving conflicts/fixing tests which got introduced in 0.29 if I understand the changelog correctly (and #3455 seems to be the main culprit). |
Can we discuss a few possible I would not want to merge |
…fering capability
Co-authored-by: Dominik Rosiek <drosiek@sumologic.com>
0758e0f
to
5326031
Compare
2cf71cc
to
0a65246
Compare
I have created few more targets in |
@bogdandrutu PTAL. |
@open-telemetry/collector-approvers if you reviewed this please approve if you are OK with the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Since we have just made a release now is a good time to merge to have enough time till next release to fix any issues. |
@pmm-sumo thank you for your patience and the great effort to get this done! |
Description:
Persistent queue implementation within queued_retry, aimed at being compatible with Jager's BoundedQueue interface (providing a simple replacement) and backed by file storage extension for storing WAL.
Currently, to run the persistent queue, OpenTelemetry Collector Contrib with
enable_unstable
build tag is required.Link to tracking Issue: #2285
Design doc
Testing: Unit Tests and manual testing, more to come
Documentation: README.md updated, including an example