Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent storage in queued_retry, backed by file storage extension #3274

Merged

Conversation

pmm-sumo
Copy link
Contributor

@pmm-sumo pmm-sumo commented May 24, 2021

Description:

Persistent queue implementation within queued_retry, aimed at being compatible with Jager's BoundedQueue interface (providing a simple replacement) and backed by file storage extension for storing WAL.

Currently, to run the persistent queue, OpenTelemetry Collector Contrib with enable_unstable build tag is required.

Link to tracking Issue: #2285

Design doc

Testing: Unit Tests and manual testing, more to come

Documentation: README.md updated, including an example

@pmm-sumo pmm-sumo force-pushed the queued-retry-storage-extension-wal branch 4 times, most recently from decaa82 to 939c199 Compare May 25, 2021 12:15
Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed partially.


When `wal_enabled` is set, the queue is being buffered to disk by the storage extension. This has some limitations
currently. The items that have been passed to consumer for the actual exporting are removed from WAL.
In effect, in case of a sudden shutdown, they might be lost.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the nice diagram!

### WAL

When `wal_enabled` is set, the queue is being buffered to disk by the storage extension. This has some limitations
currently. The items that have been passed to consumer for the actual exporting are removed from WAL.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this behavior and remove only after the item is confirmed to be delivered? Otherwise we risk loosing data if we crash/restart which defeats the purpose of the persistent queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe we must do it. I didn't want to make this PR too complicated and taking consumers into play adds some level of complexity (actually, I am not sure what's the best implementation, so far, thought about separate set of keys for just the consumers to keep the currently processed items there)

exporter/exporterhelper/README.md Outdated Show resolved Hide resolved
@@ -66,7 +74,8 @@ func DefaultQueueSettings() QueueSettings {
// This is a pretty decent value for production.
// User should calculate this from the perspective of how many seconds to buffer in case of a backend outage,
// multiply that by the number of requests per seconds.
QueueSize: 5000,
QueueSize: 5000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the in-memory size when WAL is enabled? Do we need the second setting to limit the size of the persisted data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about another setting for persistent data (since the number would be probably much different) but then maybe we should just keep the same setting to keep it simple

exporter/exporterhelper/logs.go Outdated Show resolved Hide resolved
exporter/exporterhelper/wal.go Outdated Show resolved Hide resolved
}

// Produce adds an item to the queue and returns true if it was accepted
func (wq *WALQueue) Produce(item interface{}) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be changed to be more type safe?

Suggested change
func (wq *WALQueue) Produce(item interface{}) bool {
func (wq *WALQueue) Produce(item request) bool {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I kept it like that to match the BoundedQueue interface

@github-actions
Copy link
Contributor

github-actions bot commented Jun 3, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Jun 3, 2021
@bogdandrutu bogdandrutu removed the Stale label Jun 7, 2021
@tigrannajaryan
Copy link
Member

@pmm-sumo ping me when it is time to review again.

@pmm-sumo
Copy link
Contributor Author

pmm-sumo commented Jun 9, 2021

@pmm-sumo ping me when it is time to review again.

Got a bit delayed but I am getting back to this now and should have an update by the end of the week

@pmm-sumo pmm-sumo force-pushed the queued-retry-storage-extension-wal branch from 939c199 to 4328cc6 Compare June 16, 2021 10:07
@pmm-sumo
Copy link
Contributor Author

@tigrannajaryan I have updated the codebase. It is now based on just the file storage interface (#3425). In effect, it requires OpenTelemetry Collector Contrib to run currently. Let me know if you would like to arrange it differently (and move most of the code here to Contrib - I wasn't sure how to do that without making it much more complex)

I renamed it to "Persistent Queue" and also addressed most of the comments. The major one missing is persistence for the consumers (i.e. when the request is taken from the queue). I was thinking about having separate items for each of the consumers backed by queue - that would require extra marshaling/unmarshaling but is probably the simplest/cleanest

@tigrannajaryan tigrannajaryan self-assigned this Jun 16, 2021
@tigrannajaryan
Copy link
Member

The major one missing is persistence for the consumers (i.e. when the request is taken from the queue).

@pmm-sumo sorry for a delayed review.

I think one typical approach to fix this problem is to have a queue with 3 pointers: write, read and delete. Reading does not delete the item from the queue. So the consuming end looks something like this:

for { 
  item := read(queue):
  err := send(item) // this is a blocking call that only returns when sending is done or fails.
  if !err {
    delete(queue, item)
  } else {
    // retry logic
  }
}

I think it is important to validate that the current storage interface allows to efficiently support such behavior.

@pmm-sumo
Copy link
Contributor Author

The major one missing is persistence for the consumers (i.e. when the request is taken from the queue).

@pmm-sumo sorry for a delayed review.

I think one typical approach to fix this problem is to have a queue with 3 pointers: write, read and delete. Reading does not delete the item from the queue. So the consuming end looks something like this:
(...)

Thank you @tigrannajaryan

With multiple consumers this gets bit more complicated as the range between delete and read is not contiguous. I am finalising implementation right now that does use callbacks and another key/value which stores addresses of items that are currently being processed. Going to run a couple of tests and update the PR soon

@pmm-sumo pmm-sumo force-pushed the queued-retry-storage-extension-wal branch from 939849f to e4ffa81 Compare June 23, 2021 16:59
@pmm-sumo pmm-sumo marked this pull request as ready for review June 23, 2021 19:20
@pmm-sumo pmm-sumo requested review from a team and jpkrohling June 23, 2021 19:20
@pmalek-sumo
Copy link
Contributor

Hi @tigrannajaryan,

Can we get another review on this PR?

@pmm-sumo is out for a couple of days now but we'll happily address any outstanding issues if there are any.

exporter/exporterhelper/persistent_storage.go Outdated Show resolved Hide resolved
exporter/exporterhelper/persistent_storage.go Outdated Show resolved Hide resolved
exporter/exporterhelper/persistent_storage.go Outdated Show resolved Hide resolved
exporter/exporterhelper/persistent_storage.go Outdated Show resolved Hide resolved
exporter/exporterhelper/persistent_storage.go Outdated Show resolved Hide resolved
exporter/exporterhelper/persistent_storage.go Outdated Show resolved Hide resolved
@pmalek-sumo
Copy link
Contributor

pmalek-sumo commented Jul 5, 2021

Hi @tigrannajaryan,

Do we want to resolve the issue/challenge of batching in this PR or do we want to park this until later on? (perhaps create an issue for this)

If that's decided then we can think or resolving conflicts/fixing tests which got introduced in 0.29 if I understand the changelog correctly (and #3455 seems to be the main culprit).

@tigrannajaryan
Copy link
Member

Do we want to resolve the issue/challenge of batching in this PR or do we want to park this until later on? (perhaps create an issue for this)

Can we discuss a few possible Storage interfaces that support batched Set or Get or even transactions with multiple operations mixed?

I would not want to merge Storage interface which we know is not good enough. It would be best to have the right interface from the beginning. This PR does not necessarily need to be refactored to get advantage of the improved Storage, it can be a separate PR, but it would be again great to prove that the improved Storage serves our needs.

@pmm-sumo pmm-sumo force-pushed the queued-retry-storage-extension-wal branch from 0758e0f to 5326031 Compare September 3, 2021 15:38
@pmm-sumo pmm-sumo force-pushed the queued-retry-storage-extension-wal branch from 2cf71cc to 0a65246 Compare September 3, 2021 15:51
@pmm-sumo
Copy link
Contributor Author

pmm-sumo commented Sep 3, 2021

Please also add unstable build and test to Github actions (if not already).

I have created few more targets in Makefile and currently make gotest executes both test and test-unstable. As for the GitHub Actions, build-and-test creates also unstable set of binaries (though to be honest they are not very useful, one needs persistent storage implementation, which is available in Contrib)

@tigrannajaryan
Copy link
Member

@bogdandrutu PTAL.

@tigrannajaryan
Copy link
Member

@open-telemetry/collector-approvers if you reviewed this please approve if you are OK with the PR.

Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tigrannajaryan
Copy link
Member

Since we have just made a release now is a good time to merge to have enough time till next release to fix any issues.
I am going merge the PR and if Bogdan or other approvers have comments they can be addressed in a future PR.

@tigrannajaryan tigrannajaryan merged commit eca1ba7 into open-telemetry:main Sep 10, 2021
@tigrannajaryan
Copy link
Member

@pmm-sumo thank you for your patience and the great effort to get this done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants