Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/stanza: add backpropagation of number of processed entries #16452

Closed

Conversation

sumo-drosiek
Copy link
Member

@sumo-drosiek sumo-drosiek commented Nov 23, 2022

Description:

This is fix for #15378

Idea is to backpropagate information how many entries has been processed by pipeline

Link to tracking Issue:
#15378

Testing:
Manual
Updated unit tests

Documentation:
In code comments

@github-actions github-actions bot added pkg/stanza processor/logstransform Logs Transform processor labels Nov 23, 2022
@sumo-drosiek sumo-drosiek marked this pull request as ready for review November 24, 2022 14:50
@sumo-drosiek sumo-drosiek requested a review from a team November 24, 2022 14:50
@sumo-drosiek sumo-drosiek changed the title pkg/stanza: attempt to fix logstransformprocessor flow pkg/stanza: add backpropagation of number of processed entries Nov 25, 2022
Copy link
Member

@djaglowski djaglowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me in general.

The unit tests are great, but I think given the overall purpose of this change, we should also have a few tests cases that validate this the return values are propagated back through an entire pipeline.

@sumo-drosiek
Copy link
Member Author

@djaglowski Do you have an idea how to do it technically? Not really sure how the test scenario should look like.

Do you want me to create a receiver which consists of several operators, and then validate return value of the last receiver?

@djaglowski
Copy link
Member

You can build and start a pipeline at the pkg/stanza level that is basically just a slice of operators. (Default operator not required)

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@runforesight
Copy link

runforesight bot commented Dec 27, 2022

Foresight Summary

    
Major Impacts

build-and-test-windows duration(4 seconds) has decreased 45 minutes 13 seconds compared to main branch avg(45 minutes 17 seconds).
View More Details

⭕  build-and-test-windows workflow has finished in 4 seconds (45 minutes 13 seconds less than main branch avg.) and finished at 29th Dec, 2022.


Job Failed Steps Tests
windows-unittest-matrix -     🔗  N/A See Details
windows-unittest -     🔗  N/A See Details

✅  tracegen workflow has finished in 1 minute 5 seconds (2 minutes 54 seconds less than main branch avg.) and finished at 29th Dec, 2022.


Job Failed Steps Tests
build-dev -     🔗  N/A See Details
publish-latest -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details

✅  check-links workflow has finished in 1 minute 37 seconds (1 minute 59 seconds less than main branch avg.) and finished at 29th Dec, 2022.


Job Failed Steps Tests
changed files -     🔗  N/A See Details
check-links -     🔗  N/A See Details

✅  changelog workflow has finished in 2 minutes 12 seconds (5 minutes 51 seconds less than main branch avg.) and finished at 29th Dec, 2022.


Job Failed Steps Tests
changelog -     🔗  N/A See Details

✅  prometheus-compliance-tests workflow has finished in 3 minutes 36 seconds (6 minutes 38 seconds less than main branch avg.) and finished at 29th Dec, 2022.


Job Failed Steps Tests
prometheus-compliance-tests -     🔗  ✅ 21  ❌ 0  ⏭ 0    🔗 See Details

✅  build-and-test workflow has finished in 35 minutes 41 seconds (22 minutes 39 seconds less than main branch avg.) and finished at 29th Dec, 2022.


Job Failed Steps Tests
unittest-matrix (1.19, internal) -     🔗  ✅ 597  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, internal) -     🔗  ✅ 597  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, processor) -     🔗  ✅ 1476  ❌ 0  ⏭ 0    🔗 See Details
correctness-metrics -     🔗  ✅ 2  ❌ 0  ⏭ 0    🔗 See Details
correctness-traces -     🔗  ✅ 17  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, processor) -     🔗  ✅ 1476  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, extension) -     🔗  ✅ 528  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, extension) -     🔗  ✅ 528  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, receiver-0) -     🔗  ✅ 2563  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, receiver-0) -     🔗  ✅ 2563  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, exporter) -     🔗  ✅ 2450  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, exporter) -     🔗  ✅ 2450  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, other) -     🔗  ✅ 4401  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, other) -     🔗  ✅ 4401  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.18, receiver-1) -     🔗  ✅ 1886  ❌ 0  ⏭ 0    🔗 See Details
unittest-matrix (1.19, receiver-1) -     🔗  ✅ 1886  ❌ 0  ⏭ 0    🔗 See Details
integration-tests -     🔗  ✅ 55  ❌ 0  ⏭ 0    🔗 See Details
setup-environment -     🔗  N/A See Details
check-collector-module-version -     🔗  N/A See Details
check-codeowners -     🔗  N/A See Details
lint-matrix (receiver-0) -     🔗  N/A See Details
lint-matrix (receiver-1) -     🔗  N/A See Details
lint-matrix (processor) -     🔗  N/A See Details
lint-matrix (exporter) -     🔗  N/A See Details
lint-matrix (extension) -     🔗  N/A See Details
lint-matrix (internal) -     🔗  N/A See Details
lint-matrix (other) -     🔗  N/A See Details
build-examples -     🔗  N/A See Details
checks -     🔗  N/A See Details
lint -     🔗  N/A See Details
unittest (1.19) -     🔗  N/A See Details
unittest (1.18) -     🔗  N/A See Details
cross-compile (darwin, amd64) -     🔗  N/A See Details
cross-compile (darwin, arm64) -     🔗  N/A See Details
cross-compile (linux, 386) -     🔗  N/A See Details
cross-compile (linux, amd64) -     🔗  N/A See Details
cross-compile (linux, arm) -     🔗  N/A See Details
cross-compile (linux, arm64) -     🔗  N/A See Details
cross-compile (linux, ppc64le) -     🔗  N/A See Details
cross-compile (windows, 386) -     🔗  N/A See Details
cross-compile (windows, amd64) -     🔗  N/A See Details
build-package (deb) -     🔗  N/A See Details
build-package (rpm) -     🔗  N/A See Details
windows-msi -     🔗  N/A See Details
publish-check -     🔗  N/A See Details
publish-dev -     🔗  N/A See Details
publish-stable -     🔗  N/A See Details

✅  load-tests workflow has finished in 8 minutes 21 seconds (9 minutes 11 seconds less than main branch avg.) and finished at 29th Dec, 2022.


Job Failed Steps Tests
loadtest (TestIdleMode) -     🔗  ✅ 1  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceAttributesProcessor) -     🔗  ✅ 3  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestMetric10kDPS|TestMetricsFromFile) -     🔗  ✅ 6  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceNoBackend10kSPS|TestTrace1kSPSWithAttrs) -     🔗  ✅ 8  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestMetricResourceProcessor|TestTrace10kSPS) -     🔗  ✅ 12  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestTraceBallast1kSPSWithAttrs|TestTraceBallast1kSPSAddAttrs) -     🔗  ✅ 10  ❌ 0  ⏭ 0    🔗 See Details
loadtest (TestBallastMemory|TestLog10kDPS) -     🔗  ✅ 19  ❌ 0  ⏭ 0    🔗 See Details
setup-environment -     🔗  N/A See Details

🔎 See details on Foresight

*You can configure Foresight comments in your organization settings page.

@djaglowski djaglowski removed the Stale label Dec 27, 2022
@sumo-drosiek
Copy link
Member Author

@djaglowski Is this change still wanted according to the #17079?

@djaglowski
Copy link
Member

My understanding is that the issue is resolved by #17079.

The ConsumeLogs function is now only blocked when waiting to write to a channel, whereas we formerly were blocking with an expectation of being able to read from a channel. I would certainly appreciate your take on this as well, if you have time. Either way, thanks for all the effort you put into this PR.

@sumo-drosiek
Copy link
Member Author

@djaglowski I rebased the PR. Backpropagation is added, but it's not used in logstransformprocessor for now.

Looking at the #17079 change, I'm not sure if asynchronous processors are align with the pipeline architecture. In current state it will take all logs and claim that they are processed correctly:
image

I can revert this to the desired state of my commit, but not sure what is the decision here.

@djaglowski
Copy link
Member

I'm not sure if asynchronous processors are align with the pipeline architecture.

I may be wrong, but I don't think processors are necessarily expected to be synchronous. For example, batchprocessor is one of only two in the core collector and it is asynchronous.

In current state it will take all logs and claim that they are processed correctly

Can you elaborate on this? I'm not familiar with how this is communicated and what impact it has.

@sumo-drosiek
Copy link
Member Author

sumo-drosiek commented Dec 30, 2022

In current state it will take all logs and claim that they are processed correctly

Can you elaborate on this? I'm not familiar with how this is communicated and what impact it has.

I'm not sure if asynchronous processors are align with the pipeline architecture.

I may be wrong, but I don't think processors are necessarily expected to be synchronous. For example, batchprocessor is one of only two in the core collector and it is asynchronous.

There is an error backpropagation:

https://github.com/open-telemetry/opentelemetry-collector/blob/e4e76dec22ecf8cf1d5b70af6d74215f0e05f0c0/processor/processorhelper/logs.go#L65-L71

Not really sure who and how handle the errors, but reminds me there is some issue with batchprocessor do not backpropagating the errors:

When exporter cannot handle more data and has to drop them it returns error to batchprocessor, but it is not backpropagated to receiver, so we cannot handle it (stop receiving data, ask client to stop sending them for some time). (@swiatekm-sumo correct me if I'm wrong)


Anyway, I think the PR is ready and making logstransformprocessor sync or not should be matter of another PR.

@djaglowski
Copy link
Member

In order to justify the additional complexity, we need to have a clear design for how this would be used. It's no longer clear to me what these changes would accomplish. What will we do with the returned integer, specifically?

@swiatekm
Copy link
Contributor

swiatekm commented Jan 11, 2023

When exporter cannot handle more data and has to drop them it returns error to batchprocessor, but it is not backpropagated to receiver, so we cannot handle it (stop receiving data, ask client to stop sending them for some time). (@swiatekm-sumo correct me if I'm wrong)

Yeah, this is correct, and tracked in open-telemetry/opentelemetry-collector#4646.

In general, if we have an asynchronous component, that component essentially acts as a buffer and needs to take care of its own retry logic. Being asynchronous, it can't return errors to the caller, but it can and should retry on errors it gets from the rest of the pipeline. Otherwise it just drops data without providing any feedback and stops conducting backpressure.

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 27, 2023
@sumo-drosiek
Copy link
Member Author

@djaglowski any input here? I don't feel this change makes sense in other scenarios (maybe some metric to point how many data has been dropped in the queue)

@djaglowski
Copy link
Member

In general, if we have an asynchronous component, that component essentially acts as a buffer and needs to take care of its own retry logic. Being asynchronous, it can't return errors to the caller, but it can and should retry on errors it gets from the rest of the pipeline. Otherwise it just drops data without providing any feedback and stops conducting backpressure.

Thanks for explaining this, it makes sense to me now why this processor would need special considerations for handling backpressure.

Do we have the same problem in the pkg/stanza receivers? They all share the same conversion/batching strategy. This design has been questioned in the past, mostly for its complexity. I wonder if the correct solution here is to make the shared converter synchronous.

@github-actions github-actions bot removed the Stale label Jan 31, 2023
@sumo-drosiek
Copy link
Member Author

Do we have the same problem in the pkg/stanza receivers? They all share the same conversion/batching strategy. This design has been questioned in the past, mostly for its complexity. I wonder if the correct solution here is to make the shared converter synchronous.

It seems that errors are only logged out by the receiver. There is no try to handle recoverable errors or to inform client that data has been dropped

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/adapter/receiver.go#L134-L139

@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label May 26, 2023
@sumo-drosiek
Copy link
Member Author

It seems that errors are only logged out by the receiver. There is no try to handle recoverable errors or to inform client that data has been dropped

@djaglowski Should I create separate issue for this?

@djaglowski
Copy link
Member

@sumo-drosiek, I think a separate issue makes sense.

@github-actions github-actions bot removed the Stale label Jun 1, 2023
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants