You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is the top-level exporter queue and the sub exporter queues. To avoid data loss on an autoscale down event, there can't be queues on the sub-exporter. However if you disable this queue, it's extremely slow because the endpoints are exported to synchronously in-series.
So, basically, if we have a queue, we get data loss on scaling. If we don't have a queue, we get data loss from performance issues because it's too slow to export. This problem gets worse the more downstream you have.
I've annotated the problematic code here with comments:
forexp, td:=rangeexporterSegregatedTraces {
start:=time.Now()
// with no queue, this blocks.// with queue, this doesn't block but causes data loss when the downstream gets scaled down.err:=exp.ConsumeTraces(ctx, td)
exp.consumeWG.Done()
errs=multierr.Append(errs, err)
...
}
Steps to Reproduce
Disable sub-exporter queue.
loadbalancing:
# top-level exporter queuesending_queue:
queue_size: 6000num_consumers: 200# even with high consumer amount the problem persisted for usprotocol:
otlp:
sending_queue:
enabled: false
Then throw some load at the collector.
Result is severe performance issues and high export times.
Proposed solution
I think we need to have a way of disabling the sub exporter queue without blocking in series here. An errgroup or similar so the consumers can export to different endpoints in-parallel without relying on the sub exporter queue. Open to other ideas to solve the problem, but this seems simplest to me.
Collector version
v0.115.0
Environment information
No response
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered:
@jpkrohling the failoverconnector idea is interesting, I'd be interested to hear more about that.
I think to solve this would be relatively easy though - if there is no queue on the sub exporter we should just fork and wait the exporters with using errgroup or similar.
I also would like to solve the problem of when a single sub-exporter fails, the whole batch is retried instead of just the failed parts. Would this be solved by something like failover connector?
Component(s)
exporter/loadbalancing
What happened?
Description
We had the previous issue: #35378
However this is unfortunately not quite solved.
There is the top-level exporter queue and the sub exporter queues. To avoid data loss on an autoscale down event, there can't be queues on the sub-exporter. However if you disable this queue, it's extremely slow because the endpoints are exported to synchronously in-series.
So, basically, if we have a queue, we get data loss on scaling. If we don't have a queue, we get data loss from performance issues because it's too slow to export. This problem gets worse the more downstream you have.
I've annotated the problematic code here with comments:
Steps to Reproduce
Disable sub-exporter queue.
Then throw some load at the collector.
Result is severe performance issues and high export times.
Proposed solution
I think we need to have a way of disabling the sub exporter queue without blocking in series here. An
errgroup
or similar so the consumers can export to different endpoints in-parallel without relying on the sub exporter queue. Open to other ideas to solve the problem, but this seems simplest to me.Collector version
v0.115.0
Environment information
No response
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: