[exporter/loadbalancing] Data loss still unavoidable under high load and autoscaling #36717

jamesmoessis · 2024-12-09T04:41:56Z

Component(s)

exporter/loadbalancing

What happened?

Description

We had the previous issue: #35378

However this is unfortunately not quite solved.

There is the top-level exporter queue and the sub exporter queues. To avoid data loss on an autoscale down event, there can't be queues on the sub-exporter. However if you disable this queue, it's extremely slow because the endpoints are exported to synchronously in-series.

So, basically, if we have a queue, we get data loss on scaling. If we don't have a queue, we get data loss from performance issues because it's too slow to export. This problem gets worse the more downstream you have.

I've annotated the problematic code here with comments:

for exp, td := range exporterSegregatedTraces {
  start := time.Now()
  // with no queue, this blocks.
  // with queue, this doesn't block but causes data loss when the downstream gets scaled down.
  err := exp.ConsumeTraces(ctx, td) 
  exp.consumeWG.Done()
  errs = multierr.Append(errs, err)
  ...
}

Steps to Reproduce

Disable sub-exporter queue.

  loadbalancing:
    # top-level exporter queue
    sending_queue:
      queue_size: 6000
      num_consumers: 200 # even with high consumer amount the problem persisted for us
    protocol:
      otlp:
        sending_queue:
          enabled: false

Then throw some load at the collector.

Result is severe performance issues and high export times.

Proposed solution

I think we need to have a way of disabling the sub exporter queue without blocking in series here. An errgroup or similar so the consumers can export to different endpoints in-parallel without relying on the sub exporter queue. Open to other ideas to solve the problem, but this seems simplest to me.

Collector version

v0.115.0

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-12-09T04:42:13Z

Pinging code owners:

exporter/loadbalancing: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jamesmoessis · 2024-12-09T04:42:34Z

cc @MovieStoreGuy @Fiery-Fenix

jpkrohling · 2024-12-09T13:48:14Z

I wonder if it's time to rethink the way we are handling failures at the load-balancing exporter and delegate some of that to the failoverconnector.

jamesmoessis · 2024-12-10T22:39:38Z

@jpkrohling the failoverconnector idea is interesting, I'd be interested to hear more about that.

I think to solve this would be relatively easy though - if there is no queue on the sub exporter we should just fork and wait the exporters with using errgroup or similar.

I also would like to solve the problem of when a single sub-exporter fails, the whole batch is retried instead of just the failed parts. Would this be solved by something like failover connector?

jamesmoessis added bug Something isn't working needs triage New item requiring triage labels Dec 9, 2024

github-actions bot added the exporter/loadbalancing label Dec 9, 2024

github-actions bot mentioned this issue Dec 10, 2024

Weekly Report: 2024-12-03 - 2024-12-10 #36739

Closed

This was referenced Dec 17, 2024

Weekly Report: 2024-12-10 - 2024-12-17 #36867

Open

Weekly Report: 2024-12-17 - 2024-12-24 #36929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/loadbalancing] Data loss still unavoidable under high load and autoscaling #36717

[exporter/loadbalancing] Data loss still unavoidable under high load and autoscaling #36717

jamesmoessis commented Dec 9, 2024 •

edited

Loading

github-actions bot commented Dec 9, 2024

jamesmoessis commented Dec 9, 2024

jpkrohling commented Dec 9, 2024

jamesmoessis commented Dec 10, 2024

[exporter/loadbalancing] Data loss still unavoidable under high load and autoscaling #36717

[exporter/loadbalancing] Data loss still unavoidable under high load and autoscaling #36717

Comments

jamesmoessis commented Dec 9, 2024 • edited Loading

Component(s)

What happened?

Description

Steps to Reproduce

Proposed solution

Collector version

Environment information

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Dec 9, 2024

jamesmoessis commented Dec 9, 2024

jpkrohling commented Dec 9, 2024

jamesmoessis commented Dec 10, 2024

jamesmoessis commented Dec 9, 2024 •

edited

Loading