Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/loadbalancing] Data loss still unavoidable under high load and autoscaling #36717

Open
jamesmoessis opened this issue Dec 9, 2024 · 4 comments
Labels
bug Something isn't working exporter/loadbalancing needs triage New item requiring triage

Comments

@jamesmoessis
Copy link
Contributor

jamesmoessis commented Dec 9, 2024

Component(s)

exporter/loadbalancing

What happened?

Description

We had the previous issue: #35378

However this is unfortunately not quite solved.

There is the top-level exporter queue and the sub exporter queues. To avoid data loss on an autoscale down event, there can't be queues on the sub-exporter. However if you disable this queue, it's extremely slow because the endpoints are exported to synchronously in-series.

So, basically, if we have a queue, we get data loss on scaling. If we don't have a queue, we get data loss from performance issues because it's too slow to export. This problem gets worse the more downstream you have.

I've annotated the problematic code here with comments:

for exp, td := range exporterSegregatedTraces {
  start := time.Now()
  // with no queue, this blocks.
  // with queue, this doesn't block but causes data loss when the downstream gets scaled down.
  err := exp.ConsumeTraces(ctx, td) 
  exp.consumeWG.Done()
  errs = multierr.Append(errs, err)
  ...
}

Steps to Reproduce

Disable sub-exporter queue.

  loadbalancing:
    # top-level exporter queue
    sending_queue:
      queue_size: 6000
      num_consumers: 200 # even with high consumer amount the problem persisted for us
    protocol:
      otlp:
        sending_queue:
          enabled: false

Then throw some load at the collector.

Result is severe performance issues and high export times.

Proposed solution

I think we need to have a way of disabling the sub exporter queue without blocking in series here. An errgroup or similar so the consumers can export to different endpoints in-parallel without relying on the sub exporter queue. Open to other ideas to solve the problem, but this seems simplest to me.

Collector version

v0.115.0

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@jamesmoessis jamesmoessis added bug Something isn't working needs triage New item requiring triage labels Dec 9, 2024
Copy link
Contributor

github-actions bot commented Dec 9, 2024

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jamesmoessis
Copy link
Contributor Author

cc @MovieStoreGuy @Fiery-Fenix

@jpkrohling
Copy link
Member

I wonder if it's time to rethink the way we are handling failures at the load-balancing exporter and delegate some of that to the failoverconnector.

@jamesmoessis
Copy link
Contributor Author

@jpkrohling the failoverconnector idea is interesting, I'd be interested to hear more about that.

I think to solve this would be relatively easy though - if there is no queue on the sub exporter we should just fork and wait the exporters with using errgroup or similar.

I also would like to solve the problem of when a single sub-exporter fails, the whole batch is retried instead of just the failed parts. Would this be solved by something like failover connector?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/loadbalancing needs triage New item requiring triage
Projects
None yet
Development

No branches or pull requests

2 participants