Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/datadog] Log events are lost in case of network issues without retry #24550

Closed
anmalysh-yb opened this issue Jul 25, 2023 · 10 comments
Closed
Labels
bug Something isn't working data:logs Logs related issues exporter/datadog Datadog components

Comments

@anmalysh-yb
Copy link

Component(s)

exporter/datadog

What happened?

Currently datadog exporter treats network error as permanent error and does not queue log event for retry, because of that.

Probably introduced with this diff: a21144b

Steps to Reproduce

Configure collector as shown in the ticket.
Start writing log.
Turn off network connection.
Turn network connection back on after some time.

Expected Result

All the logs are exported to datadog.

Actual Result

Logs written during network outage are lost.

Collector version

0.81.0

Environment information

Environment

OS: MacOS
Compiler(if manually compiled): N/A

OpenTelemetry Collector configuration

receivers:
   filelog:
    include: [ /Users/amalysh86/otelcol/simple.log*]
    start_at: beginning
    storage: file_storage/pts
    operators:
      - type: regex_parser
        regex: '^(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<sev>[A-Z]*) (?P<msg>.*)$'
        timestamp:
          parse_from: attributes.time
          layout: '%Y-%m-%d %H:%M:%S'
        severity:
          parse_from: attributes.sev

exporters:
  file:
    path: /Users/amalysh86/otelcol/out.json
  datadog:
    api:
      site: us3.datadoghq.com
      key: <API KEY HERE>
    retry_on_failure:
      enabled: true
      initial_interval: 1m
      max_interval: 1800m
    sending_queue:
      storage: file_storage/psq

extensions:
  file_storage/pts:
     directory: /Users/amalysh86/otelcol/pts
     timeout: 10s # in what time a file lock should be obtained

  file_storage/psq:
     directory: /Users/amalysh86/otelcol/psq
     timeout: 10s # in what time a file lock should be obtained
     compaction:
       directory: /Users/amalysh86/otelcol/psq
       on_start: true
       on_rebound: true
       rebound_needed_threshold_mib: 5
       rebound_trigger_threshold_mib: 3

service:
  extensions:
  - file_storage/pts
  - file_storage/psq
  pipelines:
    logs:
      receivers: [filelog]
      processors: []
      exporters: [datadog]

Log output

2023-06-30T19:05:28.650+0300	error	exporterhelper/queued_retry.go:391	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "logs", "name": "datadog", "error": "Permanent error: Post \"https://http-intake.logs.us3.datadoghq.com/api/v2/logs?ddtags=otel_source%3Adatadog_exporter\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "dropped_items": 1}

Additional context

No response

@anmalysh-yb anmalysh-yb added bug Something isn't working needs triage New item requiring triage labels Jul 25, 2023
@github-actions github-actions bot added the exporter/datadog Datadog components label Jul 25, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

crobert-1 commented Oct 12, 2023

Hello @anmalysh-yb, it looks like this was intended from the change here: #16390. I'll defer to the code owners to confirm if this was intended or not.

Note: Looks like another user is running into the same issue, maybe this is an unintended consequence.

@mx-psi
Copy link
Member

mx-psi commented Oct 13, 2023

@dineshg13 can this be closed or is there work left to do?

@anmalysh-yb
Copy link
Author

anmalysh-yb commented Oct 13, 2023

@mx-psi I think this is a real issue actually and should be fixed.
This issue basically prevents reliable enough logs delivery to DD.
In case network becomes unavailable - we basically start dropping all the log events instead of retrying them.

anmalysh-yb added a commit to anmalysh-yb/opentelemetry-collector-contrib that referenced this issue Oct 23, 2023
@siarhei-kharchanka-cko
Copy link
Contributor

siarhei-kharchanka-cko commented Oct 27, 2023

Hey @mx-psi @anmalysh-yb as users of Datadog Exporter, we are also facing log records lost because of blips in a network. I believe this is a regression issue and needs to be fixed. How can I help move this issue further?

@mx-psi mx-psi added the data:logs Logs related issues label Oct 27, 2023
@songy23
Copy link
Member

songy23 commented Oct 27, 2023

👋 #27450 is expected to fix the issue.

@mx-psi mx-psi linked a pull request Oct 27, 2023 that will close this issue
@crobert-1 crobert-1 removed the needs triage New item requiring triage label Oct 27, 2023
@siarhei-kharchanka-cko
Copy link
Contributor

Hey @songy23 thanks for the update. The referenced PR is quite a heavy refactoring. How feasible do you think we could fix this issue within the current approach of logs pushing without waiting for the refactoring will be released? I assume it might be a fairly quick/light fix (I am happy to create a PR). Or perhaps you could suggest a potential workaround to prevent potential data loss while waiting for the refactoring to be in the main?

@songy23
Copy link
Member

songy23 commented Oct 27, 2023

@siarhei-kharchanka-cko It would be awesome if you can send a quick fix!

Context: this issue is only with HTTP log client that datadog log exporter is currently using. #27450 migrates the datadog log exporter from HTTP log client to logs agent. There is built-in retries in logs agent.

@siarhei-kharchanka-cko
Copy link
Contributor

Thank you @songy23! I've linked PR

mx-psi pushed a commit that referenced this issue Oct 30, 2023
…esponse (#28672)

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
The Datadog exporter threats network/connectivity errors (HTTP client
doesn't receive a response) as permanent errors, which can lead to log
records loss. This change makes these errors retryable.

**Link to tracking Issue:** #24550

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>
@mx-psi mx-psi closed this as completed Oct 30, 2023
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
…esponse (open-telemetry#28672)

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
The Datadog exporter threats network/connectivity errors (HTTP client
doesn't receive a response) as permanent errors, which can lead to log
records loss. This change makes these errors retryable.

**Link to tracking Issue:** open-telemetry#24550

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>
RoryCrispin pushed a commit to ClickHouse/opentelemetry-collector-contrib that referenced this issue Nov 24, 2023
…esponse (open-telemetry#28672)

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
The Datadog exporter threats network/connectivity errors (HTTP client
doesn't receive a response) as permanent errors, which can lead to log
records loss. This change makes these errors retryable.

**Link to tracking Issue:** open-telemetry#24550

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data:logs Logs related issues exporter/datadog Datadog components
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants