Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to send metrics to Datadog #6093

Closed
harshitdx29 opened this issue Jul 9, 2019 · 15 comments
Closed

Not able to send metrics to Datadog #6093

harshitdx29 opened this issue Jul 9, 2019 · 15 comments
Labels
bug unexpected problem or unintended behavior

Comments

@harshitdx29
Copy link

Relevant telegraf.conf:

# # Configuration for DataDog API to send metrics to.
[[outputs.datadog]]
#   # Datadog API key
    apikey = "**************************************"
#
#   ## Connection timeout.
    timeout = "5s"

Expected behavior:

I expect my metrics to be sent to datadog.

Actual behavior:

Getting timeout error. But when I hit Datadog API directly from the same host, it's working so certainly not a connectivity issue
The API I am hitting is:

curl -X POST -H "Content-type: application/json"
-d "{ "series" :
[{"metric":"test.metric",
"points":[[$currenttime, 20]],
"type":"rate",
"interval": 20,
"host":"test.example.com",
"tags":["environment:test"]}
]
}"
'https://api.datadoghq.com/api/v1/series?api_key=<YOUR_API_KEY>'

Additional info:

2019-07-09T07:37:15Z E! Error writing to output [datadog]: error POSTing metrics, Post https://app.datadoghq.com/api/v1/series?api_key=****************: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

@danielnelson
Copy link
Contributor

@harshitdx29 I checked that this plugin is working for me, and also I took a quick look over the code for this plugin, but I don't see any obvious cause other than a timeout/networking issue. Any chance you are you using a proxy?

@danielnelson danielnelson added bug unexpected problem or unintended behavior need more info labels Jul 9, 2019
@harshitdx29
Copy link
Author

It can't be a networking issue because I am able to hit the datadog URL directly using CURL.

Also, we are not using any proxy for this.

@danielnelson
Copy link
Contributor

Any chance we just need to increase the timeout option? Telegraf is likely sending a much larger payload compared to the curl comman.

Otherwise, first try running in foreground (not as service) from the same shell as the curl command. If that still doesn't help could you try from a different computer preferably at a different location.

@harshitdx29
Copy link
Author

I increased the timeout to 60s but to no success. Also, I re confirmed from devops here, we are not using any proxy.

@harshitdx29
Copy link
Author

I am now getting 2019-07-31T10:09:02Z E! Error writing to output [datadog]: received bad status code, 413 from telegraf logs.

@danielnelson
Copy link
Contributor

This means Datadog rejected the payload because it was too large, what is your agent metric_batch_size? Try reducing it and see if it helps.

@harshitdx29
Copy link
Author

It was default of 1000. Reduced it to 500. Also I noticed another problem in all of our services. For the initial duration of starting the telegraf container I get the following error:

Error writing to output [datadog]: error POSTing metrics, Post https://app.datadoghq.com/api/v1/series?api_key=****************: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

After that it creates connection.

@harshitdx29
Copy link
Author

I also started getting a lot of errors:

2019-07-31T16:54:20Z W! Skipping a scheduled flush because there is already a flush ongoing.
2019-07-31T16:54:20Z E! Error in plugin [inputs.system]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.disk]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.diskio]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.net]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.netstat]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.mem]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.processes]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.kernel]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.docker]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.swap]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.conntrack]: took longer to collect than collection interval (10s)
2019-07-31T16:54:20Z E! Error in plugin [inputs.cpu]: took longer to collect than collection interval (10s)

@danielnelson
Copy link
Contributor

The first error is a timeout, Datadog took too long to respond (60s?). The second is a common error in older Telegraf versions when the output is taking a long time to write, it will probably go away if you update.

@harshitdx29
Copy link
Author

For the first one why does it take too long to respond using telegraf. If I directly hit datadog apis it responds quickly.

For the second one can you suggest how to update the version. I am using docker to run telegraf container. Also it started coming when I reduced metric_batch_size to 500. Earlier it was not coming.

@harshitdx29
Copy link
Author

Reducing metrics batch size worked. Thanks :)

@danielnelson
Copy link
Contributor

Great, I'm going to close this issue then. I think the full batch sizes were just to large for DataDog to process quickly or they were taking too long to upload. If you want to get further visibility into this you could enable the internal plugin which will output metrics about how log it takes to write each batch.

@harshitdx29
Copy link
Author

2019-08-01T03:47:10Z E! [agent] Error writing to output [datadog]: unable to marshal TimeSeries, json: unsupported value: NaN Getting this now :(

@danielnelson
Copy link
Contributor

I imagine this should be an easy fix, could you do me a favor and open a new issue for it? This way when we reference the bug in the changelog it will be clear what the issue was and it won't be muddied by the previous conversation.

@harshitdx29
Copy link
Author

Done: #6191

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants