Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog Agent won't flush when receiving SIGTERM #3940

Open
chenluo1234 opened this issue Jul 25, 2019 · 15 comments
Open

Datadog Agent won't flush when receiving SIGTERM #3940

chenluo1234 opened this issue Jul 25, 2019 · 15 comments
Labels
[deprecated] team/agent-core Deprecated. Use metrics-logs / shared-components labels instead..

Comments

@chenluo1234
Copy link

Describe what happened:
When datadog agent receives SIGTERM, it won't flush the metrics before it dies.

Describe what you expected:
When it receives SIGTERM, I would expect Datadog Agent flushes out all the currently aggregated metrics before it gracefully shuts down.

Steps to reproduce the issue:
Step 1: Start the Datadog Agent
Step 2: Emit a metric to Datadog Agent
Step 3: Send a SIGTERM signal to Datadog Agent before 15 seconds flush interval ends.

Metrics emitted at step 2 will simply lost and not available.

Additional environment details (Operating System, Cloud provider, etc):

@KSerrania KSerrania added the [deprecated] team/agent-core Deprecated. Use metrics-logs / shared-components labels instead.. label Aug 5, 2019
@KSerrania
Copy link
Contributor

Hello @chenluo1234,

Thanks for reporting this.
We're tracking this feature request in our backlog. This issue will be updated when we start working on this, until then feel free to contribute if you have some time 🙂!

@truthbk
Copy link
Member

truthbk commented Jan 22, 2021

Please note that this was partially addressed here: #4129; the solution doesn't flush open dogstatsd time buckets, but it would flush everything else:

  • closed time buckets.
  • check metrics.

@miketheman
Copy link
Contributor

the solution doesn't flush open dogstatsd time buckets ...

@truthbk et al - is there any news on whether open buckets would be addressed any time soon? We're seeing that if we emit a dogstatsd metric to a datadog container sidecar, and the main task exits, the metrics aren't making it consistently back to the datadog api.

@iMacTia
Copy link

iMacTia commented May 5, 2022

I've encountered this issue while using https://github.com/DataDog/agent-github-action in our CI.
The Action "gracefully" terminates the container running the agent, but the latter does not flush the metrics I sent to it (in my case statsd gauges) before exiting.

The result is that the metrics don't appear in Datadog and the only solution I found so far was to make my CI process sleep for 10 seconds, which is less than ideal 😅 (and a waste of money).

I'm sugin v7.35.1 of the agent, so that should include #4129, but that doesn't seem to help

@mmercurio
Copy link

This is currently a big problem for us running the agent on Heroku. We're using version 7.38.0. When the agent stops and shuts down (using the stop command) unless we wait an additional 30 seconds before stopping the agent, the most recent metrics sent by our app and received by the agent are not forwarded on to the Datadog backend.

In our testing, when attempting to reduce the wait time to 10 seconds or 15 seconds, there are always missing metrics not forwarded to the Datadog backend. Waiting an additional 30 seconds is an excessive time to wait and hardly reasonable.

I'm also in agreement with the previous comment from @iMacTia that the fix for #4129 does not appear to help. I would go further and say that #4129 is not fixed (at least in version 7.38.0) because when the agent stops it does not appear to forward received metrics on to the Datadog backend.

The only way we can reliably ensure all metrics sent by our app has been received by the Datadog Agent and forwarded on to the Datadog backend is to wait 30 seconds after our app terminates before initiating shutdown of the agent.

@twe4ked
Copy link

twe4ked commented Oct 4, 2022

@KSerrania any chance of an update on this? The issue was added to a backlog in August 2019, is it still in the backlog? Looks like there's a deprecated label on this issue too.

@mmercurio
Copy link

Any hope of ever getting this fixed? We're waiting for up to 30 seconds before shutting down and occasionally it's not enough. This is really a tough pill to swallow and not an acceptable solution at all.

@mmercurio
Copy link

Just to clarify, I was mistaken by this comment:

we're waiting for up to 30 seconds before shutting down

We have a script that implements two different time intervals in an attempt to work around this issue and ensure all metrics are forwarded during shutdowns. The first interval is the time to wait between our app terminating and the Datadog agent terminating. This needs to be more than 10 seconds (e.g, 15 seconds) because the flush interval used by the agent is 10 seconds.

We also have a second timeout interval that gives the agent up to 30 seconds to shutdown cleanly. I mistakenly incremented the second timeout interval when I meant to increment the time to wait before shutting down the agent.

For anyone else that might be experiencing issues with missing metrics due to the Datadog agent not forwarding metrics during shutdown, try adding a delay between when your app sends its last metric and when the Datadog agent terminates. This delay must be greater than 10 seconds, in order to give the agent enough time to flush metrics and forward them on to the Datadog backend.

I'd really like for this issue to be addressed so we didn't need to worry about such matters. The agent should forward all metrics received before terminating. Period. I don't understand why this issue is still not resolved after nearly 5 years.

@Knifa
Copy link

Knifa commented Jun 21, 2023

Just got hit with this myself. We have a number of VMs that spin up and down that run our application and the agent, and we noticed we were losing the last couple of metrics emitted from the application on exit. After lots of faff, finally traced it back to this.

Our solution is much like the others, to delay shutdown by a few moments. But it's not ideal.

I'm truly begging for this to be prioritised as critical. You would hope that an observability platform would have don't ever lose data as one of it's core principles. At least emit a warning or message that metrics are being lost, goodness me!

@awconstable
Copy link

This is still an issue, we had to follow the advise above and delay the agent container shutdown otherwise we were missing metrics.

However we were only running the container specifically to send a handful of business metrics and so we switched to using the DataDog HTTP API instead and avoided the need to run an agent container.

@mmercurio
Copy link

Got bit by this again today, even after waiting 30 seconds before shutting down the agent. Apparently 30 seconds is not always long enough to ensure the agent forwards all metrics.

I'm tempted to look at using the HTTP API as @awconstable suggests above, but I have to imagine running into similar issues with connectivity and ensuring metrics are sent successfully.

@jaredpetersen
Copy link

jaredpetersen commented Apr 23, 2024

This is still happening and is particularly problematic with cron jobs. Recording metrics too close to a shutdown mostly results in those metrics being dropped and lost forever. The only workaround seems to be moving the operation earlier, which isn't always possible.

Related issue: #1547

@bensherman
Copy link

I am also hitting this in github actions when sending metrics as part of the last bit of the action. I'd like to avoid adding a sleep to our workflow, flushing on a TERM would be tops.

@shermosa
Copy link

I am having the same issue, and for now the only solution we have found is adding a Thread.sleep in the code. A way of forcing metrics to be sent before the application dies would be very useful.

@chconr
Copy link

chconr commented Aug 23, 2024

Please consider fixing this issue soon. The lack of having all relevant data makes debugging issues extremely difficult.

UPDATE:
We use AWS karpenter in our product in conjunction with datadog. Our datadog agents are configured as a daemonset. There was an issue with AWS karpenter in which datadog agent pods were being destroyed before service pod termination, thereby not allowing the remainder of the service pods to send the rest of their logs.

kubernetes-sigs/karpenter#1133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[deprecated] team/agent-core Deprecated. Use metrics-logs / shared-components labels instead..
Projects
None yet
Development

No branches or pull requests