Datadog Agent won't flush when receiving SIGTERM #3940

chenluo1234 · 2019-07-25T22:24:57Z

Describe what happened:
When datadog agent receives SIGTERM, it won't flush the metrics before it dies.

Describe what you expected:
When it receives SIGTERM, I would expect Datadog Agent flushes out all the currently aggregated metrics before it gracefully shuts down.

Steps to reproduce the issue:
Step 1: Start the Datadog Agent
Step 2: Emit a metric to Datadog Agent
Step 3: Send a SIGTERM signal to Datadog Agent before 15 seconds flush interval ends.

Metrics emitted at step 2 will simply lost and not available.

Additional environment details (Operating System, Cloud provider, etc):

KSerrania · 2019-08-05T13:25:38Z

Hello @chenluo1234,

Thanks for reporting this.
We're tracking this feature request in our backlog. This issue will be updated when we start working on this, until then feel free to contribute if you have some time 🙂!

truthbk · 2021-01-22T11:16:06Z

Please note that this was partially addressed here: #4129; the solution doesn't flush open dogstatsd time buckets, but it would flush everything else:

closed time buckets.
check metrics.

miketheman · 2021-08-03T19:57:17Z

the solution doesn't flush open dogstatsd time buckets ...

@truthbk et al - is there any news on whether open buckets would be addressed any time soon? We're seeing that if we emit a dogstatsd metric to a datadog container sidecar, and the main task exits, the metrics aren't making it consistently back to the datadog api.

iMacTia · 2022-05-05T11:17:40Z

I've encountered this issue while using https://github.com/DataDog/agent-github-action in our CI.
The Action "gracefully" terminates the container running the agent, but the latter does not flush the metrics I sent to it (in my case statsd gauges) before exiting.

The result is that the metrics don't appear in Datadog and the only solution I found so far was to make my CI process sleep for 10 seconds, which is less than ideal 😅 (and a waste of money).

I'm sugin v7.35.1 of the agent, so that should include #4129, but that doesn't seem to help

mmercurio · 2022-08-08T14:03:47Z

This is currently a big problem for us running the agent on Heroku. We're using version 7.38.0. When the agent stops and shuts down (using the stop command) unless we wait an additional 30 seconds before stopping the agent, the most recent metrics sent by our app and received by the agent are not forwarded on to the Datadog backend.

In our testing, when attempting to reduce the wait time to 10 seconds or 15 seconds, there are always missing metrics not forwarded to the Datadog backend. Waiting an additional 30 seconds is an excessive time to wait and hardly reasonable.

I'm also in agreement with the previous comment from @iMacTia that the fix for #4129 does not appear to help. I would go further and say that #4129 is not fixed (at least in version 7.38.0) because when the agent stops it does not appear to forward received metrics on to the Datadog backend.

The only way we can reliably ensure all metrics sent by our app has been received by the Datadog Agent and forwarded on to the Datadog backend is to wait 30 seconds after our app terminates before initiating shutdown of the agent.

twe4ked · 2022-10-04T05:11:52Z

@KSerrania any chance of an update on this? The issue was added to a backlog in August 2019, is it still in the backlog? Looks like there's a deprecated label on this issue too.

mmercurio · 2023-05-19T13:59:18Z

Any hope of ever getting this fixed? We're waiting for up to 30 seconds before shutting down and occasionally it's not enough. This is really a tough pill to swallow and not an acceptable solution at all.

mmercurio · 2023-05-24T20:32:51Z

Just to clarify, I was mistaken by this comment:

we're waiting for up to 30 seconds before shutting down

We have a script that implements two different time intervals in an attempt to work around this issue and ensure all metrics are forwarded during shutdowns. The first interval is the time to wait between our app terminating and the Datadog agent terminating. This needs to be more than 10 seconds (e.g, 15 seconds) because the flush interval used by the agent is 10 seconds.

We also have a second timeout interval that gives the agent up to 30 seconds to shutdown cleanly. I mistakenly incremented the second timeout interval when I meant to increment the time to wait before shutting down the agent.

For anyone else that might be experiencing issues with missing metrics due to the Datadog agent not forwarding metrics during shutdown, try adding a delay between when your app sends its last metric and when the Datadog agent terminates. This delay must be greater than 10 seconds, in order to give the agent enough time to flush metrics and forward them on to the Datadog backend.

I'd really like for this issue to be addressed so we didn't need to worry about such matters. The agent should forward all metrics received before terminating. Period. I don't understand why this issue is still not resolved after nearly 5 years.

Knifa · 2023-06-21T13:43:34Z

Just got hit with this myself. We have a number of VMs that spin up and down that run our application and the agent, and we noticed we were losing the last couple of metrics emitted from the application on exit. After lots of faff, finally traced it back to this.

Our solution is much like the others, to delay shutdown by a few moments. But it's not ideal.

I'm truly begging for this to be prioritised as critical. You would hope that an observability platform would have don't ever lose data as one of it's core principles. At least emit a warning or message that metrics are being lost, goodness me!

awconstable · 2023-08-04T06:35:44Z

This is still an issue, we had to follow the advise above and delay the agent container shutdown otherwise we were missing metrics.

However we were only running the container specifically to send a handful of business metrics and so we switched to using the DataDog HTTP API instead and avoided the need to run an agent container.

mmercurio · 2023-08-11T15:00:08Z

Got bit by this again today, even after waiting 30 seconds before shutting down the agent. Apparently 30 seconds is not always long enough to ensure the agent forwards all metrics.

I'm tempted to look at using the HTTP API as @awconstable suggests above, but I have to imagine running into similar issues with connectivity and ensuring metrics are sent successfully.

jaredpetersen · 2024-04-23T16:07:54Z

This is still happening and is particularly problematic with cron jobs. Recording metrics too close to a shutdown mostly results in those metrics being dropped and lost forever. The only workaround seems to be moving the operation earlier, which isn't always possible.

Related issue: #1547

bensherman · 2024-07-09T22:55:34Z

I am also hitting this in github actions when sending metrics as part of the last bit of the action. I'd like to avoid adding a sleep to our workflow, flushing on a TERM would be tops.

shermosa · 2024-07-19T13:46:35Z

I am having the same issue, and for now the only solution we have found is adding a Thread.sleep in the code. A way of forcing metrics to be sent before the application dies would be very useful.

chconr · 2024-08-23T12:27:53Z

Please consider fixing this issue soon. The lack of having all relevant data makes debugging issues extremely difficult.

UPDATE:
We use AWS karpenter in our product in conjunction with datadog. Our datadog agents are configured as a daemonset. There was an issue with AWS karpenter in which datadog agent pods were being destroyed before service pod termination, thereby not allowing the remainder of the service pods to send the rest of their logs.

kubernetes-sigs/karpenter#1133

KSerrania added the [deprecated] team/agent-core Deprecated. Use metrics-logs / shared-components labels instead.. label Aug 5, 2019

miketheman mentioned this issue Sep 30, 2021

Data loss on stop #1547

Open

PallHaraldsson mentioned this issue Oct 13, 2023

Stop running finalizers on exit JuliaLang/julia#51466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datadog Agent won't flush when receiving SIGTERM #3940

Datadog Agent won't flush when receiving SIGTERM #3940

chenluo1234 commented Jul 25, 2019

KSerrania commented Aug 5, 2019

truthbk commented Jan 22, 2021

miketheman commented Aug 3, 2021

iMacTia commented May 5, 2022

mmercurio commented Aug 8, 2022

twe4ked commented Oct 4, 2022

mmercurio commented May 19, 2023

mmercurio commented May 24, 2023

Knifa commented Jun 21, 2023

awconstable commented Aug 4, 2023

mmercurio commented Aug 11, 2023

jaredpetersen commented Apr 23, 2024 •

edited

Loading

bensherman commented Jul 9, 2024

shermosa commented Jul 19, 2024

chconr commented Aug 23, 2024 •

edited

Loading

Datadog Agent won't flush when receiving SIGTERM #3940

Datadog Agent won't flush when receiving SIGTERM #3940

Comments

chenluo1234 commented Jul 25, 2019

KSerrania commented Aug 5, 2019

truthbk commented Jan 22, 2021

miketheman commented Aug 3, 2021

iMacTia commented May 5, 2022

mmercurio commented Aug 8, 2022

twe4ked commented Oct 4, 2022

mmercurio commented May 19, 2023

mmercurio commented May 24, 2023

Knifa commented Jun 21, 2023

awconstable commented Aug 4, 2023

mmercurio commented Aug 11, 2023

jaredpetersen commented Apr 23, 2024 • edited Loading

bensherman commented Jul 9, 2024

shermosa commented Jul 19, 2024

chconr commented Aug 23, 2024 • edited Loading

jaredpetersen commented Apr 23, 2024 •

edited

Loading

chconr commented Aug 23, 2024 •

edited

Loading