-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datadog Agent won't flush when receiving SIGTERM #3940
Comments
Hello @chenluo1234, Thanks for reporting this. |
Please note that this was partially addressed here: #4129; the solution doesn't flush open dogstatsd time buckets, but it would flush everything else:
|
@truthbk et al - is there any news on whether open buckets would be addressed any time soon? We're seeing that if we emit a dogstatsd metric to a datadog container sidecar, and the main task exits, the metrics aren't making it consistently back to the datadog api. |
I've encountered this issue while using https://github.com/DataDog/agent-github-action in our CI. The result is that the metrics don't appear in Datadog and the only solution I found so far was to make my CI process I'm sugin v7.35.1 of the agent, so that should include #4129, but that doesn't seem to help |
This is currently a big problem for us running the agent on Heroku. We're using version 7.38.0. When the agent stops and shuts down (using the In our testing, when attempting to reduce the wait time to 10 seconds or 15 seconds, there are always missing metrics not forwarded to the Datadog backend. Waiting an additional 30 seconds is an excessive time to wait and hardly reasonable. I'm also in agreement with the previous comment from @iMacTia that the fix for #4129 does not appear to help. I would go further and say that #4129 is not fixed (at least in version 7.38.0) because when the agent stops it does not appear to forward received metrics on to the Datadog backend. The only way we can reliably ensure all metrics sent by our app has been received by the Datadog Agent and forwarded on to the Datadog backend is to wait 30 seconds after our app terminates before initiating shutdown of the agent. |
@KSerrania any chance of an update on this? The issue was added to a backlog in August 2019, is it still in the backlog? Looks like there's a deprecated label on this issue too. |
Any hope of ever getting this fixed? We're waiting for up to 30 seconds before shutting down and occasionally it's not enough. This is really a tough pill to swallow and not an acceptable solution at all. |
Just to clarify, I was mistaken by this comment:
We have a script that implements two different time intervals in an attempt to work around this issue and ensure all metrics are forwarded during shutdowns. The first interval is the time to wait between our app terminating and the Datadog agent terminating. This needs to be more than 10 seconds (e.g, 15 seconds) because the flush interval used by the agent is 10 seconds. We also have a second timeout interval that gives the agent up to 30 seconds to shutdown cleanly. I mistakenly incremented the second timeout interval when I meant to increment the time to wait before shutting down the agent. For anyone else that might be experiencing issues with missing metrics due to the Datadog agent not forwarding metrics during shutdown, try adding a delay between when your app sends its last metric and when the Datadog agent terminates. This delay must be greater than 10 seconds, in order to give the agent enough time to flush metrics and forward them on to the Datadog backend. I'd really like for this issue to be addressed so we didn't need to worry about such matters. The agent should forward all metrics received before terminating. Period. I don't understand why this issue is still not resolved after nearly 5 years. |
Just got hit with this myself. We have a number of VMs that spin up and down that run our application and the agent, and we noticed we were losing the last couple of metrics emitted from the application on exit. After lots of faff, finally traced it back to this. Our solution is much like the others, to delay shutdown by a few moments. But it's not ideal. I'm truly begging for this to be prioritised as critical. You would hope that an observability platform would have don't ever lose data as one of it's core principles. At least emit a warning or message that metrics are being lost, goodness me! |
This is still an issue, we had to follow the advise above and delay the agent container shutdown otherwise we were missing metrics. However we were only running the container specifically to send a handful of business metrics and so we switched to using the DataDog HTTP API instead and avoided the need to run an agent container. |
Got bit by this again today, even after waiting 30 seconds before shutting down the agent. Apparently 30 seconds is not always long enough to ensure the agent forwards all metrics. I'm tempted to look at using the HTTP API as @awconstable suggests above, but I have to imagine running into similar issues with connectivity and ensuring metrics are sent successfully. |
This is still happening and is particularly problematic with cron jobs. Recording metrics too close to a shutdown mostly results in those metrics being dropped and lost forever. The only workaround seems to be moving the operation earlier, which isn't always possible. Related issue: #1547 |
I am also hitting this in github actions when sending metrics as part of the last bit of the action. I'd like to avoid adding a sleep to our workflow, flushing on a TERM would be tops. |
I am having the same issue, and for now the only solution we have found is adding a Thread.sleep in the code. A way of forcing metrics to be sent before the application dies would be very useful. |
Please consider fixing this issue soon. The lack of having all relevant data makes debugging issues extremely difficult. UPDATE: |
Describe what happened:
When datadog agent receives SIGTERM, it won't flush the metrics before it dies.
Describe what you expected:
When it receives SIGTERM, I would expect Datadog Agent flushes out all the currently aggregated metrics before it gracefully shuts down.
Steps to reproduce the issue:
Step 1: Start the Datadog Agent
Step 2: Emit a metric to Datadog Agent
Step 3: Send a SIGTERM signal to Datadog Agent before 15 seconds flush interval ends.
Metrics emitted at step 2 will simply lost and not available.
Additional environment details (Operating System, Cloud provider, etc):
The text was updated successfully, but these errors were encountered: